# Simple Linear Regression

Let's use the very, very small 'buildings' dataset to see several different ways to conduct simple linear regression in Python. We'll also do some plotting.

----

In [None]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Read the data into a DataFrame from the .csv file and look at its shape


In [None]:
# Small data set so print the whole thing out


## Correlation

Recall that correlation measures the strength and direction of the **linear** relationship between two variables. Also, remember that the correlation coefficient will the be same regardless of the order of the two variables. You, the human, must determine which variable is the output (the `y`) and which is the input/independent variable (the `x`).

In [None]:
# Find the correlation between just the two variables of interest


In [None]:
# Find the correlation in the other direction to verify it will be the same


In [None]:
# Find the entire correlation matrix


### Heatmap

Even though the `Year` variable doesn't make a lot of sense of this dataset, we still went ahead and calculated the full correlation matrix. One of the nice and easy to use methods in the `seaborn` package is the `.heatmap()` method. You will often send a correlation matrix to it to see a nice picture of the correlations.

In [None]:
# Plot a heatmap of the correlation matrix


### Plotting the Data

Continuing with using `seaborn`, let's create a scatter plot and add the fitted OLS regression line.

In [None]:
# Use relplot from seaborn to plot the data as a scatterplot


In [None]:
# Use regplot to add a regression line to scatter


In [None]:
# Use lmplot to add a regression line to scatter 


#### Using a `jointplot`

Within `seaborn` we have a function called `jointplot()` that can create a scatter plot, add the regression line, and add histograms for the two variables. This functionality can be superuseful when we have a lot of data. Here, we only have seven data points, so it will not add much value. However, let's go ahead and create one to see it in action.

### Where is the Regression Equation?
Unfortunately, we cannot get the regression equation from `regplot` nor `lmplot`. (Remember, `seaborn` is a visualization package, not a statistical one.) We could use `numpy` to get it. Another option is to use the `statsmodels` package. Let's try that instead.

----

## Using the `statsmodel` api
That was a lot of work; there has got to be a better way. There are multiple better ways and we will look at one now. We can use the [statsmodels library](https://www.statsmodels.org/stable/index.html).

In [None]:
# If we were importing the entire api we would use the following
#import statsmodels.formula.api as smf

# Because we know we are only going to OLS, let's just import it
from statsmodels.formula.api import ols

In [None]:
# Fit an OLS model and store it in a variable called results
# Use the statsmodels library to fit an OLS model
# Store the fit in a variable called 'results'


In [None]:
# What is results?


In [None]:
# Call .summary() on the results variable


In [None]:
# Let's just look at the coefficients table


In [None]:
# You can get the intercept and slope using .params


In [None]:
# What is results.params?


In [None]:
# Lots of other attributes on the results object
# Fitted values (that's the estimated y-hat values)


In [None]:
# We can get the sum squares due to regression (explained sum of squares)
# and the sum of squared residuals (errors)


In [None]:
# We can get the MSE for regression, MSE for residuals


In [None]:
# R-squared is easy too


#### Plotting the Residuals / Errors
We often want to plot the residuals / errors to help us determine the appropriateness of the linear model. It is easy to get the residuals from our regression results. We'll print out the residuals and then plot them two different ways.

In [None]:
# We can also get and print the residuals (errors)


In [None]:
# Plot the residuals using a scatter plot


In [None]:
# seaborn has a convenience function called residplot()


### Comparing "Mean Model" to Simple Linear Regression
One idea is that we could simply use a "naive" model for forecasting. An example of this approach would be to simply use the average `Height` across all buildings (observations) as our prediction. We should immediately think that this naive model will not be as good as our simple linear regression model based on, among other things, the R-squared value we saw earlier.

Let's first plot both the "mean model" and the simple linear regression model. Then we can compare their errors.

In [None]:
# Let's plot both the "mean model" and the SLR
fig, ax = plt.subplots(1, 2, figsize=(12,5))

# Mean model in first subplot


# Add the mean line


# Add vertical lines from actual to the mean line (prediction)


# Add title


# Now plot the SLR


# Add vertical lines from actual to the regression line (prediction)


# Add title


In [None]:
# We see that when using the "naive" mean model, the errors appear to 
# greater than when using the simple linear regression model. 
# One way to visualize how much better the SLR is compared to the mean model
# is to plot the histogram of the absolute errors.
#
# We have already seen the SLR errors 


# Get the mean model's errors


In [None]:
# Now plot them as histograms on top of each other
# Note, because the datset is so small, this may not be very pretty
plt.figure(figsize=(12,5))


In [None]:
# How often is the linear regression model better?


**&copy; 2021 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**