In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.graphics.regressionplots import abline_plot
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
%matplotlib inline

# 3.6.2 Simple Linear Regression

Load the Boston housing dataset and perform a basic regression. The book uses R. I'm going to use statsmodels and scikit-learn

In [None]:
df = sm.datasets.get_rdataset("Boston", "MASS", cache=True).data
df = sm.add_constant(df)

## StatsModels Implementation

Where possible I prefer to use StatsModels. Scikit is great, and if I wanted to do a pure prediction I might prefer it, but StatsModels gives me all that analytic goodness I'm looking for.

I'm also curious about performance so I'm going to time each of them

In [None]:
df.head()

In [None]:
%%timeit
boston_sm_ols = sm.OLS(df["medv"], df[["const", "lstat"]]).fit()

In [None]:
boston_sm_ols = sm.OLS(df["medv"], df[["const", "lstat"]]).fit()
print(boston_sm_ols.summary())

## scikit-learn implementation

Not happy that I had to do this weird reshape just because I only had one independent variable. Maybe there's a better way?

In [None]:
%%timeit
boston_skl_ols = LinearRegression()
boston_skl_ols.fit(df["lstat"].to_numpy().reshape(-1, 1), df["medv"])

In [None]:
boston_skl_ols = LinearRegression()
boston_skl_ols.fit(df["lstat"].to_numpy().reshape(-1, 1), df["medv"])
print("Coefficients: \n", boston_skl_ols.coef_)
print("Intercept: \n", boston_skl_ols.intercept_)

While it's not as verbose in estimation output, or at least I don't see an easy way to make it be, it's sure a lot faster, which makes sense given what it's designed for.

I think I'll stick with StatsModels for the rest of this

In [None]:
boston_sm_ols.conf_int(alpha=0.05)

In [None]:
boston_sm_ols.get_prediction(sm.add_constant([5, 10, 15])).summary_frame(alpha=0.05)

To plot the regression there is a way to do it in StatsModels, but the nicer way is probably Seaborn, let's try both

In [None]:
sns.regplot(x='lstat', y='medv', data=df);

In [None]:
ax = df.plot(x='lstat', y='medv', kind='scatter')
abline_plot(model_results=boston_sm_ols, ax=ax);

In [None]:
# make a new dataframe for easier plotting
result_df = df[["lstat", "medv"]].copy()
result_df["fitted"] = boston_sm_ols.fittedvalues
result_df["resid"] = boston_sm_ols.resid

In [None]:
result_df.plot(x="lstat", y="resid", kind="scatter");

more to look into here: https://www.statsmodels.org/dev/examples/notebooks/generated/regression_plots.html