In [1]:
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

# 3.6.2 Simple Linear Regression

Load the Boston housing dataset and perform a basic regression. The book uses R. I'm going to use statsmodels and scikit-learn

In [2]:
def boston_df():
    boston_dict = load_boston()
    X = boston_dict["data"]
    X_names = boston_dict["feature_names"]
    Y = boston_dict["target"]
    df = pd.DataFrame(data=X, columns=X_names)
    df["target"] = Y
    df["B0"] = 1
    return df
df = boston_df()

## StatsModels Implementation

Where possible I prefer to use StatsModels. Scikit is great, and if I wanted to do a pure prediction I might prefer it, but StatsModels gives me all that analytic goodness I'm looking for.

I'm also curious about performance so I'm going to time each of them

In [3]:
%%timeit
boston_sm_ols = sm.OLS(df["target"], df[["B0", "LSTAT"]])
boston_sm_ols_result = boston_sm_ols.fit()

1.12 ms ± 7.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [4]:
boston_sm_ols = sm.OLS(df["target"], df[["B0", "LSTAT"]])
boston_sm_ols_result = boston_sm_ols.fit()
print(boston_sm_ols_result.summary())

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.544
Model:                            OLS   Adj. R-squared:                  0.543
Method:                 Least Squares   F-statistic:                     601.6
Date:                Sat, 28 Sep 2019   Prob (F-statistic):           5.08e-88
Time:                        19:14:51   Log-Likelihood:                -1641.5
No. Observations:                 506   AIC:                             3287.
Df Residuals:                     504   BIC:                             3295.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
B0            34.5538      0.563     61.415      0.0

## scikit-learn implementation

Not happy that I had to do this weird reshape just because I only had one independent variable. Maybe there's a better way?

In [5]:
%%timeit
boston_skl_ols = LinearRegression()
boston_skl_ols.fit(df["LSTAT"].to_numpy().reshape(-1, 1), df["target"])

397 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [6]:
boston_skl_ols = LinearRegression()
boston_skl_ols.fit(df["LSTAT"].to_numpy().reshape(-1, 1), df["target"])
print("Coefficients: \n", boston_skl_ols.coef_)
print("Intercept: \n", boston_skl_ols.intercept_)

Coefficients: 
 [-0.95004935]
Intercept: 
 34.55384087938311


While it's not as verbose in estimation output, or at least I don't see an easy way to make it be, it's sure a lot faster, which makes sense given what it's designed for.