## OLS Python

#### Table of Contents
* [Setup](#Setup)
* [statsmodels.api](#statsmodels.api)
    - [sm Training](#sm-Training)
    - [sm Predicting](#sm-Predicting)
    - [sm Training](#sm-Training)
* [statsmodels.formula.api](#statsmodels.formula.api)
    - [smf Training](#smf-Training)
    - [smf Predicting](#smf-Predicting)
    - [smf Training](#smf-Training)
    - [smf Formulas](#smf-Formulas)
* [sklearn.linear_model](#sklearn.linear_model)
    - [sklearn Training](#sklearn-Training)
    - [sklearn Predicting](#sklearn-Predicting)
    - [sklearn Training](#sklearn-Training)

We are going to estimate a linear regression model using three different functions. 
The different packages demonstrate the bigger picture of what each field cares about.
`statsmodels` is focused towards statistics and econometrics, so it has much more formal output.
We will demonstrate the base API and the R-like formula API.
`sklearn` is focused towards machine learning, which is focused only on $\hat{y}$.

We are going to use the following regression:

$$
\begin{align*}
    \%\Delta rGDP_{i,t} = & \alpha_t + UrateBin_{i,t}^\prime\beta + LFPR_{i,t}\gamma + LFPR_{i,t}UrateBin_{i,t}\delta +\\
    & EmpPerEstab_{i,t}\zeta + EmpPerEstab_{i,t}^2\eta + \epsilon_{i,t}
\end{align*}
$$

*********
# Setup
[TOP](#OLS-Python)

In [1]:
import pandas as pd
import numpy as np

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_csv('C:/Users/johnj/Documents/Data/aml in econ 02 spring 2021/class data/class_data.csv')
df.set_index(['fips', 'year', 'GeoName'], inplace = True)
df
df['year'] = df.index.get_level_values('year')

In [3]:
df = pd.read_pickle('C:/Users/johnj/Documents/Data/aml in econ 02 spring 2021/class data/class_data.pkl')
df.columns

Index(['pct_d_rgdp', 'urate_bin', 'pos_net_jobs', 'emp_estabs',
       'estabs_entry_rate', 'estabs_exit_rate', 'pop', 'pop_pct_black',
       'pop_pct_hisp', 'lfpr', 'density', 'year'],
      dtype='object')

All but `statsmodels.formula` require the features and labels to be separate arguments. So, let's create them!

**IMPORTANT** The features matrix is the **design matrix**.

In [4]:
y = df['pct_d_rgdp']
x = df.drop(columns = 'pct_d_rgdp')

# Creating dummies
x = x.join([pd.get_dummies(x['year'], prefix = 'year', drop_first = True),
          pd.get_dummies(x['urate_bin'], prefix = 'urate', drop_first = True)]).drop(columns = ['year', 'urate_bin'])
x = sm.add_constant(x)

# Creating interactions
x['lfpr:urate_lower'] = x.lfpr * x.urate_lower
x['lfpr:urate_similar'] = x.lfpr * x.urate_similar
x['emp_estabs_sq'] = x.emp_estabs**2

# Dropping features we do not want to use
x.drop(columns = ['pos_net_jobs', 'estabs_entry_rate', 'estabs_exit_rate',
                  'pop', 'pop_pct_black', 'pop_pct_hisp', 'density'], inplace = True)

# Sorting the columns for output
x.sort_index(axis = 'columns', inplace = True)

# Dropping un
x.columns

Index(['const', 'emp_estabs', 'emp_estabs_sq', 'lfpr', 'lfpr:urate_lower',
       'lfpr:urate_similar', 'urate_lower', 'urate_similar', 'year_2003',
       'year_2004', 'year_2005', 'year_2006', 'year_2007', 'year_2008',
       'year_2009', 'year_2010', 'year_2011', 'year_2012', 'year_2013',
       'year_2014', 'year_2015', 'year_2016', 'year_2017', 'year_2018'],
      dtype='object')

In [5]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 2/3, random_state = 490)
print(x_train.shape)
print(y_train.shape, '\n')

print(x_test.shape)
print(y_test.shape)

(33889, 24)
(33889,) 

(16945, 24)
(16945,)


*********
# statsmodels.api
[TOP](#OLS-Python)

`statsmodels.api`'s linear regresion is a capitalized OLS.

This package is also one of the few that use the order (y, x) instead of (x, y). Be careful out there! Read the documentation when available!

## sm Training 
[TOP](#OLS-Python)

Fitting `statsmodels` functions proceeds as follows

1. calling the desired function with `y` and `x` arguments.
2. chain the `.fit()` method

This is different than `sklearn`.

In [6]:
fit_sm = sm.OLS(y_train, x_train).fit()
print(fit_sm.summary())

                            OLS Regression Results                            
Dep. Variable:             pct_d_rgdp   R-squared:                       0.028
Model:                            OLS   Adj. R-squared:                  0.027
Method:                 Least Squares   F-statistic:                     42.12
Date:                Tue, 23 Feb 2021   Prob (F-statistic):          4.05e-187
Time:                        17:55:38   Log-Likelihood:            -1.2370e+05
No. Observations:               33889   AIC:                         2.474e+05
Df Residuals:                   33865   BIC:                         2.476e+05
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  0.5566      0

In [7]:
print(fit_sm.summary2(alpha = 0.1))

                  Results: Ordinary least squares
Model:              OLS              Adj. R-squared:     0.027      
Dependent Variable: pct_d_rgdp       AIC:                247445.9529
Date:               2021-02-23 17:55 BIC:                247648.2932
No. Observations:   33889            Log-Likelihood:     -1.2370e+05
Df Model:           23               F-statistic:        42.12      
Df Residuals:       33865            Prob (F-statistic): 4.05e-187  
R-squared:          0.028            Scale:              86.754     
--------------------------------------------------------------------
                     Coef.  Std.Err.    t     P>|t|   [0.05   0.95] 
--------------------------------------------------------------------
const                0.5566   0.6366   0.8743 0.3819 -0.4906  1.6038
emp_estabs          -0.0718   0.0232  -3.1002 0.0019 -0.1100 -0.0337
emp_estabs_sq        0.0011   0.0006   1.8644 0.0623  0.0001  0.0020
lfpr                 0.0287   0.0079   3.6214 0.0003 

In [8]:
fit_sm.params
fit_sm.pvalues
fit_sm.resid
fit_sm.conf_int(alpha = 0.01)
fit_sm.rsquared

0.027808707874526828

## sm Predicting 

In [9]:
y_hat_sm = fit_sm.predict(x_test)
y_hat_sm.head()

fips   year  GeoName         
6013   2005  Contra Costa, CA    2.583403
29015  2006  Benton, MO          4.865207
40069  2007  Johnston, OK        2.259181
48235  2015  Irion, TX           3.086271
29075  2013  Gentry, MO          4.297925
dtype: float64

## sm Testing

In [10]:
rmse_sm = np.sqrt(np.mean((y_test - y_hat_sm)**2))
rmse_sm

9.274383341532479

How good is this fit, you ask. 
Well, it is a bit a little difficult to say without comparison.
A good starting place is to compare this fit against the null model.
Then we can determine the percent improvement we obtain from it.

In [11]:
# null model
rmse_null = np.sqrt(  np.mean((y_test - np.mean(y_train))**2)  )
rmse_null

9.403229309446852

In [12]:
print(round((rmse_null - rmse_sm)/rmse_null*100, 3), '%', sep = '')

1.37%


Only 1.37%!? 
That is not much at all!

We should note that if we have made a 100% imporvement, then we have interpolated the data (overfit it).
I would say if we could improve upon the null model by 10%, then that is something to be excited about.
Let's see if we can get there this semester.

***
# statsmodels.formula.api
[TOP](#OLS-Python)

This works just like `R`!

In [13]:
df.columns

Index(['pct_d_rgdp', 'urate_bin', 'pos_net_jobs', 'emp_estabs',
       'estabs_entry_rate', 'estabs_exit_rate', 'pop', 'pop_pct_black',
       'pop_pct_hisp', 'lfpr', 'density', 'year'],
      dtype='object')

So here is something cool about `train_test_split()` with a specified `random_state`:

In [14]:
df_train, df_test = train_test_split(df, train_size = 2/3, random_state = 490)
all(x_train.index == df_train.index)

True

## smf Training
[TOP](#OLS-Python)

In [15]:
fit_smf = smf.ols(formula = 'pct_d_rgdp ~ emp_estabs + I(emp_estabs**2) + C(urate_bin)*lfpr + C(year)', data = df_train).fit()
print(fit_smf.summary2())

                       Results: Ordinary least squares
Model:                 OLS                 Adj. R-squared:        0.027      
Dependent Variable:    pct_d_rgdp          AIC:                   247445.9529
Date:                  2021-02-23 17:55    BIC:                   247648.2932
No. Observations:      33889               Log-Likelihood:        -1.2370e+05
Df Model:              23                  F-statistic:           42.12      
Df Residuals:          33865               Prob (F-statistic):    4.05e-187  
R-squared:             0.028               Scale:                 86.754     
-----------------------------------------------------------------------------
                              Coef.  Std.Err.    t     P>|t|   [0.025  0.975]
-----------------------------------------------------------------------------
Intercept                     0.5566   0.6366   0.8743 0.3819 -0.6912  1.8045
C(urate_bin)[T.lower]        -0.2319   0.8808  -0.2633 0.7923 -1.9584  1.4946
C(urate_b

## smf Predicting
[TOP](#OLS-Python)

In [16]:
yhat_smf = fit_smf.predict(df_test)

## smf Testing
[TOP](#OLS-Python)

In [17]:
rmse_smf = np.sqrt(  np.mean((yhat_smf - y_test)**2)  )
rmse_smf

9.274383341532479

In [18]:
rmse_sm

9.274383341532479

## smf Formulas
[TOP](#OLS-Python)

Here are a few more examples on how to use formulas in `statsmodels.formula.api`:

In [19]:
df_train.columns

Index(['pct_d_rgdp', 'urate_bin', 'pos_net_jobs', 'emp_estabs',
       'estabs_entry_rate', 'estabs_exit_rate', 'pop', 'pop_pct_black',
       'pop_pct_hisp', 'lfpr', 'density', 'year'],
      dtype='object')

In [20]:
# no intercept
smf.ols(formula = 'pct_d_rgdp ~ density + pop - 1', data = df_train).fit().params

density    0.000027
pop        0.000002
dtype: float64

In [21]:
# only specific levels
smf.ols(formula = "pct_d_rgdp ~ I(year == 2003) + I(year.isin([range(2007,2010)]))", data = df_train).fit().params

Intercept                                    1.918966
I(year == 2003)[T.True]                      1.084449
I(year.isin([range(2007, 2010)]))[T.True]    0.000000
dtype: float64

*********
# sklearn.linear_model
[TOP](#OLS-Python)

`sklearn`, the best machine learning package for everything *other than* neural networks. 
The lack of statistical details from their OLS function goes to show what is the difference between data scientists and statisticians/econometricians.

## sklearn Training 
[TOP](#OLS-Python)

Fitting `sklearn` functions proceeds as follows:

1. call the desired function without arguments
2. chain the `.fit()` method with `x` and `y` arguments

This is different than `statsmodels`

In [22]:
fit_sk = LinearRegression(fit_intercept = False).fit(x_train, y_train)
print(fit_sk.score(x_train, y_train)) # r_sq
fit_sk.coef_ # unamed coefficients

0.027808707874526828


array([ 5.56640139e-01, -7.18430813e-02,  1.06712265e-03,  2.87370942e-02,
        2.43563032e-02,  3.03341962e-04, -2.31892706e-01,  9.12733619e-01,
       -5.97577390e-02, -2.43014292e-01,  6.88890844e-01,  2.60155665e+00,
       -1.45210522e+00, -2.14993324e+00, -3.67096841e+00, -1.40544332e-01,
       -9.14729510e-01, -1.90804270e+00, -2.67455590e-01, -1.51310812e+00,
       -1.15016125e+00, -2.69127428e+00, -1.42011928e+00, -6.93515668e-01])

In [23]:
fit_sk.intercept_

0.0

## sklearn Predicting
[TOP](#OLS-Python)

In [24]:
yhat_sk = fit_sk.predict(x_test)

## sklearn Testing
[TOP](#OLS-Python)

In [25]:
rmse_sk = mean_squared_error(yhat_sk, y_test, squared = False)
rmse_sk

9.274383341532479

In [26]:
rmse_sm

9.274383341532479

In [27]:
rmse_smf

9.274383341532479