# Model Fitting I

We want to find the causal effect of studying on grades, so we will be using some econometric techniques, focusing on causal inference. We will first run a naive OLS fit, and then demonstrate why it is inappropriate in this context.

### Naive OLS Fit

The naive approach would be to use these data to fit a "kitchen sink" OLS regression to the data. So lets see what this regression would yield, and then address the plausibility of these results. We run a model for each of our `studytime` mapping schemes.

*Note: We are using MacKinnon and White's (1985) HC3 heteroskedasticity robust covariance estimator*

In [4]:
# Loading the libraries we will use and setting global options

# Suppressing warnings
import warnings
warnings.filterwarnings(action = "ignore")

# Data manipulation and math/stats functions
import numpy as np
np.set_printoptions(suppress=True)
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from linearmodels.iv import IV2SLS 

# Plotting preferences
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns

# Import self-made functions
from p3functions import *

In [94]:
# Loading the data
student_perf = pd.read_pickle('data/student_por_v2.pkl')

Before we fit our the OLS model let's clean up our dataset by converting strings to indicators and converting the final `G3` score into a percentage. We'll be using our make_indicator function located in p3functions.py. 

In [95]:
# Data formatting - converting strings to indicators
indicator_names = {
    'school_GP': ('school', 'GP'),
    'male': ('sex', 'M'),
    'urban': ('address', 'U'),
    'fam_small': ('famsize', 'LE3'),
    'fam_split': ('Pstatus', 'A'),
    'no_parent': ('guardian', 'other'),
    'father': ('guardian', 'father'),
    'mother': ('guardian', 'mother'),
    'school_sup': ('schoolsup', 'yes'),
    'famsup': ('famsup', 'yes'),
    'paid': ('paid', 'yes'),
    'activities': ('activities', 'yes'),
    'nursery': ('nursery', 'yes'),
    'higher': ('higher', 'yes'),
    'internet': ('internet', 'yes'),
    'romantic': ('romantic', 'yes')
}
make_indicators(student_perf, indicator_names)

# Converting G3 to percent
student_perf['G3_perc'] = student_perf.G3 / 12

In [78]:
# Running the OLS model with discrete mapping
## Note: we leave out the first group from `studytime` for collinearity purposes
Y = student_perf.G3_perc
X = student_perf[['studytime_dscr2', 'studytime_dscr3', 'studytime_dscr4', 
                  'school_GP', 'male', 'age', 'urban', 'fam_small', 'fam_split', 'Medu', 'Fedu', 
                  'mother', 'father', 'traveltime', 'freetime', 'failures', 'school_sup', 'famsup', 'paid', 
                  'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'goout', 'Dalc', 'Walc', 
                  'health', 'absences']]
X = sm.add_constant(X)
model = sm.OLS(Y, X)
results_discrete = model.fit(cov_type='HC3')
print(results_discrete.summary(title = 'OLS Regression Results: Discrete Map'))

                     OLS Regression Results: Discrete Map                     
Dep. Variable:                G3_perc   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.316
Method:                 Least Squares   F-statistic:                     9.783
Date:                Sat, 07 Apr 2018   Prob (F-statistic):           1.17e-35
Time:                        15:20:19   Log-Likelihood:                 70.081
No. Observations:                 649   AIC:                            -78.16
Df Residuals:                     618   BIC:                             60.58
Df Model:                          30                                         
Covariance Type:                  HC3                                         
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
const               0.6312      0.174     

In [81]:
# Running the OLS model with continuous mapping
Y = student_perf.G3_perc
X = student_perf[['studytime_continuous',  
                  'school_GP', 'male', 'age', 'urban', 'fam_small', 'fam_split', 'Medu', 'Fedu', 
                  'mother', 'father', 'traveltime', 'freetime', 'failures', 'school_sup', 'famsup', 'paid', 
                  'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'goout', 'Dalc', 'Walc', 
                  'health', 'absences']]
X = sm.add_constant(X)
model = sm.OLS(Y, X)
results_continuous = model.fit(cov_type='HC3')
print(results_continuous.summary(title = 'OLS Regression Results: Continuous Map'))

                    OLS Regression Results: Continuous Map                    
Dep. Variable:                G3_perc   R-squared:                       0.346
Model:                            OLS   Adj. R-squared:                  0.316
Method:                 Least Squares   F-statistic:                     10.31
Date:                Sat, 07 Apr 2018   Prob (F-statistic):           6.89e-36
Time:                        15:21:19   Log-Likelihood:                 68.893
No. Observations:                 649   AIC:                            -79.79
Df Residuals:                     620   BIC:                             50.00
Df Model:                          28                                         
Covariance Type:                  HC3                                         
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    0.6055 

In both models our $R^2$ and adjusted $R^2$ are low (around 0.3).

In [13]:
#Saving our cleaned dataset and results of our Naive OLS fit
student_perf.to_pickle('data/student_por_v3.pkl')
results_discrete.save('results/Naive_OLS_discrete.pickle')
results_continuous.save('results/Naive_OLS_continuous.pickle')

In [161]:
# Loading the data
student_perf = pd.read_pickle('data/student_por_v2.pkl')




# Data formatting - converting strings to indicators
indicator_names = {
    'school_GP': ('school', 'GP'),
    'male': ('sex', 'M'),
    'urban': ('address', 'U'),
    'fam_small': ('famsize', 'LE3'),
    'fam_split': ('Pstatus', 'A'),
    
    'no_parent': ('guardian', 'other'),
    'father': ('guardian', 'father'),
    'mother': ('guardian', 'mother'),
    
    'school_sup': ('schoolsup', 'yes'),
    'famsup': ('famsup', 'yes'),
    'paid': ('paid', 'yes'),
    'activities': ('activities', 'yes'),
    'nursery': ('nursery', 'yes'),
    'higher': ('higher', 'yes'),
    'internet': ('internet', 'yes'),
    'romantic': ('romantic', 'yes'),
    
    ###### NEW BELOW HERE ######
    
    'Mjob_teach': ('Mjob', 'teacher'),
    'Mjob_health': ('Mjob', 'health'),
    'Mjob_civil': ('Mjob', 'services'),
    'Mjob_other': ('Mjob', 'other'),
    'Fjob_teach': ('Fjob', 'teacher'),
    'Fjob_health': ('Fjob', 'health'),
    'Fjob_civil': ('Fjob', 'services'),
    'Fjob_other': ('Fjob', 'other'),
    
    'Medu_primary': ('Medu', 1),
    'Medu_5_9': ('Medu', 2),
    'Medu_secondary': ('Medu', 3),
    'Medu_higher': ('Medu', 4),
    'Fedu_primary': ('Fedu', 1),
    'Fedu_5_9': ('Fedu', 2),
    'Fedu_secondary': ('Fedu', 3),
    'Fedu_higher': ('Fedu', 4),
    
    'reason_home' : ('reason', 'home'),
    'reason_course' : ('reason', 'course'),
    'reason_reputation' : ('reason', 'reputation'),
    
    'traveltime_0_15m' : ('traveltime', 1),
    'traveltime_15_30m' : ('traveltime', 2),
    'traveltime_30m_1h' : ('traveltime', 3),
    'traveltime_1h_plus' : ('traveltime', 4),
    
    'famrel_1' : ('famrel', 1),
    'famrel_2' : ('famrel', 2),
    'famrel_3' : ('famrel', 3),
    'famrel_4' : ('famrel', 4),
    'famrel_5' : ('famrel', 5),
    
    'freetime_1' : ('freetime', 1),
    'freetime_2' : ('freetime', 2),
    'freetime_3' : ('freetime', 3),
    'freetime_4' : ('freetime', 4),
    'freetime_5' : ('freetime', 5),
    
    'goout_1' : ('goout', 1),
    'goout_2' : ('goout', 2),
    'goout_3' : ('goout', 3),
    'goout_4' : ('goout', 4),
    'goout_5' : ('goout', 5),

    'Dalc_1' : ('Dalc', 1),
    'Dalc_2' : ('Dalc', 2),
    'Dalc_3' : ('Dalc', 3),
    'Dalc_4' : ('Dalc', 4),
    'Dalc_5' : ('Dalc', 5),

    'Walc_1' : ('Walc', 1),
    'Walc_2' : ('Walc', 2),
    'Walc_3' : ('Walc', 3),
    'Walc_4' : ('Walc', 4),
    'Walc_5' : ('Walc', 5),

    'health_1' : ('health', 1),
    'health_2' : ('health', 2),
    'health_3' : ('health', 3),
    'health_4' : ('health', 4),
    'health_5' : ('health', 5),
}
make_indicators(student_perf, indicator_names)

# Converting G3 to percent
student_perf['G3_perc'] = student_perf.G3 / 12






# Running the OLS model with continuous mapping
Y = student_perf.G3_perc
X = student_perf[['studytime_continuous',
                  'Mjob_teach', 'Mjob_health', 'Mjob_civil', 'Mjob_other', 
                  'Fjob_teach', 'Fjob_health', 'Fjob_civil', 'Fjob_other',
                  'Medu_primary', 'Medu_5_9', 'Medu_secondary', 'Medu_higher', 
                  'Fedu_primary', 'Fedu_5_9', 'Fedu_secondary', 'Fedu_higher', 
                  'reason_home', 'reason_course', 'reason_reputation',
                  'traveltime_15_30m', 'traveltime_30m_1h', 'traveltime_1h_plus',
                  'famrel_1', 'famrel_2', 'famrel_4', 'famrel_5',
                  'freetime_1', 'freetime_2', 'freetime_4', 'freetime_5',
                  'goout_1', 'goout_2', 'goout_4', 'goout_5',
                  'Dalc_1', 'Dalc_2', 'Dalc_4', 'Dalc_5',
                  'Walc_1', 'Walc_2', 'Walc_4', 'Walc_5',
                  'health_1', 'health_2', 'health_4', 'health_5',
                  
                  'school_GP', 'male', 'age', 'urban', 'fam_small', 'fam_split', 
                  'mother', 'father', 'failures', 'school_sup', 'famsup', 'paid', 
                  'activities', 'nursery', 'higher', 'internet', 'romantic', 'absences']]
X = sm.add_constant(X)
model = sm.OLS(Y, X)
results_continuous = model.fit(cov_type='HC3')
#print(results_continuous.summary(title = 'OLS Regression Results: Continuous Map').as_latex())
print(results_continuous.summary(title = 'OLS Regression Results: Continuous Map'))

                    OLS Regression Results: Continuous Map                    
Dep. Variable:                G3_perc   R-squared:                       0.408
Model:                            OLS   Adj. R-squared:                  0.342
Method:                 Least Squares   F-statistic:                     5.905
Date:                Sat, 07 Apr 2018   Prob (F-statistic):           1.29e-33
Time:                        17:29:07   Log-Likelihood:                 101.26
No. Observations:                 649   AIC:                            -70.53
Df Residuals:                     583   BIC:                             224.9
Df Model:                          65                                         
Covariance Type:                  HC3                                         
                           coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    0.5286 

In [137]:
# Computing the Variance Inflation Factor
vif = pd.DataFrame({
    'Features' : X.columns,
    'VIF Factor' : [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
})

vif.sort_values(by = 'VIF Factor', ascending = False)

Unnamed: 0,Features,VIF Factor
0,const,660.764338
13,Medu_higher,27.490742
11,Medu_5_9,26.160910
15,Fedu_5_9,24.140741
12,Medu_secondary,22.142360
10,Medu_primary,21.756254
14,Fedu_primary,21.361527
17,Fedu_higher,19.169556
16,Fedu_secondary,18.518848
26,famrel_4,9.331149


In [160]:
from sklearn import linear_model
clf = linear_model.Lasso(alpha = 0.01)
test = clf.fit(X, Y)
test.coef_

array([ 0.        ,  0.01022711,  0.        ,  0.        ,  0.        ,
       -0.        ,  0.        , -0.        , -0.        ,  0.        ,
       -0.        , -0.        ,  0.        ,  0.0284    , -0.00171195,
       -0.        ,  0.        ,  0.        ,  0.        , -0.        ,
        0.        , -0.        ,  0.        , -0.        , -0.        ,
       -0.        ,  0.01321529, -0.        , -0.        ,  0.        ,
        0.        , -0.        , -0.        ,  0.        ,  0.        ,
       -0.        ,  0.02676692, -0.        , -0.        , -0.        ,
        0.        ,  0.        , -0.        , -0.        ,  0.        ,
        0.        ,  0.        , -0.00780183,  0.08481742, -0.00714887,
        0.        ,  0.        ,  0.        ,  0.        , -0.        ,
        0.        , -0.11505379, -0.        ,  0.        , -0.        ,
        0.        , -0.        ,  0.06352521,  0.        , -0.        ,
       -0.00243664])