# Model Benchmarks
---

### Objective
This notebook seeks to evaluate the benchmark multiple linear regression model using the list of interesting features chosen in notebook 02. Using sklearn and statsmodels, cross validation scores are calculated and interpreted. An analysis of p-values and coefficients allows for the removal of highly insignificant features.

---
#### External Libraries Import

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score , train_test_split
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

  from pandas.core import datetools


#### Read Cleaned and Preprocessed Datasets

In [2]:
df_train = pd.read_csv('../datasets/preprocessed_train.csv')

### Fit the Benchmark Multiple Linear Regression Model

In [3]:
# get interesting features from 02 notebook

interesting_features = ['neighborhood','overall_qual','year_built','year_remod/add','exterior_1st','mas_vnr_type',
                        'exter_qual', 'bsmt_qual', 'foundation','total_bsmt_sf','gr_liv_area',
                        'full_bath','kitchen_qual','fireplaces','garage_area' , 'heating_qc']

In [4]:
# set X and y and train-test split on training data

X = df_train[interesting_features]
y = df_train['saleprice']

X_train, X_test, y_train, y_test = train_test_split(X , y , test_size = 0.3 , random_state = 77)

In [5]:
# fit a linear regression model on training set

lr = LinearRegression()
lr.fit(X_train , y_train)
lr_cv = cross_val_score(lr , X_train , y_train , cv = 5).mean()
print('Linear regression with the training data produces a mean R-squared score of {}.' .format(lr_cv))

Linear regression with the training data produces a mean R-squared score of 0.8602092947450769.


This base model produces a cross-validation R-squared score of 0.86. This implies that, compared the mean of house prices in Ames, Iowa, 86% of sale price variation is explained by the 16 variables I included in this MLR. 
<br><br>

In [6]:
# fit a linear regression model on testing set

lr.fit(X_test , y_test)
lr_cv = cross_val_score(lr , X_test, y_test , cv = 5).mean()
print('Linear regression with the testing data produces a mean R-squared score of {}.' .format(lr_cv))

Linear regression with the testing data produces a mean R-squared score of 0.8664782641435853.


By comparing the the two cross-validation scores between the training and testing data, I can identify any bias and/or variance. Because the score produced using the testing data produces a higher score by 0.006, this implies that my model is slightly underfit. 
<br><br>

In [7]:
X_train = sm.add_constant(X_train)

model = sm.OLS(y_train , X_train).fit()
model.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.863
Model:,OLS,Adj. R-squared:,0.862
Method:,Least Squares,F-statistic:,559.0
Date:,"Thu, 06 Dec 2018",Prob (F-statistic):,0.0
Time:,22:33:08,Log-Likelihood:,-16812.0
No. Observations:,1434,AIC:,33660.0
Df Residuals:,1417,BIC:,33750.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.061e+05,2.26e+05,-0.469,0.639,-5.5e+05,3.38e+05
neighborhood,1144.7006,185.306,6.177,0.000,781.196,1508.205
overall_qual,1.104e+04,1050.624,10.509,0.000,8980.341,1.31e+04
year_built,-6402.4439,4813.236,-1.330,0.184,-1.58e+04,3039.391
year_remod/add,146.7923,58.085,2.527,0.012,32.851,260.734
exterior_1st,536.0570,403.594,1.328,0.184,-255.650,1327.764
mas_vnr_type,3401.8772,1292.784,2.631,0.009,865.901,5937.853
exter_qual,1.227e+04,2372.472,5.170,0.000,7612.785,1.69e+04
bsmt_qual,1.064e+04,1339.245,7.942,0.000,8009.475,1.33e+04

0,1,2,3
Omnibus:,387.67,Durbin-Watson:,1.973
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2662.446
Skew:,1.076,Prob(JB):,0.0
Kurtosis:,9.319,Cond. No.,788000.0


### Metrics Summary

- P-values
    - Statistically insignificant P-values attached to: year_built , heating_qc , foundation , 'exterior_1st'.
    - There is no evidence to reject the null hypothesis that these three variables have an effect on housing prices.
- Coefficients
    - Interpret the sign (+/-) in front of each coefficient to determine the directional relationship of each feature. full_bath is the only statistically significant variable that negatively affects price.
    - The magnitude of each coefficient indicates the strength of the effect of that variable on sale price. Three strongest are: 'overall_qual' , 'exter_qual' , and 'full_bath'.
    
<br><br>
This model sets the benchmark moving forward for future models to beat.

#### Create new list of interesting features by removing weak predictors

In [8]:
# remove heating_qc and foundation

interesting_features.remove('foundation')
interesting_features.remove('heating_qc')

#### Final Features for Model Tuning

In [9]:
interesting_features

['neighborhood',
 'overall_qual',
 'year_built',
 'year_remod/add',
 'exterior_1st',
 'mas_vnr_type',
 'exter_qual',
 'bsmt_qual',
 'total_bsmt_sf',
 'gr_liv_area',
 'full_bath',
 'kitchen_qual',
 'fireplaces',
 'garage_area']