### Assignments

As in previous checkpoints, please submit links to two Juypyter notebooks (one for each assignment below).

Please submit links to all your work below. This is not a graded checkpoint, but you should discuss your solutions with your mentor. Also, when you're done, compare your work to these example solutions.

Part 2. House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

    Load the houseprices data from Thinkful's database.
    Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
    Do you think your model is satisfactory? If so, why?
    In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables.
    For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?



#### Load the houseprices data from Thinkful's database

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

In [2]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
houseprices_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [3]:
houseprices_df['central_air_ohc'] = pd.get_dummies(houseprices_df.centralair, drop_first=True)

mszoning_ohc_df = pd.get_dummies(houseprices_df.mszoning, prefix = 'mszoning', drop_first=True)
houseprices_df = pd.concat([houseprices_df, mszoning_ohc_df], axis=1)

kitchenqual_ohc_df = pd.get_dummies(houseprices_df.kitchenqual, prefix = 'kitchenqual',drop_first=True)
houseprices_df = pd.concat([houseprices_df, kitchenqual_ohc_df], axis=1)

fullbath_ohc_df = pd.get_dummies(houseprices_df.fullbath, prefix = 'fullbath', drop_first=True)
houseprices_df = pd.concat([houseprices_df, fullbath_ohc_df], axis=1)

In [4]:
numerics_df = houseprices_df.select_dtypes(include=[np.number])

In [5]:
np.abs(numerics_df.iloc[:,1:].corr().loc[:,'saleprice']).sort_values(ascending=False)

saleprice          1.000000
overallqual        0.790982
grlivarea          0.708624
garagecars         0.640409
garagearea         0.623431
totalbsmtsf        0.613581
firstflrsf         0.605852
fullbath           0.560664
totrmsabvgrd       0.533723
yearbuilt          0.522897
fullbath_1         0.520796
kitchenqual_TA     0.519298
yearremodadd       0.507101
garageyrblt        0.486362
masvnrarea         0.477493
fireplaces         0.466929
fullbath_2         0.425672
bsmtfinsf1         0.386420
lotfrontage        0.351799
wooddecksf         0.324413
kitchenqual_Gd     0.321641
fullbath_3         0.319596
secondflrsf        0.319334
openporchsf        0.315856
mszoning_RM        0.288065
halfbath           0.284108
lotarea            0.263843
central_air_ohc    0.251328
mszoning_RL        0.245063
bsmtfullbath       0.227122
bsmtunfsf          0.214479
bedroomabvgr       0.168213
kitchenqual_Fa     0.157199
kitchenabvgr       0.135907
enclosedporch      0.128578
screenporch        0

Choosing only variables with corr >0.5:
overallqual        0.790982,
grlivarea          0.708624,
garagecars         0.640409,
garagearea         0.623431,
totalbsmtsf        0.613581,
firstflrsf         0.605852,
fullbath           0.560664,
totrmsabvgrd       0.533723,
yearbuilt          0.522897,
fullbath_1         0.520796,
kitchenqual_TA     0.519298,
yearremodadd       0.507101

In [6]:
features = ['overallqual', 'grlivarea', 'garagecars',
            'garagearea', 'totalbsmtsf', 'firstflrsf', 
            'fullbath', 'totrmsabvgrd', 'yearbuilt',
            'fullbath_1', 'kitchenqual_TA', 'yearremodadd']

##### Run your house prices model again (M1)
and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.

In [7]:
import statsmodels.api as sm

Y = houseprices_df['saleprice']

X = houseprices_df[['overallqual', 'grlivarea', 'garagecars',
            'garagearea', 'totalbsmtsf', 'firstflrsf', 
            'fullbath', 'totrmsabvgrd', 'yearbuilt',
            'fullbath_1', 'kitchenqual_TA', 'yearremodadd']]

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.963
Model:,OLS,Adj. R-squared:,0.962
Method:,Least Squares,F-statistic:,3121.0
Date:,"Wed, 23 Oct 2019",Prob (F-statistic):,0.0
Time:,11:37:24,Log-Likelihood:,-17472.0
No. Observations:,1460,AIC:,34970.0
Df Residuals:,1448,BIC:,35030.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
overallqual,2.125e+04,1179.331,18.016,0.000,1.89e+04,2.36e+04
grlivarea,44.4716,4.244,10.480,0.000,36.147,52.796
garagecars,1.434e+04,3065.587,4.677,0.000,8323.571,2.04e+04
garagearea,11.9567,10.426,1.147,0.252,-8.495,32.408
totalbsmtsf,22.2268,4.320,5.145,0.000,13.753,30.700
firstflrsf,12.8085,4.970,2.577,0.010,3.060,22.557
fullbath,1.389e+04,4912.915,2.828,0.005,4254.880,2.35e+04
totrmsabvgrd,-8.9566,1133.766,-0.008,0.994,-2232.956,2215.043
yearbuilt,126.9411,46.524,2.729,0.006,35.680,218.203

0,1,2,3
Omnibus:,442.552,Durbin-Watson:,1.977
Prob(Omnibus):,0.0,Jarque-Bera (JB):,40022.268
Skew:,-0.374,Prob(JB):,0.0
Kurtosis:,28.639,Cond. No.,24800.0


**Assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.**

**F-statistic: 3121,  Prob (F-statistic): 0.00.**
The p-value of the F-statistic is 0.00, which indicates that the features of the model are providing useful information as compared to a reduced model. 

**R-squared: 0.963,  Adj. R-squared: 0.962.**
The Adj R-sq value is 96.2% which indicates that the model explains 96.2% of the variance in house prices. An Adj R-sq value this high may be an indication of overfitting. 

**AIC: 3.497e+04,  BIC: 3.503e+04**
Both values appear relatively high. The AIC is lower than the BIC, which may be an indication of overfitting. The higher BIC may be due to a relatively high number of features. 


#### Do you think your model is satisfactory? If so, why?

The model appears satisfactory in the sense that that the Adj R-sq value is high and the p-value of the F-statistic is 0. However, there are features which are not significantly stastic and there are indications of possible overfitting. 

#### In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables.

**Which features are statistically significant, and which are not?**

Significant: 'overallqual', 'grlivarea', 'garagecars',
            'totalbsmtsf', 'firstflrsf', 
            'fullbath', 'yearbuilt',
            'fullbath_1', 'kitchenqual_TA', 'yearremodadd'
            
Not significant: 'garagearea', 'totrmsabvgrd', 

##### Model 2 (M2)

In [8]:
Y = houseprices_df['saleprice']

X = houseprices_df[['overallqual', 'grlivarea', 'garagecars',
            'totalbsmtsf', 'firstflrsf', 'fullbath', 'yearbuilt',
            'fullbath_1', 'kitchenqual_TA', 'yearremodadd']]

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.963
Model:,OLS,Adj. R-squared:,0.962
Method:,Least Squares,F-statistic:,3747.0
Date:,"Wed, 23 Oct 2019",Prob (F-statistic):,0.0
Time:,11:57:55,Log-Likelihood:,-17473.0
No. Observations:,1460,AIC:,34970.0
Df Residuals:,1450,BIC:,35020.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
overallqual,2.123e+04,1177.093,18.036,0.000,1.89e+04,2.35e+04
grlivarea,44.7407,3.017,14.829,0.000,38.822,50.659
garagecars,1.717e+04,1804.843,9.515,0.000,1.36e+04,2.07e+04
totalbsmtsf,22.6440,4.291,5.277,0.000,14.226,31.061
firstflrsf,13.2023,4.957,2.664,0.008,3.479,22.925
fullbath,1.384e+04,4903.414,2.822,0.005,4218.682,2.35e+04
yearbuilt,128.5306,46.416,2.769,0.006,37.480,219.581
fullbath_1,2.02e+04,5203.250,3.882,0.000,9993.598,3.04e+04
kitchenqual_TA,-1.746e+04,2460.242,-7.096,0.000,-2.23e+04,-1.26e+04

0,1,2,3
Omnibus:,431.663,Durbin-Watson:,1.976
Prob(Omnibus):,0.0,Jarque-Bera (JB):,37480.171
Skew:,-0.336,Prob(JB):,0.0
Kurtosis:,27.813,Cond. No.,24600.0


#### For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?


**Assess the goodness of fit of each model using F-test, R-squared, adjusted R-squared, AIC and BIC. Which model is the best and why?**

**M1: F-statistic: 3121,  Prob (F-statistic): 0.00.
M2: F-statistic:  3747,   Prob (F-statistic): 0.00.**

The p-value of the F-statistic is 0.00 in each model, which indicates that the features of the model are providing useful information as compared to a reduced model. 

**M1:  R-squared: 0.963,  Adj. R-squared: 0.962.
M2:  R-squared: 0.963,  Adj. R-squared: 0.962**

The R-sq and Adj R-sq value remained the same from M1 to M2. The Adj R-sq value of 96.2% appears to indicatea successful model, however this high may be an indication of overfitting. 

**M1: AIC: 3.497e+04,  BIC: 3.503e+04, 
M2: AIC: 3.497e+04,  BIC: 3.502e+04.**

There was little movement in these values. The BIC decreased slighltly, which may be an indication of an improved model.  

**Features**
All features remain significantly significant. I am mostly curious about the effect of 'fullbath', which decreased slightly this model, and 'fullbath1', which increased slightly. I will remove 'fullbath' and perhaps add the more specific number of full baths. 

I am also curious regarding the effect of 'kitchenqual' on the model. As it is a negative feature on the price of the house, I wonder if including the other levels of quality with improve the model. 

##### Model 3 (M3)

In [9]:
Y = houseprices_df['saleprice']

X = houseprices_df[['overallqual', 'grlivarea', 'garagecars',
                    'totalbsmtsf', 'firstflrsf', 'fullbath', 
                    'yearbuilt', 'fullbath_1', 'kitchenqual_TA',
                    'yearremodadd', 'fullbath_2', 'fullbath_3',
                    'kitchenqual_Gd', 'kitchenqual_Fa']]

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.967
Model:,OLS,Adj. R-squared:,0.967
Method:,Least Squares,F-statistic:,3269.0
Date:,"Wed, 23 Oct 2019",Prob (F-statistic):,0.0
Time:,12:18:10,Log-Likelihood:,-17383.0
No. Observations:,1460,AIC:,34790.0
Df Residuals:,1447,BIC:,34860.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
overallqual,1.8e+04,1167.016,15.421,0.000,1.57e+04,2.03e+04
grlivarea,40.8509,2.889,14.139,0.000,35.184,46.518
garagecars,1.587e+04,1724.371,9.202,0.000,1.25e+04,1.93e+04
totalbsmtsf,19.0031,4.049,4.693,0.000,11.061,26.945
firstflrsf,10.2441,4.680,2.189,0.029,1.065,19.423
fullbath,-2618.0818,5078.396,-0.516,0.606,-1.26e+04,7343.723
yearbuilt,173.6178,43.828,3.961,0.000,87.644,259.591
fullbath_1,-2.539e+04,7471.084,-3.398,0.001,-4e+04,-1.07e+04
kitchenqual_TA,-6.423e+04,4748.507,-13.526,0.000,-7.35e+04,-5.49e+04

0,1,2,3
Omnibus:,647.212,Durbin-Watson:,2.02
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52467.764
Skew:,-1.157,Prob(JB):,0.0
Kurtosis:,32.277,Cond. No.,5.46e+19


**Assess the goodness of fit of each model using F-test, R-squared, adjusted R-squared, AIC and BIC. Which model is the best and why?**

**M1: F-statistic: 3121,  Prob (F-statistic): 0.00,
M2: F-statistic:  3747,   Prob (F-statistic): 0.00,
M3: F-statistic:  3269,   Prob (F-statistic): 0.00.**

The p-value of the F-statistic is 0.00 in each model, which indicates that the features of the model are providing useful information as compared to a reduced model. 

**M1:  R-squared: 0.963,  Adj. R-squared: 0.962,
M2:  R-squared: 0.963,  Adj. R-squared: 0.962,
M3:  R-squared: 0.0.967,  Adj. R-squared: 0.0.967**

The R-sq and Adj R-sq values show an improvement from the values of M1 to M2. The Adj R-sq value of 96.7% appears to indicate a successful model, however this high may be an indication of overfitting. 

**M1: AIC: 3.497e+04,  BIC: 3.503e+04, 
M2: AIC: 3.497e+04,  BIC: 3.502e+04,
M3: AIC: 3.479e+04,  BIC: 3.486e+04.**

Both the AIC and BIC have decreased, which may be an indication of an improved model.  

**Features**
All added kitchen and bathroom features are stastically significant. All other features are significantly EXCEPT 'fullbath', which will be removed. 

Although 'firstflsf' remains significant, it is less so in this model. Due to its reduction in significance and the fact that it is likely overlapping with 'totalbsmtsf', I will remove it as well. 

##### Model 4 (M4)

In [10]:
#will remove 'fullbath' and 'firstflsf'
Y = houseprices_df['saleprice']

X = houseprices_df[['overallqual', 'grlivarea', 'garagecars',
                    'totalbsmtsf', 'yearbuilt', 'fullbath_1',
                    'kitchenqual_TA', 'yearremodadd', 
                    'fullbath_2', 'fullbath_3',
                    'kitchenqual_Gd', 'kitchenqual_Fa']]

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.967
Model:,OLS,Adj. R-squared:,0.967
Method:,Least Squares,F-statistic:,3532.0
Date:,"Wed, 23 Oct 2019",Prob (F-statistic):,0.0
Time:,12:35:48,Log-Likelihood:,-17385.0
No. Observations:,1460,AIC:,34790.0
Df Residuals:,1448,BIC:,34860.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
overallqual,1.766e+04,1158.052,15.246,0.000,1.54e+04,1.99e+04
grlivarea,42.7740,2.756,15.521,0.000,37.368,48.180
garagecars,1.624e+04,1718.227,9.452,0.000,1.29e+04,1.96e+04
totalbsmtsf,25.5907,2.712,9.435,0.000,20.270,30.911
yearbuilt,169.1023,43.837,3.858,0.000,83.112,255.093
fullbath_1,-2.825e+04,1.21e+04,-2.337,0.020,-5.2e+04,-4539.149
kitchenqual_TA,-6.461e+04,4751.495,-13.598,0.000,-7.39e+04,-5.53e+04
yearremodadd,-150.1688,43.282,-3.470,0.001,-235.070,-65.267
fullbath_2,-3.195e+04,1.22e+04,-2.612,0.009,-5.59e+04,-7960.803

0,1,2,3
Omnibus:,652.349,Durbin-Watson:,2.017
Prob(Omnibus):,0.0,Jarque-Bera (JB):,53848.549
Skew:,-1.169,Prob(JB):,0.0
Kurtosis:,32.66,Cond. No.,76700.0


In [None]:
#will remove 'fullbath_3' as it is not significant

In [11]:
#will remove 'fullbath_3'
Y = houseprices_df['saleprice']

X = houseprices_df[['overallqual', 'grlivarea', 'garagecars',
                    'totalbsmtsf', 'yearbuilt', 'fullbath_1',
                    'kitchenqual_TA', 'yearremodadd', 
                    'fullbath_2', 'kitchenqual_Gd', 'kitchenqual_Fa']]

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.967
Model:,OLS,Adj. R-squared:,0.967
Method:,Least Squares,F-statistic:,3851.0
Date:,"Wed, 23 Oct 2019",Prob (F-statistic):,0.0
Time:,12:36:37,Log-Likelihood:,-17386.0
No. Observations:,1460,AIC:,34790.0
Df Residuals:,1449,BIC:,34850.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
overallqual,1.772e+04,1157.152,15.310,0.000,1.54e+04,2e+04
grlivarea,43.5988,2.672,16.317,0.000,38.357,48.840
garagecars,1.615e+04,1716.923,9.406,0.000,1.28e+04,1.95e+04
totalbsmtsf,25.4047,2.709,9.379,0.000,20.092,30.718
yearbuilt,171.4779,43.801,3.915,0.000,85.558,257.398
fullbath_1,-4.072e+04,6432.167,-6.331,0.000,-5.33e+04,-2.81e+04
kitchenqual_TA,-6.457e+04,4752.203,-13.588,0.000,-7.39e+04,-5.53e+04
yearremodadd,-146.7210,43.196,-3.397,0.001,-231.455,-61.987
fullbath_2,-4.499e+04,5933.184,-7.583,0.000,-5.66e+04,-3.34e+04

0,1,2,3
Omnibus:,652.043,Durbin-Watson:,2.016
Prob(Omnibus):,0.0,Jarque-Bera (JB):,54797.252
Skew:,-1.163,Prob(JB):,0.0
Kurtosis:,32.923,Cond. No.,32200.0


**Aassess the goodness of fit of each model using F-test, R-squared, adjusted R-squared, AIC and BIC. Which model is the best and why?**

**M1: F-statistic: 3121,  Prob (F-statistic): 0.00,
M2: F-statistic:  3747,   Prob (F-statistic): 0.00,
M3: F-statistic:  3269,   Prob (F-statistic): 0.00,
M4: F-statistic:  3851.,   Prob (F-statistic): 0.00**

The p-value of the F-statistic is 0.00 in each model, which indicates that the features of the model are providing useful information as compared to a reduced model. 

**M1:  R-squared: 0.963,  Adj. R-squared: 0.962,
M2:  R-squared: 0.963,  Adj. R-squared: 0.962,
M3:  R-squared: 0.0.967,  Adj. R-squared: 0.0.967,
M4:  R-squared: 0.0.967,  Adj. R-squared: 0.0.967.**

The R-sq and Adj R-sq values show an improvement from the values of M1 to M2 and remain the same from M3 to M4. The Adj R-sq value of 96.7% appears to indicate a successful model, however this high may be an indication of overfitting. 

**M1: AIC: 3.497e+04,  BIC: 3.503e+04, 
M2: AIC: 3.497e+04,  BIC: 3.502e+04,
M3: AIC: 3.479e+04,  BIC: 3.486e+04,
M4: AIC: 3.479e+04,  BIC: 3.485e+04.**

The BIC has decreased, which may be an indication of an improved model.  

**Features**
All features remain stastically significant.

Features which have increased in impact: 
'garagecars', 'totalbsmtsf', 'yearbuilt', 'fullbath_1' (although now a negative impact).

The impact on the full bath features are interesting in that in the original model, the 'fullbath' feature ADDED 1.389e+04 and the 'fullbath_1' feature also added 
1.986e+04. With the inclusion of all the full bath features, the result is now 'fullbath_1' has a larger, but negative impact of -4.072e+04. An additional full bath 'fullbath_2' appears to reduce the value even further, -4.499e+04. I do not understand this. 

The addition of all the kitchen quality features has increased the impact of each of the features, but shows that the impact is dependent on the quality of the kitchen. A good quality kitchen reduces the value less than the typical and fair quality kitchens. 

M1 'kitchenqual_TA' = -1.736e+04,
M4 'kitchenqual_TA' = -6.457e+04, and 
M4 'kitchenqual_Fa' = -6.299e+04
M4 'kitchenqual_Gd' = -4.831e+04.
