# 19.04 Interpreting Estimated Coefficients
## Assignment 03 House Prices Model

In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
* Now, exclude the insignificant features from your model. Did anything change?
* Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
* Do the results sound reasonable to you? If not, try to explain the potential reasons.

### Load the houseprices Data

In [1]:
import warnings

import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf
import statsmodels.api as sm 
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn import linear_model
from sqlalchemy import create_engine 
from sqlalchemy.engine.url import URL 
import scipy.stats as stats 
from scipy.stats import bartlett
from scipy.stats import levene
from scipy.stats.stats import pearsonr
from scipy.stats import jarque_bera
from scipy.stats import normaltest
from scipy.stats import percentileofscore # Ref 17.10 challenge
from scipy.stats.mstats import winsorize # Ref 17.10 challenge
from statsmodels.tsa.stattools import acf

warnings.filterwarnings(action="ignore")

kagle = dict(
    drivername = "postgresql",
    username = "dsbc_student",
    password = "7*.8G9QH21",
    host = "142.93.121.174",
    port = "5432",
    database = "houseprices"
)

In [2]:
# Load the data from the "houseprices" database
engine=create_engine(URL(**kagle), echo=True)

houses_raw=pd.read_sql_query("SELECT * FROM houseprices", con=engine)

engine.dispose()

2020-01-06 09:07:36,522 INFO sqlalchemy.engine.base.Engine select version()
2020-01-06 09:07:36,531 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 09:07:36,645 INFO sqlalchemy.engine.base.Engine select current_schema()
2020-01-06 09:07:36,647 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 09:07:36,754 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2020-01-06 09:07:36,755 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 09:07:36,807 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2020-01-06 09:07:36,809 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 09:07:36,863 INFO sqlalchemy.engine.base.Engine show standard_conforming_strings
2020-01-06 09:07:36,864 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 09:07:36,971 INFO sqlalchemy.engine.base.Engine SELECT * FROM houseprices
2020-01-06 09:07:36,973 INFO sqlalchemy.engine.base.Engine {}


In [28]:
houses_working = houses_raw.copy()

Select a subset of variables on which to base the model on 

In [29]:
houses_df = houses_working[["neighborhood","overallqual","lotarea",
                            "totalbsmtsf","firstflrsf","grlivarea",
                            "totrmsabvgrd","garagecars","saleprice"]]

Create a set of dummy columns for the two categorical columns, neighborhood and overallqual

In [30]:
# Create a set of dummies for the neighborhood variable, prefix the dummies with "neighborhood"
houses_df = pd.concat([houses_df, pd.get_dummies(houses_df["neighborhood"], prefix="neighborhood",drop_first=True)], axis=1)

# Create a set of dumies for the overallqual variable, previs the dummies with "overallqual"
houses_df = pd.concat([houses_df, pd.get_dummies(houses_df["overallqual"], prefix="overallqual",drop_first=True)], axis=1)

In [31]:
# Get a list of column names to be used for feature consideration
feature_names = houses_df.iloc[:,2:].columns.to_list()

# Pop saleprice from the list of feature_names
feature_names.pop(6)

'saleprice'

In [42]:
# Y is the target variable
Y = houses_df["saleprice"]

# X is the feature set
X = houses_df[feature_names]

# Add a constant to the model
X = sm.add_constant(X)

# Fit an OLS model using statsmodel
results = sm.OLS(Y,X).fit()

# Print the results
print(results.summary())

# Tear out the columns that I'm intersted in comparing 
first_model = results.summary2().tables[1]
first_model = first_model[["Coef.","P>|t|"]].round(4)

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.834
Model:                            OLS   Adj. R-squared:                  0.829
Method:                 Least Squares   F-statistic:                     182.7
Date:                Mon, 06 Jan 2020   Prob (F-statistic):               0.00
Time:                        11:25:01   Log-Likelihood:                -17234.
No. Observations:                1460   AIC:                         3.455e+04
Df Residuals:                    1420   BIC:                         3.476e+04
Df Model:                          39                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                  2.73e+04 

The estimated model for houseprices data is: 
$$
saleprice = 27300 + 0.5104 lotarea + 15.2633 totalbsmtsf + 2.3280 firstflrsf + 41.6097 grlivarea + 9.0013 totrmsabvgrd + 12340 garagecars - 16070 neighborhood\_Blueste -24190 neighborhood\_BrDale - 8703.2186 neighborhood\_BrkSide + 18350 neighborhood\_ClearCr +  11660 neighborhood\_CollgCr + 23850 neighborhood\_Crawfor - 18140 neighborhood\_Edwards + 7306.2099 neighborhood\_Gilbert  - 25990 neighborhood\_IDOTRR - 15600 neighborhood\_MeadowV - 2017.7272 neighborhood\_Mitchel - 5003.4636 neighborhood\_NAmes - 11070 neighborhood\_NPkVill - 277.0948 neighborhood\_NWAmes + 56770 neighborhood\_NoRidge + 40100 neighborhood\_NridgHt - 27180 neighborhood\_OldTown - 19330 neighborhood\_SWISU - 4799.7841 neighborhood\_Sawyer + 5665.8848 neighborhood\_SawyerW + 17120 neighborhood\_Somerst + 50380 neighborhood\_StoneBr + 13770 neighborhood\_Timber + 35270 neighborhood\_Veenker - 205.3532 overallqual\_2 + 2447.1836 overallqual\_3 + 15760 overallqual\_4 + 22430 overallqual\_5 + 31540 overallqual\_6 + 48460 overallqual\_7 + 78650 overallqual\_8 + 148100 overallqual\_9 + 172300 overallqual\_10
$$

As you can see from the report above, there are several features that are underperforming.  The p-values for the following features are $ > 0.61 $ and will be excluded from the second fitting for the model: firstflrsf (0.61), totrmsabvgrd (0.993), neighborhood_Mitchel (0.833), neighborhood_NWAmesd (0.976), neighborhood_Sawyer (0.607), overallqual_2 (0.995), \& overallqual_3 (0.921).  Hopefully, it will yield a better performing model.

In [39]:
insignfigant_features = ["firstflrsf", "totrmsabvgrd", "neighborhood_Mitchel", 
                         "neighborhood_NWAmesd", "neighborhood_Sawyer", 
                         "overallqual_2", "overallqual_3"]

features2 = [i for i in feature_names if not i in insignfigant_features or insignfigant_features.remove(i)]

## Y is the target variable
Y = houses_df["saleprice"]

# X is the feature set
X = houses_df[features2]

# Add a constant to the model
X = sm.add_constant(X)

# Fit an OLS model using statsmodel
results2 = sm.OLS(Y,X).fit()

# Print the results
print(results.summary2())

# Tear out the columns that I'm interested in comparing
second_model = results2.summary2().tables[1]
second_model = second_model[["Coef.","P>|t|"]].round(4)

                         Results: Ordinary least squares
Model:                   OLS                   Adj. R-squared:          0.829     
Dependent Variable:      saleprice             AIC:                     34547.4614
Date:                    2020-01-06 11:22      BIC:                     34758.9091
No. Observations:        1460                  Log-Likelihood:          -17234.   
Df Model:                39                    F-statistic:             182.7     
Df Residuals:            1420                  Prob (F-statistic):      0.00      
R-squared:               0.834                 Scale:                   1.0773e+09
----------------------------------------------------------------------------------
                        Coef.     Std.Err.     t    P>|t|     [0.025      0.975]  
----------------------------------------------------------------------------------
const                 27300.6254 25016.0006  1.0913 0.2753 -21771.6618  76372.9127
lotarea                   0.51

The model minus the insignfigant features is: 
$$
saleprice = 26840 + 0.5103 lotarea + 16.7071 totalbsmtsf + 42.0095 grlivarea + 12520 garagecars -13610 neighborhood\_Blueste - 21480 neighborhood\_BrDale - 5605.4957 neighborhood\_BrkSide + 21370 neighborhood\_ClearCr + 14430 neighborhood\_CollgCr + 27050 neighborhood\_Crawfor - 14810 neighborhood\_Edwards + 9926.6190 neighborhood\_Gilbert - 22980 neighborhood\_IDOTRR - 12590 neighborhood\_MeadowV - 1690.7554 neighborhood\_NAmes - 8314.9794 neighborhood\_NPkVill + 2818.6992 neighborhood\_NWAmes + 59280 neighborhood\_NoRidge + 42800 neighborhood\_NridgHt - 24100 neighborhood\_OldTown - 16280 neighborhood\_SWISU + 8718.1671 neighborhood\_SawyerW +19680 neighborhood\_Somerst + 53160 neighborhood\_StoneBr + 16750 neighborhood\_Timber + 38370 neighborhood\_Veenker + 13450 overallqual\_4 + 20020 overallqual\_5 + 29230 overallqual\_6 + 46380 overallqual\_7 + 76600 overallqual\_8 + 146100 overallqual\_9 + 170100 overallqual\_10   
$$

Removing the "insignfigant features" from the model caused changes to make the model less predictive.  I would need to return to feature engineering to, possibly, add some new features to recover perormance.

Do a side-by-side comparison of models 1 & 2 to see the impact of removing the "insignifigant features"

In [37]:
comparison = first_model.merge(second_model, 
                               how="left", 
                               left_index=True, 
                               right_index=True, 
                               suffixes=("_firstm","_secondm"))

comparison[["Coef._firstm","Coef._secondm","P>|t|_firstm","P>|t|_secondm"]].round(4)

Unnamed: 0,Coef._firstm,Coef._secondm,P>|t|_firstm,P>|t|_secondm
const,27300.6254,26835.9662,0.2753,0.0007
lotarea,0.5104,0.5103,0.0,0.0
totalbsmtsf,15.2633,16.7071,0.0,0.0
firstflrsf,2.328,,0.6096,
grlivarea,41.6097,42.0095,0.0,0.0
totrmsabvgrd,9.0013,,0.9928,
garagecars,12339.851,12524.1768,0.0,0.0
neighborhood_Blueste,-16072.3589,-13609.4957,0.5174,0.5616
neighborhood_BrDale,-24191.2205,-21481.4665,0.0437,0.0152
neighborhood_BrkSide,-8703.2186,-5605.4957,0.3641,0.2864
