# Evaluate House Price Models Performances

This is part two of two for the evaluating performance assignment.
1. [Evaluate Weather Models Performances](https://github.com/philbowman212/Thinkful_repo/blob/master/assignments/3_supervised_learning/regression_problems/eval_temp_perf.ipynb)
2. [Evaluate House Price Models Performances](https://github.com/philbowman212/Thinkful_repo/blob/master/assignments/3_supervised_learning/regression_problems/eval_hp_perf.ipynb)

### House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
* Do you think your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
* For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df = pd.read_sql_query('select * from houseprices',con=engine)

engine.dispose()

Cleaning...

In [3]:
df.drop(columns='id', inplace=True)

In [4]:
cat_list = ['mssubclass','mszoning','street','alley','lotshape','landcontour','utilities','lotconfig','landslope','neighborhood','condition1','condition2',
            'bldgtype','housestyle','roofstyle','roofmatl','exterior1st','exterior2nd','masvnrtype','exterqual','extercond','foundation','bsmtqual','bsmtcond',
            'bsmtexposure','bsmtfintype1','bsmtfintype2','heating','heatingqc','centralair','electrical','kitchenqual','functional','fireplacequ','garagetype',
            'garagefinish','garagequal','garagecond','paveddrive','poolqc','fence','miscfeature','saletype','salecondition', 'overallqual', 'overallcond', 
            'yearbuilt', 'yearremodadd', 'mosold', 'yrsold']
for var in cat_list:
    df[var] = df[var].astype('category')

In [5]:
def add_cat_fillna(variable, new_cat='None'):
    df[variable] = df[variable].cat.add_categories(new_cat).fillna(new_cat).copy()

In [6]:
nulls_list = ['alley','bsmtqual','bsmtcond','bsmtexposure','bsmtfintype1','bsmtfintype2','fireplacequ','garagetype','garagefinish','garagequal','garagecond',
             'poolqc','fence','miscfeature']
for var in nulls_list:
    add_cat_fillna(var)

In [7]:
df.masvnrtype = df.masvnrtype.fillna('None').copy()
df.masvnrarea = df.masvnrarea.fillna(0).copy()
df.lotfrontage = df.lotfrontage.fillna(df.lotfrontage.median()).copy()
df.electrical = df.electrical.fillna(df.electrical.mode()[0]).copy()
df.drop(columns='garageyrblt', inplace=True)

In [8]:
def outliers_std(data, columns, thresh=2):
    outlier_indexes = []
    for col in columns:
        ser_col = data[col]
        mean = ser_col.mean()
        sd = ser_col.std()
        outliers_mask = data[(data[col] > mean + thresh*sd) | (data[col] < mean - thresh*sd)].index
        outlier_indexes += [x for x in outliers_mask]
    return list(set(outlier_indexes))

In [9]:
df.drop(outliers_std(df, df.describe().columns), inplace=True)

Feature selection...

In [10]:
data = pd.DataFrame()
data['target'] = df.saleprice
data['totalsf'] = df.totalbsmtsf + df.firstflrsf + df.secondflrsf
data['grlivarea'] = df.grlivarea
data['bedroomabvgr'] = df.bedroomabvgr
data['totalbath'] = df.fullbath + df.halfbath * .5
data['garagearea'] = df.garagearea
data['selling_age'] = df.yrsold.astype('int') - df.yearbuilt.astype('int')

In [11]:
import statsmodels.api as sm

In [12]:
def OLS_sum(data):
    target = data.iloc[:, 0]
    data = data.iloc[:, 1:]
    sm_data = sm.add_constant(data)
    results = sm.OLS(target, sm_data).fit()
    print(results.summary())
OLS_sum(data)

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.792
Model:                            OLS   Adj. R-squared:                  0.790
Method:                 Least Squares   F-statistic:                     387.1
Date:                Mon, 11 Nov 2019   Prob (F-statistic):          3.00e-204
Time:                        11:17:17   Log-Likelihood:                -7053.5
No. Observations:                 617   AIC:                         1.412e+04
Df Residuals:                     610   BIC:                         1.415e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         3.874e+04   8012.830      4.835   

The totalbath variable appears to not be significant in this model, let's remove it and see if that model is better.

In [13]:
data = pd.DataFrame()
data['target'] = df.saleprice
data['totalsf'] = df.totalbsmtsf + df.firstflrsf + df.secondflrsf
data['grlivarea'] = df.grlivarea
data['bedroomabvgr'] = df.bedroomabvgr
data['garagearea'] = df.garagearea
data['selling_age'] = df.yrsold.astype('int') - df.yearbuilt.astype('int')

OLS_sum(data)

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.791
Model:                            OLS   Adj. R-squared:                  0.790
Method:                 Least Squares   F-statistic:                     463.9
Date:                Mon, 11 Nov 2019   Prob (F-statistic):          2.80e-205
Time:                        11:17:17   Log-Likelihood:                -7054.2
No. Observations:                 617   AIC:                         1.412e+04
Df Residuals:                     611   BIC:                         1.415e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         4.362e+04   6952.996      6.274   

The adjusted R-squared is identical to the last model, the F-stat is better in this one. The AIC and BIC are practically identical. Let's see if the interaction between totalsf and grlivarea adds anything to the model.

In [14]:
data = pd.DataFrame()
data['target'] = df.saleprice
data['totalsf'] = df.totalbsmtsf + df.firstflrsf + df.secondflrsf
data['grlivarea'] = df.grlivarea
data['totalsf_grl_rel'] = (df.totalbsmtsf + df.firstflrsf + df.secondflrsf) * df.grlivarea
data['bedroomabvgr'] = df.bedroomabvgr
data['garagearea'] = df.garagearea
data['selling_age'] = df.yrsold.astype('int') - df.yearbuilt.astype('int')

OLS_sum(data)

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.801
Model:                            OLS   Adj. R-squared:                  0.799
Method:                 Least Squares   F-statistic:                     409.2
Date:                Mon, 11 Nov 2019   Prob (F-statistic):          4.34e-210
Time:                        11:17:17   Log-Likelihood:                -7039.8
No. Observations:                 617   AIC:                         1.409e+04
Df Residuals:                     610   BIC:                         1.412e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const            1.253e+05   1.66e+04     

This model makes totalsf and grlivarea insignificant - perhaps their relation is enough? The model's outcome variable is better explained, the AIC and BIC have also decreased. Let's remove totalsf and grlivarea and see what happens.

In [15]:
data = pd.DataFrame()
data['target'] = df.saleprice
data['totalsf_grl_rel'] = (df.totalbsmtsf + df.firstflrsf + df.secondflrsf) * df.grlivarea
data['bedroomabvgr'] = df.bedroomabvgr
data['garagearea'] = df.garagearea
data['selling_age'] = df.yrsold.astype('int') - df.yearbuilt.astype('int')

OLS_sum(data)

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.799
Model:                            OLS   Adj. R-squared:                  0.798
Method:                 Least Squares   F-statistic:                     608.0
Date:                Mon, 11 Nov 2019   Prob (F-statistic):          1.58e-211
Time:                        11:17:17   Log-Likelihood:                -7043.0
No. Observations:                 617   AIC:                         1.410e+04
Df Residuals:                     612   BIC:                         1.412e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
const            1.193e+05   5784.148     

Not too shabby. I wonder if the relation between sf and bedroomabvgr would alter anything.

In [16]:
data = pd.DataFrame()
data['target'] = df.saleprice
data['totalsf'] = df.totalbsmtsf + df.firstflrsf + df.secondflrsf
data['grlivarea'] = df.grlivarea
data['totalsf_grl_rel'] = (df.totalbsmtsf + df.firstflrsf + df.secondflrsf) * df.grlivarea
data['bedroomabvgr'] = df.bedroomabvgr
data['bedroomabvgr_sf_rel'] = df.bedroomabvgr * (df.totalbsmtsf + df.firstflrsf + df.secondflrsf)
data['garagearea'] = df.garagearea
data['selling_age'] = df.yrsold.astype('int') - df.yearbuilt.astype('int')

OLS_sum(data)

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.805
Model:                            OLS   Adj. R-squared:                  0.802
Method:                 Least Squares   F-statistic:                     358.1
Date:                Mon, 11 Nov 2019   Prob (F-statistic):          3.75e-211
Time:                        11:17:18   Log-Likelihood:                -7034.2
No. Observations:                 617   AIC:                         1.408e+04
Df Residuals:                     609   BIC:                         1.412e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                7.786e+04   2

This model, though more complicated, appears to describe the outcome variable with greater consistency. I am curious if I add a dummy to the equation, how this may change.

In [17]:
data = pd.DataFrame()
data['target'] = df.saleprice
data['totalsf'] = df.totalbsmtsf + df.firstflrsf + df.secondflrsf
data['grlivarea'] = df.grlivarea
data['totalsf_grl_rel'] = (df.totalbsmtsf + df.firstflrsf + df.secondflrsf) * df.grlivarea
data['bedroomabvgr'] = df.bedroomabvgr
data['bedroomabvgr_sf_rel'] = df.bedroomabvgr * (df.totalbsmtsf + df.firstflrsf + df.secondflrsf)
data['garagearea'] = df.garagearea
data['selling_age'] = df.yrsold.astype('int') - df.yearbuilt.astype('int')
data = data.join(pd.get_dummies(df.mszoning, prefix='mszoning', drop_first=True))

In [18]:
OLS_sum(data)

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.811
Model:                            OLS   Adj. R-squared:                  0.807
Method:                 Least Squares   F-statistic:                     235.8
Date:                Mon, 11 Nov 2019   Prob (F-statistic):          1.93e-210
Time:                        11:17:18   Log-Likelihood:                -7024.1
No. Observations:                 617   AIC:                         1.407e+04
Df Residuals:                     605   BIC:                         1.413e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                5.316e+04   2

Looks like a number of these dummies are insignificant, what if we do it without the significant ones?

In [19]:
data = pd.DataFrame()
data['target'] = df.saleprice
data['totalsf'] = df.totalbsmtsf + df.firstflrsf + df.secondflrsf
data['grlivarea'] = df.grlivarea
data['totalsf_grl_rel'] = (df.totalbsmtsf + df.firstflrsf + df.secondflrsf) * df.grlivarea
data['bedroomabvgr'] = df.bedroomabvgr
data['bedroomabvgr_sf_rel'] = df.bedroomabvgr * (df.totalbsmtsf + df.firstflrsf + df.secondflrsf)
data['garagearea'] = df.garagearea
data['selling_age'] = df.yrsold.astype('int') - df.yearbuilt.astype('int')
data = data.join(pd.get_dummies(df.mszoning, prefix='mszoning', drop_first=True))
data.drop(columns=['mszoning_RH', 'mszoning_RM'], inplace=True)

OLS_sum(data)

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.810
Model:                            OLS   Adj. R-squared:                  0.807
Method:                 Least Squares   F-statistic:                     287.4
Date:                Mon, 11 Nov 2019   Prob (F-statistic):          2.99e-212
Time:                        11:17:18   Log-Likelihood:                -7025.7
No. Observations:                 617   AIC:                         1.407e+04
Df Residuals:                     607   BIC:                         1.412e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                7.212e+04   2

It appears that this model is superior to all the prior models.