# House Prices Model

* Load the houseprices data from Thinkful's database.
* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
* Do you think your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables.
* For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

## Load Data

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import statsmodels.api as sm
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

# Load data from PostgreSQL database and print out
# observations
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

house_df = pd.read_sql_query('select * from houseprices',con=engine)

# No need for an open connection, as we're only doing a single query
engine.dispose()

house_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [2]:
# Convert street and overallqual to numerical variables
house_df['street_is_paved'] = np.where(house_df['street'] == 'Pave', 1, 0)
house_df['overallqual_above_6'] = np.where(house_df['overallqual'] > 6, 1, 0)

house_df[['id', 'street', 'street_is_paved', 'overallqual', 'overallqual_above_6']].head(25)

Unnamed: 0,id,street,street_is_paved,overallqual,overallqual_above_6
0,1,Pave,1,7,1
1,2,Pave,1,6,0
2,3,Pave,1,7,1
3,4,Pave,1,7,1
4,5,Pave,1,8,1
5,6,Pave,1,5,0
6,7,Pave,1,8,1
7,8,Pave,1,7,1
8,9,Pave,1,7,1
9,10,Pave,1,5,0


## Build 1st Model

In [3]:
# Y is the target variable
Y = house_df['saleprice']

# X is the feature set
X = house_df[['street_is_paved', 'overallqual_above_6', 'lotarea', 'totalbsmtsf', 'grlivarea', 'garagearea']]

# Manually add constant
X = sm.add_constant(X)

# Use fit method to build model
results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.719
Model:,OLS,Adj. R-squared:,0.717
Method:,Least Squares,F-statistic:,618.5
Date:,"Tue, 31 Dec 2019",Prob (F-statistic):,0.0
Time:,21:33:09,Log-Likelihood:,-17618.0
No. Observations:,1460,AIC:,35250.0
Df Residuals:,1453,BIC:,35290.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3.904e+04,1.82e+04,-2.151,0.032,-7.46e+04,-3437.927
street_is_paved,3.872e+04,1.78e+04,2.178,0.030,3847.798,7.36e+04
overallqual_above_6,4.742e+04,2830.163,16.755,0.000,4.19e+04,5.3e+04
lotarea,0.4752,0.120,3.967,0.000,0.240,0.710
totalbsmtsf,40.0037,3.092,12.937,0.000,33.938,46.069
grlivarea,53.5868,2.666,20.102,0.000,48.358,58.816
garagearea,74.1952,6.519,11.382,0.000,61.408,86.982

0,1,2,3
Omnibus:,546.237,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,61601.254
Skew:,-0.719,Prob(JB):,0.0
Kurtosis:,34.789,Cond. No.,333000.0


### Assess goodness of fit of model

F-test: F statistic=618.5; p-value=0.00

This means that the model is more useful in explaining house prices than an "empty" model (p-value less than 0.05).

R-squared=0.719
Adjusted R-squared=0.717

This means that our model explains 71.7% of the variance in the house prices, leaving 28.3% unexplained.

AIC=3.525e+04
BIC=3.529e+04

The AIC and BIC values seem quite large. Lower values are better.

The model is pretty good, but there is room for improvement since 28.3% of the variance is still unexplained and the AIC and BIC values could seemingly be reduced.

## Build 2nd Model

In [4]:
# Y is the target variable
Y = house_df['saleprice']

# X is the feature set
# Removed totalbsmtsf, grlivearea, and garagearea since
# they are highly correlated with each other and
# the other features
X = house_df[['street_is_paved', 'overallqual_above_6', 'lotarea']]

# Manually add constant
X = sm.add_constant(X)

# Use fit method to build model
results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.481
Model:,OLS,Adj. R-squared:,0.48
Method:,Least Squares,F-statistic:,449.8
Date:,"Tue, 31 Dec 2019",Prob (F-statistic):,9.57e-207
Time:,21:48:36,Log-Likelihood:,-18065.0
No. Observations:,1460,AIC:,36140.0
Df Residuals:,1456,BIC:,36160.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.726e+04,2.42e+04,2.363,0.018,9724.768,1.05e+05
street_is_paved,6.602e+04,2.4e+04,2.755,0.006,1.9e+04,1.13e+05
overallqual_above_6,1.046e+05,3113.801,33.596,0.000,9.85e+04,1.11e+05
lotarea,1.7730,0.154,11.517,0.000,1.471,2.075

0,1,2,3
Omnibus:,715.652,Durbin-Watson:,1.962
Prob(Omnibus):,0.0,Jarque-Bera (JB):,8848.861
Skew:,1.97,Prob(JB):,0.0
Kurtosis:,14.399,Cond. No.,329000.0


### Compare goodness of fit to model 1

F-test: F statistic=449.85; p-value=9.57e-207
R-squared=0.481
Adjusted R-squared=0.480
AIC=3.614e+04
BIC=3.616e+04

This model performed worse than the first model (lower adjusted R-squared and higher AIC and BIC values).

## Build 3rd Model

In [5]:
# Y is the target variable
Y = house_df['saleprice']

# X is the feature set
# Replace overallqual_above_6 dummy variable with overallqual
X = house_df[['street_is_paved', 'overallqual', 'lotarea', 'totalbsmtsf', 'grlivarea', 'garagearea']]

# Manually add constant
X = sm.add_constant(X)

# Use fit method to build model
results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.763
Model:,OLS,Adj. R-squared:,0.762
Method:,Least Squares,F-statistic:,779.3
Date:,"Tue, 31 Dec 2019",Prob (F-statistic):,0.0
Time:,21:57:11,Log-Likelihood:,-17493.0
No. Observations:,1460,AIC:,35000.0
Df Residuals:,1453,BIC:,35040.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.334e+05,1.67e+04,-7.976,0.000,-1.66e+05,-1.01e+05
street_is_paved,3.088e+04,1.63e+04,1.892,0.059,-1142.381,6.29e+04
overallqual,2.572e+04,1046.040,24.589,0.000,2.37e+04,2.78e+04
lotarea,0.6437,0.110,5.828,0.000,0.427,0.860
totalbsmtsf,26.9389,2.932,9.188,0.000,21.187,32.691
grlivarea,42.4296,2.539,16.712,0.000,37.449,47.410
garagearea,57.3485,6.060,9.463,0.000,45.461,69.236

0,1,2,3
Omnibus:,556.787,Durbin-Watson:,1.982
Prob(Omnibus):,0.0,Jarque-Bera (JB):,66708.644
Skew:,-0.739,Prob(JB):,0.0
Kurtosis:,36.082,Cond. No.,333000.0


### Compare goodness of fit to model 1

F-test: F statistic=779.3; p-value=0.00
R-squared=0.763
Adjusted R-squared=0.762
AIC=3.500e+04
BIC=3.504e+04

This model performed better than the first model (higher adjusted R-squared and lower AIC and BIC values).

## Build 4th Model

In [6]:
# Y is the target variable
Y = house_df['saleprice']

# X is the feature set
# Add garagecars variable to feature set
X = house_df[['street_is_paved', 'overallqual_above_6', 'lotarea', 'totalbsmtsf', 'grlivarea', 'garagearea', 'garagecars']]

# Manually add constant
X = sm.add_constant(X)

# Use fit method to build model
results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.728
Model:,OLS,Adj. R-squared:,0.726
Method:,Least Squares,F-statistic:,553.8
Date:,"Tue, 31 Dec 2019",Prob (F-statistic):,0.0
Time:,21:59:39,Log-Likelihood:,-17595.0
No. Observations:,1460,AIC:,35210.0
Df Residuals:,1452,BIC:,35250.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-4.291e+04,1.79e+04,-2.401,0.016,-7.8e+04,-7846.769
street_is_paved,3.574e+04,1.75e+04,2.042,0.041,1400.308,7.01e+04
overallqual_above_6,4.358e+04,2841.346,15.338,0.000,3.8e+04,4.92e+04
lotarea,0.4803,0.118,4.072,0.000,0.249,0.712
totalbsmtsf,41.1287,3.048,13.492,0.000,35.149,47.108
grlivarea,52.3953,2.630,19.923,0.000,47.237,57.554
garagearea,10.9042,11.213,0.972,0.331,-11.092,32.900
garagecars,2.194e+04,3188.337,6.883,0.000,1.57e+04,2.82e+04

0,1,2,3
Omnibus:,434.337,Durbin-Watson:,1.977
Prob(Omnibus):,0.0,Jarque-Bera (JB):,39788.756
Skew:,-0.323,Prob(JB):,0.0
Kurtosis:,28.566,Cond. No.,333000.0


### Compare goodness of fit to model 1

F-test: F statistic=553.8; p-value=0.00
R-squared=0.728
Adjusted R-squared=0.726
AIC=3.521e+04
BIC=3.525e+04

This model performed slightly better than the first model (higher adjusted R-squared and lower AIC and BIC values).

## Build 5th Model

In [7]:
# Y is the target variable
Y = house_df['saleprice']

# X is the feature set
# Replace overallqual_above_6 dummy variable with overallqual
# Add garagecars variable to feature set
X = house_df[['street_is_paved', 'overallqual', 'lotarea', 'totalbsmtsf', 'grlivarea', 'garagearea', 'garagecars']]

# Manually add constant
X = sm.add_constant(X)

# Use fit method to build model
results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.767
Model:,OLS,Adj. R-squared:,0.766
Method:,Least Squares,F-statistic:,681.4
Date:,"Tue, 31 Dec 2019",Prob (F-statistic):,0.0
Time:,22:01:26,Log-Likelihood:,-17482.0
No. Observations:,1460,AIC:,34980.0
Df Residuals:,1452,BIC:,35020.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.311e+05,1.66e+04,-7.893,0.000,-1.64e+05,-9.85e+04
street_is_paved,2.93e+04,1.62e+04,1.808,0.071,-2488.529,6.11e+04
overallqual,2.442e+04,1073.004,22.759,0.000,2.23e+04,2.65e+04
lotarea,0.6390,0.110,5.829,0.000,0.424,0.854
totalbsmtsf,28.3166,2.924,9.683,0.000,22.580,34.053
grlivarea,42.1823,2.520,16.736,0.000,37.238,47.126
garagearea,16.7361,10.381,1.612,0.107,-3.628,37.100
garagecars,1.435e+04,2990.265,4.800,0.000,8486.365,2.02e+04

0,1,2,3
Omnibus:,468.123,Durbin-Watson:,1.974
Prob(Omnibus):,0.0,Jarque-Bera (JB):,49803.706
Skew:,-0.428,Prob(JB):,0.0
Kurtosis:,31.6,Cond. No.,333000.0


### Compare goodness of fit to model 1
F-test: F statistic=681.4; p-value=0.00
R-squared=0.767
Adjusted R-squared=0.766
AIC=3.498e+04
BIC=3.502e+04

This model performed the best of the five models (highest adjusted R-squared and lowest AIC and BIC values).