In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings('ignore')

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

In [10]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
df = pd.read_sql_query('select * from houseprices',con=engine)

engine.dispose()

# Creating the model

In [11]:
numeric_variables = ['overallqual','firstflrsf','totrmsabvgrd']
variables = numeric_variables + ['neighborhood','exterior1st','saleprice']

In [12]:
df2 = pd.get_dummies(df[variables], drop_first=True)
X = df2.drop('saleprice', axis=1)
y = df2.saleprice

In [13]:
X = sm.add_constant(X)
results = sm.OLS(y,X).fit()
results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.787
Model:,OLS,Adj. R-squared:,0.781
Method:,Least Squares,F-statistic:,127.6
Date:,"Sat, 24 Jul 2021",Prob (F-statistic):,0.0
Time:,12:17:20,Log-Likelihood:,-17416.0
No. Observations:,1460,AIC:,34920.0
Df Residuals:,1418,BIC:,35140.0
Df Model:,41,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-9.879e+04,1.46e+04,-6.776,0.000,-1.27e+05,-7.02e+04
overallqual,2.292e+04,1144.838,20.023,0.000,2.07e+04,2.52e+04
firstflrsf,43.1906,3.391,12.737,0.000,36.539,49.842
totrmsabvgrd,8869.1029,726.944,12.201,0.000,7443.102,1.03e+04
neighborhood_Blueste,1.755e+04,2.81e+04,0.624,0.533,-3.76e+04,7.27e+04
neighborhood_BrDale,-9569.3846,1.36e+04,-0.702,0.483,-3.63e+04,1.72e+04
neighborhood_BrkSide,1.507e+04,1.09e+04,1.388,0.165,-6225.160,3.64e+04
neighborhood_ClearCr,5.422e+04,1.19e+04,4.564,0.000,3.09e+04,7.75e+04
neighborhood_CollgCr,2.749e+04,9580.200,2.870,0.004,8701.504,4.63e+04

0,1,2,3
Omnibus:,535.89,Durbin-Watson:,1.924
Prob(Omnibus):,0.0,Jarque-Bera (JB):,11606.3
Skew:,1.179,Prob(JB):,0.0
Kurtosis:,16.61,Cond. No.,60000.0


# Evaluating the model

The F statistic is 127.6 which shows this model is better than an empty model.  The R-squared is 0.787 and the adjusted R-squared is 0.781 meaning 78 percent of the variance in saleprice can be explained by this model.  The AIC and BIC are both relatively large.  The model could use some improvement.  Once adjusted R-squared gets into the eighties while continuing to lower the AIC and BIC I will stop.

# Improving the model

From my simple linear regression model notebook, it appears saleprice may vary with the value of basement condition.  Since basement condition would not be correlated with the first floor exterior type, I will try to add this to the model and see if the model improves

In [14]:
variables2 = variables + ['bsmtcond']

In [15]:
df2 = pd.get_dummies(df[variables2], drop_first=True)
X = df2.drop('saleprice', axis=1)
y = df2.saleprice

In [16]:
X = sm.add_constant(X)
results = sm.OLS(y,X).fit()
results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.789
Model:,OLS,Adj. R-squared:,0.782
Method:,Least Squares,F-statistic:,120.1
Date:,"Sat, 24 Jul 2021",Prob (F-statistic):,0.0
Time:,12:17:31,Log-Likelihood:,-17409.0
No. Observations:,1460,AIC:,34910.0
Df Residuals:,1415,BIC:,35150.0
Df Model:,44,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.089e+05,1.48e+04,-7.346,0.000,-1.38e+05,-7.98e+04
overallqual,2.204e+04,1170.291,18.829,0.000,1.97e+04,2.43e+04
firstflrsf,43.5062,3.382,12.864,0.000,36.872,50.140
totrmsabvgrd,9097.4610,727.409,12.507,0.000,7670.544,1.05e+04
neighborhood_Blueste,1.775e+04,2.8e+04,0.634,0.526,-3.72e+04,7.27e+04
neighborhood_BrDale,-8435.6162,1.36e+04,-0.620,0.535,-3.51e+04,1.82e+04
neighborhood_BrkSide,1.568e+04,1.08e+04,1.449,0.148,-5552.515,3.69e+04
neighborhood_ClearCr,5.456e+04,1.18e+04,4.608,0.000,3.13e+04,7.78e+04
neighborhood_CollgCr,2.75e+04,9549.953,2.880,0.004,8767.830,4.62e+04

0,1,2,3
Omnibus:,553.633,Durbin-Watson:,1.925
Prob(Omnibus):,0.0,Jarque-Bera (JB):,12527.047
Skew:,1.223,Prob(JB):,0.0
Kurtosis:,17.14,Cond. No.,60100.0


# Evaluating the new model

Both R-squared values and AIC values have improved, but not by much.  This model is better.  However the F statistic has decreased instead of increased and BIC has increased instead of decreased, supporting the first model.  Since the improvement in R-squared values and information criteria is not much I will try one more model, without basement condition.

# Attempting to improve the model more

For this model, I am going to include the variable that denotes wether or not the house has central air.

In [17]:
variables2 = variables + ['centralair']

In [18]:
df2 = pd.get_dummies(df[variables2], drop_first=True)
X = df2.drop('saleprice', axis=1)
y = df2.saleprice

In [19]:
X = sm.add_constant(X)
results = sm.OLS(y,X).fit()
results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.788
Model:,OLS,Adj. R-squared:,0.782
Method:,Least Squares,F-statistic:,125.3
Date:,"Sat, 24 Jul 2021",Prob (F-statistic):,0.0
Time:,12:22:54,Log-Likelihood:,-17412.0
No. Observations:,1460,AIC:,34910.0
Df Residuals:,1417,BIC:,35140.0
Df Model:,42,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.033e+05,1.46e+04,-7.055,0.000,-1.32e+05,-7.45e+04
overallqual,2.238e+04,1159.450,19.299,0.000,2.01e+04,2.47e+04
firstflrsf,42.9745,3.384,12.699,0.000,36.336,49.613
totrmsabvgrd,8983.1416,726.453,12.366,0.000,7558.102,1.04e+04
neighborhood_Blueste,1.69e+04,2.81e+04,0.602,0.547,-3.81e+04,7.19e+04
neighborhood_BrDale,-1.043e+04,1.36e+04,-0.767,0.443,-3.71e+04,1.63e+04
neighborhood_BrkSide,1.574e+04,1.08e+04,1.453,0.146,-5511.970,3.7e+04
neighborhood_ClearCr,5.315e+04,1.19e+04,4.482,0.000,2.99e+04,7.64e+04
neighborhood_CollgCr,2.715e+04,9558.921,2.841,0.005,8402.953,4.59e+04

0,1,2,3
Omnibus:,546.19,Durbin-Watson:,1.935
Prob(Omnibus):,0.0,Jarque-Bera (JB):,12095.348
Skew:,1.206,Prob(JB):,0.0
Kurtosis:,16.893,Cond. No.,60000.0


# Evaluating the model

* F statistic -- still lower than the first
* R squared -- better than the first but not as good as the second
* Adjusted R squared --- Better than the first, the same as the second
* AIC -- Better than the first, the same as the second
* BIC -- The same as the first, better than the second.

# Choosing the model

Since my second and third attempt did not greatly improve the model evaluation metrics, I am going to choose my original model for house prices.