Suppose that we would like to know how much families in the US are spending on recreation annually. We've estimated the following model:

expenditure = 873 + 0.0012annual_income + 0.00002annual_income2 − 223.57have_kids

expenditure is the annual spending on recreation in US dollars, annual_income is the annual income in US dollars, and have_kids is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically. Write up your answer.

In order to make sure our interpretations make sense statistically we should be given the p-value for each of our variables. Otherwise we do not know which variables are significant and which aren't. Assuming each variable is significant, we can say the expenditures for those with the annual_income is around 0.0012 more and expenditures for those with kids are around 223.57 less. 

# Weather Model

First, load the dataset from the weatherinszeged table from Thinkful's database.

In [1]:
%reload_ext nb_black


import numpy as np
import pandas as pd

from sklearn import linear_model
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

from sqlalchemy import create_engine

import warnings

import statsmodels.api as sm

warnings.filterwarnings("ignore")

postgres_user = "dsbc_student"
postgres_pw = "7*.8G9QH21"
postgres_host = "142.93.121.174"
postgres_port = "5432"
postgres_db = "weatherinszeged"

engine = create_engine(
    "postgresql://{}:{}@{}:{}/{}".format(
        postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db
    )
)
weather = pd.read_sql_query("select * from weatherinszeged", con=engine)

engine.dispose()

<IPython.core.display.Javascript object>

In [2]:
weather.head(2)

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.


<IPython.core.display.Javascript object>

Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?

In [11]:
X = weather[["humidity", "windspeed"]]
y = weather["apparenttemperature"] - weather["temperature"]

X = sm.add_constant(X)

results = sm.OLS(y, X).fit()
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Tue, 01 Sep 2020",Prob (F-statistic):,0.0
Time:,14:44:43,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


<IPython.core.display.Javascript object>

Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for humidity and windspeed change? Interpret the estimated coefficients.

In [13]:
weather["humid_wind"] = weather["humidity"] * weather["windspeed"]
X = weather[["humidity", "windspeed", "humid_wind"]]
y = weather["apparenttemperature"] - weather["temperature"]

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Tue, 01 Sep 2020",Prob (F-statistic):,0.0
Time:,14:49:44,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
humid_wind,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


<IPython.core.display.Javascript object>

# House prices model

Load the houseprices data from Thinkful's database.

In [15]:
postgres_user = "dsbc_student"
postgres_pw = "7*.8G9QH21"
postgres_host = "142.93.121.174"
postgres_port = "5432"
postgres_db = "houseprices"

engine = create_engine(
    "postgresql://{}:{}@{}:{}/{}".format(
        postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db
    )
)
house_df = pd.read_sql_query("select * from houseprices", con=engine)

engine.dispose()

<IPython.core.display.Javascript object>

In [24]:
house_df = pd.concat([house_df, pd.get_dummies(house_df.housestyle)], axis=1)
dummy = list(pd.get_dummies(house_df.housestyle).columns)
house_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,SFoyer,SLvl,1.5Fin,1.5Unf,1Story,2.5Fin,2.5Unf,2Story,SFoyer.1,SLvl.1
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,0,0,0,0,0,1,0,0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,0,0,1,0,0,0,0,0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,0,0,0,0,0,1,0,0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,0,0,0,0,0,1,0,0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,0,0,0,0,0,1,0,0


<IPython.core.display.Javascript object>

In [25]:
house_df.columns

Index(['id', 'mssubclass', 'mszoning', 'lotfrontage', 'lotarea', 'street',
       'alley', 'lotshape', 'landcontour', 'utilities', 'lotconfig',
       'landslope', 'neighborhood', 'condition1', 'condition2', 'bldgtype',
       'housestyle', 'overallqual', 'overallcond', 'yearbuilt', 'yearremodadd',
       'roofstyle', 'roofmatl', 'exterior1st', 'exterior2nd', 'masvnrtype',
       'masvnrarea', 'exterqual', 'extercond', 'foundation', 'bsmtqual',
       'bsmtcond', 'bsmtexposure', 'bsmtfintype1', 'bsmtfinsf1',
       'bsmtfintype2', 'bsmtfinsf2', 'bsmtunfsf', 'totalbsmtsf', 'heating',
       'heatingqc', 'centralair', 'electrical', 'firstflrsf', 'secondflrsf',
       'lowqualfinsf', 'grlivarea', 'bsmtfullbath', 'bsmthalfbath', 'fullbath',
       'halfbath', 'bedroomabvgr', 'kitchenabvgr', 'kitchenqual',
       'totrmsabvgrd', 'functional', 'fireplaces', 'fireplacequ', 'garagetype',
       'garageyrblt', 'garagefinish', 'garagecars', 'garagearea', 'garagequal',
       'garagecond', 'paved

<IPython.core.display.Javascript object>

In [18]:
X = house_df[["overallqual", "grlivarea", "garagecars"] + dummy]
Y = house_df["saleprice"]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.766
Model:,OLS,Adj. R-squared:,0.764
Method:,Least Squares,F-statistic:,474.3
Date:,"Tue, 01 Sep 2020",Prob (F-statistic):,0.0
Time:,14:53:44,Log-Likelihood:,-17484.0
No. Observations:,1460,AIC:,34990.0
Df Residuals:,1449,BIC:,35050.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.139e+05,4969.209,-22.919,0.000,-1.24e+05,-1.04e+05
overallqual,2.594e+04,1033.361,25.106,0.000,2.39e+04,2.8e+04
grlivarea,70.6682,2.950,23.952,0.000,64.881,76.456
garagecars,1.644e+04,1776.303,9.256,0.000,1.3e+04,1.99e+04
1.5Fin,-1.488e+04,3740.372,-3.980,0.000,-2.22e+04,-7547.841
1.5Unf,-786.7970,9563.600,-0.082,0.934,-1.95e+04,1.8e+04
1Story,1.415e+04,2961.928,4.776,0.000,8335.872,2e+04
2.5Fin,-5.742e+04,1.29e+04,-4.466,0.000,-8.26e+04,-3.22e+04
2.5Unf,-5.729e+04,1.06e+04,-5.381,0.000,-7.82e+04,-3.64e+04

0,1,2,3
Omnibus:,358.075,Durbin-Watson:,2.001
Prob(Omnibus):,0.0,Jarque-Bera (JB):,19883.657
Skew:,0.07,Prob(JB):,0.0
Kurtosis:,21.079,Cond. No.,1.82e+19


<IPython.core.display.Javascript object>

From the model above we see that all of the variables are significant variables except for 'Slvl' and '1.5Unf'

Now, exclude the insignificant features from your model. Did anything change?

In [28]:
X = X[
    [
        "overallqual",
        "grlivarea",
        "garagecars",
        "1.5Fin",
        "1Story",
        "2.5Fin",
        "2.5Unf",
        "2Story",
        "SFoyer",
    ]
]
Y = house_df["saleprice"]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.766
Model:,OLS,Adj. R-squared:,0.765
Method:,Least Squares,F-statistic:,527.3
Date:,"Tue, 01 Sep 2020",Prob (F-statistic):,0.0
Time:,15:02:02,Log-Likelihood:,-17484.0
No. Observations:,1460,AIC:,34990.0
Df Residuals:,1450,BIC:,35040.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.13e+05,6277.246,-18.005,0.000,-1.25e+05,-1.01e+05
overallqual,2.593e+04,1031.701,25.137,0.000,2.39e+04,2.8e+04
grlivarea,70.7127,2.939,24.060,0.000,64.948,76.478
garagecars,1.646e+04,1773.772,9.277,0.000,1.3e+04,1.99e+04
1.5Fin,-1.579e+04,5481.682,-2.881,0.004,-2.65e+04,-5039.596
1Story,1.325e+04,4571.239,2.898,0.004,4280.913,2.22e+04
2.5Fin,-5.838e+04,1.5e+04,-3.897,0.000,-8.78e+04,-2.9e+04
2.5Unf,-5.82e+04,1.25e+04,-4.642,0.000,-8.28e+04,-3.36e+04
2Story,-1.655e+04,4911.176,-3.370,0.001,-2.62e+04,-6915.939

0,1,2,3
Omnibus:,358.053,Durbin-Watson:,2.0
Prob(Omnibus):,0.0,Jarque-Bera (JB):,19899.291
Skew:,0.068,Prob(JB):,0.0
Kurtosis:,21.086,Cond. No.,25400.0


<IPython.core.display.Javascript object>

Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?


* 2.5Unf and 2.5Fin have the most negative effect on house prices 
* overallqual and garagecars have the most positive effect on house prices