Suppose that we would like to know how much families in the US are spending on recreation annually. We've estimated the following model:

𝑒𝑥𝑝𝑒𝑛𝑑𝑖𝑡𝑢𝑟𝑒=873+0.0012𝑎𝑛𝑛𝑢𝑎𝑙_𝑖𝑛𝑐𝑜𝑚𝑒+0.00002𝑎𝑛𝑛𝑢𝑎𝑙_𝑖𝑛𝑐𝑜𝑚𝑒2−223.57ℎ𝑎𝑣𝑒_𝑘𝑖𝑑𝑠

expenditure is the annual spending on recreation in US dollars, annual_income is the annual income in US dollars, and have_kids is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically. Write up your answer.

So a family with no annual income or kids will still be spending $873 annually on recreation, and every dollar of extra income will correspond to .0012 cents. Income will also increase expenditure by the second power times .00002, and each kid will reduce expenditure by $223.57.
The p-values of each coefficient should be provided, and checking the normality of the errors, colinearity of the errors, correlation between the features, and variance of the errors over the course of the observations would be nice.

First, load the dataset from the weatherinszeged table from Thinkful's database.

In [2]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sqlalchemy import create_engine

pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df = pd.read_sql_query('select * from weatherinszeged',con=engine)

engine.dispose()

Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. 
Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?

In [3]:
X = df[['humidity', 'windspeed']]
Y = df.apparenttemperature - df.temperature
X = sm.add_constant(X)
lrm = sm.OLS(Y, X).fit()
lrm.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Sat, 17 Aug 2019",Prob (F-statistic):,0.0
Time:,10:39:44,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


The apparent temperature is biased high, but humidity and windspeed makes the apparent temperature low. Specifically, the bias is 2.4381 degrees high, every point of humidity subtracts 3.0292 degrees from the difference of the apparent and actual temperatures, and windspeed reduces it by another .1193 degrees. So it usually feels hotter than it is, but windspeed and humidity make it feel colder than it is. That makes sense.

Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS.

In [4]:
df['humXwind'] = df.humidity * df.windspeed
X = df[['humidity', 'windspeed', 'humXwind']]
Y = df.apparenttemperature - df.temperature
X = sm.add_constant(X)
lrm = sm.OLS(Y, X).fit()
lrm.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Sat, 17 Aug 2019",Prob (F-statistic):,0.0
Time:,11:19:21,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
humXwind,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


Are the coefficients statistically significant? Did the signs of the estimated coefficients for humidity and windspeed change? Interpret the estimated coefficients.

Yes, and the interaction of the two is what explains both of their temperature reducing effects. Without wind, humidity will raise apparent temperature by .1775 per point increase, and windspeed will increase it by .0905.

Load the houseprices data from Thinkful's database.

In [11]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
df = pd.read_sql_query('select * from houseprices',con=engine)
engine.dispose()

Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?

In [12]:
df.drop(['id'], axis=1, inplace=True)

## Fill continuous variable null values with zero
for column in ['masvnrarea', 'lotfrontage', 'garagecars']:
    df[column] = df[column].fillna(0)
    
indexes = df[df.garageyrblt == 'None'].index
df.loc[indexes, ['garageyrblt']] = 1980
df['garageyrblt'] = pd.to_numeric(df['garageyrblt'])

## Fill all null values with 'none'
df = df.fillna('None')

In [15]:
df2 = df[['yearbuilt', 'yearremodadd', 'masvnrarea', 'bsmtfinsf1',
'fireplaces', 'garagecars', 'wooddecksf', 'openporchsf',
'totalbsmtsf', 'firstflrsf', 'secondflrsf', 'fullbath', 'saleprice']]

Y = df2.saleprice
X = df2.drop(columns = ['saleprice'])
X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()
results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.753
Model:,OLS,Adj. R-squared:,0.751
Method:,Least Squares,F-statistic:,368.4
Date:,"Sat, 17 Aug 2019",Prob (F-statistic):,0.0
Time:,11:44:25,Log-Likelihood:,-17522.0
No. Observations:,1460,AIC:,35070.0
Df Residuals:,1447,BIC:,35140.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.827e+06,1.26e+05,-14.518,0.000,-2.07e+06,-1.58e+06
yearbuilt,321.3320,52.028,6.176,0.000,219.274,423.390
yearremodadd,607.2316,65.261,9.305,0.000,479.215,735.248
masvnrarea,38.1261,6.636,5.746,0.000,25.110,51.142
bsmtfinsf1,13.4850,2.766,4.876,0.000,8.060,18.910
fireplaces,1.214e+04,1887.454,6.434,0.000,8440.829,1.58e+04
garagecars,1.788e+04,1875.061,9.535,0.000,1.42e+04,2.16e+04
wooddecksf,28.8622,8.822,3.272,0.001,11.557,46.168
openporchsf,23.8654,16.944,1.409,0.159,-9.371,57.102

0,1,2,3
Omnibus:,668.695,Durbin-Watson:,1.948
Prob(Omnibus):,0.0,Jarque-Bera (JB):,90737.056
Skew:,-1.086,Prob(JB):,0.0
Kurtosis:,41.56,Cond. No.,398000.0


Now, exclude the insignificant features from your model. Did anything change?
Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
Do the results sound reasonable to you? If not, try to explain the potential reasons.

In [16]:
df2 = df[['yearbuilt', 'yearremodadd', 'masvnrarea', 'bsmtfinsf1', 'fireplaces',
          'garagecars', 'wooddecksf', 'totalbsmtsf', 'firstflrsf', 'secondflrsf', 'saleprice']]

Y = df2.saleprice
X = df2.drop(columns = ['saleprice'])
X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()
results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.753
Model:,OLS,Adj. R-squared:,0.751
Method:,Least Squares,F-statistic:,441.9
Date:,"Sat, 17 Aug 2019",Prob (F-statistic):,0.0
Time:,18:29:43,Log-Likelihood:,-17523.0
No. Observations:,1460,AIC:,35070.0
Df Residuals:,1449,BIC:,35130.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.849e+06,1.16e+05,-15.876,0.000,-2.08e+06,-1.62e+06
yearbuilt,324.4480,49.089,6.609,0.000,228.155,420.741
yearremodadd,615.1727,64.677,9.511,0.000,488.303,742.043
masvnrarea,37.6681,6.619,5.691,0.000,24.684,50.652
bsmtfinsf1,13.4715,2.725,4.943,0.000,8.126,18.817
fireplaces,1.22e+04,1881.785,6.483,0.000,8507.451,1.59e+04
garagecars,1.788e+04,1873.607,9.541,0.000,1.42e+04,2.16e+04
wooddecksf,28.1103,8.804,3.193,0.001,10.841,45.380
totalbsmtsf,28.2655,4.509,6.269,0.000,19.421,37.110

0,1,2,3
Omnibus:,661.787,Durbin-Watson:,1.949
Prob(Omnibus):,0.0,Jarque-Bera (JB):,88432.588
Skew:,-1.067,Prob(JB):,0.0
Kurtosis:,41.067,Cond. No.,368000.0


Pretty much nothing changed. Every variable has a positive effect on the price of the house, and the constant is remarkably low. The values of many coefficients are lower than expected, for instance, for every square foot added, the price goes up nearly 60 dollars, which is far less than the cost to add a square foot. Others, like the year built or modified, are shockingly low considering the improvements that time has brought to construction techniques. On the other hand, fireplaces seem to add quite a lot of value, as does the size of the garage. Fireplace quantity and garage size might reflect the quality of the house in ways not reflected in the other variables, or the buyers are illogical and easily impressed by not particularly useful additions.