<a href="https://colab.research.google.com/github/kylemcq13/Assignments/blob/master/18_4_Interpretation_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. Interpretation and significance

Suppose that we would like to know how much families in the US are spending on recreation annually. We've estimated the following model:

$$ expenditure = 873 + 0.0012annual\_income + 0.00002annual\_income^2 - 223.57have\_kids $$

*expenditure* is the annual spending on recreation in US dollars, *annual_income* is the annual income in US dollars, and *have_kids* is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically. Write up your answer.

---

Statistical significance between the coefficients is not given, we would need this information to make sense of the interpretation. For this exercise, we will assume they are statsticially significant. 

Coefficient Interpretation

Bias term = 873

First thing noticed is that the equation is quadratic. This means that the second term increases as annual income increases.

Families with kids spend, on average, 223.57 less on recreation than families without a child. According to the equation, an increase in income would increase the recreational expenditure by 0.0012 times annual income plus  0.00002 times the annual income squared. Again, as noted, the relation is quadratic, so the second term (0.00002 times annual income squared) will increase as annual income increases. 

If we were to graph this, we would get two curved lines (kids/no kids) with no kids running slightly above the with kids line with a positive correlation between recreation expenditure (Y) and annual income (x).

---

### 2. Weather model

In this exercise, you'll work with the historical temperature data from the previous checkpoint. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

In [0]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

In [0]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
weather_df = pd.read_sql_query('select * from weatherinszeged',con=engine)

engine.dispose()

In [0]:
#create target variable
Y = weather_df['apparenttemperature'] - weather_df['temperature']
#create explanatory variables
X = weather_df[['humidity', 'windspeed']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Mon, 11 Nov 2019",Prob (F-statistic):,0.0
Time:,11:57:30,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


- All estimated coefficients are statistically significant as their P values are less than 0.05
- According to our estimations, the sign of the humidity coefficient is negative. We would expect that as humidity increases, our target would increase. It seems as if our windspeed variable is negatively correlated with the target variable as we would expect. 
- According to this model, a one point (?) increase in humidity results 3.0292 points decrease in our target (apparent temp - temp). A one point increase in the windspeed results in 0.1193 points decrease in the target. 

In [0]:
#create interaction
weather_df['interaction'] = weather_df.humidity * weather_df.windspeed

#create target variable
Y = weather_df['apparenttemperature'] - weather_df['temperature']
#create explanatory variables
X = weather_df[['humidity', 'windspeed', 'interaction']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Mon, 11 Nov 2019",Prob (F-statistic):,0.0
Time:,12:08:44,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
interaction,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


- All coefficients are statistically significant as their p values are less than 0.05
- The signs of humidity and windspeed did change in this case.
- In this model, a one point increase in humidity results in a 0.1775 point increase in our target variable. A one point increase in windspeed would result in a 0.0905 point increase in our target variable. 
- Our interaction term is valued at -0.2971. This means that given a humidity level, a 1 point increase in windspeed would result in a 0.0905 - 0.2971X humidity increase in the target variable. And, given a windspeed, a one point increase in humidity would result in a 0.1775 - 0.2971X winsdspeed increase in the target. This tells us that humidity and windspeed have effects on each other on the target ie winsdspeed will have an effect on humidity increase on the target. 

---

###  3. House prices model

In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
* Now, exclude the insignificant features from your model. Did anything change?
* Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
* Do the results sound reasonable to you? If not, try to explain the potential reasons.

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

In [0]:
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
houses_df = pd.read_sql_query('select * from houseprices',con=engine)

engine.dispose()

In [0]:
#convert to numeric and add our categorical variables back to our dataframe

houses_df = pd.concat([houses_df,pd.get_dummies(houses_df.mszoning, prefix="mszoning", drop_first=True)], axis=1)
houses_df = pd.concat([houses_df,pd.get_dummies(houses_df.street, prefix="street", drop_first=True)], axis=1)
houses_df = pd.concat([houses_df,pd.get_dummies(houses_df.street, prefix="kitchenqual", drop_first=True)], axis=1)

cat_column_names = list(pd.get_dummies(houses_df.mszoning, prefix="mszoning", drop_first=True).columns)
cat_column_names = cat_column_names + list(pd.get_dummies(houses_df.street, prefix="street", drop_first=True).columns)
cat_column_names2 = cat_column_names + list(pd.get_dummies(houses_df.street, prefix="kitchenqual", drop_first=True).columns)

In [0]:
# Y is the target variable
Y = houses_df['saleprice']
# X is the feature set which includes
X = houses_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf']  + cat_column_names2]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.769
Model:,OLS,Adj. R-squared:,0.767
Method:,Least Squares,F-statistic:,482.0
Date:,"Mon, 11 Nov 2019",Prob (F-statistic):,0.0
Time:,12:24:39,Log-Likelihood:,-17475.0
No. Observations:,1460,AIC:,34970.0
Df Residuals:,1449,BIC:,35030.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.173e+05,1.8e+04,-6.502,0.000,-1.53e+05,-8.19e+04
overallqual,2.333e+04,1088.506,21.430,0.000,2.12e+04,2.55e+04
grlivarea,45.6344,2.468,18.494,0.000,40.794,50.475
garagecars,1.345e+04,2990.453,4.498,0.000,7584.056,1.93e+04
garagearea,16.4082,10.402,1.577,0.115,-3.997,36.813
totalbsmtsf,28.3816,2.931,9.684,0.000,22.633,34.131
mszoning_FV,2.509e+04,1.37e+04,1.833,0.067,-1761.679,5.19e+04
mszoning_RH,1.342e+04,1.58e+04,0.847,0.397,-1.77e+04,4.45e+04
mszoning_RL,2.857e+04,1.27e+04,2.246,0.025,3612.782,5.35e+04

0,1,2,3
Omnibus:,415.883,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,41281.526
Skew:,-0.115,Prob(JB):,0.0
Kurtosis:,29.049,Cond. No.,1.24e+21


- Significant features include: overallqual, grlivarea, garagecars, totalbsmtsf, and mszoning_RL.

In [0]:
# Y is the target variable
Y = houses_df['saleprice']
# X is the feature set which includes
X = houses_df[['overallqual', 'grlivarea', 'garagecars', 'totalbsmtsf', 'mszoning_RL']]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.767
Model:,OLS,Adj. R-squared:,0.766
Method:,Least Squares,F-statistic:,956.8
Date:,"Mon, 11 Nov 2019",Prob (F-statistic):,0.0
Time:,12:28:23,Log-Likelihood:,-17481.0
No. Observations:,1460,AIC:,34970.0
Df Residuals:,1454,BIC:,35010.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.083e+05,4804.236,-22.540,0.000,-1.18e+05,-9.89e+04
overallqual,2.396e+04,1060.549,22.588,0.000,2.19e+04,2.6e+04
grlivarea,45.4093,2.452,18.517,0.000,40.599,50.220
garagecars,1.763e+04,1731.766,10.183,0.000,1.42e+04,2.1e+04
totalbsmtsf,28.8729,2.862,10.088,0.000,23.259,34.487
mszoning_RL,1.596e+04,2558.589,6.238,0.000,1.09e+04,2.1e+04

0,1,2,3
Omnibus:,402.656,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,35429.68
Skew:,-0.08,Prob(JB):,0.0
Kurtosis:,27.133,Cond. No.,9530.0


After excluding insignificant features we can interpret the model as follows:

- A 1 point increase in overallqual results in a 23960 increase in sale price.
- A 1 point increase in grlivarea results in a 45.4093 increase in sale price.
- A 1 point increase in garagecars results in a 17360 increase in sale price.
- A 1 point increase in totalbsmtsf results in a 28.8729 increase in sale price.
- A 1 point increase in mszoning_RL results in a 15960 in increase in sale price. 

According to our model, overallqual, garagecars and mszoning_RL have the most significant impact on house sale prices. These results sound reasonable as the overall quality of the build material and finish (overallqual) should have a significant impact on sale price. The higher the quality of the house, the more it should sell for. The size of the garage was a little surprising to me but it makes sense it has a big impact on the sale of a house. I just wasn't expecting it to be that big. The mszoning_RL variable makes sense as low density residential areas are probably in high demand. I know I would rather have a big lot than sit on top of my neighbors.