# Introduction

### Imports

In [6]:
from statsmodels.tsa.stattools import acf
import statsmodels.api as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sqlalchemy import create_engine
%matplotlib inline

# Options for pandas
pd.options.display.max_columns = 150
pd.options.display.max_rows = 150

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Imports</a></span></li></ul></li></ul></li><li><span><a href="#Notes" data-toc-modified-id="Notes-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Notes</a></span></li><li><span><a href="#Weather-model" data-toc-modified-id="Weather-model-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Weather model</a></span></li><li><span><a href="#House-prices-model" data-toc-modified-id="House-prices-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>House prices model</a></span></li></ul></div>

# Notes

F-Test:
    - Purpose: Determine if features in model explains the variance within the outcome or if the outcome can be explained without the features
    - High Level Aproach: Find the difference in variance between the original model and a reduced model divided by the difference between each model's degrees of freedom
    - Reduced Model: A model that doesn't include any of the model's features and the variance is unexplained
    - Model's parameters:
$$p_F = 2({bias} { and } {coefficients})$$
    - Reduced Model's parameters:
$$p_R = 1(bias)$$
    - degrees of freedom of model (n= Number of datapoints):
$$df_F = n - p_F$$
    - degrees of freedom of reduced model:

$$df_R = n - p_R$$
 
    

# Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
* Add *visibility* as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

In [7]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

weather_df = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [8]:
model = LinearRegression()
weather_df['target_temp'] = weather_df['apparenttemperature'] - weather_df['temperature']
X = weather_df[['humidity', 'windspeed']]
Y = weather_df['target_temp']
model.fit(X,Y)

sm.add_constant(X)

pred = sm.OLS(Y,X).fit()
pred.summary()

0,1,2,3
Dep. Variable:,target_temp,R-squared:,0.425
Model:,OLS,Adj. R-squared:,0.425
Method:,Least Squares,F-statistic:,35700.0
Date:,"Wed, 17 Jul 2019",Prob (F-statistic):,0.0
Time:,12:18:50,Log-Likelihood:,-176750.0
No. Observations:,96453,AIC:,353500.0
Df Residuals:,96451,BIC:,353500.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
humidity,-0.4873,0.010,-47.338,0.000,-0.507,-0.467
windspeed,-0.0772,0.001,-126.510,0.000,-0.078,-0.076

0,1,2,3
Omnibus:,9577.682,Durbin-Watson:,0.228
Prob(Omnibus):,0.0,Jarque-Bera (JB):,12669.324
Skew:,-0.867,Prob(JB):,0.0
Kurtosis:,3.378,Cond. No.,27.2


The R^2 and adjusted R^2 values are the same. Both are somewhat poor scores, coming in at .425. This means that the model explains 43 percent of the variance within the predictions. Although the p-value for the F-test is 0 which means that humidity and windspeed are good features in terms of explaining the target variable. 

In [11]:
model = LinearRegression()

weather_df['wind_humid_interaction'] = weather_df['humidity'] * weather_df['windspeed']

X = weather_df[['humidity', 'windspeed', 'wind_humid_interaction']]
Y = weather_df['target_temp']
model.fit(X,Y)

sm.add_constant(X)

pred = sm.OLS(Y,X).fit()
pred.summary()

0,1,2,3
Dep. Variable:,target_temp,R-squared:,0.533
Model:,OLS,Adj. R-squared:,0.533
Method:,Least Squares,F-statistic:,36770.0
Date:,"Wed, 17 Jul 2019",Prob (F-statistic):,0.0
Time:,13:05:55,Log-Likelihood:,-166700.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96450,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
humidity,0.2820,0.011,26.590,0.000,0.261,0.303
windspeed,0.0958,0.001,74.776,0.000,0.093,0.098
wind_humid_interaction,-0.3038,0.002,-149.513,0.000,-0.308,-0.300

0,1,2,3
Omnibus:,4919.327,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9471.445
Skew:,-0.381,Prob(JB):,0.0
Kurtosis:,4.333,Cond. No.,38.0


The R^2 and adjusted R^2 percentages both increased by ten percent. 

In [12]:
model = LinearRegression()

X = weather_df[['humidity', 'windspeed', 'visibility']]
Y = weather_df['target_temp']
model.fit(X,Y)

sm.add_constant(X)

pred = sm.OLS(Y,X).fit()
pred.summary()

0,1,2,3
Dep. Variable:,target_temp,R-squared:,0.49
Model:,OLS,Adj. R-squared:,0.49
Method:,Least Squares,F-statistic:,30940.0
Date:,"Wed, 17 Jul 2019",Prob (F-statistic):,0.0
Time:,13:06:19,Log-Likelihood:,-170960.0
No. Observations:,96453,AIC:,341900.0
Df Residuals:,96450,BIC:,341900.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
humidity,-1.3488,0.012,-108.590,0.000,-1.373,-1.324
windspeed,-0.1052,0.001,-167.634,0.000,-0.106,-0.104
visibility,0.0976,0.001,110.936,0.000,0.096,0.099

0,1,2,3
Omnibus:,5476.521,Durbin-Watson:,0.283
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6619.177
Skew:,-0.587,Prob(JB):,0.0
Kurtosis:,3.519,Cond. No.,43.9


The windspeed/humidity interaction model performed better than the visibility one according to their respective r^2 scores. The windspeed/humidity interaction model's R^2 was four percent higher.

The lower the AIC and BIC scores the better. In respect to each of the three model's AIC and BIC score, the best model was the humidity/windspeed interaction model.

# House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
* Do you think your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
* For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

In [13]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
housing_df = pd.read_sql_query('select * from houseprices',con=engine)

engine.dispose()


In [26]:
feature_list = ['overallqual', 'yearbuilt', 'yearremodadd', 'totalbsmtsf', 'grlivarea', 'garagearea']

model = LinearRegression()

X = housing_df[feature_list]
Y = housing_df['saleprice']

model.fit(X,Y)

print(model.coef_)
X = sm.add_constant(X)
results = sm.OLS(Y,X).fit()

results.summary()

[19666.6743967    251.24645113   283.11642118    28.56583563
    50.74507536    45.97229764]


0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.77
Model:,OLS,Adj. R-squared:,0.769
Method:,Least Squares,F-statistic:,810.0
Date:,"Wed, 17 Jul 2019",Prob (F-statistic):,0.0
Time:,13:29:01,Log-Likelihood:,-17472.0
No. Observations:,1460,AIC:,34960.0
Df Residuals:,1453,BIC:,34990.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.125e+06,1.2e+05,-9.344,0.000,-1.36e+06,-8.89e+05
overallqual,1.967e+04,1178.777,16.684,0.000,1.74e+04,2.2e+04
yearbuilt,251.2465,47.287,5.313,0.000,158.489,344.004
yearremodadd,283.1164,63.670,4.447,0.000,158.222,408.011
totalbsmtsf,28.5658,2.862,9.980,0.000,22.951,34.180
grlivarea,50.7451,2.553,19.877,0.000,45.737,55.753
garagearea,45.9723,6.148,7.478,0.000,33.913,58.032

0,1,2,3
Omnibus:,513.186,Durbin-Watson:,1.977
Prob(Omnibus):,0.0,Jarque-Bera (JB):,67391.62
Skew:,-0.544,Prob(JB):,0.0
Kurtosis:,36.266,Cond. No.,410000.0


The model's R^2 and R^2 are reasonable but could be improved. Since there is a good amount of parameters within the model, overfitting may be an issue if more features are added with the purpose of increasing the R^2 score. 

F-Test proves that the explanatory features do explain the variance within the predictions.  

The AIC and BIC score are extremely high. To reduce these scores, I've experimented with removing 2-3 features from the model. Surprisingly the AIC and BIC score increased. 

The first model was decent in regards to R^2 and F-Test scores but it would be interesting to see if adding more features would increase the R^2 significantly.

In [31]:
#add ['fullbath','totrmsabvgrd','garagecars'] to feature list 

feature_list = ['overallqual',
 'yearbuilt',
 'yearremodadd',
 'totalbsmtsf',
 'firstflrsf',
 'grlivarea',
 'fullbath',
 'totrmsabvgrd',
 'garagecars',
 'garagearea']

model = LinearRegression()

X = housing_df[feature_list]
Y = housing_df['saleprice']

model.fit(X,Y)

print(model.coef_)
X = sm.add_constant(X)
results = sm.OLS(Y,X).fit()

results.summary()

[ 1.96045898e+04  2.68240707e+02  2.96481161e+02  1.98650991e+01
  1.41737355e+01  5.12971178e+01 -6.79087146e+03  3.31050771e+01
  1.04179010e+04  1.49475334e+01]


0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.774
Model:,OLS,Adj. R-squared:,0.772
Method:,Least Squares,F-statistic:,495.4
Date:,"Wed, 17 Jul 2019",Prob (F-statistic):,0.0
Time:,13:36:45,Log-Likelihood:,-17459.0
No. Observations:,1460,AIC:,34940.0
Df Residuals:,1449,BIC:,35000.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.186e+06,1.29e+05,-9.187,0.000,-1.44e+06,-9.33e+05
overallqual,1.96e+04,1190.159,16.472,0.000,1.73e+04,2.19e+04
yearbuilt,268.2407,50.346,5.328,0.000,169.481,367.000
yearremodadd,296.4812,63.635,4.659,0.000,171.655,421.307
totalbsmtsf,19.8651,4.295,4.625,0.000,11.439,28.291
firstflrsf,14.1737,4.930,2.875,0.004,4.504,23.844
grlivarea,51.2971,4.233,12.119,0.000,42.994,59.600
fullbath,-6790.8715,2682.369,-2.532,0.011,-1.21e+04,-1529.130
totrmsabvgrd,33.1051,1119.061,0.030,0.976,-2162.048,2228.258

0,1,2,3
Omnibus:,477.814,Durbin-Watson:,1.983
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58906.279
Skew:,-0.412,Prob(JB):,0.0
Kurtosis:,34.107,Cond. No.,469000.0


Adding the 3 features above increased the r^2 score by only .004%. This increase is insignificant enough to where adding these features wouldn't be necessary. Although it would be interesting to take the first model and removing the amount of features and to see if harms the R^2 score dramatically. 

In [60]:
# remove 'yearremodadd', 'garagearea', 'yearbuilt'

feature_list = ['overallqual', 'totalbsmtsf', 'grlivarea']

model = LinearRegression()

X = housing_df[feature_list]
Y = housing_df['saleprice']

model.fit(X,Y)

print(model.coef_)
X = sm.add_constant(X)
results = sm.OLS(Y,X).fit()

results.summary()

[28046.71532769    36.61501735    49.45262987]


0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.742
Model:,OLS,Adj. R-squared:,0.742
Method:,Least Squares,F-statistic:,1396.0
Date:,"Wed, 17 Jul 2019",Prob (F-statistic):,0.0
Time:,13:44:13,Log-Likelihood:,-17555.0
No. Observations:,1460,AIC:,35120.0
Df Residuals:,1456,BIC:,35140.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.038e+05,4794.565,-21.651,0.000,-1.13e+05,-9.44e+04
overallqual,2.805e+04,1023.743,27.396,0.000,2.6e+04,3.01e+04
totalbsmtsf,36.6150,2.918,12.548,0.000,30.891,42.339
grlivarea,49.4526,2.551,19.388,0.000,44.449,54.456

0,1,2,3
Omnibus:,525.531,Durbin-Watson:,1.968
Prob(Omnibus):,0.0,Jarque-Bera (JB):,45766.824
Skew:,-0.726,Prob(JB):,0.0
Kurtosis:,30.39,Cond. No.,8890.0


After experimenting with removing different features from the original list, I found that removing the combination of features above resulted in lowest drop in the r^2 score. Removing those three features only decreased the r^2 score by three percent. This is great in terms of simplifying the model and increasing computational speed.

Looking at all three model above, the best model in terms of R^2 scores would be the first model (original features). Although in a hypothetical situaion where this model would be used on a larger scale, it may be more computationally efficient to use the reduced feature model since it uses half the amount of features and only has a 3 percent lower R^2 score than the first model. 