#  19.04 Interpreting Estimated Coefficients
## Assignment 02 Weather Model

In this exercise, you'll work with the historical temperature data from the previous checkpoint. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

### Load the dataset

In [8]:
import warnings

import numpy as np 
import pandas as pd 
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn import linear_model
from sqlalchemy import create_engine 
from sqlalchemy.engine.url import URL 
from scipy.stats import bartlett
from scipy.stats import levene
from scipy.stats.stats import pearsonr
from scipy.stats import jarque_bera
from scipy.stats import normaltest
from statsmodels.tsa.stattools import acf
import statsmodels.api as sm 

warnings.filterwarnings(action="ignore")

kagle = dict(
    drivername = "postgresql",
    username = "dsbc_student",
    password = "7*.8G9QH21",
    host = "142.93.121.174",
    port = "5432",
    database = "weatherinszeged"
)

In [2]:
engine=create_engine(URL(**kagle), echo=True)

weather_raw = pd.read_sql_query("SELECT * FROM weatherinszeged", con=engine)

engine.dispose()

2019-12-31 22:12:58,500 INFO sqlalchemy.engine.base.Engine select version()
2019-12-31 22:12:58,508 INFO sqlalchemy.engine.base.Engine {}
2019-12-31 22:12:58,611 INFO sqlalchemy.engine.base.Engine select current_schema()
2019-12-31 22:12:58,613 INFO sqlalchemy.engine.base.Engine {}
2019-12-31 22:12:58,711 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2019-12-31 22:12:58,712 INFO sqlalchemy.engine.base.Engine {}
2019-12-31 22:12:58,763 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2019-12-31 22:12:58,764 INFO sqlalchemy.engine.base.Engine {}
2019-12-31 22:12:58,817 INFO sqlalchemy.engine.base.Engine show standard_conforming_strings
2019-12-31 22:12:58,817 INFO sqlalchemy.engine.base.Engine {}
2019-12-31 22:12:58,920 INFO sqlalchemy.engine.base.Engine SELECT * FROM weatherinszeged
2019-12-31 22:12:58,922 INFO sqlalchemy.engine.base.Engine {}


#### Build the Model

Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?

In [5]:
# Create a copy of the raw datafreame to work with
weather_df = weather_raw.copy()

# Create a new variable "temp_diff" that measures the difference between the appearant temperature and the temperature
weather_df["temp_diff"] = weather_df["apparenttemperature"] - weather_df["temperature"]

In [9]:
# Y is the target variable, in this case "temp_diff"
Y = weather_df["temp_diff"]

# X is the featue set
X = weather_df[["humidity", "windspeed"]]

# Add a constant to the model
X = sm.add_constant(X)

# Fit an OLS model using statsmodel
results = sm.OLS(Y,X).fit()

# Print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              temp_diff   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Tue, 31 Dec 2019   Prob (F-statistic):               0.00
Time:                        22:26:48   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4381      0.021    115.948      0.0

_Are the estimated coefficients statistically significant?_

The resulting model is expressed as: 
$$ temp\_diff = 2.4381 - 3.0292 (humidity) - 0.1193 (windspeed) $$

Both humitidity and windspeed are statistically signifigant because their p-values are close to zero (0.000).  The coefficient of humidity is -3.0292.  As the temp_diff increases by 1 degree, humidity will decrease by 3.0292 percent. The coefficient of windspeed is -0.1193.  As the temp_diff increases by 1 degree, windspeed will decrease by 0.1193.   

### Include an Interaction

Include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for humidity and windspeed change? Interpret the estimated coefficients.

In [12]:
# Y is the target variable
Y = weather_df["temp_diff"]

# Now, include the interaction of humidity and windspeed
weather_df["humidity_and_wind"] = weather_df["humidity"] * weather_df["windspeed"]

# X is the feature set
X = weather_df[["humidity", "windspeed", "humidity_and_wind"]]

# Add a constant to the model
X = sm.add_constant(X)

# Fit an OLS model
results = sm.OLS(Y,X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              temp_diff   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Tue, 31 Dec 2019   Prob (F-statistic):               0.00
Time:                        23:23:36   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 0.0839      0.03

The resluting model can be expressed as: 
$$ temp\_diff = 0.0839 + 0.1775 (humidity) + 0.0905 (windspeed) -0.2971 (humidity\_and\_wind) $$

The humidity and windspeed coefficients have changed from the previous model do to the inclusion of the interaction variable, "humidity_and_wind".  This time for each 1 degree change in "temp_diff" the humidity will increase by 0.1775 percent and the windspeed will increase by 0.0905 k/hr.  For the new "humidity_and_wind" variable, each 1 degree increse in "temp_diff" the "humidity_and_wind" variable will decrease by 0.2971