## Assignment 2

### Weather model

In this exercise, you'll work with the historical temperature data from the previous checkpoint. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

#### First, load the dataset from the weatherinszeged table from Thinkful's database.

In [20]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sqlalchemy import create_engine

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action="ignore")

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

temperature_df = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

temperature_df.head(5)

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472,7.389,0.89,14.12,251.0,15.826,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.356,7.228,0.86,14.265,259.0,15.826,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.378,9.378,0.89,3.928,204.0,14.957,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.289,5.944,0.83,14.104,269.0,15.826,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.756,6.978,0.83,11.045,259.0,15.826,0.0,1016.51,Partly cloudy throughout the day.


#### Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. 

- Are the estimated coefficients statistically significant? 
- Are the signs of the estimated coefficients in line with your previous expectations?
- Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?

In [21]:
# Y is the target variable
Y = temperature_df['temperature'] - temperature_df['apparenttemperature']

# X is the feature set which includes
X = temperature_df[['humidity','windspeed']]

# We add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results
print(results.summary())



                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Mon, 12 Aug 2019   Prob (F-statistic):               0.00
Time:                        15:38:02   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.4381      0.021   -115.948      0.0

- **Are the estimated coefficients statistically significant?** _Yes, they are since their p-values are equal to zero_


- **Are the signs of the estimated coefficients in line with your previous expectations?** _Yes_


- **Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?**


Holding all other features fixed, a 1 unit increase in humidity is associated with an increase of 3.0292 our target, the difference between the apparenttemperature and the temperature.

Holding all other features fixed, a 1 unit increase in windspeed is associated with an increase of 0.1193 our target, the difference between the apparenttemperature and the temperature.

Since coefficients are statistically significant, they can explain some information in the outcome


#### Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. 

- Are the coefficients statistically significant? 
- Did the signs of the estimated coefficients for *humidity* and *windspeed* change? 
- Interpret the estimated coefficients

In [22]:
# Y is the target variable
Y = temperature_df['temperature'] - temperature_df['apparenttemperature']

# X is the feature set which includes

temperature_df['interactionhumidity_windspeed'] = temperature_df['humidity'] * temperature_df['windspeed']

X = temperature_df[['humidity','windspeed','interactionhumidity_windspeed']]

# We add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Mon, 12 Aug 2019   Prob (F-statistic):               0.00
Time:                        15:38:02   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const         

- **Are the coefficients statistically significant?** _Yes, the pvalues are equal to zero_


- **Did the signs of the estimated coefficients for *humidity* and *windspeed* change?** _Yes, coeficients were decreased_


- **Interpret the estimated coefficients**

Holding all other features fixed, a 1 unit increase in humidity is associated with a decrease of 0.1775 our target, the difference between the apparenttemperature and the temperature.

Holding all other features fixed, a 1 unit increase in windspeed is associated with a decrease of 0.0905 our target, the difference between the apparenttemperature and the temperature.


Holding all other features fixed, a 1 unit increase in windspeed comibined with humidity is associated with an increase of 0.2971 our target, the difference between the apparenttemperature and the temperature.








<font color=gray>
------------------------------------------------------------------------------

By: Wendy Navarrete

8/12/2019
</font>