## Weather Model
* First, load the dataset from the weatherinszeged table from Thinkful's database.
* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why?
* Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one?
* Add visibility as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the visibility in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

## Load Data

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import linear_model
import statsmodels.formula.api as smf
from sqlalchemy import create_engine
import seaborn as sns
import statsmodels.api as sm

# Display preferences
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

import warnings
warnings.filterwarnings(action='ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

temperature_df = pd.read_sql_query('select * from weatherinszeged',con=engine)

# No need for an open connection, as we're only doing a single query
engine.dispose()

temperature_df.head(10)

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472,7.389,0.89,14.12,251.0,15.826,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.356,7.228,0.86,14.265,259.0,15.826,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.378,9.378,0.89,3.928,204.0,14.957,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.289,5.944,0.83,14.104,269.0,15.826,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.756,6.978,0.83,11.045,259.0,15.826,0.0,1016.51,Partly cloudy throughout the day.
5,2006-04-01 03:00:00+00:00,Partly Cloudy,rain,9.222,7.111,0.85,13.959,258.0,14.957,0.0,1016.66,Partly cloudy throughout the day.
6,2006-04-01 04:00:00+00:00,Partly Cloudy,rain,7.733,5.522,0.95,12.365,259.0,9.982,0.0,1016.72,Partly cloudy throughout the day.
7,2006-04-01 05:00:00+00:00,Partly Cloudy,rain,8.772,6.528,0.89,14.152,260.0,9.982,0.0,1016.84,Partly cloudy throughout the day.
8,2006-04-01 06:00:00+00:00,Partly Cloudy,rain,10.822,10.822,0.82,11.318,259.0,9.982,0.0,1017.37,Partly cloudy throughout the day.
9,2006-04-01 07:00:00+00:00,Partly Cloudy,rain,13.772,13.772,0.72,12.526,279.0,9.982,0.0,1017.22,Partly cloudy throughout the day.


## Build 1st Model

In [2]:
# Y is the target variable
Y = temperature_df['apparenttemperature'] - temperature_df['temperature']

# X is the feature set which includes
# humidity and windspeed
X = temperature_df[['humidity', 'windspeed']]

# Add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# Fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# Print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Tue, 31 Dec 2019   Prob (F-statistic):               0.00
Time:                        21:14:43   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4381      0.021    115.948      0.0

The R-squared and adjusted R-squared values are 0.288. This is not satisfactory because it means the model explains 28.8% of the variance in the difference, leaving 71.2% unexplained.

## Build 2nd Model

In [3]:
# Y is the target variable
Y = temperature_df['apparenttemperature'] - temperature_df['temperature']

# This is the interaction between humidity and windspeed
temperature_df["humidity_windspeed"] = temperature_df['humidity'] * temperature_df['windspeed']

# X is the feature set which includes
# humidity and windspeed
X = temperature_df[['humidity', 'windspeed', 'humidity_windspeed']]

# Add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# Fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# Print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Tue, 31 Dec 2019   Prob (F-statistic):               0.00
Time:                        21:18:32   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  0.0839      0

The R-squared and adjusted R-squared values are 0.341. This model slightly improved upon the previous one, explaining 34.1% of the variance rather than just 28.8%. However, it still leaves 65.9% of the variance unexplained.

## Build 3rd Model

In [4]:
# Y is the target variable
Y = temperature_df['apparenttemperature'] - temperature_df['temperature']

# X is the feature set which includes
# humidity and windspeed
X = temperature_df[['humidity', 'windspeed', 'visibility']]

# Add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# Fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# Print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                 1.401e+04
Date:                Tue, 31 Dec 2019   Prob (F-statistic):               0.00
Time:                        21:21:07   Log-Likelihood:            -1.6938e+05
No. Observations:               96453   AIC:                         3.388e+05
Df Residuals:                   96449   BIC:                         3.388e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.5756      0.028     56.605      0.0

R-squared increased from 0.288 to 0.304 and adjusted R-squared increased from 0.288 to 0.303. These increases in R-squared and adjusted R-squared are smaller than the increases seen in the second model (containing the interaction term), suggesting that the interaction term is more useful than the visibility term (explains 3.7% more variance).

AIC and BIC values for the three models:
1. AIC=3.409e+05; BIC=3.409e+05
2. AIC=3.334e+05; BIC=3.334e+05
3. AIC=3.388e+05; BIC=3.388e+05

According to the AIC and BIC values, model 2 is the best (lowest AIC and BIC values).