# 19.05 Evaluating Performance
## Assignment

### 1. Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
* Add *visibility* as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

In [2]:
import warnings

import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import statsmodels.api as sm

from sqlalchemy import create_engine
from sqlalchemy.engine.url import URL

pd.options.display.float_format = "{:3f}".format

warnings.filterwarnings(action="ignore")

kagle = dict(
    drivername = "postgresql",
    username = "dsbc_student",
    password = "7*.8G9QH21",
    host = "142.93.121.174",
    port = "5432",
    database = "weatherinszeged"
)

In [3]:
engine=create_engine(URL(**kagle), echo=True)

weather_raw = pd.read_sql_query("SELECT * FROM weatherinszeged", con=engine)

engine.dispose()

2020-01-06 19:58:09,778 INFO sqlalchemy.engine.base.Engine select version()
2020-01-06 19:58:09,785 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 19:58:09,890 INFO sqlalchemy.engine.base.Engine select current_schema()
2020-01-06 19:58:09,892 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 19:58:09,994 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2020-01-06 19:58:09,995 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 19:58:10,062 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2020-01-06 19:58:10,064 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 19:58:10,126 INFO sqlalchemy.engine.base.Engine show standard_conforming_strings
2020-01-06 19:58:10,128 INFO sqlalchemy.engine.base.Engine {}
2020-01-06 19:58:10,237 INFO sqlalchemy.engine.base.Engine SELECT * FROM weatherinszeged
2020-01-06 19:58:10,240 INFO sqlalchemy.engine.base.Engine {}


In [5]:
weather_raw.describe()

Unnamed: 0,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure
count,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0
mean,11.932678,10.855029,0.734899,10.81064,187.509232,10.347325,0.0,1003.235956
std,9.551546,10.696847,0.195473,6.913571,107.383428,4.192123,0.0,116.969906
min,-21.822222,-27.716667,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.688889,2.311111,0.6,5.8282,116.0,8.3398,0.0,1011.9
50%,12.0,12.0,0.78,9.9659,180.0,10.0464,0.0,1016.45
75%,18.838889,18.838889,0.89,14.1358,290.0,14.812,0.0,1021.09
max,39.905556,39.344444,1.0,63.8526,359.0,16.1,0.0,1046.38


In [6]:
weather_raw.describe(include=["O"])

Unnamed: 0,summary,preciptype,dailysummary
count,96453,96453,96453
unique,27,3,214
top,Partly Cloudy,rain,Mostly cloudy throughout the day.
freq,31733,85224,20085


In [7]:
# Create a copy of the raw datafreame to work with
weather_df = weather_raw.copy()

# Create a new variable "temp_diff" that measures the difference between the appearant temperature and the temperature
weather_df["temp_diff"] = weather_df["apparenttemperature"] - weather_df["temperature"]

In [8]:
# Y is the target variable, in this case "temp_diff"
Y = weather_df["temp_diff"]

# X is the featue set
X = weather_df[["humidity", "windspeed"]]

# Add a constant to the model
X = sm.add_constant(X)

# Fit an OLS model using statsmodel
results = sm.OLS(Y,X).fit()

# Print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              temp_diff   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Mon, 06 Jan 2020   Prob (F-statistic):               0.00
Time:                        20:08:04   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4381      0.021    115.948      0.0

#### _What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why?_

The R-squared and adjusted R-squared values are $0.288$ and $0.288$ respectively.  For both values the model explains 28.8% of the variance in the temp_diff variable.

#### _Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS._

In [9]:
# Y is the target variable
Y = weather_df["temp_diff"]

# Now, include the interaction of humidity and windspeed
weather_df["humidity_and_wind"] = weather_df["humidity"] * weather_df["windspeed"]

# X is the feature set
X = weather_df[["humidity", "windspeed", "humidity_and_wind"]]

# Add a constant to the model
X = sm.add_constant(X)

# Fit an OLS model
results = sm.OLS(Y,X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              temp_diff   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Mon, 06 Jan 2020   Prob (F-statistic):               0.00
Time:                        20:14:05   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 0.0839      0.03

#### _Now, what is the R-squared of this model? Does this model improve upon the previous one?_
The R-squared and adjusted R-squared values are $0.341$ and $0.341$ respectively.  For both values the model explains 34.1% of the variance in the temp_diff variable.  The addition of the interaction between humidity and wind increased model's accuracy by six percentage points. 

#### _Next, add *visibility* as an additional explanatory variable to the first model and estimate it._

In [10]:
# Y is the target variable
Y = weather_df["temp_diff"]

# Now, include the interaction of humidity and windspeed
weather_df["humidity_and_wind"] = weather_df["humidity"] * weather_df["windspeed"]

# X is the feature set
X = weather_df[["humidity", "windspeed", "humidity_and_wind", "visibility"]]

# Add a constant to the model
X = sm.add_constant(X)

# Fit an OLS model
results = sm.OLS(Y,X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              temp_diff   R-squared:                       0.364
Model:                            OLS   Adj. R-squared:                  0.363
Method:                 Least Squares   F-statistic:                 1.377e+04
Date:                Mon, 06 Jan 2020   Prob (F-statistic):               0.00
Time:                        20:23:23   Log-Likelihood:            -1.6504e+05
No. Observations:               96453   AIC:                         3.301e+05
Df Residuals:                   96448   BIC:                         3.301e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -1.1006      0.03

#### _Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?_

Both R-squared and adjusted R-squared increaseed to $0.364$ and $0.363$ respectively.  R-squared increased by 2.3% and adjusted R-squared increased by 2.2%.  While is a gain it is not as signifigant a gain as the inclusion of the humidity_and_wind variable.

#### _Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor._

| | Model 1 | Model 2 | Model 3 |
| :-----: | :-----: | :-----: | :-----: |
| AIC | 3.409 | 3.334 | 3.301 |
| BIC | 3.409 | 3.334 | 3.301 |

By the AIC and BIC scores the third model is the most complete model.