# Evaluate Weather Models Performances

This is part one of two for the evaluating performance assignment.
1. [Evaluate Weather Models Performances](https://github.com/philbowman212/Thinkful_repo/blob/master/assignments/3_supervised_learning/regression_problems/eval_temp_perf.ipynb)
2. [Evaluate House Price Models Performances](https://github.com/philbowman212/Thinkful_repo/blob/master/assignments/3_supervised_learning/regression_problems/eval_hp_perf.ipynb)

### Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
* Add *visibility* as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

df = pd.read_sql_query('select * from weatherinszeged',con=engine)

engine.dispose()

In [3]:
data = pd.DataFrame()
data['target'] = df.apparenttemperature.copy() - df.temperature.copy()
data['humidity'] = df.humidity.copy()
data['windspeed'] = df.windspeed.copy()

Detection and removal of outliers...

In [4]:
def outliers_std(data, columns, thresh=2):
    outlier_indexes = []
    for col in columns:
        ser_col = data[col]
        mean = ser_col.mean()
        sd = ser_col.std()
        outliers_mask = data[(data[col] > mean + thresh*sd) | (data[col] < mean - thresh*sd)].index
        outlier_indexes += [x for x in outliers_mask]
    return list(set(outlier_indexes))

In [5]:
def outliers_iqr(data, columns, thresh=1.5):
    outlier_indexes = []
    for col in columns:
        q25, q75 = np.percentile(data[col], [25, 75])
        iqr = q75 - q25
        upper_lim = q75 + (iqr*thresh)
        lower_lim = q25 - (iqr*thresh)
        outliers_mask = data[(data[col] >= upper_lim) | (data[col] <= lower_lim)].index
        outlier_indexes += [x for x in outliers_mask]
    return list(set(outlier_indexes))

In [6]:
len(outliers_std(data, data.columns)), len(outliers_iqr(data, data.columns))

(12375, 4185)

Go with IQR detection method...less loss of information.

In [7]:
data.drop(outliers_iqr(data, data.columns), inplace=True)

Create model...

In [8]:
import statsmodels.api as sm

In [9]:
def OLS_sum(data):
    target = data.iloc[:, 0]
    data = data.iloc[:, 1:]
    sm_data = sm.add_constant(data)
    results = sm.OLS(target, sm_data).fit()
    print(results.summary())
OLS_sum(data)

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.253
Model:                            OLS   Adj. R-squared:                  0.253
Method:                 Least Squares   F-statistic:                 1.566e+04
Date:                Mon, 11 Nov 2019   Prob (F-statistic):               0.00
Time:                        11:17:28   Log-Likelihood:            -1.5491e+05
No. Observations:               92268   AIC:                         3.098e+05
Df Residuals:                   92265   BIC:                         3.099e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.2733      0.020    112.379      0.0

The R-squared and Adj. R-squared values are both .253 which indicates this model only explains 25.3% of the outcome variable's variance. In other words, this model does not have very good predictive power.

Update model with interaction between humidity and windspeed...

In [10]:
data = pd.DataFrame()
data['target'] = df.apparenttemperature.copy() - df.temperature.copy()
data['humidity'] = df.humidity.copy()
data['windspeed'] = df.windspeed.copy()
data['ws_hum_rel'] = df.humidity.copy() * df.windspeed.copy()

Clean again...

In [11]:
data.drop(outliers_iqr(data, data.columns), inplace=True)

In [12]:
OLS_sum(data)

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.311
Model:                            OLS   Adj. R-squared:                  0.311
Method:                 Least Squares   F-statistic:                 1.374e+04
Date:                Mon, 11 Nov 2019   Prob (F-statistic):               0.00
Time:                        11:17:28   Log-Likelihood:            -1.4890e+05
No. Observations:               91457   AIC:                         2.978e+05
Df Residuals:                   91453   BIC:                         2.979e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2743      0.034     -8.064      0.0

It would appear that the model does improve, slightly. The F-statistic is actually lower here, however. The AIC and BIC does show improvement.

Update model with visibility...

In [13]:
data = pd.DataFrame()
data['target'] = df.apparenttemperature.copy() - df.temperature.copy()
data['humidity'] = df.humidity.copy()
data['windspeed'] = df.windspeed.copy()
data['visibility'] = df.visibility.copy()

In [14]:
data.drop(outliers_iqr(data, data.columns), inplace=True)

In [15]:
OLS_sum(data)

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.272
Model:                            OLS   Adj. R-squared:                  0.272
Method:                 Least Squares   F-statistic:                 1.150e+04
Date:                Mon, 11 Nov 2019   Prob (F-statistic):               0.00
Time:                        11:17:29   Log-Likelihood:            -1.5375e+05
No. Observations:               92268   AIC:                         3.075e+05
Df Residuals:                   92264   BIC:                         3.075e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.4375      0.026     54.551      0.0

Based on the R-squared values, this model is better than the original, but worse than the interaction model. It also has a lower F-stat than the original model. It only shows a slight improvement on the AIC and BIC.

Based on adjusted R-squared alone, the interaction between humidity and windspeed seems to have more explanatory power for the outcome variable than the visibility does.

Model selection: Of the three models above, it would appear the second (with the interaction variable) is the best with respect to all things except the greater F-stat. However, comparing the second and third model to one another using the the F-stat is not useful as one model does not contain all the features of another. We also haven't checked for normality in the errors, which could deem the F-stats useless regardless.