In [None]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt

In [None]:
df_all = pd.read_csv('BechleLUR_2006_allmodelbuildingdata.csv')

In [None]:
for i in df_all.columns:
    print(i,end="\n ")

In [None]:
df_all.head()

Question: Which variables would be a bad idea to include in the model?  That is, before we try to do any model selection, what's obviously out?

Now let's build some models.

First let's try regressing observed NO2 against the satellite measuremenets (you'll do this in HW)

In [None]:
lm = linear_model.LinearRegression(fit_intercept=True)

Very important note: scikit-learn wants a data frame for the predictors -- if you have only one predictor and you pass in a pandas series, scikit-learn throws an error. 

In [None]:
predictors = df_all.loc[:,['WRF+DOMINO']]
output = df_all['Observed_NO2_ppb']
lm.fit(predictors, output)

In [None]:
print('slope is',lm.coef_[0])
print('intercept is',lm.intercept_)

In [None]:
y_hat = lm.predict(predictors)

In [None]:
plt.scatter(output,y_hat)
plt.xlabel('Observed NO2')
plt.ylabel('Predicted NO2')

In [None]:
plt.scatter(predictors, output-y_hat)
plt.xlabel('Satellite NO2')
plt.ylabel('Residual')

Now let's look at the regression results using a different package called statsmodels

In [None]:
import statsmodels.api as sm
from scipy import stats

There is a nice feature in statsmodels that allows you to add a constant to a dataframe:

In [None]:
predictors_const = sm.add_constant(predictors)
predictors_const.head()

In [None]:
est = sm.OLS(output, predictors_const)
est_fit = est.fit()
print(est_fit.summary())

Now let's try estimating a model with **all** the predictors embedded:

In [None]:
predictors_all = df_all.loc[:,'WRF+DOMINO':'total_14000']
predictors_all_const = sm.add_constant(predictors_all)
est_all = sm.OLS(output, predictors_all_const)
est_all_fit = est_all.fit()
print(est_all_fit.summary())

Now let's look at what happens if we drop some of the predictors

In [None]:
predictors_less = df_all.loc[:,'WRF+DOMINO':'Resident_100']
predictors_less_const = sm.add_constant(predictors_less)
est_less = sm.OLS(output, predictors_less_const)
est_less_fit = est_less.fit()
print(est_less_fit.summary())

Some things to note:
1. AIC is lower for \*\_less.  \*\_all does a better job of reducing squared error, but it gets penalized more for having more variables.
2. Look at the p-values and confidence intervals.  You can see that p-values are big (>0.05) when confidence intervals span zero.  
3. Resident_100 was insignificant when included with lots of other Resident_\* variables, but on it's own it is strongly significant.   