## Nonlinear Features

October 2 2018

Duncan Callaway

This notebook explores how adding nonlinear transformation of predictors improves (or doesn't) model fit.

In [None]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

In [None]:
#this will allow us to see all the columns of the data frame when we display it.
pd.set_option('display.max_columns', 150)

In [None]:
df_all = pd.read_csv('BechleLUR_2006_allmodelbuildingdata.csv')

In [None]:
df_all.head()

In [None]:
X = df_all.loc[:,['WRF+DOMINO']]
Y = df_all["Observed_NO2_ppb"]

In [None]:
X_const = sm.add_constant(X)
X_const.head()

In [None]:
est = sm.OLS(Y, X_const)
result_simple = est.fit()
result_simple.aic

Now let's try estimating a model with **all** the predictors embedded:

In [None]:
X_all = df_all.loc[:,'WRF+DOMINO':'total_14000']
X_all_const = sm.add_constant(X_all)
est_all = sm.OLS(Y, X_all_const)
result_all = est_all.fit()
result_all.aic

And now a model that is close to (but not exactly the same as) Novotny's

In [None]:
X_base = df_all[['WRF+DOMINO', 'Impervious_6000', 'Major_800', 'total_100', 'Major_100', 'Major_200', 'Elevation_truncated_km', 'Distance_to_coast_km']]

X_base_const = sm.add_constant(X_base)
est_base = sm.OLS(Y, X_base_const)
results_base = est_base.fit()
results_base.aic

Let's call that the 'base' AIC.  

What if we add a nonlinear predictor?  

One of the things Novotny claims to be missing is traffic volume.  They use road density instead.  

What if we try adding a variable that is people per km of road?

In [None]:
to_add = pd.Series(df_all.loc[:,'Population_800'] / (df_all.loc[:,'total_800']))
X_base_popperroad = X_base.assign(pop_per_road_800 = to_add.values)
X_base_popperroad.loc[np.isnan(X_base_popperroad.loc[:,'pop_per_road_800']),'pop_per_road_800'] = 0
X_base_popperroad.loc[np.isinf(X_base_popperroad.loc[:,'pop_per_road_800']),'pop_per_road_800'] = 0

X_base_popperroad_const = sm.add_constant(X_base_popperroad)
est_base_popperroad = sm.OLS(Y, X_base_popperroad_const)
results_base_popperroad = est_base_popperroad.fit()
results_base_popperroad.aic

They also don't include population in the model.  I tried a few versions of population -- linear to start.  But what about this one:  population^(1/4)

In [None]:
to_add = pd.Series(df_all.loc[:,'Population_800']**(0.25))
X_base_pop4root = X_base.assign(pop_4root = to_add.values)

X_base_pop4root_const = sm.add_constant(X_base_pop4root)
est_base_pop4root = sm.OLS(Y, X_base_pop4root_const)
results_base_pop4root = est_base_pop4root.fit()
results_base_pop4root.aic

And now how about taking the log?

In [None]:
m = df_all.loc[:,'Population_800']
to_add = pd.Series( np.where( m > 0, np.log(m), 0))

X_base_poplog = X_base.assign(pop_log = to_add.values)

X_base_poplog_const = sm.add_constant(X_base_poplog)
est_base_poplog = sm.OLS(Y, X_base_poplog_const)
results_base_poplog = est_base_poplog.fit()
results_base_poplog.aic