# Data Modeling

Now that we've cleaned and explored our data, we can start working on modeling it. 

In [121]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import sklearn as sk
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold, cross_val_score, TimeSeriesSplit
from sklearn.linear_model import Lasso, ElasticNet, LinearRegression
from sklearn.metrics import mean_absolute_error
from xgboost import XGBRegressor
from fbprophet import Prophet
import tensorflow as tf

First, let's import all of our cleaned .csv files

In [122]:
df = pd.read_csv('../data/clean/full/dengue_features_train.csv')
df_labels = pd.read_csv('../data/clean/full/dengue_features_train.csv')
sj_features = pd.read_csv('../data/clean/sj/sj_train_features.csv')
sj_labels = pd.read_csv('../data/clean/sj/sj_train_labels.csv')

iq_features = pd.read_csv('../data/clean/iq/iq_train_features.csv')
iq_labels = pd.read_csv('../data/clean/iq/iq_train_labels.csv')

Now, we can start playing around with algorithms and their hyperparameters

In [123]:
lasso = Lasso()
enet = ElasticNet()
reg = LinearRegression()

lasso_params = {
    'alpha':[1, 5, 10, 100],
}

enet_params = {
    'alpha':[.1, 1, 5, 10, 100],
    'l1_ratio':[.1, .5, .9],
}

for estimator, params in zip([lasso, enet, reg], [lasso_params, enet_params, {}]):
    grid_search = GridSearchCV(
        estimator=estimator,
        param_grid=params,
        scoring='neg_mean_absolute_error',
        cv=5,
        n_jobs=-1,
    )
    grid_search.fit(sj_features.drop(['city', 'week_start_date'], axis=1), sj_labels['total_cases'])
    print(sj_features['city'].unique()[0], grid_search.best_score_, grid_search.best_estimator_)
    

sj -27.60013422946121 Lasso(alpha=100, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)
sj -27.288571587403965 ElasticNet(alpha=100, copy_X=True, fit_intercept=True, l1_ratio=0.1,
           max_iter=1000, normalize=False, positive=False, precompute=False,
           random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
sj -30.038683611654715 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


As we can see, the scores for ElasticNet, L2 Least Squares, and Linear Regression come out to 28, 27, and 30 respectively. We can also see that the estimators with complexity penalties find that a higher penalty leads to a better score. This, however, is slightly false. By visualizing the results we can see they are actually just regressing towards the mean of the `total_cases` column.

# More than a Baseline

Now that we have simple, interpretable models as our baseline, we can start to increase the complexity. In order to avoid the tendency for our models to pull to the mean, we'll use a `RandomizedGridSearch` instead. 

In [124]:
xgb = XGBRegressor()

sj_xgb_X = sj_xgb_y = pd.DataFrame()

sj_xgb_X['ds'] = sj_features['week_start_date']
sj_xgb_X['y'] = sj_labels['total_cases']
sj_xgb_y['y'] = sj_labels['total_cases']


kf = TimeSeriesSplit(n_splits=5)
scores = []
for train_index, test_index in kf.split(sj_xgb_X):
    prophet = Prophet(
        growth = 'linear',
        yearly_seasonality = 10,
        weekly_seasonality = False,
        daily_seasonality = False,
        seasonality_mode = 'additive'
    )
    X_train, X_test = sj_xgb_X.iloc[train_index], sj_xgb_X.iloc[test_index]
    prophet.fit(X_train)
    p = prophet.predict(X_test)
    scores.append(mean_absolute_error(p['yhat'], sj_xgb_X['y'].iloc[test_index]))

np.mean(scores)

55.332693268328946