# Data Modeling

Now that we've cleaned and explored our data, we can start working on modeling it. 

In [53]:
import pandas as pd
import numpy as np

import sklearn as sk
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold, cross_val_score
from sklearn.linear_model import Lasso, ElasticNet, LinearRegression
from xgboost import XGBRegressor
from fbprophet import Prophet
import tensorflow as tf

First, let's import all of our cleaned .csv files

In [42]:
df = pd.read_csv('../data/clean/full/dengue_features_train.csv')
df_labels = pd.read_csv('../data/clean/full/dengue_features_train.csv')
sj_features = pd.read_csv('../data/clean/sj/sj_train_features.csv')
sj_labels = pd.read_csv('../data/clean/sj/sj_train_labels.csv')

iq_features = pd.read_csv('../data/clean/iq/iq_train_features.csv')
iq_labels = pd.read_csv('../data/clean/iq/iq_train_labels.csv')
sj_features.head()

Unnamed: 0,city,year,weekofyear,week_start_date,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,reanalysis_air_temp_k,reanalysis_avg_temp_k,...,reanalysis_max_air_temp_k,reanalysis_precip_amt_kg_per_m2,reanalysis_relative_humidity_percent,reanalysis_sat_precip_amt_mm,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm
0,sj,1990,18,1990-04-30,0.1226,0.103725,0.198483,0.177617,297.572857,297.742857,...,299.8,32.0,73.365714,12.42,2.628571,25.442857,6.9,29.4,20.0,16.0
1,sj,1990,19,1990-05-07,0.1699,0.142175,0.162357,0.155486,298.211429,298.442857,...,300.9,17.94,77.368571,22.82,2.371429,26.714286,6.371429,31.7,22.2,8.6
2,sj,1990,20,1990-05-14,0.03225,0.172967,0.1572,0.170843,298.781429,298.878571,...,300.5,26.1,82.052857,34.54,2.3,26.714286,6.485714,32.2,22.8,41.4
3,sj,1990,21,1990-05-21,0.128633,0.245067,0.227557,0.235886,298.987143,299.228571,...,301.4,13.9,80.337143,15.36,2.428571,27.471429,6.771429,33.3,23.3,4.0
4,sj,1990,22,1990-05-28,0.1962,0.2622,0.2512,0.24734,299.518571,299.664286,...,301.9,12.2,80.46,7.52,3.014286,28.942857,9.371429,35.0,23.9,5.8


Now, we can start playing around with algorithms and their hyperparameters

In [57]:
lasso = Lasso()
enet = ElasticNet()
reg = LinearRegression()

lasso_params = {
    'alpha':[1, 5, 10, 100],
}

enet_params = {
    'alpha':[.1, 1, 5, 10, 100],
    'l1_ratio':[.1, .5, .9],
}

for estimator, params in zip([lasso, enet, reg], [lasso_params, enet_params, {}]):
    grid_search = GridSearchCV(
        estimator=estimator,
        param_grid=params,
        scoring='neg_mean_absolute_error',
        cv=5,
    )
    grid_search.fit(sj_features.drop(['city', 'week_start_date'], axis=1), sj_labels['total_cases'])
    print(grid_search.best_score_, grid_search.best_estimator_)


-27.60013422946121 Lasso(alpha=100, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)
-27.288571587403965 ElasticNet(alpha=100, copy_X=True, fit_intercept=True, l1_ratio=0.1,
           max_iter=1000, normalize=False, positive=False, precompute=False,
           random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
-30.038683611654573 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


As we can see, the scores for ElasticNet, L2 Least Squares, and Linear Regression come out to 28, 27, and 30 respectively. We can also see that the estimators with complexity penalties find that a higher penalty leads to a better score. This, however, is slightly false. By visualizing the results we can see they are actually just regressing towards the mean of the `total_cases` column.

# More than a Baseline

Now that we have simple, interpretable models as our baseline, we can start to increase the complexity.

In [58]:
prophet = Prophet()
xgb = XGBRegressor()