# DengAI: Predicting Disease Spread

Description: Using environmental data collected by various U.S. Federal Government agencies—from the Centers for Disease Control and Prevention to the National Oceanic and Atmospheric Administration in the U.S. Department of Commerce—can you predict the number of dengue fever cases reported each week in San Juan, Puerto Rico and Iquitos, Peru?

First, we'll take a look at some of the features of the training data

In [1]:
import pandas as pd # Data handling 
import numpy as np

import tensorflow as tf # Neural networks  
from tensorflow import keras
from tensorflow.keras import layers

import plotly.graph_objects as go # Visualization
import plotly.express as px

import sklearn as sk # Stats / ML
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import cross_val_score, train_test_split, KFold, GridSearchCV, TimeSeriesSplit
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import impute

from xgboost import XGBRegressor # Efficient Gradient Boosting

In [2]:
df = pd.read_csv("data/dengue_features_train.csv")
df_labels = pd.read_csv("data/dengue_labels_train.csv")
df_features_test = pd.read_csv("data/dengue_features_test.csv")
submission_format = pd.read_csv("data/submission_format.csv")

display(df.head(5))

Unnamed: 0,city,year,weekofyear,week_start_date,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,...,reanalysis_precip_amt_kg_per_m2,reanalysis_relative_humidity_percent,reanalysis_sat_precip_amt_mm,reanalysis_specific_humidity_g_per_kg,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm
0,sj,1990,18,1990-04-30,0.1226,0.103725,0.198483,0.177617,12.42,297.572857,...,32.0,73.365714,12.42,14.012857,2.628571,25.442857,6.9,29.4,20.0,16.0
1,sj,1990,19,1990-05-07,0.1699,0.142175,0.162357,0.155486,22.82,298.211429,...,17.94,77.368571,22.82,15.372857,2.371429,26.714286,6.371429,31.7,22.2,8.6
2,sj,1990,20,1990-05-14,0.03225,0.172967,0.1572,0.170843,34.54,298.781429,...,26.1,82.052857,34.54,16.848571,2.3,26.714286,6.485714,32.2,22.8,41.4
3,sj,1990,21,1990-05-21,0.128633,0.245067,0.227557,0.235886,15.36,298.987143,...,13.9,80.337143,15.36,16.672857,2.428571,27.471429,6.771429,33.3,23.3,4.0
4,sj,1990,22,1990-05-28,0.1962,0.2622,0.2512,0.24734,7.52,299.518571,...,12.2,80.46,7.52,17.21,3.014286,28.942857,9.371429,35.0,23.9,5.8


Now, let's take a look at some visualizations of the data. The correlation matrix uses the Pearson correlation coefficient. More info on the documentation of .corr() can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)

In [3]:
df['total_cases'] = df_labels['total_cases'] #append labels to DataFrame for visualization

# Seperate data for San Juan
sj_train_features = df.loc[df['city'] == 'sj']
sj_train_labels = sj_train_features['total_cases'].to_frame()

# Separate data for Iquitos
iq_train_features = df.loc[df['city'] == 'iq']
iq_train_labels = iq_train_features['total_cases'].to_frame()

From what we can see, the data is fairly Gaussian. The DewPoint data is slightly skewed, but we'll ignore that for now. Next, lets explore how clean the data is.

As we can see, the percentage of rows with some NaN value is pretty high, so it wouldn't be a great idea to just remove them entirely. Instead, lets try to fill those missing values with some meaningful data. 

I initially tested filling with the mean value, but that increased bias significantly. Let's try using scikit-learn's KNNImputer. The documentation and explanation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer)

In [4]:
df = pd.get_dummies(df, columns=['city'], drop_first=True)

df_cp = df.copy()
imp = impute.IterativeImputer() 
# imp = impute.KNNImputer()

imputed = imp.fit_transform(df_cp[[i for i in df.columns if i != 'week_start_date']])

imputed_features = pd.DataFrame(imputed, index=df_cp.index, columns=[i for i in df_cp.columns if i != 'week_start_date'])
imputed_features['week_start_date'] = test['week_start_date']

df = imputed_features.copy()

# Seperate data for San Juan
sj_train_features = df.loc[df['city_sj'] == 1]
sj_train_labels = sj_train_features['total_cases'].to_frame()

# Separate data for Iquitos
iq_train_features = df.loc[df['city_sj'] == 0]
iq_train_labels = iq_train_features['total_cases'].to_frame()

# sj_train_features.drop('city_sj', axis=1, inplace=True)
# iq_train_features.drop('city_sj', axis=1, inplace=True) 

NameError: name 'test' is not defined

Now, lets split the dataset by cit and drop some highly correlated variables (by the Pearson correlation coefficient)

In [7]:
#Remove some highly correlated columns
df.drop(['precipitation_amt_mm', 'reanalysis_specific_humidity_g_per_kg', 'reanalysis_min_air_temp_k'], axis=1, inplace=True)
sj_train_features.drop(['precipitation_amt_mm', 'reanalysis_specific_humidity_g_per_kg', 'reanalysis_min_air_temp_k'], axis=1, inplace=True)
iq_train_features.drop(['precipitation_amt_mm', 'reanalysis_specific_humidity_g_per_kg', 'reanalysis_min_air_temp_k'], axis=1, inplace=True)

print(df.isnull().sum()) #should be zero for all columns

year                                    0
weekofyear                              0
ndvi_ne                                 0
ndvi_nw                                 0
ndvi_se                                 0
ndvi_sw                                 0
reanalysis_air_temp_k                   0
reanalysis_avg_temp_k                   0
reanalysis_dew_point_temp_k             0
reanalysis_max_air_temp_k               0
reanalysis_precip_amt_kg_per_m2         0
reanalysis_relative_humidity_percent    0
reanalysis_sat_precip_amt_mm            0
reanalysis_tdtr_k                       0
station_avg_temp_c                      0
station_diur_temp_rng_c                 0
station_max_temp_c                      0
station_min_temp_c                      0
station_precip_mm                       0
total_cases                             0
city_sj                                 0
week_start_date                         0
dtype: int64




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



## Gradient Boosting

Now, let's construct a gradient boosted tree model using XGBoost. As far as hyperparameter optimization, let's start with a GridSearchCV

In [11]:
xgb = XGBRegressor(objective='reg:squarederror') # Try with smooth loss first

sj_features_fit = sj_train_features.copy()
iq_features_fit = iq_train_features.copy()

sj_features_fit.drop(['week_start_date', 'total_cases'], axis=1, inplace=True)
iq_features_fit.drop(['week_start_date', 'total_cases'], axis=1, inplace=True)

In [12]:
param = {
    'max_depth':[i for i in range(3,6)], # depth of tree
    'eta':[0.001, 0.01], # Step size shrinkage used in update to prevents overfitting
    'gamma':[0, 1], # Minimum loss reduction required to make a further partition on a leaf node of the tree
    'n_estimators':[9, 10, 20], # Number of trees 
    'lambda':[.001, .01], # l2 regularization coefficient
    'alpha':[.01] # l1 regularization coefficient
}

grid_search = GridSearchCV(
    estimator = xgb,
    param_grid = param,
    scoring = 'neg_mean_absolute_error',
    verbose = 1,
    cv = 10 #get close to LOOCV
)

Now, let's fit the model on this grid search. This might take a while since GridSearchCV is what is called an "**exhaustive search**". This means it tests every *permutation* of the hyperparameters, meaning we have to fit 3\*2\*2\*2\*2\*2 = 960 models and test them all using cross-fold validation. It's a good thing XGBoost is incredibly efficient, else this process could take a long time.

In [13]:
df_c = df.copy()
df_c.drop(['week_start_date', 'total_cases'], axis=1, inplace=True) #drop string representation of time for training

grid_search = grid_search.fit(df_c, df_labels['total_cases'].values)

Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 720 out of 720 | elapsed:   27.3s finished


In [15]:
#Could be data snooping bias
best_est = grid_search.best_estimator_
chart_actual_and_pred(df, df_c, df_labels, best_est, 'lines', 'Total (MAE of {})'.format(abs(grid_search.best_score_)))

chart_actual_and_pred(sj_train_features, t, sj_train_labels, best_est, 'lines', 'SJ')
# chart_actual_and_pred(df, df_c, df_labels, best_est, 'lines', 'Total (MAE of {})'.format(abs(grid_search.best_score_)))

For this first test, we trained the model on the entirety of the dataset. Now, let's try fitting it to each city specifically and see if it results in a lower bias. 

In [16]:
grid_search_sj = grid_search.fit(sj_features_fit, sj_train_labels['total_cases'].values)
grid_search_iq = grid_search.fit(iq_features_fit, iq_train_labels['total_cases'].values)

Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 720 out of 720 | elapsed:   16.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=1)]: Done 720 out of 720 | elapsed:   11.2s finished


In [17]:
sj_pred = grid_search_sj.predict(sj_features_fit)
print(max(sj_pred))
# chart_actual_and_pred(sj_train_features, sj_features_fit, sj_train_labels, grid_search_sj.best_estimator_, 'lines', 'SJ (MAE {})'.format(grid_search_sj.best_score_))
# chart_actual_and_pred(iq_train_features, iq_features_fit, iq_train_labels, grid_search_iq.best_estimator_, 'lines', 'IQ (MAE {})'.format(grid_search_iq.best_score_))

# Why are these scores the same? Something above is not correct?
print(grid_search_sj.best_score_)
print(grid_search_iq.best_score_)

14.892132
-6.0778205465811945
-6.0778205465811945
