<a href="https://colab.research.google.com/github/gshreya5/colab/blob/main/DengAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🦟 DengAI: Predicting Disease Spread
HOSTED BY DRIVENDATA

**Can you predict local epidemics of dengue fever?**

Dengue fever is a mosquito-borne disease that occurs in tropical and sub-tropical parts of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue fever can cause severe bleeding, low blood pressure, and even death.

Because it is carried by mosquitoes, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have significant public health implications worldwide.

**GOAL** : An understanding of the relationship between climate and dengue dynamics can improve research initiatives and resource allocation to help fight life-threatening pandemics.


# Import Libraries


In [152]:
!pip install lazypredict --quiet

In [260]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as MAE

import lazypredict
from lazypredict.Supervised import LazyClassifier
from lazypredict.Supervised import LazyRegressor

import xgboost as xgb

from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from sklearn.linear_model import HuberRegressor

%matplotlib inline

#  [Load Datasets](https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/page/82/)

In [251]:
train_labels = pd.read_csv('https://drivendata-prod.s3.amazonaws.com/data/44/public/dengue_labels_train.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYQTZTLQOS%2F20230307%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230307T161138Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=abdc7ff1905a9930e60d3bf0a759a2ec5133aa04e487da35bca0a3358ab43662')
train_features = pd.read_csv('https://drivendata-prod.s3.amazonaws.com/data/44/public/dengue_features_train.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYQTZTLQOS%2F20230307%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230307T161446Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=15956f08257281ae26bd3ed8f89b67b5c763c44dc28458b59c7575e9298e98bd')

test_features = pd.read_csv('https://drivendata-prod.s3.amazonaws.com/data/44/public/dengue_features_test.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYQTZTLQOS%2F20230307%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230307T161446Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=be4d51a059c3609b80d0a274deb4b379b56c30323c6fc1684474b29dc2d83a9c')

Before merging train features and train labels let's check if both data are equal

In [252]:
train_features[['city','year','weekofyear']].equals(train_labels[['city','year','weekofyear']])

True

In [266]:
df = train_features.copy()
df['train'] = 1
df['totCases'] = train_labels[['total_cases']].astype(int)

test_features['train'] = 0 
test_features['totCases'] = np.nan
df = df.append(test_features)
df['week_start_date'] = pd.to_datetime(df['week_start_date'])

In [267]:
df.reset_index(inplace = True, drop = True)

Split data on two cities San Juan (sj) and Iquitos (iq) because they probably  exhibit different patterns

In [268]:
sj = df[df.city =='sj'].copy()
del sj['city']
sj.set_index(['week_start_date'],inplace = True)

iq = df[df.city =='iq'].copy()
del iq['city']
iq.set_index(['week_start_date'],inplace = True)

# Explore Dataset


In [234]:
df.shape, sj.shape, iq.shape

((1872, 26), (1196, 24), (676, 24))

In [265]:
sj.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1196 entries, 1990-04-30 to 2013-04-23
Data columns (total 24 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   year                                   1196 non-null   int64  
 1   weekofyear                             1196 non-null   int64  
 2   ndvi_ne                                962 non-null    float64
 3   ndvi_nw                                1136 non-null   float64
 4   ndvi_se                                1176 non-null   float64
 5   ndvi_sw                                1176 non-null   float64
 6   precipitation_amt_mm                   1185 non-null   float64
 7   reanalysis_air_temp_k                  1188 non-null   float64
 8   reanalysis_avg_temp_k                  1188 non-null   float64
 9   reanalysis_dew_point_temp_k            1188 non-null   float64
 10  reanalysis_max_air_temp_k              1188 non-null  

#  create time based features?

Data has some empty values

In [109]:
sj.describe()

Unnamed: 0,year,weekofyear,ndvi_ne,ndvi_nw,ndvi_se,ndvi_sw,precipitation_amt_mm,reanalysis_air_temp_k,reanalysis_avg_temp_k,reanalysis_dew_point_temp_k,reanalysis_max_air_temp_k,reanalysis_min_air_temp_k,reanalysis_precip_amt_kg_per_m2,reanalysis_relative_humidity_percent,reanalysis_sat_precip_amt_mm,reanalysis_specific_humidity_g_per_kg,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm,train,totCases
count,1196.0,1196.0,962.0,1136.0,1176.0,1176.0,1185.0,1188.0,1188.0,1188.0,1188.0,1188.0,1188.0,1188.0,1185.0,1188.0,1188.0,1188.0,1188.0,1188.0,1188.0,1188.0,1196.0,936.0
mean,2001.326923,26.503344,0.050453,0.060731,0.177523,0.163152,33.52227,299.227588,299.334698,295.155665,301.441751,297.36069,29.010556,78.488882,33.52227,16.596055,2.531638,27.064298,6.625998,31.62298,22.71069,28.398485,0.782609,34.180556
std,6.652597,15.020404,0.114467,0.092158,0.059411,0.056092,42.274077,1.251881,1.234097,1.592483,1.273053,1.313495,35.489485,3.394443,42.274077,1.589038,0.495944,1.420081,0.843749,1.714663,1.518206,31.091872,0.412644,51.381372
min,1990.0,1.0,-0.4634,-0.4561,-0.015533,-0.063457,0.0,295.938571,296.114286,289.642857,297.8,292.6,0.0,64.92,0.0,11.715714,1.357143,22.842857,4.042857,26.7,17.8,0.0,0.0,0.0
25%,1996.0,13.75,-0.009308,0.007946,0.137757,0.125214,0.76,298.209643,298.319643,293.883929,300.4,296.3,9.475,76.214286,0.76,15.256071,2.171429,25.871429,6.053571,30.6,21.7,6.9,1.0,9.0
50%,2001.0,26.5,0.051125,0.05904,0.176014,0.162707,19.67,299.352857,299.439286,295.492143,301.6,297.5,20.0,78.620714,19.67,16.884286,2.471429,27.278571,6.6,31.7,22.8,18.6,1.0,19.0
75%,2007.0,39.25,0.1071,0.107513,0.213646,0.200418,49.81,300.294286,300.364286,296.478571,302.4,298.4,35.725,80.863214,49.81,17.93,2.814286,28.228571,7.157143,32.8,23.9,38.9,1.0,37.0
max,2013.0,53.0,0.5004,0.649,0.393129,0.38142,390.6,302.2,302.164286,297.795714,304.3,299.9,570.5,87.575714,390.6,19.44,4.428571,30.271429,9.914286,35.6,26.7,305.9,1.0,461.0


In [16]:
pd.options.display.max_columns = 100

In [110]:
sj.corr()[['totCases']].sort_values(by='totCases',ascending=False)

Unnamed: 0,totCases
totCases,1.0
weekofyear,0.287134
reanalysis_specific_humidity_g_per_kg,0.207947
reanalysis_dew_point_temp_k,0.203774
station_avg_temp_c,0.196617
reanalysis_max_air_temp_k,0.194532
station_max_temp_c,0.189901
reanalysis_min_air_temp_k,0.187943
reanalysis_air_temp_k,0.181917
station_min_temp_c,0.177012


In [111]:
iq.corr()[['totCases']].sort_values(by='totCases',ascending=False)

Unnamed: 0,totCases
totCases,1.0
reanalysis_specific_humidity_g_per_kg,0.236476
reanalysis_dew_point_temp_k,0.230401
reanalysis_min_air_temp_k,0.214514
station_min_temp_c,0.211702
year,0.179451
reanalysis_relative_humidity_percent,0.130083
station_avg_temp_c,0.11307
reanalysis_precip_amt_kg_per_m2,0.101171
reanalysis_air_temp_k,0.097098


These 2 features, reanalysis_specific_humidity_g_per_kg, reanalysis_dew_point_temp_k are mostly correlated with total Cases, probably because dengue thrives in wet conditions.

total_cases seems to rise as temp variables rise, probably because as temp rises, wetter conditions are more probable

# Feature Engineer


## Imputing Null Values


In [116]:
sj.isna().sum()

year                                       0
weekofyear                                 0
precipitation_amt_mm                      11
reanalysis_air_temp_k                      8
reanalysis_avg_temp_k                      8
reanalysis_dew_point_temp_k                8
reanalysis_max_air_temp_k                  8
reanalysis_min_air_temp_k                  8
reanalysis_precip_amt_kg_per_m2            8
reanalysis_relative_humidity_percent       8
reanalysis_sat_precip_amt_mm              11
reanalysis_specific_humidity_g_per_kg      8
reanalysis_tdtr_k                          8
station_avg_temp_c                         8
station_diur_temp_rng_c                    8
station_max_temp_c                         8
station_min_temp_c                         8
station_precip_mm                          8
train                                      0
totCases                                 260
dtype: int64

Let's drop Satellite vegetation, since as we saw in correlation they don't really affect total cases

In [269]:
sj.drop(columns=['ndvi_ne','ndvi_se','ndvi_sw','ndvi_nw'],inplace=True)
iq.drop(columns=['ndvi_ne','ndvi_se','ndvi_sw','ndvi_nw'],inplace=True)

Lets fill remaining empty values with mean

In [126]:
iq.describe()

Unnamed: 0,year,weekofyear,precipitation_amt_mm,reanalysis_air_temp_k,reanalysis_avg_temp_k,reanalysis_dew_point_temp_k,reanalysis_max_air_temp_k,reanalysis_min_air_temp_k,reanalysis_precip_amt_kg_per_m2,reanalysis_relative_humidity_percent,reanalysis_sat_precip_amt_mm,reanalysis_specific_humidity_g_per_kg,reanalysis_tdtr_k,station_avg_temp_c,station_diur_temp_rng_c,station_max_temp_c,station_min_temp_c,station_precip_mm,train,totCases
count,676.0,676.0,672.0,672.0,672.0,672.0,672.0,672.0,672.0,672.0,672.0,672.0,672.0,629.0,629.0,661.0,661.0,657.0,676.0,520.0
mean,2006.5,26.464497,62.778333,297.844165,299.111214,295.513157,307.057887,292.832143,61.092024,88.863576,62.778333,17.123563,9.233355,27.53329,10.606971,33.994554,21.172466,55.928615,0.769231,7.565385
std,3.777712,14.99245,34.557077,1.155995,1.326678,1.37875,2.33006,1.641664,51.483816,7.385234,34.557077,1.410123,2.381186,0.889584,1.531324,1.321145,1.272756,58.689751,0.421637,10.765478
min,2000.0,1.0,0.0,294.554286,294.892857,290.088571,300.0,286.2,0.0,57.787143,0.0,12.111429,3.714286,21.4,5.2,29.6,14.2,0.0,0.0,0.0
25%,2003.0,13.75,38.995,297.0925,298.205357,294.6275,305.2,291.9,26.125,84.686786,38.995,16.155,7.4,27.0,9.533333,33.2,20.6,15.0,1.0,1.0
50%,2006.5,26.0,58.655,297.815,299.071429,295.875714,306.9,293.0,49.34,91.210714,58.655,17.45,9.021429,27.6,10.633333,34.0,21.3,38.8,1.0,5.0
75%,2010.0,39.0,83.7575,298.568929,300.071429,296.543214,308.7,294.1,79.375,94.595357,83.7575,18.176786,11.0,28.1,11.666667,34.9,22.0,77.0,1.0,9.0
max,2013.0,53.0,210.83,301.935714,303.328571,298.45,314.1,296.0,362.03,98.61,210.83,20.461429,16.028571,30.8,15.8,42.2,24.2,543.3,1.0,116.0


In [128]:
empty_cols = ['precipitation_amt_mm','reanalysis_air_temp_k','reanalysis_avg_temp_k','reanalysis_dew_point_temp_k','reanalysis_max_air_temp_k','reanalysis_min_air_temp_k','reanalysis_precip_amt_kg_per_m2','reanalysis_relative_humidity_percent','reanalysis_sat_precip_amt_mm','reanalysis_specific_humidity_g_per_kg','reanalysis_tdtr_k','station_avg_temp_c','station_diur_temp_rng_c','station_max_temp_c','station_min_temp_c','station_precip_mm']

In [270]:
for col in empty_cols:
  sj[col].fillna(sj[col].mean(), inplace=True)
  iq[col].fillna(iq[col].mean(), inplace=True)


In [131]:
sj.isna().sum()

year                                       0
weekofyear                                 0
precipitation_amt_mm                       0
reanalysis_air_temp_k                      0
reanalysis_avg_temp_k                      0
reanalysis_dew_point_temp_k                0
reanalysis_max_air_temp_k                  0
reanalysis_min_air_temp_k                  0
reanalysis_precip_amt_kg_per_m2            0
reanalysis_relative_humidity_percent       0
reanalysis_sat_precip_amt_mm               0
reanalysis_specific_humidity_g_per_kg      0
reanalysis_tdtr_k                          0
station_avg_temp_c                         0
station_diur_temp_rng_c                    0
station_max_temp_c                         0
station_min_temp_c                         0
station_precip_mm                          0
train                                      0
totCases                                 260
dtype: int64

## Standard Scaler

In [271]:
scaler = StandardScaler()
scaler_cols = list(sj.drop(columns=['totCases','train','year','weekofyear']).columns)
sj[scaler_cols] = scaler.fit_transform(sj[scaler_cols])
iq[scaler_cols] = scaler.fit_transform(iq[scaler_cols])

## Split data into features and targets

In [272]:
sj_targets = sj[sj.train==1][['totCases']].astype(int).copy()
sj_features = sj[sj.train==1].drop(columns=['totCases','train']).copy()

iq_targets = iq[iq.train==1][['totCases']].astype(int).copy()
iq_features = iq[iq.train==1].drop(columns=['totCases','train']).copy()

submission_test_sj = sj[sj.train==0].drop(columns=['totCases','train']).copy()
submission_test_iq = iq[iq.train==0].drop(columns=['totCases','train']).copy()

# Split data into train and test

In [157]:
def split(train_x,train_y):
  return train_test_split(train_x,train_y,test_size = .30,random_state= 45)

In [276]:
x_train,x_test,y_train,y_test = split(sj_features,sj_targets)

In [189]:
clf = LazyRegressor(verbose = 0,
                     ignore_warnings = True,
                     custom_metric = MAE,
)
models, predictions = clf.fit(x_train, x_test, y_train, y_test)
models

100%|██████████| 42/42 [00:28<00:00,  1.46it/s]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken,mean_absolute_error
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LGBMRegressor,0.6,0.63,31.23,0.18,17.22
HistGradientBoostingRegressor,0.6,0.62,31.35,0.45,17.58
GradientBoostingRegressor,0.55,0.58,33.34,0.53,17.73
ExtraTreesRegressor,0.53,0.56,33.92,0.89,17.6
RandomForestRegressor,0.45,0.48,36.82,1.53,17.66
AdaBoostRegressor,0.39,0.43,38.59,1.43,29.12
BaggingRegressor,0.39,0.43,38.76,0.38,18.43
XGBRegressor,0.37,0.41,39.24,1.82,16.42
DecisionTreeRegressor,0.26,0.31,42.47,0.12,18.71
ExtraTreeRegressor,0.23,0.28,43.36,0.08,19.89


In [277]:
gbm = xgb.XGBRegressor()
reg_cv = GridSearchCV(gbm, {"colsample_bytree":[1.0],"min_child_weight":[1.0,1.2]
                            ,'max_depth': [3,4,6], 'n_estimators': [500,1000]}, verbose=1)
reg_cv.fit(x_train,y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


In [193]:
reg_cv.best_params_

{'colsample_bytree': 1.0,
 'max_depth': 4,
 'min_child_weight': 1.2,
 'n_estimators': 1000}

Train data using XGBRegressor with best parameter

In [278]:
gbm = xgb.XGBRegressor(**reg_cv.best_params_)
gbm.fit(x_train,y_train)

In [196]:
predictions = gbm.predict(x_test)

In [199]:
MAE(predictions,y_test)

14.41036495348949

In [279]:
submission = submission_test_sj[['year', 'weekofyear']].reset_index(drop=True).copy()
submission['total_cases'] = gbm.predict(submission_test_sj)
submission[['city']] = 'sj'

In [286]:
submission

Unnamed: 0,year,weekofyear,total_cases,city
0,2008,18,13,sj
1,2008,19,5,sj
2,2008,20,19,sj
3,2008,21,24,sj
4,2008,22,10,sj
...,...,...,...,...
255,2013,13,0,sj
256,2013,14,14,sj
257,2013,15,8,sj
258,2013,16,-2,sj


For iq

In [211]:
x_train,x_test,y_train,y_test = split(iq_features,iq_targets)

In [212]:
clf = LazyRegressor(verbose = 0,
                     ignore_warnings = True,
                     custom_metric = MAE,
)
models, predictions = clf.fit(x_train, x_test, y_train, y_test)
models

100%|██████████| 42/42 [00:10<00:00,  3.99it/s]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken,mean_absolute_error
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ExtraTreesRegressor,0.06,0.17,6.47,0.43,4.92
LassoLarsIC,-0.01,0.11,6.72,0.02,5.34
LassoCV,-0.01,0.1,6.73,0.45,5.33
LassoLarsCV,-0.01,0.1,6.74,0.07,5.33
LarsCV,-0.01,0.1,6.74,0.08,5.33
OrthogonalMatchingPursuit,-0.02,0.1,6.75,0.02,5.24
ElasticNetCV,-0.02,0.1,6.75,0.2,5.37
ElasticNet,-0.02,0.1,6.76,0.02,5.37
Lasso,-0.03,0.09,6.78,0.04,5.36
LassoLars,-0.03,0.09,6.78,0.04,5.36


HuberRegressor

In [215]:
huber = HuberRegressor().fit(x_train,y_train)

In [216]:
MAE(huber.predict(x_test),y_test)

4.694724645343658

In [287]:
submission_iq = submission_test_iq[['year', 'weekofyear']].reset_index(drop=True).copy()
submission_iq['total_cases'] = huber.predict(submission_test_iq)
submission_iq[['city']] = 'iq'

In [288]:
submission=submission.append(submission_iq)

In [289]:
submission=submission[['city','year', 'weekofyear', 'total_cases']]

In [290]:
submission[['total_cases']] = submission[['total_cases']].round().astype(int)

In [291]:
submission.to_csv("submission.csv",index = None)

In [292]:
submission.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 416 entries, 0 to 155
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   city         416 non-null    object
 1   year         416 non-null    int64 
 2   weekofyear   416 non-null    int64 
 3   total_cases  416 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 16.2+ KB


SCORE : 26.7524
