# <p><center>Predictions of Local Epidemics of Dengue Fever</p></center>

Dengue fever is a mosquito-borne illness that mainly occurs in tropical and sub-tropical environments. According to the World Health Organization, cases of Dengue fever have increased dramatically in the last few decades and half of the world's population is currently at risk. Dengue fever can develop into the potentially lethal severe dengue, which is the leading cause of death among children in some countries.

<img src="img/location.jpg" width="600"></img>
<center><i>Figure 1.<cite><a href="http://www.nathnac.org/pro/factsheets/images/clip_image002_015.jpg"><i> Dengue fever risk map 2008.</i></a> from the World Health Organization.</cite> Dengue fever primarily occurs between 0 to 30 degrees latitiude.</i></center>

This data analysis will look at two cities in Peru, San Juan and Iquitos, and predict the total cases of dengue fever for the year and week of year for each city. The dataset includes data from NOAA's GHCN daily climate data weather station measurements, PERSIANN satellite precipitation measurements on a 0.25x0.25 degree scale, NOAA's NCEP Climate Forecast System Reanalysis measurements on a 0.5x0.5 degree scale, and satellite vegetation - NOAA's CDR Normalized Difference Vegetation Index (NDVI) on 0.5x0.5 degree scale measurements.


Significant variables we will look at in the dataset include:

<ul>
    <li>Temperature (air, dewpoint)</li>
    <li>Precipitation</li>
    <li>Humidity</li>
    <li>How "green" an area is (the NDVI)</li>
    <li>Day-Week-Month</li>
    <li>Location</li>
    
</ul>

<img src="img/weather.png" width="600"></img><center>Figure 2.<cite><a href="https://www.researchgate.net/figure/Two-most-occurring-Aedes-mosquitoes-amount-of-rainfall-temperature-and-relative_fig2_317867099"><i>Mosquito occurence in Nigeria compared with rainfall, temperature, and relative humidity</i></a> from Reasearch Gate.</cite></center>

Because we are making the assumption that an increase in mosquitos is positively correlated in an increase in Dengue cases, it is helpful to understand what factors lead to a mosquito increase. This graph shows mosquito occurence compared to average rainfall, temperature, and relative humidity in Nigeria from March 2015 to February 2016. Being that the variables included in the graph are also in our dataset we can gain a better estimation on what variables will be important in our analysis.

<img src="img/NDVI.png" width="450"></img><center>Figure 3.<cite><a href="https://geonetcast.wordpress.com/2017/08/29/python-can-do-anything-3-ndvi-from-suomi-npp-in-gnc-a/"><i> NDVI satellite image of Peru from July 2017</i></a> from GEONETcast.</cite></center>


In [4]:
# Import libraries
from sklearn.tree import DecisionTreeClassifier  
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

ImportError: C extension: 'iNaT' from 'pandas._libs.tslib' (unknown location) not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.

In [None]:
#read in csvs
f_test = pd.read_csv("dengue_features_test.csv")
f_train = pd.read_csv("dengue_features_train.csv")
l_train = pd.read_csv("dengue_labels_train.csv")
test = f_test
test['cases'] = l_train['total_cases']

In [None]:
#Create a correlation heat map
hmap = f_test
hmap['cases'] = l_train['total_cases']
plt.figure(figsize=(14, 12))
sns.heatmap(hmap.drop(['city','week_start_date'],axis=1).corr(), annot=True, annot_kws={"size": 8})

In [1]:
sp_y = hmap['cases']
sp = hmap.drop(['city','week_start_date','year','cases', 'weekofyear'], axis = 1)
fig = plt.figure(1, figsize = [20,20])

columns = 4 
rows = 5

for i in range(1, columns*rows):
    fig.add_subplot(rows, columns, i)
    plt.xlabel(sp.columns[i])
    plt.ylabel('Number of Cases')
    plt.scatter(sp[sp.columns[i]], sp_y)


plt.tight_layout()
plt.show()
len(sp.columns)


NameError: name 'np' is not defined

In [133]:
#replace NaN with previous value
X = f_test
X['cases'] = l_train['total_cases']
X.fillna(method='ffill', inplace=True)

In [134]:
#using linear regression, talk about why....

In [135]:
y = X['cases']

In [136]:
#remove columns relating to time and city
X = X.drop(['city','week_start_date','year','cases', 'weekofyear'], axis = 1)

In [143]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)
y_pred = regr.predict(X_test)

In [149]:
print(reg.coef_)
print(reg.intercept_)
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))

[ 5.34372863e+01 -1.20082887e+02  7.20650716e+01 -2.76026689e+01
 -3.58675207e-02  3.98942137e+00  1.72326698e+01  5.74159907e+00
  4.08492234e+00 -3.42567651e+00 -5.12615860e-02  2.89080802e+00
 -3.58675207e-02 -1.81196846e+01 -1.19607808e+01  7.81968823e+00
  5.57399079e+00 -2.20942158e+00  3.95418066e+00 -7.38079529e-02]
-8369.251088240251
3190.7842956317704
0.1456178435605029


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

ValueError: Classification metrics can't handle a mix of continuous and multiclass targets