# West Nile Virus in Chicago: Analysis & Predictive Modeling
---

## Background
---

West Nile virus (WNV) is the leading cause of mosquito-borne disease in the United States. It first emerged in the United States in the New York metropolitan area in 1999. Since then, the virus, which can be transmitted to humans by the bite of an infected mosquito, quickly spread across the country. 

While only round 20% of people who become infected with the virus develop symptoms (ranging from mild symptoms like a persistent fever, to serious neurological illnesses that can result in death), cost of medical treatment can be high. As per the Centers for Disease Control and Prevention, no vaccine or specific antiviral treatments are available. As such, prevention of the disease relies largely on management of mosquitos through various control tactics. 

In 2002, the first human cases of WNV were reported in Chicago. By 2004, the City of Chicago and the Chicago Department of Public Health (CDPH) established a comprehensive surveillance and control program that is still in effect today. Every week from late spring through the fall, mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations.

<b> Additional Sources: </b>

1) Centers for Disease Control and Prevention <br>
 (https://www.cdc.gov/westnile/index.html) <br>
2) West Nile Virus: An Historical Overview <br>
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3111838/) <br>
3) Economic Cost Analysis of West Nile Virus Outbreak, Sacramento County, California, USA, 2005 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322011/#R6)


## Problem Statement
---

As a team of data scientists from the Chicago Department of Public Health (CDPH), we are tasked with building a model that can help accurately predict when and where different species of mosquitos will test positive for WNV, using weather, location, testing, and spraying data. 

The model should help the City of Chicago and CDPH more efficiently and effectively allocate resources towards preventing transmission of this potentially deadly virus.

## Executive Summary
---

This project aims to provide aggregated analysis of weather, spray and trap data in the city of Chicago during the period 2007 to 2013, for purpose of constructing a predictive model for forcasting occurrences of West Nile Virus.

Overall, the production model did a decent job in providing the most predictive features and it can be used to help the city in identifying potential outbreaks. As the model has the least False Negatives among the other models, it's more likely to detect the areas with WNV presence in them. Combined with vigilant monitoring on the ground and educating the public on the preventation of mosquito breeding, we believe that this could significantly decrease the number of mosquitos as well as the presence of WNV.

Should resources permit, we can also do a study on reducing the number of reservoir hosts (dead birds) within the city to effectively kill the source of the WNV.

**Cost-Benefit Analysis**<br>


According to a study on the outbreak of WNV disease in Sacramento County, California in 2005, treatment costs for patients were approximately USD 2,140,409 and the total costs including productivity loss was approximately USD 2,844,338 (across 46 patients). This amounts to roughly USD46,500 per person for medical costs and USD15,500  per person in terms of productivity loss. Based on the average number of cases in Chicago in the past 3 years (roughly 50), total expected loss amounts to approximately USD 3.1 million. 

On the other hand, spray procedures based on bi-weekly spray of traps and hotspots during breeding season (assumed to be 6 months) is estimated to cost roughly USD 706,320 (USD 155,520 to spray traps + USD 550,800 for sprays in the city).

Overall, estimated costs in terms of medical treatment and productivity loss far exceed the spray costs. This reiterates the project's focus on reducing false negatives as the medical and human costs of the virus spreading are much higher than any potential additional expenditure on vector control which may be incurred as a result of executing mosquito surveillence measures on false positives.


Notes:
Trap spray costs is based on an estimated cost of USD 72 per trap for 90 traps <br>
Breakdown of city spray costs is based on the cost of USD 1.70/acre covering roughly 54000 acres (84 miles) <br>

Sources:
1) Economic Cost Analysis of West Nile Virus Outbreak, Sacramento County, California, USA, 2005 <br>
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3322011/#R6) <br>
2) Healthy Chicago Data Brief West Nile Virus <br>
(https://www.chicago.gov/content/dam/city/depts/cdph/food_env/general/West_Nile_Virus/WNV_2018databrief_FINALJan102019.pdf) <br>
3) Healthy Chicago Data Brief West Nile Virus <br>
(https://www.chicago.gov/content/dam/city/depts/cdph/food_env/general/West_Nile_Virus/WNV_2018databrief_FINALJan102019.pdf) <br>
4) Comparison of the Efficiency and Cost of West Nile Virus Surveillance Methods in California <br>
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4340646/) <br>
5) City to Spray Insecticide Thursday to Kill Mosquitoes <br>
(https://www.chicago.gov/city/en/depts/cdph/provdrs/healthy_communities/news/2020/august/city-to-spray-insecticide-thursday-to-kill-mosquitoes0.html#:~:text=CHICAGO%20%2D%20The%20Chicago%20Department%20of,Thursday%2C%20August%2027%2C%202020) <br>
6) Cost of Spray <br>
(https://www.forestrydistributing.com/aqua-zenivex-e20-ulv-insecticide-zeocon) <br>
7) Mosquito Control Resources for Professionals <br>
(https://www.centralmosquitocontrol.com/resources/calculator) <br>

**Train set**<br>
In the training set, duplicate entries were found and were dropped. The criteria for detecting the duplicate values were that the Number of Mosquitoes less than 50 were filtered out and then checking for duplicate values.
As the duplicate entries made up less than 2%, they were dropped. Also, the Number of Mosquitoes feature were capped at 50 per entry. This caused inconsistencies when conducting Exploratory Data Analysis. 
This was resolved by summing and spreading it out evenly across the dataset. 

The month of August showed the largest increase in the number of mosquitoes as well as West Nile Virus (WNV) presence across the years. 
This shows there is some correlation between the surge in number of mosquitoes and the WNV presence. 

The dominant species which carries the West Nile is the `CULEX PIPIENS` and `CULEX RESTUANS`, the other species have very little or no presence of WNV.

**Test set**<br>
No cleaning was done on the test set as there were no null and duplicate values. 

There is additional `UNSPECIFIED CULEX` species found in the test set under the Species feature. This feature was one-hot encoded as a dummy variable in the training set to be consistent with the test set.
There were also unique traps found in the test set that were not present in the training set. 

**Spray Set**<br>
In the Spray Dataset, spray was only conducted in two years(2011 & 2013) across 10 days (twice in 2011 and the rest in 2013). The dataset has 500+ missing data in the time column which was initially filled with the mode timing of spray but was subsequently dropped as time does not help with analysis. Within the datasets, there are also 500+ duplicated data with the same date, time, latitude and longitude which could indicate a data input error so the duplicate data was dropped.

**Weather Set**<br>
Cleaning of weather dataset involved splitting into the 2 available stations, from which missing values in 1 station were imputed from the other station, with dataset-wide mean difference for the affected features accounted for. Weather effects, originally elucidated in the CodeSum column, were one-hot encoded into the individual effect columns in order to facilitate numerical analysis. We then performed Exploratory Data Analysis (EDA) using the clean weather dataset. As part of the EDA, we plotted histograms of the weather features from one year, 2013, as it was the year with the highest incidence of the WNV.

In feature engineering phase, train and weather datasets were combined by taking the weather effects of the nearer station. Subsequently, historical weather data columns were engineered, such that each trap record would have a rolling 7/14/21-day aggregated weather effects data, as we believed that precedent weather might have an influence on mosquito breeding. For spray dataset, we engineered rolling 7/14-day spray-count columns at several arbitrary distances that would indicate if a given trap has had a number of sprays conducted in the past period within a certain proximity.

**Exploratory Data Analysis**<br>
In the Exploratory Data Analysis phase, combined bar/line plots were made to study weather trends in relation to Mosquito Count and WNV occurrences. Periods with high temperature, usually occurring in the July/August period, were matched with high mosquito count and subsequently high WNV occurrences. Inconsistent trends were discovered when Mosquito Count/WNV occurrence was plotted against precipiation. In daylight duration analysis on a week-by-week basis, decreasing amount of daylight duration seems to lead to a decrease in mosquito breeding. Plotting of trap data with spray count also revealed inconclusive effect of spraying. This was attributed to insufficient data (10 dates of spraying across 2 of the 4 available years of trap data) as well as unclear impact of spraying on following-month mosquito counts. Heatmap of correlation scores with respect to WNV occurrence generally show weak correlation across the board, likely due to the imbalanced dataset (close to 400 WNV-positive observations vs over 7000 WNV-negative observations). Boxplot revealed that preceeding periods of higher temperature and possibly low precipitation may likely contribute to increase in occurrences of WNV.

**Modeling**<br>
In the Modeling phase, we picked WNV-positive as our target class, and settled on <b>sensitivity</b> as our key evaluation metric. This was a result of our established priority to reduce the incidences of false negatives (predicted as WNV-negative, when in reality that trap is WNV-positive). While presence of false positives would technically drive down the overall accuracy score, especially when the majority class is WNV-negative, it obscures that reality that the high cost of patient treatment for WNV illness far surpasses the cost of mosquito surveillence measures. 

As the classes are heavily unbalanced, we used the Synthetic Minority Oversampling TEchnique (SMOTE) as a means to oversample the positive class (WNV-present). Using SMOTE and GridSearch to find the optimal hyperparameters, we ran 10 models using Logistic Regression, K Nearest Neighbors, Decision Trees, Bagging Tree, Gradient Boosting, Extra Trees, Random Forest, Ada Boost, SVC and XG Boost. We recorded train and test scores, ROC-AUC scores, specificity, sensitivity and F1 scores for each of the 10 models. 

Although the Gradient Boosting and XG Boost models performed best in terms of accuracy (0.91 and 0.92 respectively), both models had rather low sensitivity scores (0.12 and 0.14 respectively), i.e. they did not do a great job of predicting the positive class (presence of the WNV). In this regard, Logistic Regression and AdaBoost classifiers come out on top, with sensitivity scores of 0.84 and 0.73 and ROC-AUC scores of 0.78 each.
Ultimately, Logistic Regression was chosen as the production model as it had the highest sensitivity score (0.81) and ROC-AUC score (0.77) amongst all the models.

## Content
---
- [Data Cleaning Summary](../code/01_Data_Cleaning.ipynb) **Current Notebook**
- [Train-test Investigation](../code/01a_Train_Test_Investigation.ipynb)
- [Weather Investigation](../code/01b_Weather_Investigation.ipynb)
- [Spray Investigation](../code/01c_Spray_Investigation.ipynb)
- [Feature Engineering](../code/02a_Feature_Engineering.ipynb)
- [Combined EDA and Preprocessing](../code/02b_Combined_EDA_and_Preprocessing.ipynb)
- [Model Tuning & Conclusion](../code/03_Model_Tuning_&_Conclusion.ipynb)

In [None]:
import pandas as pd
import numpy as np

## Train-test Investigation
---

**Summary of train**

In the train data set, duplicate entries were found and the dropped. The criteria was that it had to be less than 50 and it had to be duplicated values. As they make up less than 2% of the dataset, they were dropped. Entries with Mosquitoes more than 50 created outlier values and this was resolved by grouping up 'Date', 'Trap', 'Species' and then summing up the 'NumMosquitos'.There are no null values found in the training set. There is very little correlation found between 'NumMosquitos' and 'WnvPresent'. Addresses with WNVPresent are the highest are W OHARE AIRPORT, S DOTY AVE, N OAK PARK AVE. 

Months where the WNV occurs the most are August, followed by a large dip in September. The most commonly occurring species are the CULEX PIPIENS and the CULEX RESTAUNS. The areas which have the WNV presence in them based on their latitude and longitude are found between -87.80 and -87.735 for longitude and for the latitude. Most of the time, around 25 mosquitoes were caught by the traps. There was an increase of mosquitoes from May to August, the peak was in August, but it starts to decline from there onward. July, August and September seems have the months which have mosquitos with WNV presence. There's a decline in the number of mosquitos captured over the years, but there's a sudden increase in the number of mosquitos captured in 2013. 2007 and 2013 are the years where the highest number of mosquitoes were caught.


**Summary of test**

No Duplicates and null values found the in test set. There is additional `UNSPECIFIED CULEX` species found in the test set under the Species feature. This feature will be added as a dummy variable in the training set to be consistent with the test set. There are also some traps that were found in the training set but not found in the test set.

In [None]:
# Read CSV
train_cleaned = pd.read_csv('../assets/cleaned_train.csv')
test_cleaned = pd.read_csv('../assets/test.csv')

In [None]:
# DataFrame of train CSV
train_cleaned.head()

Unnamed: 0,Date,Species,Block,Street,Trap,Latitude,Longitude,WnvPresent,NumMosquitos_sum
0,2007-05-29,CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,41.95469,-87.800991,0,1.0
1,2007-05-29,CULEX RESTUANS,41,N OAK PARK AVE,T002,41.95469,-87.800991,0,1.0
2,2007-05-29,CULEX RESTUANS,62,N MANDELL AVE,T007,41.994991,-87.769279,0,1.0
3,2007-05-29,CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,41.974089,-87.824812,0,1.0
4,2007-05-29,CULEX RESTUANS,79,W FOSTER AVE,T015,41.974089,-87.824812,0,4.0


In [None]:
train_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7370 entries, 0 to 7369
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              7370 non-null   object 
 1   Species           7370 non-null   object 
 2   Block             7370 non-null   int64  
 3   Street            7370 non-null   object 
 4   Trap              7370 non-null   object 
 5   Latitude          7370 non-null   float64
 6   Longitude         7370 non-null   float64
 7   WnvPresent        7370 non-null   int64  
 8   NumMosquitos_sum  7370 non-null   float64
dtypes: float64(3), int64(2), object(4)
memory usage: 518.3+ KB


In [None]:
# DataFrame of train CSV
test_cleaned.head()

Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
2,3,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
3,4,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX SALINARIUS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
4,5,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX TERRITANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9


## Weather Investigation
----

Findings for cleaning weather data as follows:

1) No duplicate dates for both stations.

2) First date and last date for both stations are the same.

3) There are gaps in the date-range of both stations, i.e. weather info not collected at certain times of the year.

4) `Tavg` investigation revealed no problem with station-1, but 11 records in station-2 with 'M' value. Patched 11 Tavg records in station-2 by taking the mean of Tmax and Tmin, then rounding the result. After patching, Tavg in both stations successfully converted to *int64*.

5) `Depart` investigation revealed no problem with station-1, but all (1472) records in station-2 with 'M' value. 30-year average temperature for both stations are assumed to be very similar, so `Depart` values from station-1 were copied over to `Depart` column in station 2.

6) `WetBulb` investigation revealed 3 'M' values in station-1, but 1 'M' value in station-2. Average difference between stn1 and stn2 is 0.5 deg F, i.e. stn2 wetbulb is 0.5 deg F higher than stn1 on average. Histogram and boxplot analysis of differences in web-bulb temps between 2 stations reveal that they are very close to being the same, so we patch missing values of 1 station with value from the other station, with average difference included in calculation. Fortunately, there were no instances where wet-bulb temps were missing from both stations on a same day.

7) `Heat` investigation revealed no problem with station-1, but 11 'M' values in station-2. No significant diff, though station-1 is 0.4 deg F higher than station-2. Bad records in station-2 patched in same manner as `Wetbulb`.

8) `Cool` investigation revealed no problem with station-1, but 11 'M' values in station-2. No significant diff, though station-1 is 0.8 deg F lower than station-2. Bad records in station-2 patched in same manner as `Wetbulb`.

9) Station-1 sunrise timings have no abnormalities, but sunset timings have weird values like 1660, 1760, and 1860, so replaced them with 1700, 1800, 1900 respectively. Used helper function to create 2 new columns `Sunrise_datetime` and `Sunset_datetime` for station-1, then replicated both new columns over to station-2.

10) `Depth`, `Water1` and `SnowFall` displayed almost no variation, so they were dropped from datasets of both stations.

11) For `PrecipTotal`, trace precipitation amount *T* was replaced with 0.001, justified by definition found in NOAA document.  After replacement, station-1 `PrecipTotal` values had no more problems while station-2 had 2 values marked as 'M'. Mean difference is 0.0, so missing values in station-2 were filled with values from station-1. Both columns converted to float64.

12) `StnPressure` investigation revealed 2 missing values for both stations. The average difference in station pressure between both stations is 0.06. We manually patch #424 in station-1 and #43 in station-2. For #1205 in both stations, we fill with the respective `StnPressure` mean values from both stations.

13) For `SeaLevel`, there were 5 missing values in station-1 and 4 missing values in station-2. No significant difference, though station-1 is 0.01 Hg higher than station-2. Bad records at both sides were patched in same manner as `Wetbulb`.

14) For `AvgSpeed`, there were 3 missing values in station-2. No significant difference, station-2 was 0.02 mph higher than station-1. Bad records in station-2 where patched in same manner as `Wetbulb`.

15) `CodeSum` was one-hot encoded, and dummified columns where synchronized across both stations.

In [None]:
weather_cleaned = pd.read_csv('../assets/weather_cleaned.csv')
weather_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 34 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Station           2944 non-null   int64  
 1   Date              2944 non-null   object 
 2   Tmax              2944 non-null   int64  
 3   Tmin              2944 non-null   int64  
 4   Tavg              2944 non-null   int64  
 5   Depart            2944 non-null   int64  
 6   DewPoint          2944 non-null   int64  
 7   WetBulb           2944 non-null   int64  
 8   Heat              2944 non-null   int64  
 9   Cool              2944 non-null   int64  
 10  PrecipTotal       2944 non-null   float64
 11  StnPressure       2944 non-null   float64
 12  SeaLevel          2944 non-null   float64
 13  ResultSpeed       2944 non-null   float64
 14  ResultDir         2944 non-null   int64  
 15  AvgSpeed          2944 non-null   float64
 16  Sunrise_datetime  2944 non-null   object 


In [None]:
weather_cleaned.head(3)

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum_GR,CodeSum_HZ,CodeSum_MIFG,CodeSum_RA,CodeSum_SN,CodeSum_SQ,CodeSum_TS,CodeSum_TSRA,CodeSum_VCFG,CodeSum_VCTS
0,1,2007-05-01 00:00:00,83,50,67,14,51,56,0,2,...,0,0,0,0,0,0,0,0,0,0
1,1,2007-05-02 00:00:00,59,42,51,-3,42,47,14,0,...,0,0,0,0,0,0,0,0,0,0
2,1,2007-05-03 00:00:00,66,46,56,2,40,48,9,0,...,0,0,0,0,0,0,0,0,0,0


## Spray Investigation
---

** Summary of findings** <br>
In the Spray Dataset, spray was only conducted in two years(2011 & 2013) across 10 days (twice in 2011 and the rest in 2013). The dataset has 500+ missing data in the time column which was initially filled with the mode timing of spray but was subsequently dropped as time does not help with analysis. Within the datasets, there are also 500+ duplicated data with the same date, time, latitude and longitude which could indicate a data input error so those duplicated data was dropped.<br>
One of the focus was in 2013 where there was a surge in WNV cases and mutliple spray was conducted with efforts to reduce the number of mosquitoes.

To also aid in analyzing the effectiveness of the spray, three columns was created (7, 14, 21 days after spray) to compare it against the number of mosquito captured in the trap. If the number of mosquito reduced after a spraying, it would prove that the spray is indeed effective against combating adult mosquito. This will be futher analysed during EDA process.

Note: For coding of spray, please refer to "Spray_investigation" notebook.

In [None]:
spray_cleaned = pd.read_csv('../assets/spray_cleaned.csv')
spray_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14294 entries, 0 to 14293
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Date             14294 non-null  object 
 1   Time             14294 non-null  object 
 2   Latitude         14294 non-null  float64
 3   Longitude        14294 non-null  float64
 4   7Daysfterspray   14294 non-null  object 
 5   14Daysfterspray  14294 non-null  object 
 6   21Daysfterspray  14294 non-null  object 
dtypes: float64(2), object(5)
memory usage: 781.8+ KB


In [None]:
# Dataframe
spray_cleaned.head()

Unnamed: 0,Date,Time,Latitude,Longitude,7Daysfterspray,14Daysfterspray,21Daysfterspray
0,2011-08-29,6:56:58 PM,42.391623,-88.089163,2011-09-05,2011-09-12,2011-09-19
1,2011-08-29,6:57:08 PM,42.391348,-88.089163,2011-09-05,2011-09-12,2011-09-19
2,2011-08-29,6:57:18 PM,42.391022,-88.089157,2011-09-05,2011-09-12,2011-09-19
3,2011-08-29,6:57:28 PM,42.390637,-88.089158,2011-09-05,2011-09-12,2011-09-19
4,2011-08-29,6:57:38 PM,42.39041,-88.088858,2011-09-05,2011-09-12,2011-09-19
