# Project 4 - Predicting West Nile Virus (Kaggle Challenge)

## Background

Every year from late-May to early-October, public health workers in Chicago setup mosquito traps scattered across the city. Every week from Monday through Wednesday, these traps collect mosquitos, and the mosquitos are tested for the presence of West Nile virus before the end of the week. The test results include the number of mosquitos, the mosquitos species, and whether or not West Nile virus is present in the cohort. 

Main dataset

These test results are organized in such a way that when the number of mosquitos exceed 50, they are split into another record (another row in the dataset), such that the number of mosquitos are capped at 50. 

The location of the traps are described by the block number and street name. For your convenience, we have mapped these attributes into Longitude and Latitude in the dataset. Please note that these are derived locations. For example, Block=79, and Street= "W FOSTER AVE" gives us an approximate address of "7900 W FOSTER AVE, Chicago, IL", which translates to (41.974089,-87.824812) on the map.

Some traps are "satellite traps". These are traps that are set up near (usually within 6 blocks) an established trap to enhance surveillance efforts. Satellite traps are postfixed with letters. For example, T220A is a satellite trap to T220. 

Please note that not all the locations are tested at all times. Also, records exist only when a particular species of mosquitos is found at a certain trap at a certain time. In the test set, we ask you for all combinations/permutations of possible predictions and are only scoring the observed ones.

Spray Data

The City of Chicago also does spraying to kill mosquitos. You are given the GIS data for their spray efforts in 2011 and 2013. Spraying can reduce the number of mosquitos in the area, and therefore might eliminate the appearance of West Nile virus. 

## Problem statement

In view of the recent epidemic in Windy City of Chicago state affecting the state population, we aim to build a classifier model to make predictions on the possibility of West Nile Virus occurence on various locations of interest, which could be used to aid the deployment of pesticides in the fight for public health and safety. 

The model would be build using collected data related to mosquito population from the surveillance and control system setup by Deparment of Public Health. 

In addition, a cost-benefit analysis would be conducted on the cost benefits for the use of pesticides as a response in managing the epidemic. 


## Executive Summary

## Data Dictionary of downloaded files

Spray data: GIS data of spraying efforts in 2011 and 2013

|Feature|Python data Type|Description|
|---|---|---|
|**Date**|*String*|Date of spray|
|**Time**|*String*|Time of spray|
|**Latitude**|*float*|Latitude of spray location|
|**Longitude**|*float*|Longitude of spray location|



Weather data: Weather data from 2007 to 2014

|Feature|Python data Type|Description|
|---|---|---|
|**Station**|*Integer*|Station ID|
|**Date**|*String*|Date of the weather data|
|**Tmax**|*Integer*|Max temperature in Fahrenheit|
|**Tmin**|*Integer*|Min temperature in Fahrenheit|
|**Tavg**|*Integer*|Average temperature in Fahrenheit|
|**Depart**|*Integer*|Temperature departure from normal|
|**DewPoint**|*Integer*|Average Dew Point in Fahrenheit|
|**WetBulb**|*Integer*|Average Wet Bulb temperature in Fahrenheit|
|**Heat**|*Integer*|Absolute temperature difference of average temperature (Tavg) from base 65 deg Fahrenheit for Tavg >=65 (season begins with July)|
|**Cool**|*Integer*|Absolute temperature difference of average temperature from base 65 deg Fahrenheit for Tavg <=65|
|**Sunrise**|*String*|Calculated Sunset timing in 24H format|
|**Sunset**|*String*|Calculated Sunrise timing in 24H format|
|**CodeSum**|*String*|Weather Type represented in codes|
|**Depth**|*Integer*|Snow Depth in inches|
|**Water1**|*Integer*|Amount of water equivalent from melted Snow|
|**SnowFall**|*Float*|SnowFall in precipitation|
|**PrecipTotal**|*Float*|Water precipitation|
|**StnPressure**|*Float*|Average Station Pressure|
|**SeaLevel**|*Float*|Average Sea Level Pressure|
|**ResultSpeed**|*Float*|Resultant Wind Speed|
|**ResultDir**|*Integer*|Resultant Wind Direction in Degrees|
|**AvgSpeed**|*Float*|Average Wind Speed|
              
Training data: The training set consists of data from 2007, 2009, 2011, and 2013.

|Feature|Python data Type|Description|
|---|---|---|
|**Date**|*String*|Date which the WNV test is performed|
|**Address**|*String*|Approximate address of the location of trap. |
|**Species**|*String*|Named species of mosquitoes|
|**Block**|*String*|Block number of address for the location of the trap|
|**Street**|*String*|Street name|
|**Trap**|*String*|Trap ID|
|**AddressNumberAndStreet**|*String*|Approximate address returned from GeoCoder|
|**Latitude**|*Float*|Latitude returned from Geocoder|
|**Longitude**|*Float*|Longitude returned from Geocoder|
|**AddressAccuracy**|*Integer*|Accuracy returned from GeoCoder|
|**NumMosquitos**|*Integer*|Number of mosquitoes caught in this trap|
|**WnvPresent**|*Integer*|Whether West Nile Virus was present in these mosquitos. 1 means WNV is present, and 0 means not present. 
|

Testing data: Dataset to predict the test results for 2008, 2010, 2012, and 2014.

|Feature|Python data Type|Description|
|---|---|---|
|**Id**|*Integer*|ID of the record|
|**Date**|*String*|Date which the WNV test is performed|
|**Address**|*String*|Approximate address of the location of trap. |
|**Species**|*String*|Species of mosquitoes|
|**Block**|*String*|Block number of address for the location of the trap|
|**Street**|*String*|Street name|
|**Trap**|*String*|Trap ID|
|**AddressNumberAndStreet**|*String*|Approximate address returned from GeoCoder|
|**Latitude**|*Float*|Latitude returned from Geocoder|
|**Longitude**|*Float*|Longitude returned from Geocoder|
|**AddressAccuracy**|*Integer*|Accuracy returned from GeoCoder|

# Import libraries

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import csv files
Weather data is handled by Rifqi

In [2]:
spray_df = pd.read_csv("../../data/spray.csv")
train_df = pd.read_csv("../../data/train.csv")
test_df = pd.read_csv("../../data/test.csv")

# Data cleaning

## Spray data

In [3]:
spray_df.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


In [4]:
spray_df.shape

(14835, 4)

In [5]:
spray_df.dtypes

Date          object
Time          object
Latitude     float64
Longitude    float64
dtype: object

In [6]:
#Check which column has null
spray_df.isnull().sum()

Date           0
Time         584
Latitude       0
Longitude      0
dtype: int64

In [7]:
#Drop duplicates
spray_df.drop_duplicates(inplace = True)

In [8]:
spray_df[spray_df['Time'].isnull()]["Date"].value_counts()

2011-09-07    584
Name: Date, dtype: int64

In [9]:
# Drop time since only one date is affected
spray_df.drop(["Time"], axis = 1, inplace = True)

In [10]:
# Convert date to datetime object
spray_df["Date"] = pd.to_datetime(spray_df["Date"], format = "%Y-%m-%d")
spray_df.dtypes

Date         datetime64[ns]
Latitude            float64
Longitude           float64
dtype: object

In [11]:
# Convert column names to lowercase 
spray_df.columns = [col.lower() for col in spray_df.columns]
spray_df.head()

Unnamed: 0,date,latitude,longitude
0,2011-08-29,42.391623,-88.089163
1,2011-08-29,42.391348,-88.089163
2,2011-08-29,42.391022,-88.089157
3,2011-08-29,42.390637,-88.089158
4,2011-08-29,42.39041,-88.088858


In [49]:
spray_df.to_csv("../../data/spray_clean.csv", index = False)

## Read weather data
Adapted from Rifqi's code work

In [5]:
weather = pd.read_csv('../../data/weather.csv')

In [6]:
weather

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,...,,0,M,0.0,0.00,29.10,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,...,,M,M,M,0.00,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,...,BR,0,M,0.0,0.00,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,...,BR HZ,M,M,M,0.00,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,...,,0,M,0.0,0.00,29.39,30.12,11.7,7,11.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2939,2,2014-10-29,49,40,45,M,34,42,20,0,...,,M,M,M,0.00,29.42,30.07,8.5,29,9.0
2940,1,2014-10-30,51,32,42,-4,34,40,23,0,...,,0,M,0.0,0.00,29.34,30.09,5.1,24,5.5
2941,2,2014-10-30,53,37,45,M,35,42,20,0,...,RA,M,M,M,T,29.41,30.10,5.9,23,6.5
2942,1,2014-10-31,47,33,40,-6,25,33,25,0,...,RA SN,0,M,0.1,0.03,29.49,30.20,22.6,34,22.9


In [None]:
# Check for missing values
weather.isin(['M', '-', '  T']).sum().sort_values(ascending=False)

Weather has two observations per day - one per weather station. Explore missing values to see if they are only missing from one of the stations.

In [None]:
weather.groupby('Station')['Water1'].value_counts()

In [None]:
weather.groupby('Station')['Depth'].value_counts()

In [None]:
weather.groupby('Station')['SnowFall'].value_counts()

Drop `Water1` as all values are missing. Drop `Depth` and `SnowFall` as well because even the recorded values are mostly 0, doesn't provide us meaningful information

In [None]:
weather.drop(columns=['Water1', 'Depth', 'SnowFall'], inplace=True)

In [None]:
# Impute missing sunrise & sunset values with observations from other station
def impute_sun(row):
    if row['Sunrise'] == '-':
        row['Sunrise'] = weather.loc[
            (weather['Date'] == row['Date']) & 
            (weather['Station'] == 1), 
            'Sunrise'
        ].values[0]
        row['Sunset'] = weather.loc[
            (weather['Date'] == row['Date']) & 
            (weather['Station'] == 1), 
            'Sunset'
        ].values[0]
    return row

In [None]:
weather = weather.apply(impute_sun, axis=1)

In [None]:
weather.loc[weather['Tavg'].isin(['M'])]

In [None]:
# Impute Tavg with mean of Tmax & Tmin
def impute_tavg(row):
    if row['Tavg'] == 'M':
        row['Tavg'] = round((row['Tmax'] + row['Tmin']) / 2)
    return row

In [None]:
# Impute station 2 with departure from station 1's 30 year normal
def impute_depart(row):
    if row['Station'] == 2:
        # Difference between avg temp of two stations
        diff = int(row['Tavg']) - int(weather.loc[
            (weather['Date'] == row['Date']) & 
            (weather['Station'] == 1), 
            'Tavg'
        ])
        # Impute with station 1's readings plus difference
        row['Depart'] = int(weather.loc[
            (weather['Date'] == row['Date']) & 
            (weather['Station'] == 1), 
            'Depart'
        ]) + diff
    return row

In [None]:
weather = weather.apply(impute_depart, axis=1)

In [None]:
weather.loc[weather['Heat'].isin(['M'])]

In [None]:
# Impute Heat & Cool with departure from base 65 degree temp
def impute_heat_cool(row):
    if row['Heat'] == 'M' or row['Cool'] == 'M':
        diff = 65 - row['Tavg']
        if diff < 0: 
            row['Heat'] = 0
            row['Cool'] = diff
        elif diff > 0:
            row['Heat'] = diff
            row['Cool'] = 0
        else:
            row['Heat'] = row['Cool'] = 0
    return row

In [None]:
weather = weather.apply(impute_heat_cool, axis=1)

In [None]:
weather.loc[weather['StnPressure'].isin(['M'])]

In [None]:
# Impute StnPressure by interpolating from previous & next day values
for index, row in weather.loc[weather['StnPressure'].isin(['M'])].iterrows():
    inter = (float(weather.iloc[(index - 2)]['StnPressure']) + \
             float(weather.iloc[(index + 2)]['StnPressure'])) / 2
    weather.at[index, 'StnPressure'] = round(inter, 2)

In [None]:
weather.loc[weather['SeaLevel'].isin(['M'])]

In [None]:
# Impute StnPressure by interpolating from previous & next day values
for index, row in weather.loc[weather['SeaLevel'].isin(['M'])].iterrows():
    inter = (float(weather.iloc[(index - 2)]['SeaLevel']) + \
             float(weather.iloc[(index + 2)]['SeaLevel'])) / 2
    weather.at[index, 'SeaLevel'] = round(inter, 2)

In [None]:
# Change trace values for PrecipTotal to 0.01
weather['PrecipTotal'] = weather['PrecipTotal'].map(lambda x: 0.01 if x == '  T' else x)

In [None]:
# Impute remaining missing values with observations from other station
def impute_remain(row):
    if row['WetBulb'] == 'M':
        if row['Station'] == 1:
            row['WetBulb'] = weather.loc[
                (weather['Date'] == row['Date']) & 
                (weather['Station'] == 2), 
                'WetBulb'
            ].values[0]
        else:
            row['WetBulb'] = weather.loc[
                (weather['Date'] == row['Date']) & 
                (weather['Station'] == 1), 
                'WetBulb'
            ].values[0]
        
    if row['AvgSpeed'] == 'M':
        row['AvgSpeed'] = weather.loc[
            (weather['Date'] == row['Date']) & 
            (weather['Station'] == 1), 
            'AvgSpeed'
        ].values[0]
        
    if row['PrecipTotal'] == 'M':
        row['PrecipTotal'] = weather.loc[
            (weather['Date'] == row['Date']) & 
            (weather['Station'] == 1), 
            'PrecipTotal'
        ].values[0]
    
    return row

In [None]:
weather = weather.apply(impute_remain, axis=1)

`CodeSum` column has some inconsistencies - some of the codes are joined together. Will need to separate them and remove duplicate codes within the same observation.

In [None]:
weather['CodeSum'].unique()

In [None]:
# Create set of codes for each observation
weather['CodeSum'] = weather['CodeSum'].map(lambda x: set(x.split()))

In [None]:
# Create function to split conjoined codes into two separate codes
def code_split(row):
    new_set = set()
    for code in row:
        if len(code) > 3:
            new_set.add(code[:2])
            new_set.add(code[2:])
        else:
            new_set.add(code)
    return new_set

In [None]:
weather['CodeSum'] = weather['CodeSum'].map(code_split)

Need to fix data types of some columns

In [None]:
# Convert object columns to numeric values
fix_cols = weather.columns[weather.dtypes.eq('object')].drop(['Date', 'CodeSum'])
weather[fix_cols] = weather[fix_cols].apply(pd.to_numeric)

In [None]:
# Convert Date column to datetime format
weather['Date'] = pd.to_datetime(weather['Date'], format="%Y-%m-%d")

In [None]:
weather.dtypes # All good

In [None]:
# Replace values in station 1 observations with combined values
for index, row in weather.iterrows():
    if index % 2 == 0:
        # Take union of sets for CodeSum
        codes = weather.iloc[index]['CodeSum'].union(weather.iloc[index + 1]['CodeSum'])
        weather.at[index, 'CodeSum'] = codes
        
        # Take average of numerical features of both stations
        for col in weather.columns.drop(['Station', 'Date', 'CodeSum']):
            avg = round((weather.iloc[index][col] + weather.iloc[index + 1][col]) / 2, 2)
            weather.at[index, col] = avg

# Drop observations for station 2
weather.drop(weather.loc[weather['Station'] == 2].index, inplace=True)

# Drop station column & reset index
weather.drop(columns='Station', inplace=True)
weather.reset_index(drop=True, inplace=True)

In [None]:
# Lowercase columns
weather.columns = [col.lower() for col in weather.columns]

In [None]:
# Export cleaned weather dataset
weather.to_csv('../../data/weather_clean.csv', index=False)

## Read training data
1 day 1 trap how many types of mosquitos

In [35]:
train_df = pd.read_csv("../../data/train.csv")

In [36]:
train_df.head()

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0


In [37]:
# Check size of train data and names of columns
print(train_df.shape)
train_df.columns

(10506, 12)


Index(['Date', 'Address', 'Species', 'Block', 'Street', 'Trap',
       'AddressNumberAndStreet', 'Latitude', 'Longitude', 'AddressAccuracy',
       'NumMosquitos', 'WnvPresent'],
      dtype='object')

In [38]:
train_df.dtypes

Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
NumMosquitos                int64
WnvPresent                  int64
dtype: object

In [39]:
# Check for nulls
train_df.isnull().sum()

Date                      0
Address                   0
Species                   0
Block                     0
Street                    0
Trap                      0
AddressNumberAndStreet    0
Latitude                  0
Longitude                 0
AddressAccuracy           0
NumMosquitos              0
WnvPresent                0
dtype: int64

In [40]:
# Convert date to datetime object
train_df["Date"] = pd.to_datetime(train_df["Date"], format = "%Y-%m-%d")
train_df.dtypes

Date                      datetime64[ns]
Address                           object
Species                           object
Block                              int64
Street                            object
Trap                              object
AddressNumberAndStreet            object
Latitude                         float64
Longitude                        float64
AddressAccuracy                    int64
NumMosquitos                       int64
WnvPresent                         int64
dtype: object

In [41]:
# Convert column names to lowercase 
train_df.columns = [col.lower() for col in train_df.columns]
train_df.head()

Unnamed: 0,date,address,species,block,street,trap,addressnumberandstreet,latitude,longitude,addressaccuracy,nummosquitos,wnvpresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0
