# Project 4 West-Nile Virus Prediction

Project Members Include:
- Bao Fionna
- Lim Yu Zheng
- Munish

# Problem Statement
Due to the recent epidemic of West Nile Virus in Chicago, the Disease And Treatment Agency of City of Chicago needs to know when and where different species of mosquitos will test positive for West Nile virus.

A model predicting outbreaks of West Nile virus in mosquitos using exising data(weather, location, testing, and spraying data) will enable effective allocation of pesticide spraying resources to prevent virus transmission.

# Business Background


[West Nile virus (WNV)](https://www.cdc.gov/westnile/index.html) cases occur during mosquito season, which starts in the summer and continues through fall. There are no vaccines to prevent or medications to treat WNV in people.
West Nile virus is most commonly spread to humans through [infected mosquitos](https://www.cdc.gov/westnile/transmission/index.html). Mosquitoes become infected when they feed on infected birds. Infected mosquitoes spread West Nile virus to people and other animals by biting them. West Nile virus has been detected in over 300 species of [dead birds](https://www.cdc.gov/westnile/dead-birds/index.html). Although some infected birds, especially crows and jays, frequently die of infection, most birds survive.

- 80% of the people Infected with West Nile virus do not develop any [Symptoms](https://www.cdc.gov/westnile/symptoms/index.html)
- Around 20% of people who become infected with the virus develop symptoms ranging from a persistent fever, to serious neurological illnesses that can result in death.

The first human cases of West Nile virus were reported in Chicago in  [2002](https://experts.illinois.edu/en/publications/west-nile-virus-in-the-greater-chicago-area-a-geographic-examinat). By 2004, the City of Chicago and the Chicago Department of Public Health (CDPH) had established a comprehensive surveillance and control program that is still in effect today.

Every week from late spring through the fall, mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations. This method is resource extensive and CDPH is looking for an aternative to decide where to spray.

# Executive Summary

- We are engaged by the Chicago Department of Public Health to study the epidemic of West Nile Virus in Chicago using the 3 DataSets of data collected by the Department of Public Health surveillance and control systems. 
    - The datasets are:
        - Weather, 
        - Train, 
        - Spray Data
    - As per initial analysis on the datasets provided : 
        - The Weather DataSet has 2944 Observations and 22 Features, 
        - The Train DataSet has 10506 Observations and 12 Features, 
        - The Spray DataSet has 14835 Observations and 4 Features.
    - We built 8 classification models and evaluated the models. 
        - The top model was used to identify whether a location at a certain point in time has the West Nile Virus
    - We used test dataset for testing to see it's predicting capabilities.
    
- EDA 
    -  Data Cleaning : We cleaned the data in the datasets 
         - For Weather dataset, we did the cleaning by filling Null Values 
             - either with another observation (data collected from the other station)
             - or using mean/median of the entire dataset
         - for Train dataset, we did the cleaning by dropping repeated columns such as Address/Street/Block
         - for Spray dataset, we did the cleaned by filling Null values with the respective time. 
             
    - Exploration on cleaned data
        - We plotted time series lineplot 
            - To look at time trend regarding the weather and 
            - How it affects the observations of West Nile Virus. 
        - We plotted countplot 
            - To look at the relationship between each of the features with respect to the West Nile Virus.
        - We plotted a heatmap to determine 
            - The hotspots where mosquitoes are mostly found and 
            - Where West Nile Virus is found as well.

- Feature Engineering
    - we added new features based on EDA to improve the model.
    - We obserevd that the data was imbalanced for West Nile Virus
         - about 94.76%(0) 
         - and 5.24%(1)
     - Such a heavily unbalanced dataset  is not an ideal dataset to used for classification problems 
             ( baseline accuracy is 94.76%), so we balanced the dataset using SMOTE.
     - We used PCA for Feature Engineering
- Model and Evaluation : 
    - We built few models:
        - Decision Tree, Extra Trees, Random Forest\n",
        - Logistic Regression\n",
        - Ada Boost, Gradient Boost\n",
        - SVM\n",
    - We recommend 3 models to address the business problem in short term, medium term and long term
      - Logistics Regression model (with a score of 71.32% on test data) for short term
       - AdaBoost Model(with a score of 70.21% on test data) to improved and used for medium term
       - Gradient Boost model (with a score of 70.51% on test data) to be improved and used for long term

- Next Actions 
     - For subsequent iterations to improve this model, 
         - We can explore improving Feature Reduction (Drop more features that may not have as much significance)
             - Or 
         - We can collect larger and more balanced dataset 
         - And tune our model to improve its classification accuracy based on either or both of the above
      - After improving the model accuracy, 
          - We can expand the usage to other States and 
          - Explore having subreddit auto-warning system to crowd source data
              - The data from subreddit would warn users of possible West Nile Viruses happening 
              - Agencies would be able to tackle the problem early.

The following Data Science Process was carried out.

- Problem Statement
- Data Collection
- Data Cleaning & EDA
- Preprocessing & Modeling
- Evaluation and Model Recommendation
- Cost benefit analysis and Conclusion

### Contents:
- [Import libraries](#Import-libraries)
- [Load Data](#Load-Data)
- [Clean Train and Test Dataset](#Clean-Train-and-Test-Dataset)
- [Clean Weather Dataset](#Clean-Weather-Dataset)
- [Clean Spray Dataset](#Clean-Spray-Dataset)

# Import libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
%matplotlib inline 
sns.set_style('whitegrid')

#settings to see all columns / rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


# Load Data

In [18]:
df_train = pd.read_csv('../assets/train.csv')
print(f'Loaded Training dataset which has {df_train.shape[0]} \
 observations({df_train[df_train.duplicated(list(df_train.columns))].shape[0]} duplicated) \
for {df_train.shape[1]} features')


df_test = pd.read_csv('../assets/test.csv')
print(f'Loaded Testing  dataset which has {df_test.shape[0]} \
observations({df_test[df_test.duplicated(list(df_test.columns))].shape[0]}   duplicated) \
for {df_test.shape[1]} features')

df_weather = pd.read_csv('../assets/weather.csv')
print(f'Loaded Weather  dataset which has {df_weather.shape[0]} \
  observations({df_weather[df_weather.duplicated(list(df_weather.columns))].shape[0]}   duplicated) \
for {df_weather.shape[1]} features')

df_spray = pd.read_csv('../assets/spray.csv')
print(f'Loaded Spray    dataset which has {df_spray.shape[0]} \
 observations({df_spray[df_spray.duplicated(list(df_spray.columns))].shape[0]} duplicated) \
for {df_spray.shape[1]} features')

# df_sample = pd.read_csv('../assets/sampleSubmission.csv')
# print(f'Sample Submission dataset of {df_sample.shape[0]} observations and {df_sample.shape[1]} features loaded')

# df_traps = pd.read_csv('../assets/train.csv')[['Date', 'Trap','Longitude', 'Latitude', 'WnvPresent']]
# print(f'Traps dataset of {df_traps.shape[0]} observations and {df_traps.shape[1]} features derived from training')

Loaded Training dataset which has 10506  observations(813 duplicated) for 12 features
Loaded Testing  dataset which has 116293 observations(0   duplicated) for 11 features
Loaded Weather  dataset which has 2944   observations(0   duplicated) for 22 features
Loaded Spray    dataset which has 14835  observations(541 duplicated) for 4 features


# Clean Train and Test Dataset


## Train Dataset Checking and Cleaning


### Train dataset - data check


In [4]:
df_train.head(2)

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0


In [5]:
df_train.describe()

Unnamed: 0,Block,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
count,10506.0,10506.0,10506.0,10506.0,10506.0,10506.0
mean,35.687797,41.841139,-87.699908,7.819532,12.853512,0.052446
std,24.339468,0.112742,0.096514,1.452921,16.133816,0.222936
min,10.0,41.644612,-87.930995,3.0,1.0,0.0
25%,12.0,41.732984,-87.76007,8.0,2.0,0.0
50%,33.0,41.846283,-87.694991,8.0,5.0,0.0
75%,52.0,41.95469,-87.627796,9.0,17.0,0.0
max,98.0,42.01743,-87.531635,9.0,50.0,1.0


In [6]:
# # Check Shape of Train dataset
df_train.shape

(10506, 12)

In [7]:
# Check for Null Values and datatypes of Train dataset
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 12 columns):
Date                      10506 non-null object
Address                   10506 non-null object
Species                   10506 non-null object
Block                     10506 non-null int64
Street                    10506 non-null object
Trap                      10506 non-null object
AddressNumberAndStreet    10506 non-null object
Latitude                  10506 non-null float64
Longitude                 10506 non-null float64
AddressAccuracy           10506 non-null int64
NumMosquitos              10506 non-null int64
WnvPresent                10506 non-null int64
dtypes: float64(2), int64(4), object(6)
memory usage: 985.1+ KB


<h2><center> Data Dictionary : Train Dataset </center></h2>

|Feature|Type|Number of Null|Description|
|---|---|---|---|
|Date|object|0|date that the WNV test is performed|
|Address|object|0|approximate address of the location of trap. This is used to send to the GeoCoder|
|Species|object|0|the species of mosquitos|
|Block|int|0|block number of address|
|Street|object|0|street name|
|Trap|object|0|Id of the trap|
|AddressNumberAndStreet|object|0|approximate address returned from GeoCoder|
|Latitude|float|0|Latitude and Longitude returned from GeoCoder|
|Longitute|float|0|Latitude and Longitude returned from GeoCoder|
|AddressAccuracy|int|0|accuracy returned from GeoCoder|
|NumMosquitos|int|0|number of mosquitoes caught in this trap|
|WnvPresent|int|0|whether West Nile Virus was present in these mosquitos. 1 means WNV is present, and 0 means not present|

Train DataSet has no null values so nothing needs to be done for this step.

There is a total of 10506 observations and 12 columns.

### Train dataset - Data Cleaning

#### Dropping Street, Block, Address Number and Street Columns

Address and AddressNumberAndStreet are basically telling us the same thing, which is the location of the Traps. Also, AddressNumberAndStreet = Block + Street. Therefore, Address, AddressNumberAndStreet, Block and Street actually mean the same thing. So we have decided to drop the columns: Block, Street and AdressNumberAndStreet. We will have Address to support our Latitude and Longitude data to know the location of our mosquitoe traps.

In [19]:
# Drop Columns for Train Dataset
df_train.drop(columns = ['Street','Block','AddressNumberAndStreet'], inplace=True)
df_train.head()

Unnamed: 0,Date,Address,Species,Trap,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,T002,41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,T007,41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,T015,41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,T015,41.974089,-87.824812,8,4,0


#### Change Date column Datatype to Date-Time

In [20]:
# Change Train Datatype to date time
df_train['Date'] = pd.to_datetime(df_train['Date'])

In [21]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 9 columns):
Date               10506 non-null datetime64[ns]
Address            10506 non-null object
Species            10506 non-null object
Trap               10506 non-null object
Latitude           10506 non-null float64
Longitude          10506 non-null float64
AddressAccuracy    10506 non-null int64
NumMosquitos       10506 non-null int64
WnvPresent         10506 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 738.8+ KB


## Test Dataset Checking and Cleaning


### Test dataset - data check

In [22]:
df_test.head(2)

Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9


In [23]:
df_test.describe()

Unnamed: 0,Id,Block,Latitude,Longitude,AddressAccuracy
count,116293.0,116293.0,116293.0,116293.0,116293.0
mean,58147.0,41.1311,41.849389,-87.693658,7.954357
std,33571.041765,24.864726,0.106593,0.080699,1.252733
min,1.0,10.0,41.644612,-87.930995,3.0
25%,29074.0,18.0,41.753411,-87.750938,8.0
50%,58147.0,39.0,41.862292,-87.694991,8.0
75%,87220.0,61.0,41.951866,-87.64886,9.0
max,116293.0,98.0,42.01743,-87.531635,9.0


In [24]:
# Check Shape of Test dataset
df_test.shape

(116293, 11)

In [25]:
# Check for Null Values and datatypes of Test dataset
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116293 entries, 0 to 116292
Data columns (total 11 columns):
Id                        116293 non-null int64
Date                      116293 non-null object
Address                   116293 non-null object
Species                   116293 non-null object
Block                     116293 non-null int64
Street                    116293 non-null object
Trap                      116293 non-null object
AddressNumberAndStreet    116293 non-null object
Latitude                  116293 non-null float64
Longitude                 116293 non-null float64
AddressAccuracy           116293 non-null int64
dtypes: float64(2), int64(3), object(6)
memory usage: 9.8+ MB


<h2><center> Data Dictionary : Test Dataset </center></h2>

|Feature|Type|Number of Null|Description|
|---|---|---|---|
|Id|int|0|Id of each row|
|Date|object|0|date that the WNV test is performed|
|Address|object|0|approximate address of the location of trap. This is used to send to the GeoCoder|
|Species|object|0|the species of mosquitos|
|Block|int|0|block number of address|
|Street|object|0|street name|
|Trap|object|0|Id of the trap|
|AddressNumberAndStreet|object|0|approximate address returned from GeoCoder|
|Latitude|float|0|Latitude and Longitude returned from GeoCoder|
|Longitute|float|0|Latitude and Longitude returned from GeoCoder|
|AddressAccuracy|int|0|accuracy returned from GeoCoder|


Test DataSet has no null values so nothing needs to be done for this step.

There is a total of 116293 observations and 11 columns. Test Dataset does not contain NumMosquitos and WnvPresent columns. But has an additional Id column.

### Test dataset - Data Cleaning

#### Dropping Street, Block, Address Number and Street Columns

Address and AddressNumberAndStreet are basically telling us the same thing, which is the location of the Traps. Also, AddressNumberAndStreet = Block + Street. Therefore, Address, AddressNumberAndStreet, Block and Street  actually mean the same thing. So we have decided to drop the columns: Block, Street and AdressNumberAndStreet. We will have Address to support our Latitude and Longitude data to know the location of our mosquitoe traps.

In [26]:
# Drop Columns for Test Dataset
df_test.drop(columns = ['Street','Block','AddressNumberAndStreet'], inplace=True)
df_test.head()

Unnamed: 0,Id,Date,Address,Species,Trap,Latitude,Longitude,AddressAccuracy
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,T002,41.95469,-87.800991,9
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,T002,41.95469,-87.800991,9
2,3,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS,T002,41.95469,-87.800991,9
3,4,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX SALINARIUS,T002,41.95469,-87.800991,9
4,5,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX TERRITANS,T002,41.95469,-87.800991,9


#### Change Date column Datatype to Date-Time

In [27]:
# Change Test Datatype to date time
df_test['Date'] = pd.to_datetime(df_test['Date'])

In [28]:
#verify that the datatypes are correct
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116293 entries, 0 to 116292
Data columns (total 8 columns):
Id                 116293 non-null int64
Date               116293 non-null datetime64[ns]
Address            116293 non-null object
Species            116293 non-null object
Trap               116293 non-null object
Latitude           116293 non-null float64
Longitude          116293 non-null float64
AddressAccuracy    116293 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(2), object(3)
memory usage: 7.1+ MB


## Train/Test - Final Check and Save Data Frame

In [29]:
# Save Cleaned Train and Test datasets for further analysis. Used pickle files as orginal source files are csv. 
df_train.to_pickle('../assets/train_clean.pkl')
df_test.to_pickle('../assets/test_clean.pkl')

# df_train.to_csv('../assets/clean/train_clean.csv', index = False)
# df_test.to_csv('../assets/clean/test_clean.csv', index = False)

# Clean Weather Dataset

## Weather dataset - Data Check

In [30]:
df_weather.describe()

Unnamed: 0,Station,Tmax,Tmin,DewPoint,ResultSpeed,ResultDir
count,2944.0,2944.0,2944.0,2944.0,2944.0,2944.0
mean,1.5,76.166101,57.810462,53.45788,6.960666,17.494905
std,0.500085,11.46197,10.381939,10.675181,3.587527,10.063609
min,1.0,41.0,29.0,22.0,0.1,1.0
25%,1.0,69.0,50.0,46.0,4.3,7.0
50%,1.5,78.0,59.0,54.0,6.4,19.0
75%,2.0,85.0,66.0,62.0,9.2,25.0
max,2.0,104.0,83.0,75.0,24.1,36.0


In [31]:
df_weather.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,0448,1849,,0,M,0.0,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,-,-,,M,M,M,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,0447,1850,BR,0,M,0.0,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,-,-,BR HZ,M,M,M,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,0446,1851,,0,M,0.0,0.0,29.39,30.12,11.7,7,11.9


In [32]:
df_weather.shape

(2944, 22)

In [33]:
#Check for obvious missing data (Null values)
df_weather.isnull().sum()

Station        0
Date           0
Tmax           0
Tmin           0
Tavg           0
Depart         0
DewPoint       0
WetBulb        0
Heat           0
Cool           0
Sunrise        0
Sunset         0
CodeSum        0
Depth          0
Water1         0
SnowFall       0
PrecipTotal    0
StnPressure    0
SeaLevel       0
ResultSpeed    0
ResultDir      0
AvgSpeed       0
dtype: int64

There are no null values in weather dataset. Lets look further if there are realy no missing values

In [34]:
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 22 columns):
Station        2944 non-null int64
Date           2944 non-null object
Tmax           2944 non-null int64
Tmin           2944 non-null int64
Tavg           2944 non-null object
Depart         2944 non-null object
DewPoint       2944 non-null int64
WetBulb        2944 non-null object
Heat           2944 non-null object
Cool           2944 non-null object
Sunrise        2944 non-null object
Sunset         2944 non-null object
CodeSum        2944 non-null object
Depth          2944 non-null object
Water1         2944 non-null object
SnowFall       2944 non-null object
PrecipTotal    2944 non-null object
StnPressure    2944 non-null object
SeaLevel       2944 non-null object
ResultSpeed    2944 non-null float64
ResultDir      2944 non-null int64
AvgSpeed       2944 non-null object
dtypes: float64(1), int64(5), object(16)
memory usage: 506.1+ KB


We see here that certain columns should be numerical, yet are categorical. Eg. Tavg, Depart, Wetbulb, Heat, Cool, Sunrise, Sunset, and beyond. Possibly due to some alphabetical values in these columns which need further investigation

In [35]:
#check for missing values 'M' or NaN
for c in df_weather.columns:
    print(c, df_weather[df_weather[c].isin(['M', 'T'])][c].count())

Station 0
Date 0
Tmax 0
Tmin 0
Tavg 11
Depart 1472
DewPoint 0
WetBulb 4
Heat 11
Cool 11
Sunrise 0
Sunset 0
CodeSum 0
Depth 1472
Water1 2944
SnowFall 1472
PrecipTotal 2
StnPressure 4
SeaLevel 9
ResultSpeed 0
ResultDir 0
AvgSpeed 3


Though there are no null values in weather dataset. However, quite a number of columns have M and T as the value which means the data is missing or trace. We will address the missing data in following sections

<h2><center> Data Dictionary : Weather Dataset </center></h2>

|Feature|Type|Number of M/T|Description|
|---|---|---|---|
|Station|int|0|Station Number|
|Date|object|0|Date of data Recorded|
|Tmax|int|0|Maximum Temperature of the Day|
|Tmin|int|0|Minimum Temperature of the Day|
|Tavg|object|11|Average Temperature of the Day|
|Depart|object|1472|Difference in temperature for the day against the normal temperature for the same day for the past 30 years.|
|Dewpoint|int|0|The dew point is the temperature at which air is saturated with water vapor, which is the gaseous state of water|
|WetBulb|object|4|Wet bulb temperature tells you the lowest temperature that can be reached by evaporating water into the air|
|Heat|object|11|Heating is determined by the day's average temperature minus 65F.|
|Cool|object|11|Cooling is determined by the day's average temperature minus 65F.|
|Sunrise|object|0|Time of Sunrise|
|Sunset|object|0|Time of Sunset|
|CodeSum|object|0|Code Sum indicates the weather phenomena for the day|
|Depth|object|1472|Depth contains information related to snow|
|Water1|object|2994|Water1 contains information related to snow|
|Snowfall|object|1472|SnowFall contains information related to snow|
|PrecipTotal|object|2|the height of the wet mark indicates the amount of collected precipitation|
|StnPressure|object|4|StnPressure indicate the average station pressure|
|SeaLevel|object|9|SeaLevel which indicate's the average sea level pressure|
|ResultSpeed|float|0|Windspeed|
|ResultDir|int|0|Direction of Wind|
|AvgSpeed|object|3|Average Wind Speed|

## Weather - Data Cleaning

### Average Temperature

Tavg which means average temperature has 11 missing values. The average temperature is mean of the day's maximum temperature and minimum temperature. As such we will fill the missing values based on this principle.

In [37]:
#replace with (Tmax + Tmin)/2
df_weather['Tavg'].replace('M',(df_weather['Tmax']+ df_weather['Tmin'])/2,inplace=True)

### Depart Temperature

Depart is the difference in temperature for the day against the normal temperature for the same day for the past 30 years. From our data checking, all the missing values are from Station 2. We shall fill the missing value from Station 1's Depart

In [38]:
#Use ffill-fill with previous row (station 1's value)
#change value with 'M' with NaN
df_weather['Depart'] = df_weather['Depart'].replace('M',np.nan)
df_weather['Depart'].ffill(axis='rows',inplace=True)

### Wet Bulb Temperature

Wet bulb temperature tells you the lowest temperature that can be reached by evaporating water into the air and we have 4 missing values. Going through the data, we can determine that the difference between the 2 station's wet bulb temperature is minimal, at a range of 0 to 3. As such we will fill the missing web bulb temperature with the temperature of the corresponding station for the same day.  

In [39]:
#fill with the other station’s value
for i,index in enumerate(df_weather[df_weather['WetBulb']=='M'].index):
    print(i,index)
    if df_weather.loc[index,'Station'] == 1:
        df_weather.loc[index,'WetBulb'] = df_weather.loc[index+1,'WetBulb']
    else:
        df_weather.loc[index,'WetBulb'] = df_weather.loc[index-1,'WetBulb']

0 848
1 2410
2 2412
3 2415


### Heating and Cooling Day

Heating and Cooling is determined by the day's average temperature minus 65F. If positive, it is a heating degree day and if negative, it is a cooling degree day. We will fill up the missing values base on this principal

In [40]:
for i,index in enumerate(df_weather[df_weather['Heat']=='M'].index):
    print(i,index)
    result = df_weather.loc[index, 'Tavg'] - 65
    if result > 0:
        df_weather.loc[index, 'Heat'] = int(result)
        df_weather.loc[index, 'Cool'] = int(0)
    else:
        df_weather.loc[index, 'Cool'] = abs(result)
        df_weather.loc[index, 'Heat'] = int(0)

0 7
1 505
2 675
3 1637
4 2067
5 2211
6 2501
7 2511
8 2525
9 2579
10 2811


### Sunrise and Sunset

There are 1472 missing values in Sunrise and Sunset indicated by '-' and they are all from Station 2. The sunrise and sunset time for the whole of Chicago is the same, as such. We will fill the missing values with the values from Station 1.

In [41]:
#Use ffill-fill with next row (station 1's value)
#change value with 'M' with NaN
df_weather['Sunrise'] = df_weather['Sunrise'].replace('-',np.nan)
df_weather['Sunrise'].ffill(axis='rows',inplace=True)

df_weather['Sunset'] = df_weather['Sunset'].replace('-',np.nan)
df_weather['Sunset'].ffill(axis='rows',inplace=True)

### Change Date, Tavg, Depart, WetBulb, Heat and Cool to Correct Data Type

In [42]:
#convert 'Date' to date datatype
df_weather['Date'] = pd.to_datetime(df_weather['Date']) 

#convert 'Tavg' to int
df_weather['Tavg'] = df_weather['Tavg'].astype(int)

#convert 'Depart' to int
df_weather['Depart'] = df_weather['Depart'].astype(int)

#convert 'WetBulb' to int
df_weather['WetBulb'] = df_weather['WetBulb'].astype(int)

#convert 'Heat','Cool' to int
df_weather['Heat'] = df_weather['Heat'].astype(int)
df_weather['Cool'] = df_weather['Cool'].astype(int)

### Code Sum

Code Sum indicates the weather phenomena for the day. Indicated in Kaggle's data description, if CodeSum is blank, it means there are no signs of any special weather phenomena. We will replace the blanks with 'No Sign'

In [43]:
#Replace empty values as 'No Sign'
df_weather['CodeSum'] = df_weather['CodeSum'].replace(to_replace = ' ', value = 'No Sign')

### Depth, Water1 and SnowFall

From the data information, Depth, Water1 and SnowFall contains information related to snow. As our data is collected from the period of time when Chicago do not experience any snowfall, we will fill up the missing data with 0 and convert them to numeric.

In [44]:
df_weather['Depth'] = df_weather['Depth'].replace('M', 0)
df_weather['Depth'] = pd.to_numeric(df_weather['Depth'])

In [45]:
df_weather['Water1'] = df_weather['Water1'].replace('M', 0)
df_weather['Water1'] = pd.to_numeric(df_weather['Water1'])

In [46]:
df_weather['SnowFall'] = df_weather['SnowFall'].replace('M', 0)
df_weather['SnowFall'] = df_weather['SnowFall'].replace('  T', 0)
df_weather['SnowFall'] = pd.to_numeric(df_weather['SnowFall'])

### Precipitation

Section 5.1.1 of [ NATIONAL WEATHER SERVICE INSTRUCTION at NOAA Website](https://www.nws.noaa.gov/directives/sym/pd01013002curr.pdf) explains how precipation is recorded. Everyday, the COOP observer will place a rain stick into the SRG and perform a visual inspection for wet mark. The height of the wet mark indicates the amount of collected precipitation. If the wet mark is less than 0.01 inch, the observer reports “T” for a trace of precipitation. As such we will replace 'T' and 'M' with 0.00

In [47]:
# replacing T and M value with 0
df_weather['PrecipTotal'] = df_weather['PrecipTotal'].replace('  T', 0.00)
df_weather['PrecipTotal'] = df_weather['PrecipTotal'].replace('M', 0.00)
df_weather['PrecipTotal'] = pd.to_numeric(df_weather['PrecipTotal'])

### Station Pressure

StnPressure indicate the average station pressure

In [48]:
df_weather['StnPressure'].unique()

array(['29.10', '29.18', '29.38', '29.44', '29.39', '29.46', '29.31',
       '29.36', '29.40', '29.57', '29.62', '29.29', '29.21', '29.28',
       '29.20', '29.26', '29.33', '29.49', '29.54', '29.55', '29.23',
       '29.13', '29.19', '29.53', '29.60', '29.34', '29.41', '29.47',
       '29.51', '29.42', '29.43', '29.25', '29.03', '28.82', '28.87',
       '28.88', '29.16', '29.07', '28.84', '28.91', '29.24', 'M', '29.30',
       '29.12', '29.45', '29.56', '29.32', '29.05', '29.11', '29.06',
       '29.22', '29.08', '29.14', '29.37', '29.35', '29.15', '29.17',
       '29.48', '29.52', '29.27', '29.50', '28.59', '28.67', '28.75',
       '29.02', '29.79', '29.86', '29.63', '29.70', '28.95', '29.01',
       '28.79', '28.85', '28.97', '28.89', '28.94', '28.93', '28.98',
       '28.96', '29.00', '29.66', '29.09', '28.90', '29.04', '29.59',
       '29.65', '29.58', '29.61', '29.64', '29.71', '29.67', '28.80',
       '28.73', '29.68', '28.74', '28.55', '28.63', '28.92', '28.99',
       '28.81',

In [49]:
print(df_weather[df_weather.StnPressure.isin(['M', 'T'])]['StnPressure'].count())
df_weather[df_weather.StnPressure.isin(['M', 'T'])]['StnPressure']

4


87      M
848     M
2410    M
2411    M
Name: StnPressure, dtype: object

We have 4 missing data in StnPressure. We can see that the fluctuation of station pressure is not great and there are no outliers. As such we will fill 'M' with the mean of the series

In [52]:
# Replace missing values in 'StnPressure' with null values before fillign with mean
df_weather['StnPressure'] = df_weather['StnPressure'].replace(to_replace='M', value=np.nan)

#fill null values with mean value
df_weather['StnPressure']=df_weather['StnPressure'].fillna(round(df_weather['StnPressure'].astype(float).mean(),2))
df_weather['StnPressure']=pd.to_numeric(df_weather['StnPressure'])

### Sea Level

We will not take a look at SeaLevel which indicate's the average sea level pressure

In [53]:
#check for unique values of SeaLevel column
df_weather['SeaLevel'].unique()

array(['29.82', '30.09', '30.08', '30.12', '30.05', '30.04', '30.10',
       '30.29', '30.28', '30.03', '30.02', '29.94', '29.93', '29.92',
       '29.91', '30.20', '30.19', '30.24', '29.97', '29.98', '29.84',
       '29.83', '30.27', '30.25', '30.26', '30.11', '30.06', '30.23',
       '30.15', '30.14', '30.00', '29.99', '29.90', '29.77', '29.76',
       '29.56', '29.54', '29.52', '29.51', '29.79', '29.78', '29.81',
       '29.55', '29.85', '30.07', '30.16', 'M', '29.96', '29.95', '30.13',
       '30.21', '30.22', '29.88', '30.01', '29.80', '29.89', '29.74',
       '29.87', '29.86', '30.18', '30.17', '29.34', '29.44', '29.45',
       '29.71', '29.72', '30.52', '30.53', '30.40', '30.41', '29.67',
       '29.53', '29.69', '29.61', '29.64', '29.63', '29.66', '29.70',
       '30.34', '30.33', '29.62', '29.60', '29.75', '29.68', '29.73',
       '30.31', '30.30', '30.32', '30.37', '30.39', '29.59', '29.65',
       '30.35', '30.36', '29.48', '30.38', '29.50', '29.25', '29.23',
       '29.46',

In [54]:
print(df_weather[df_weather.SeaLevel.isin(['M', 'T'])]['SeaLevel'].count())
df_weather[df_weather.SeaLevel.isin(['M', 'T'])]['SeaLevel']

9


87      M
832     M
994     M
1732    M
1745    M
1756    M
2067    M
2090    M
2743    M
Name: SeaLevel, dtype: object

We have 9 missing data in SeaLevel. Similar to StnPressure the fluctuation of sea level is not great and there are no outliers. As such we will fill 'M' with the mean of the series.

In [55]:
# replacing the missing value in 'SeaLevel' with null values before fillign with mean
df_weather['SeaLevel'] = df_weather['SeaLevel'].replace('M', np.nan)

#fill null values with mean value
df_weather['SeaLevel'] = df_weather['SeaLevel'].fillna(round(df_weather['SeaLevel'].astype(float).mean(),2))
df_weather['SeaLevel'] = pd.to_numeric(df_weather['SeaLevel'])

### Average Speed

In [56]:
print(df_weather[df_weather.AvgSpeed.isin(['M', 'T'])]['AvgSpeed'].count())
df_weather[df_weather.AvgSpeed.isin(['M', 'T'])]['AvgSpeed']

3


87      M
1745    M
2067    M
Name: AvgSpeed, dtype: object

We have 3 missing data in AvgSpeed. Going through the data, the difference between the 2 station's AvgSpeed minimal, at a range of 0 to 2. As such we will fill the missing web bulb temperature with the temperature of the corresponding station for the same day.

In [57]:
df_weather['AvgSpeed'] = df_weather['AvgSpeed'].replace('M', np.nan)
df_weather['AvgSpeed'].ffill(axis='rows', inplace=True)

In [58]:
# changing the string type to numeric type
df_weather['AvgSpeed']=pd.to_numeric(df_weather['AvgSpeed'])

## Weather - Final Check and Save Data Frame

In [59]:
df_weather.dtypes

Station                 int64
Date           datetime64[ns]
Tmax                    int64
Tmin                    int64
Tavg                    int32
Depart                  int32
DewPoint                int64
WetBulb                 int32
Heat                    int32
Cool                    int32
Sunrise                object
Sunset                 object
CodeSum                object
Depth                   int64
Water1                  int64
SnowFall              float64
PrecipTotal           float64
StnPressure           float64
SeaLevel              float64
ResultSpeed           float64
ResultDir               int64
AvgSpeed              float64
dtype: object

In [60]:
df_weather.shape

(2944, 22)

In [61]:
df_weather.head(20)

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,448,1849,No Sign,0,0,0.0,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,14,51,57,0,3,448,1849,No Sign,0,0,0.0,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,447,1850,BR,0,0,0.0,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,-3,42,47,13,0,447,1850,BR HZ,0,0,0.0,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,446,1851,No Sign,0,0,0.0,0.0,29.39,30.12,11.7,7,11.9
5,2,2007-05-03,67,48,58,2,40,50,7,0,446,1851,HZ,0,0,0.0,0.0,29.46,30.12,12.9,6,13.2
6,1,2007-05-04,66,49,58,4,41,50,7,0,444,1852,RA,0,0,0.0,0.0,29.31,30.05,10.4,8,10.8
7,2,2007-05-04,78,51,64,4,42,50,0,0,444,1852,No Sign,0,0,0.0,0.0,29.36,30.04,10.1,7,10.4
8,1,2007-05-05,66,53,60,5,38,49,5,0,443,1853,No Sign,0,0,0.0,0.0,29.4,30.1,11.7,7,12.0
9,2,2007-05-05,66,54,60,5,39,50,5,0,443,1853,No Sign,0,0,0.0,0.0,29.46,30.09,11.2,7,11.5


In [62]:
# Save Cleaned weather dataset for further analysis. Used pickle files as orginal source files are csv. 
df_weather.to_pickle('../assets/weather_clean.pkl')


# df_weather.to_csv('../assets/clean/weather_clean.csv', index = False)

# Clean Spray Dataset

## Spray dataset - Check Data

In [63]:
df_spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


In [64]:
#check data types  in spray dataset
df_spray.dtypes

Date          object
Time          object
Latitude     float64
Longitude    float64
dtype: object

In [65]:
#check null values in spray dataset
df_spray.isnull().sum()

Date           0
Time         584
Latitude       0
Longitude      0
dtype: int64

<h2><center> Data Dictionary : Spray Dataset </center></h2>

|Feature|Type|Number of Null|Description|
|---|---|---|---|
|Date|object|0|Date when Spraying is conducted|
|Time|object|584|Time when Spraying is conducted|
|Latitude|float|0|Latitude of Spray Locations|
|Longitude|float|0|Longitude of Spray Locations|

- There are 584 null values in Time and it is in AM/PM format. We shall explore this column further to determine how best we can handle the null values and convert it to 24 hour format
- Date's data type is object, we shall convert it to datetime format

## Spray - Data Cleaning

### Time

In [66]:
df_spray[df_spray['Time'].isnull()]

Unnamed: 0,Date,Time,Latitude,Longitude
1030,2011-09-07,,41.987092,-87.794286
1031,2011-09-07,,41.98762,-87.794382
1032,2011-09-07,,41.988004,-87.794574
1033,2011-09-07,,41.988292,-87.795486
1034,2011-09-07,,41.9881,-87.796014
1035,2011-09-07,,41.986372,-87.794862
1036,2011-09-07,,41.986228,-87.795582
1037,2011-09-07,,41.984836,-87.793998
1038,2011-09-07,,41.984836,-87.79467
1039,2011-09-07,,41.984884,-87.795198


All the null values for time are from the same date and it seems like they are in sequential order. We shall take a look at one row before and one row after the null values to get more insight

In [67]:
df_spray.iloc[[1029, 1614]]

Unnamed: 0,Date,Time,Latitude,Longitude
1029,2011-09-07,7:44:32 PM,41.98646,-87.794225
1614,2011-09-07,7:46:30 PM,41.973465,-87.827643


The entry before and after our null values are also of the same date. The time before the start of our null values is 7:44:32 PM and the time after our null values is 7:46:30 PM. Looking at other rows of data, we can see that time for the same date are in running order when going down the rows, as such we will fill the null values with 7:45:00 PM. Then convert the time to 24 hour format.

In [68]:
df_spray.fillna('7:45:00 PM', inplace = True);

In [69]:
df_spray['Time'] = pd.to_datetime(df_spray['Time'],format= '%I:%M:%S %p').dt.time

### Date

In [70]:
df_spray['Date'] = pd.to_datetime(df_spray['Date'])

In [71]:
df_spray.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,18:56:58,42.391623,-88.089163
1,2011-08-29,18:57:08,42.391348,-88.089163
2,2011-08-29,18:57:18,42.391022,-88.089157
3,2011-08-29,18:57:28,42.390637,-88.089158
4,2011-08-29,18:57:38,42.39041,-88.088858


## Spray - Final Check and Save Data Frame

After completing cleaning, we shall make a check on our columns and save the cleansed csv file

In [72]:
df_spray.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 4 columns):
Date         14835 non-null datetime64[ns]
Time         14835 non-null object
Latitude     14835 non-null float64
Longitude    14835 non-null float64
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 463.7+ KB


In [74]:
# Save Cleaned spray dataset for further analysis. Used pickle files as orginal source files are csv. 
df_spray.to_pickle('../assets/spray_clean.pkl')


# df_spray.to_csv('../assets/clean/spray_clean.csv', index = False)

- This is the first of the 4 Notebooks for project 
    - For EDA and Feature Engineering, pl refer to 2_eda_feature_engineering.ipynb
    - For modelling, Pl refer to 3_modelling and eval.ipynb an
    - For Cost benefit analysis please refer to 4_cost_benefit_analysis.ipynb