# Project 4: Predict West Nile Virus
### Section 1: Introduction & Data Preprocessing

## Problem Statement

1. As an employee of Disease And Treatment Agency, division of Societal Cures In Epidemiology and New Creative Engineering (DATA-SCIENCE), we are tasked to better understand the mosquito population and advise on appropriate interventions which are beneficial and cost-effective for the city.


2. Through this project, we hope to:
- Identify features which are most important to predict presence of West Nile Virus (which can be done by ranking the coefficients of each feature in a logistic regression model)
- Predict the probability of West Nile Virus by location to provide decision makers an effective plan to deploy pesticides throughout the city, which consequently can help to reduce cost.

## Import Libraries

In [1]:
import pandas as pd

## Load Data

In [2]:
# Load datasets
train = pd.read_csv('../data/train.csv')
weather = pd.read_csv('../data/weather.csv')
spray = pd.read_csv('../data/spray.csv')
test = pd.read_csv('../data/test.csv')
# train['id'] = train.index
# train.shape

In [3]:
# Check column names, types and null values
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    10506 non-null  object 
 1   Address                 10506 non-null  object 
 2   Species                 10506 non-null  object 
 3   Block                   10506 non-null  int64  
 4   Street                  10506 non-null  object 
 5   Trap                    10506 non-null  object 
 6   AddressNumberAndStreet  10506 non-null  object 
 7   Latitude                10506 non-null  float64
 8   Longitude               10506 non-null  float64
 9   AddressAccuracy         10506 non-null  int64  
 10  NumMosquitos            10506 non-null  int64  
 11  WnvPresent              10506 non-null  int64  
dtypes: float64(2), int64(4), object(6)
memory usage: 985.1+ KB


In [4]:
# Check column names, types and null values
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 22 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Station      2944 non-null   int64  
 1   Date         2944 non-null   object 
 2   Tmax         2944 non-null   int64  
 3   Tmin         2944 non-null   int64  
 4   Tavg         2944 non-null   object 
 5   Depart       2944 non-null   object 
 6   DewPoint     2944 non-null   int64  
 7   WetBulb      2944 non-null   object 
 8   Heat         2944 non-null   object 
 9   Cool         2944 non-null   object 
 10  Sunrise      2944 non-null   object 
 11  Sunset       2944 non-null   object 
 12  CodeSum      2944 non-null   object 
 13  Depth        2944 non-null   object 
 14  Water1       2944 non-null   object 
 15  SnowFall     2944 non-null   object 
 16  PrecipTotal  2944 non-null   object 
 17  StnPressure  2944 non-null   object 
 18  SeaLevel     2944 non-null   object 
 19  Result

In [5]:
# Check column names, types and null values
spray.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       14835 non-null  object 
 1   Time       14251 non-null  object 
 2   Latitude   14835 non-null  float64
 3   Longitude  14835 non-null  float64
dtypes: float64(2), object(2)
memory usage: 463.7+ KB


In [6]:
# Check column names, types and null values
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116293 entries, 0 to 116292
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Id                      116293 non-null  int64  
 1   Date                    116293 non-null  object 
 2   Address                 116293 non-null  object 
 3   Species                 116293 non-null  object 
 4   Block                   116293 non-null  int64  
 5   Street                  116293 non-null  object 
 6   Trap                    116293 non-null  object 
 7   AddressNumberAndStreet  116293 non-null  object 
 8   Latitude                116293 non-null  float64
 9   Longitude               116293 non-null  float64
 10  AddressAccuracy         116293 non-null  int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 9.8+ MB


In [7]:
# Create a function to rename columns to lower case
def columns_rename(columns):
    """ Rename column names to lower case"""
    return columns.lower()

train.columns = [columns_rename(col) for col in train.columns]
test.columns = [columns_rename(col) for col in test.columns]
weather.columns = [columns_rename(col) for col in weather.columns]
spray.columns = [columns_rename(col) for col in spray.columns]

In [None]:
# Convert date column to datetime type 
train['date'] = pd.to_datetime(train['date'])
test['date'] = pd.to_datetime(test['date'])
weather['date'] = pd.to_datetime(weather['date'])
spray['date'] = pd.to_datetime(spray['date'])

In [None]:
# Combine train and test datasets with column to indicate dataset.
traintest = pd.concat([train,test], axis=0, keys=('train','test'))
traintest = traintest.reset_index() \
                        .rename(columns={'level_0':'dataset'}) \
                        .drop(columns=['level_1','id'])

In [None]:
# Add Year, Month, Week and Day of Week features
traintest['year'] = traintest['date'].apply(lambda x: x.year)
traintest['month'] = traintest['date'].apply(lambda x: x.month)
traintest['week'] = traintest['date'].apply(lambda x: x.week)
traintest['dayofweek'] = traintest['date'].apply(lambda x: x.dayofweek)

In [None]:
# Add Year, Month, Week and Day of Week features
weather['year'] = weather['date'].apply(lambda x: x.year)
weather['month'] = weather['date'].apply(lambda x: x.month)
weather['week'] = weather['date'].apply(lambda x: x.week)
weather['dayofweek'] = weather['date'].apply(lambda x: x.dayofweek)

In [None]:
# Add Year, Month, Week and Day of Week features
spray['year'] = spray['date'].apply(lambda x: x.year)
spray['month'] = spray['date'].apply(lambda x: x.month)
spray['week'] = spray['date'].apply(lambda x: x.week)
spray['dayofweek'] = spray['date'].apply(lambda x: x.dayofweek)

### Export to CSV

In [None]:
# Save processed data to CSV
traintest.to_csv('../data/data_traintest.csv')
weather.to_csv('../data/data_weather.csv')
spray.to_csv('../data/data_spray.csv')