# Project 4: Predict West Nile Virus
### Section 4. Model Exploration

## Problem Statement

1. As an employee of Disease And Treatment Agency, division of Societal Cures In Epidemiology and New Creative Engineering (DATA-SCIENCE), we are tasked to better understand the mosquito population and advise on appropriate interventions which are beneficial and cost-effective for the city.


2. Through this project, we hope to:
- Identify features which are most important to predict presence of West Nile Virus (which can be done by ranking the coefficients of each feature in a logistic regression model)
- Predict the probability of West Nile Virus by location to provide decision makers an effective plan to deploy pesticides throughout the city, which consequently can help to reduce cost.

## Import Libraries

In [27]:
#!pip install shapely
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from shapely import geometry
from shapely.geometry import Point, Polygon
import geopandas as gpd
from datetime import timedelta
import math

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

## Load Data

In [11]:
# Load datasets
df = pd.read_csv('../data/combined.csv', index_col='Unnamed: 0')

  df = pd.read_csv('../data/combined.csv', index_col='Unnamed: 0')


In [19]:
# Split into train and test (kaggle) data 
train = df[df['dataset']=='train']
test = df[df['dataset']=='test']
print(train.shape)
print(test.shape)

(8304, 38)
(44550, 38)


In [21]:
train.describe()

Unnamed: 0,latitude,longitude,wnvpresent,year,month,week,dayofweek,nummosquitos,dist_s1,dist_s2,...,wetbulb,heat,cool,sunrise,depth,sealevel,resultspeed,resultdir,avgspeed,is_spray
count,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,...,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0
mean,41.8458,-87.696229,0.055034,2009.742293,7.70183,31.746869,2.666787,16.181358,0.290053,0.146744,...,64.267943,1.050819,8.14475,469.584778,0.0,29.965943,5.998434,17.842245,7.64381,0.008911
std,0.106658,0.08444,0.22806,2.345157,1.10454,4.697907,1.392025,69.756992,0.11393,0.060568,...,6.911066,2.960102,5.686859,46.592967,0.0,0.119905,2.860682,9.433945,3.191881,0.093984
min,41.644612,-87.930995,0.0,2007.0,5.0,22.0,0.0,1.0,0.037292,0.007815,...,47.0,0.0,0.0,416.0,0.0,29.59,0.1,1.0,2.1,0.0
25%,41.750498,-87.752411,0.0,2007.0,7.0,28.0,2.0,2.0,0.208237,0.112582,...,60.0,0.0,4.0,427.0,0.0,29.89,3.9,8.0,5.8,0.0
50%,41.862292,-87.696269,0.0,2009.0,8.0,32.0,3.0,4.0,0.282471,0.15118,...,65.0,0.0,8.0,451.0,0.0,29.97,5.8,19.0,7.1,0.0
75%,41.947227,-87.648064,0.0,2011.0,9.0,35.0,4.0,12.0,0.385369,0.190375,...,70.0,0.0,13.0,518.0,0.0,30.05,7.8,25.0,9.4,0.0
max,42.01743,-87.531635,1.0,2013.0,10.0,41.0,4.0,2206.0,0.518433,0.24844,...,76.0,15.0,22.0,557.0,0.0,30.33,15.4,36.0,29.2,1.0


In [31]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8304 entries, 0 to 8303
Data columns (total 38 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   dataset       8304 non-null   object 
 1   date          8304 non-null   object 
 2   species       8304 non-null   object 
 3   trap          8304 non-null   object 
 4   latitude      8304 non-null   float64
 5   longitude     8304 non-null   float64
 6   wnvpresent    8304 non-null   float64
 7   year          8304 non-null   int64  
 8   month         8304 non-null   int64  
 9   week          8304 non-null   int64  
 10  dayofweek     8304 non-null   int64  
 11  nummosquitos  8304 non-null   float64
 12  geometry      8304 non-null   object 
 13  dist_s1       8304 non-null   float64
 14  dist_s2       8304 non-null   float64
 15  nearest_stat  8304 non-null   int64  
 16  station       8304 non-null   int64  
 17  tmax          8304 non-null   int64  
 18  tmin          8304 non-null 

## Feature Engineering

In [32]:
# Categorical variables
non_numeric_features = [
    'species',
    'trap',
]

In [34]:
# Create a new dataframe to hold dummies for all categorical features and sale price
dummies_df = train[non_numeric_features]
dummies_df = pd.get_dummies(dummies_df, drop_first=True)
dummy_plus_wnvpresent = pd.concat(objs = [dummies_df, train[['wnvpresent']]] , axis = 1)

In [36]:
# Identify dummy features with high correlation with wnvpresent (correlation >0.4 or <-0.4)
corr_wnvpresent = dummy_plus_wnvpresent.corr().sort_values('wnvpresent', ascending=False)
corr_wnvpresent['wnvpresent']

wnvpresent                1.000000
trap_T900                 0.072840
trap_T003                 0.035874
trap_T225                 0.032690
trap_T086                 0.032537
                            ...   
trap_T043                -0.020760
trap_T148                -0.021435
trap_T017                -0.022883
trap_T046                -0.027178
species_CULEX RESTUANS   -0.098416
Name: wnvpresent, Length: 138, dtype: float64

## Preparing Train-Test (Kaggle) Data and Further Split Train Data into Train and Holdout

In [23]:
# Split train data into X (all features except wnvpresent) and y (wnvpresent)
features = [col for col in train.columns if col != 'wnvpresent']
X = train[features]
y = train['wnvpresent']

In [25]:
y.value_counts(normalize = True)

0.0    0.944966
1.0    0.055034
Name: wnvpresent, dtype: float64

y is highly inbalance, with only about 6% of the data points having West Nile Virus. Hence, it is important to stratify proportionally to ensure that our train and holdout dataset have about the same proportion of presence and absence of West Nile Virus.

In [28]:
# Further split train data into train and holdout data
X_train, X_holdout, y_train, y_holdout = train_test_split(
    X, 
    y,
    stratify = y,
    random_state=42
)

## Model Exploration

### Logistic Regression

In [30]:
# Instantiate model
logreg = LogisticRegression()

# Fit model
logreg.fit(X_train, y_train)

print(f'Logistic Regression Intercept: {logreg.intercept_}')
print(f'Logistic Regression Coefficient: {logreg.coef_}')

ValueError: could not convert string to float: 'train'