# Project 4: Predict West Nile Virus
### Section 5. Model Exploration

## Problem Statement

1. As an employee of Disease And Treatment Agency, division of Societal Cures In Epidemiology and New Creative Engineering (DATA-SCIENCE), we are tasked to better understand the mosquito population and advise on appropriate interventions which are beneficial and cost-effective for the city.


2. Through this project, we hope to:
- Identify features which are most important to predict presence of West Nile Virus (which can be done by ranking the coefficients of each feature in a logistic regression model)
- Predict the probability of West Nile Virus by location to provide decision makers an effective plan to deploy pesticides throughout the city, which consequently can help to reduce cost.

## Import Libraries

In [1]:
#!pip install shapely
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# from shapely import geometry
# from shapely.geometry import Point, Polygon
# import geopandas as gpd
# from datetime import timedelta
# import math

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

## Load Data

In [2]:
# Load datasets
df = pd.read_csv('../data/final_df.csv', index_col='Unnamed: 0')

In [3]:
# Split into train and test (kaggle) data 
train = df[df['dataset']=='train'].copy()
test = df[df['dataset']=='test'].copy()
print(train.shape)
print(test.shape)

(8304, 252)
(43035, 252)


In [4]:
train.drop(columns='dataset', inplace=True)
test.drop(columns='dataset', inplace=True)

In [5]:
train.describe()

Unnamed: 0,latitude,longitude,nummosquitos,tmax,tmin,tavg,depart,dewpoint,wetbulb,heat,...,codesum_TSRA BR HZ VCTS,codesum_TSRA FG+ BR HZ,codesum_TSRA RA,codesum_TSRA RA BR,codesum_TSRA RA BR HZ,codesum_TSRA RA BR HZ VCTS,codesum_TSRA RA BR VCTS,codesum_TSRA RA VCTS,codesum_VCTS,wnvpresent
count,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,...,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0,8304.0
mean,41.8458,-87.696229,16.095255,81.248434,62.443401,72.093931,2.591402,59.334056,64.267943,1.050819,...,0.006142,0.0,0.029383,0.037211,0.0,0.0,0.010597,0.0,0.003974,0.055034
std,0.106658,0.08444,69.585928,8.402787,7.802554,7.63033,6.624498,7.977426,6.911066,2.960102,...,0.078132,0.0,0.168889,0.18929,0.0,0.0,0.102402,0.0,0.062918,0.22806
min,41.644612,-87.930995,1.0,57.0,41.0,50.0,-12.0,38.0,47.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,41.750498,-87.752411,2.0,78.0,58.0,69.0,-2.0,54.0,60.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,41.862292,-87.696269,4.0,83.0,64.0,73.0,4.0,59.0,65.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,41.947227,-87.648064,12.0,87.0,69.0,78.0,7.0,67.0,70.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,42.01743,-87.531635,2206.0,97.0,79.0,87.0,20.0,73.0,76.0,15.0,...,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8304 entries, 0 to 8303
Columns: 251 entries, latitude to wnvpresent
dtypes: bool(10), float64(14), int64(227)
memory usage: 15.4 MB


## Preparing Train-Test (Kaggle) Data and Further Split Train Data into Train and Holdout

In [7]:
# Split train data into X (all features except wnvpresent) and y (wnvpresent)
features = [col for col in train.columns if col != 'wnvpresent']
X = train[features]
y = train['wnvpresent']

In [8]:
y.value_counts(normalize = True)

0.0    0.944966
1.0    0.055034
Name: wnvpresent, dtype: float64

In [9]:
X.columns[X.isna().any()].tolist()

[]

y is highly inbalance, with only about 6% of the data points having West Nile Virus. Hence, it is important to stratify proportionally to ensure that our train and holdout dataset have about the same proportion of presence and absence of West Nile Virus.

In [10]:
# Further split train data into train and holdout data
X_train, X_holdout, y_train, y_holdout = train_test_split(
    X, 
    y,
    stratify = y,
    random_state=42
)

## Model Exploration

### Logistic Regression

In [11]:
# Instantiate model
logreg = LogisticRegression()

# Fit model
logreg.fit(X_train, y_train)

print(f'Logistic Regression Intercept: {logreg.intercept_}')
print(f'Logistic Regression Coefficient: {logreg.coef_}')

Logistic Regression Intercept: [-0.00125884]
Logistic Regression Coefficient: [[-4.19546654e-02  9.02612790e-02  9.63238974e-04  8.81910256e-02
   1.49713508e-01  1.15150931e-01 -3.77111141e-01 -4.64546654e-02
   5.52076336e-03 -1.10248842e-01  8.66833948e-02  2.44824701e-02
  -1.26854412e-02 -3.67544860e-02 -3.01340733e-02 -3.62938586e-02
   2.89128182e-02 -5.29547165e-04 -9.25824459e-03  1.24804375e-02
  -2.52663245e-02 -3.34939074e-02 -3.18396111e-02 -8.99427200e-03
  -1.16856386e-02 -9.28506257e-03  1.63652358e-02 -4.84739105e-04
   0.00000000e+00 -3.19777847e-04  3.62399785e-02 -3.19777847e-04
  -2.89882454e-02 -2.01029396e-02  0.00000000e+00 -7.09132351e-02
   0.00000000e+00 -5.64048644e-02  0.00000000e+00  1.47307849e-01
   0.00000000e+00 -3.96720566e-03 -5.50782068e-03 -9.34708902e-03
  -1.29997848e-02 -1.78333479e-03 -2.51750881e-02  1.80591855e-03
  -1.91776801e-02  1.29948280e-02  3.35021335e-02  2.61030148e-02
   2.64881559e-02  5.25294378e-03  2.73969178e-02  1.67094878e-0

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
