# Predicting West Nile Virus in Chicago
## Cleaning, feature engineering, and EDA


* [1 Introduction and Imports](#intro)
* [2. Cleaning](#cleaning)
    *  [2.1 Weather](#weather_c)
        * [2.1.1 Correcting Data Types](#w_missing)
        * [2.1.2 Snow fall](#snow_fall)
        * [2.1.3 Depth](#depth)
        * [2.1.4 Water1](#water1)
        * [2.1.5 Preciptotal](#precip)
        * [2.1.6 Filling missing numerical data](#w_miss)
        * [2.1.7 weather codes](#codesum)
        * [2.1.8 two weather stations](#two_w)
    * [2.2 Mosquito and Spray data](#mosquito_and_spray) 
        * [2.2.1 Spray data](#spray)
        * [2.2.2 Traps](#trap)
        * [2.2.3 uniting the two](#ms)  
* [2.3 EDA and Feature Engineering](#EDA)
    * [2.3.1 Distributions and Pairplots](#dist) 
    * [2.3.1 Species](#species)
    * [2.3.2 WNV comparisons](#grouped)

## Introduction

According to the CDC, West Nile Virus (WNV) is the leading cause of mosquito-borne disease in the US. The Chicago area reported 6 cases of WNV in the summer of 2020. Although this figure is small, the disease is dangerous, proving fatal for around 1 in 150 people who become infected. Mosquitos, beyond being a nuisance, are a public health concern, and it’s important for densely populated urban areas to control the mosquito population. 

The primary method of mosquito control is to spray insecticide over large areas of land. Along with environmental costs, there are significant costs and inconveniences associated with controlling mosquito populations. The city of Chicago Department of Public Health treats 40,000 water basins each year with larvicide and monitors 83 traps around the city each week for mosquitos with WNV. It’s costly both in terms of time and resources, and yet there are cases of WNV reported every year. 

This notebook is centered on the wrangling, cleaning, and feature engineering of three data sets in order to predict the presence of WNV in Chicago and surrounding commiunities. The three data sets are 1. historical data on mosquito spraying with insecticide 2. historical weather data from two monitoring stations 3. mosquito trap data from the city of Chicago Department of Public Health, which details if and when mosquitos were trapped, and whether or not they carried WNV.

<a id='intro'></a>

In [1]:
#importing relevant packages

import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
import pandas.core.algorithms as algos
from pandas import Series
import scipy.stats.stats as stats
import re
import traceback
import string
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

sns.set_style('whitegrid')

In [2]:
#loading the three data sets

sp = pd.read_csv('./data/spray.csv.zip') #spray data
df= pd.read_csv('./data/train.csv.zip') #mosquito data
w = pd.read_csv('./data/weather.csv.zip') # weather data


<a id='cleaning'></a>

## 2. Cleaning

<a id='weather_c'></a>

## 2. 1 Weather Cleaning


In [3]:
w.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 22 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Station      2944 non-null   int64  
 1   Date         2944 non-null   object 
 2   Tmax         2944 non-null   int64  
 3   Tmin         2944 non-null   int64  
 4   Tavg         2944 non-null   object 
 5   Depart       2944 non-null   object 
 6   DewPoint     2944 non-null   int64  
 7   WetBulb      2944 non-null   object 
 8   Heat         2944 non-null   object 
 9   Cool         2944 non-null   object 
 10  Sunrise      2944 non-null   object 
 11  Sunset       2944 non-null   object 
 12  CodeSum      2944 non-null   object 
 13  Depth        2944 non-null   object 
 14  Water1       2944 non-null   object 
 15  SnowFall     2944 non-null   object 
 16  PrecipTotal  2944 non-null   object 
 17  StnPressure  2944 non-null   object 
 18  SeaLevel     2944 non-null   object 
 19  Result

In [4]:
w.head()

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,CodeSum,Depth,Water1,SnowFall,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed
0,1,2007-05-01,83,50,67,14,51,56,0,2,...,,0,M,0.0,0.0,29.1,29.82,1.7,27,9.2
1,2,2007-05-01,84,52,68,M,51,57,0,3,...,,M,M,M,0.0,29.18,29.82,2.7,25,9.6
2,1,2007-05-02,59,42,51,-3,42,47,14,0,...,BR,0,M,0.0,0.0,29.38,30.09,13.0,4,13.4
3,2,2007-05-02,60,43,52,M,42,47,13,0,...,BR HZ,M,M,M,0.0,29.44,30.08,13.3,2,13.4
4,1,2007-05-03,66,46,56,2,40,48,9,0,...,,0,M,0.0,0.0,29.39,30.12,11.7,7,11.9


In [5]:
w.Water1.value_counts()

M    2944
Name: Water1, dtype: int64

### 2.1.1 Correcting Data Types

<a id='w_missing'></a>

In order to have the numerical data correctly represented, I first replace 'M', 'T', and '-' and then represent relevant columns as floats.

'M' indicates missing data. I replace 'M' with np.nan in order to more easily handle missing numerical data.

'T' indicates 'trace,' for example a snow flurry that does not stick. T is only relevant for precipitation, where I reaplce T with 0.01. For snow fall, where T is also present, there are only 12 non-zero observations, too sparse to make any conclusions on.

'-' appears only in sunrise and sunset, which are measured every other day. I fill all missing sunrise and sunset values with the day before.



In [6]:
w = w.replace('M',np.nan)
w = w.replace(['T','  T'],0.01)
w.loc[:,'Sunrise':'Sunset'] = w.loc[:,'Sunrise':'Sunset'].replace('-',method='bfill')
w.loc[:,'Sunrise':'Sunset'] = w.loc[:,'Sunrise':'Sunset'].replace('-',method='ffill')

w.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 22 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Station      2944 non-null   int64  
 1   Date         2944 non-null   object 
 2   Tmax         2944 non-null   int64  
 3   Tmin         2944 non-null   int64  
 4   Tavg         2933 non-null   object 
 5   Depart       1472 non-null   object 
 6   DewPoint     2944 non-null   int64  
 7   WetBulb      2940 non-null   object 
 8   Heat         2933 non-null   object 
 9   Cool         2933 non-null   object 
 10  Sunrise      2944 non-null   object 
 11  Sunset       2944 non-null   object 
 12  CodeSum      2944 non-null   object 
 13  Depth        1472 non-null   object 
 14  Water1       0 non-null      float64
 15  SnowFall     1472 non-null   object 
 16  PrecipTotal  2942 non-null   object 
 17  StnPressure  2940 non-null   object 
 18  SeaLevel     2935 non-null   object 
 19  Result

In [7]:
as_float = ['Tavg','Depart','WetBulb','Heat','Cool','Depth','StnPressure','SeaLevel','AvgSpeed','PrecipTotal',
           'Sunrise','Sunset']

w.loc[:,as_float] = w.loc[:,as_float].astype('float')
    
#quickly engineering a feature
w['sunrise_diff'] = w['Sunrise'].diff().fillna(0)
w['sunset_diff'] = w['Sunset'].diff().fillna(0)


### 2.1.2 Snow Fall
<a id='snow_fall'></a>

Snow fall is almost entirely 0, except for 12 observations of trace snow fall and 1 of 0.1 inches. For this reason, I drop the column.

In [8]:
w.SnowFall.value_counts()

0.0     1459
0.01      12
0.1        1
Name: SnowFall, dtype: int64

In [9]:
w = w.drop('SnowFall',axis=1)

### 2.1.3 Depth
<a id='depth'></a>
Depth is all 0 and is dropped

In [10]:
w.Depth.value_counts()

0.0    1472
Name: Depth, dtype: int64

In [11]:
w = w.drop('Depth',axis=1)

### 2.1.4 Water1
<a id='water1'></a>
Water 1 is all 0 and is dropped

In [12]:
w['Water1'].value_counts()

Series([], Name: Water1, dtype: int64)

In [13]:
w = w.drop('Water1',axis=1)

### 2.1.5 PrecipTotal
<a id='precip'></a>

I assume that missing values for percipitation are simply days the data was not recorded because there was no precipitation. I have not found any indication that too much rain (filling the measuring device too high) would have caused a missing value to be unput.

For this reason, I fill missing values with 0

In [14]:
w['PrecipTotal'] = w['PrecipTotal'].fillna(0)

### 2.1.6 Filling missing numerical data

<a id='w_miss'></a>

There are few coluns where there is significat missing data. 

Depart is only measured at one of two weather stations each days, which is why half of the data is missing. This is handled by using the value for Depart from station 1 for station 2 as well. The weather at the two stations is not different enought to suggest different values for Departure is necessary. 

The other columns with missing data do not follow an obvious trend. My assumption is that the missing values are random and could be due to chance or errors with the measurement devices. I will fill the missing values with a forward fill. The previous day's weather will be a good enough approximation of the next days weather for the few values that are missing.

In [15]:
w.isna().sum()

Station            0
Date               0
Tmax               0
Tmin               0
Tavg              11
Depart          1472
DewPoint           0
WetBulb            4
Heat              11
Cool              11
Sunrise            0
Sunset             0
CodeSum            0
PrecipTotal        0
StnPressure        4
SeaLevel           9
ResultSpeed        0
ResultDir          0
AvgSpeed           3
sunrise_diff       0
sunset_diff        0
dtype: int64

In [16]:
w[w.Tavg.isna()]

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,sunrise_diff,sunset_diff
7,2,2007-05-04,78,51,,,42,50.0,,,...,1853.0,,0.0,29.36,30.04,10.1,7,10.4,-1.0,1.0
505,2,2008-07-08,86,46,,,68,71.0,,,...,1929.0,TS RA,0.28,29.16,29.8,7.4,24,8.3,1.0,0.0
675,2,2008-10-01,62,46,,,41,47.0,,,...,1732.0,,0.0,29.3,29.96,10.9,33,11.0,1.0,-2.0
1637,2,2011-07-22,100,71,,,70,74.0,,,...,1920.0,TS TSRA BR,0.14,29.23,29.86,3.8,10,8.2,1.0,-1.0
2067,2,2012-08-22,84,72,,,51,61.0,,,...,1842.0,,0.0,29.39,,4.7,19,,2.0,-1.0
2211,2,2013-05-02,71,42,,,39,45.0,,,...,1851.0,,0.0,29.51,30.17,15.8,2,16.1,-1.0,1.0
2501,2,2013-09-24,91,52,,,48,54.0,,,...,1744.0,,0.0,29.33,30.0,5.8,9,7.7,1.0,-2.0
2511,2,2013-09-29,84,53,,,48,54.0,,,...,1735.0,RA BR,0.22,29.36,30.01,6.3,36,7.8,1.0,-2.0
2525,2,2013-10-06,76,48,,,44,50.0,,,...,1724.0,RA DZ BR,0.06,29.1,29.76,10.1,25,10.6,1.0,-1.0
2579,2,2014-05-02,80,47,,,43,47.0,,,...,1851.0,RA,0.04,29.1,29.79,10.7,23,11.9,-1.0,1.0


In [17]:
w[w.StnPressure.isna()]

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,sunrise_diff,sunset_diff
87,2,2007-06-13,86,68,77.0,,53,62.0,0.0,12.0,...,1928.0,,0.0,,,7.0,5,,0.0,1.0
848,1,2009-06-26,86,69,78.0,7.0,60,,0.0,13.0,...,1931.0,,0.0,,29.85,6.4,4,8.2,0.0,0.0
2410,1,2013-08-10,81,64,73.0,0.0,57,,0.0,8.0,...,1900.0,,0.0,,30.08,5.3,5,6.5,0.0,0.0
2411,2,2013-08-10,81,68,75.0,,55,63.0,0.0,10.0,...,1859.0,,0.0,,30.07,6.0,6,7.4,1.0,-41.0


In [18]:
w[w.AvgSpeed.isna()]

Unnamed: 0,Station,Date,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,...,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,sunrise_diff,sunset_diff
87,2,2007-06-13,86,68,77.0,,53,62.0,0.0,12.0,...,1928.0,,0.0,,,7.0,5,,0.0,1.0
1745,2,2011-09-14,60,48,54.0,,45,51.0,11.0,0.0,...,1803.0,RA BR HZ FU,0.01,29.47,,6.0,32,,1.0,-2.0
2067,2,2012-08-22,84,72,,,51,61.0,,,...,1842.0,,0.0,29.39,,4.7,19,,2.0,-1.0


In [19]:
w = w.fillna(method='ffill')

In [20]:
w.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 21 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Station       2944 non-null   int64  
 1   Date          2944 non-null   object 
 2   Tmax          2944 non-null   int64  
 3   Tmin          2944 non-null   int64  
 4   Tavg          2944 non-null   float64
 5   Depart        2944 non-null   float64
 6   DewPoint      2944 non-null   int64  
 7   WetBulb       2944 non-null   float64
 8   Heat          2944 non-null   float64
 9   Cool          2944 non-null   float64
 10  Sunrise       2944 non-null   float64
 11  Sunset        2944 non-null   float64
 12  CodeSum       2944 non-null   object 
 13  PrecipTotal   2944 non-null   float64
 14  StnPressure   2944 non-null   float64
 15  SeaLevel      2944 non-null   float64
 16  ResultSpeed   2944 non-null   float64
 17  ResultDir     2944 non-null   int64  
 18  AvgSpeed      2944 non-null 

### 2.1.7 Codesum
<a id='codesum'></a>

Codesum is a sum of all of the weather codes observed on a particular day. I break these apart into separate columns and assign 1 for occured and 0 otherwise.

In [21]:
#breaking out codesum into separate columns 
#getting a list of all unique codes in the weather set
codes = []
for code in w.CodeSum.unique():
    for c in code.split(' '):
        codes.append(c)
codes = pd.Series(codes)
codes = codes.unique()[1:]

#for each code, creating a column where value is 1 if happens, 0 if doesn't, then drop codeSum
for c in codes:
    col = [1 if c in s else 0 for s in w.CodeSum]
    
    if np.sum(col)>(len(col)/100): 
        w[c] = col

w = w.drop('CodeSum',axis=1)


In [22]:
w.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 27 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Station       2944 non-null   int64  
 1   Date          2944 non-null   object 
 2   Tmax          2944 non-null   int64  
 3   Tmin          2944 non-null   int64  
 4   Tavg          2944 non-null   float64
 5   Depart        2944 non-null   float64
 6   DewPoint      2944 non-null   int64  
 7   WetBulb       2944 non-null   float64
 8   Heat          2944 non-null   float64
 9   Cool          2944 non-null   float64
 10  Sunrise       2944 non-null   float64
 11  Sunset        2944 non-null   float64
 12  PrecipTotal   2944 non-null   float64
 13  StnPressure   2944 non-null   float64
 14  SeaLevel      2944 non-null   float64
 15  ResultSpeed   2944 non-null   float64
 16  ResultDir     2944 non-null   int64  
 17  AvgSpeed      2944 non-null   float64
 18  sunrise_diff  2944 non-null 

### 2.1.8 Two weather stations
<a id='two_w'></a>

There are two weather observations for each day, one for each weather station. Here I separate each station and re-merge on data. This yields one observation for each day, which is necessary to merge with the mosquito data.

I resample and compare sample means for each variable. Although there is significant differences statistically, the differences are not likely to be practically significant. For example, Tavg is statistically different between the groups but only by about 1.2 degrees F, less than the margin of error on most home thermometers. 

Because the groups are close, but not exact, I will average the two to use for analysis. There is not enough of a significant difference to justify separate use, and one or the other would be culled later on due to covariance.

In [23]:
#breaking two weather stations out based on station and merging together in one 
w_1 = w[w['Station']==1].drop('Station',axis=1)
w_2 = w[w['Station']==2].drop('Station',axis=1)

In [24]:
def resample(df):
    res = {col : [] for col, ser in df.iteritems()}
    for col, ser in df.iteritems():
        res[col] = [np.mean(np.random.choice(ser,size=100)) for i in range(len(ser))]
    return pd.DataFrame(res)


w_1_sam = resample(w_1.drop('Date',axis=1))
w_2_sam = resample(w_2.drop('Date',axis=1))

In [25]:
from scipy.stats import ttest_ind

for col in w_1_sam.columns:
    t, p = ttest_ind(w_1_sam[col],w_2_sam[col])
    if p < 0.1**(len(w_1_sam.columns)):
        print('for {}, t is {}, and p is {}'.format(col,t,p))

for Tmin, t is -56.62076913856423, and p is 0.0
for Tavg, t is -31.02886440802983, and p is 4.020404591911731e-183
for WetBulb, t is -14.378232502709048, and p is 2.356703255293739e-45
for Heat, t is 20.42288964763542, and p is 8.318272821389424e-87
for Cool, t is -38.86456329038449, and p is 4.848281468526567e-267
for StnPressure, t is -106.00949942103098, and p is 0.0
for SeaLevel, t is 16.583358523714292, and p is 4.086777296551439e-59
for HZ, t is -34.14104296815195, and p is 1.6759063617316336e-215
for DZ, t is -16.362692182669402, and p is 1.15351602755834e-57


In [26]:
w_diff = w_1_sam - w_2_sam
w_diff.describe()

Unnamed: 0,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,...,AvgSpeed,sunrise_diff,sunset_diff,BR,HZ,RA,TSRA,TS,DZ,FG
count,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,...,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0,1472.0
mean,-0.394246,-2.161936,-1.174592,-0.017486,-0.248845,-0.499062,0.452846,-0.865808,-0.36964,0.151399,...,-0.023436,-0.11377,0.181963,0.005931,-0.035652,0.009341,-0.005333,-0.002969,-0.012385,0.003804
std,1.626396,1.477347,1.465883,0.978568,1.570987,1.367528,0.866751,0.814678,9.155917,12.512646,...,0.455558,1.268245,1.520403,0.063181,0.040527,0.066435,0.043432,0.04485,0.029131,0.016701
min,-5.84,-7.25,-6.28,-3.19,-4.97,-5.05,-2.33,-3.63,-26.28,-48.96,...,-1.639,-3.02,-6.01,-0.19,-0.18,-0.19,-0.15,-0.17,-0.12,-0.08
25%,-1.56,-3.13,-2.1225,-0.65,-1.34,-1.4225,-0.11,-1.44,-6.265,-8.45,...,-0.32825,-0.97,-0.88,-0.04,-0.06,-0.03,-0.03,-0.03,-0.03,-0.01
50%,-0.355,-2.17,-1.18,0.0,-0.205,-0.505,0.46,-0.875,-0.155,0.105,...,-0.044,-0.515,0.645,0.01,-0.04,0.01,-0.01,0.0,-0.01,0.0
75%,0.75,-1.13,-0.2075,0.64,0.79,0.4625,1.01,-0.32,5.65,8.8575,...,0.28175,0.82,1.25,0.05,-0.01,0.05,0.02,0.03,0.01,0.01
max,5.72,2.23,3.8,3.64,5.31,4.3,4.08,2.06,28.95,40.84,...,1.479,6.13,3.46,0.19,0.09,0.24,0.18,0.17,0.09,0.06


In [27]:
#averaging the two weather sets
w_avg = (w_1.set_index('Date') + w_2.set_index('Date')) / 2
w_avg = w_avg.reset_index()

In [28]:
w_avg['Date'] = pd.to_datetime(w_avg['Date'])

## 2.2 Mosquito and Spray Data
<a id='mosquito_and_spray'></a>


### 2.2.1 Spray Data Exploration
<a id ='spray'></a>

In [29]:
sp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14835 entries, 0 to 14834
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       14835 non-null  object 
 1   Time       14251 non-null  object 
 2   Latitude   14835 non-null  float64
 3   Longitude  14835 non-null  float64
dtypes: float64(2), object(2)
memory usage: 463.7+ KB


In [30]:
sp.head()

Unnamed: 0,Date,Time,Latitude,Longitude
0,2011-08-29,6:56:58 PM,42.391623,-88.089163
1,2011-08-29,6:57:08 PM,42.391348,-88.089163
2,2011-08-29,6:57:18 PM,42.391022,-88.089157
3,2011-08-29,6:57:28 PM,42.390637,-88.089158
4,2011-08-29,6:57:38 PM,42.39041,-88.088858


### 2.2.2 Mosquito trap data
<a id='trap'></a>


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    10506 non-null  object 
 1   Address                 10506 non-null  object 
 2   Species                 10506 non-null  object 
 3   Block                   10506 non-null  int64  
 4   Street                  10506 non-null  object 
 5   Trap                    10506 non-null  object 
 6   AddressNumberAndStreet  10506 non-null  object 
 7   Latitude                10506 non-null  float64
 8   Longitude               10506 non-null  float64
 9   AddressAccuracy         10506 non-null  int64  
 10  NumMosquitos            10506 non-null  int64  
 11  WnvPresent              10506 non-null  int64  
dtypes: float64(2), int64(4), object(6)
memory usage: 985.1+ KB


In [32]:
df.head()

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0


### 2.2.3 Integrating spray and mosquito data
<a id='ms'></a>

Insecticide sprays, unlike weather, are highly location dependant. A particular spray affects a small area and can last for a varying amount of time dependant on the type of insecticide used, the amount sprayed, the total are covered and other factors. These factors are not included, so to integrate spray data, I look at different time intervals and different location proximities.

After this calculation I merge the data sets and drop a few redundant columns.

In [33]:
#integrating spray data into mosquito data

mi_per_deg_lat = 364000/5280 #36400 ft per degree lat
mi_per_deg_long = 288200/5280 #288200 ft per degree long

#chaning date columns to pandas datetime for easy handling
sp['Date'] = pd.to_datetime(sp['Date'])
df['Date'] = pd.to_datetime(df['Date'])


def sprayed(dist, time, traps, sprays):
    """Returns wether or not a mosquito trap locations was sprayed within a certain distance and time frame. 
    Distance is in miles, time is 0 for year, 1 for month, and 2 for day. d is dataset"""
    s = []
    #for each trap, find the distances to all sprays within the timeframe, if the miniumum is below 
    # the threshold, than it was sprayed during that time period
    period = {0:'y',1:'m',2:'d'}
    
    for i,r  in traps.iterrows():
       
        #creating a mask to select relevant spray locations based on date
        mask = sprays['Date'].dt.to_period(period[time]) == r[0].to_period(period[time]) 
        
        #passing the loop if there are no sprays during the right window
        if mask.sum() == 0:
            s.append(0)
            continue
        spray = sprays[mask]
        
        #finding euclidian distance based on lat/long converted to miles
        lat_d = (spray.iloc[:,2]-r[7]) * mi_per_deg_lat
        long_d = (spray.iloc[:,3]-r[8])* mi_per_deg_long
        d = np.sqrt(lat_d**2 + long_d**2)
        
        #if the closest spray in the time period is within the cutoff distance, assign 1, otherwise 0
        if d.min() <= dist:
            s.append(1) 
        else: 
            s.append(0)

    return s

In [34]:
df['spray_year'] = sprayed(1,0,df,sp)
df['spray_month'] = sprayed(1,1,df,sp)
df['spray_day'] = sprayed(1,2,df,sp)

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Date                    10506 non-null  datetime64[ns]
 1   Address                 10506 non-null  object        
 2   Species                 10506 non-null  object        
 3   Block                   10506 non-null  int64         
 4   Street                  10506 non-null  object        
 5   Trap                    10506 non-null  object        
 6   AddressNumberAndStreet  10506 non-null  object        
 7   Latitude                10506 non-null  float64       
 8   Longitude               10506 non-null  float64       
 9   AddressAccuracy         10506 non-null  int64         
 10  NumMosquitos            10506 non-null  int64         
 11  WnvPresent              10506 non-null  int64         
 12  spray_year              10506 non-null  int64 

In [36]:
df['spray_day'].sum()

60

In [37]:
df['spray_month'].sum()

402

In [38]:
df['spray_year'].sum()

1184

In [39]:
df = df.drop(['Address','Block','Street','AddressNumberAndStreet','AddressAccuracy'],axis=1)

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10506 entries, 0 to 10505
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          10506 non-null  datetime64[ns]
 1   Species       10506 non-null  object        
 2   Trap          10506 non-null  object        
 3   Latitude      10506 non-null  float64       
 4   Longitude     10506 non-null  float64       
 5   NumMosquitos  10506 non-null  int64         
 6   WnvPresent    10506 non-null  int64         
 7   spray_year    10506 non-null  int64         
 8   spray_month   10506 non-null  int64         
 9   spray_day     10506 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(5), object(2)
memory usage: 820.9+ KB


In [46]:
df_avg = df.merge(w_avg,on='Date')

In [47]:
df_avg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10506 entries, 0 to 10505
Data columns (total 35 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          10506 non-null  datetime64[ns]
 1   Species       10506 non-null  object        
 2   Trap          10506 non-null  object        
 3   Latitude      10506 non-null  float64       
 4   Longitude     10506 non-null  float64       
 5   NumMosquitos  10506 non-null  int64         
 6   WnvPresent    10506 non-null  int64         
 7   spray_year    10506 non-null  int64         
 8   spray_month   10506 non-null  int64         
 9   spray_day     10506 non-null  int64         
 10  Tmax          10506 non-null  float64       
 11  Tmin          10506 non-null  float64       
 12  Tavg          10506 non-null  float64       
 13  Depart        10506 non-null  float64       
 14  DewPoint      10506 non-null  float64       
 15  WetBulb       10506 non-null  float6

## 2.3 EDA and Feature Engineering
<a id='EDA'></a>

### 2.3.1 Distributions and Pairplots
<a id='dist'></a>


In [48]:
df_num = df_avg.select_dtypes(include='number')
df_num.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10506 entries, 0 to 10505
Data columns (total 32 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Latitude      10506 non-null  float64
 1   Longitude     10506 non-null  float64
 2   NumMosquitos  10506 non-null  int64  
 3   WnvPresent    10506 non-null  int64  
 4   spray_year    10506 non-null  int64  
 5   spray_month   10506 non-null  int64  
 6   spray_day     10506 non-null  int64  
 7   Tmax          10506 non-null  float64
 8   Tmin          10506 non-null  float64
 9   Tavg          10506 non-null  float64
 10  Depart        10506 non-null  float64
 11  DewPoint      10506 non-null  float64
 12  WetBulb       10506 non-null  float64
 13  Heat          10506 non-null  float64
 14  Cool          10506 non-null  float64
 15  Sunrise       10506 non-null  float64
 16  Sunset        10506 non-null  float64
 17  PrecipTotal   10506 non-null  float64
 18  StnPressure   10506 non-nu

In [None]:
vars1 = ['NumMosquitos','spray_month','Tavg','Depart','DewPoint','PrecipTotal','StnPressure','SeaLevel','AvgSpeed']

def p_plot(df,v,file_name):
    ax = plt.figure(figsize = (30,30))
    sns.pairplot(df,vars=v,corner=True, kind ='reg')
    plt.savefig(filename,dpi=60)
p_plot(df_num, vars1, './plots/pairplot.png')

NameError: name 'filename' is not defined

<Figure size 2160x2160 with 0 Axes>

### 2.3.2 Species 
<a id='species'></a>

WNV can only be carrier by two mosquito species, pipens or restauns. Unfortunately the data is highly class biased, and the vast majority of mosquitos are of the two species that can carry WNV. I create a boolean variable with 1 as a positive indication for species and 0 otherwise.

In [None]:
df.groupby('Species').describe().T

In [None]:
sns.barplot(data=df,y='Species',x='WnvPresent')


In [None]:
sns.barplot(y=df.Species.value_counts().index, x=df.Species.value_counts())

In [None]:
df['cul_or_pip'] = [1 if s in ['CULEX PIPIENS/RESTUANS', 'CULEX RESTUANS', 'CULEX PIPIENS'] else 0 for s in df.Species]


### 2.3.4 WNV comparisons
<a id='grouped'></a>

For the numerical data, I create violin plots to looks at any differences in the distributions for WNV present or not.

In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='Tavg')

In [None]:
df_avg.WnvPresent.corr(df_avg.Tavg)

It appears as though temperatures trend higher when WNV is present. The relationship is not strong and has a low correlation.

In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='PrecipTotal')

Precipitation is somewhat unclear, despite the association with mosquitos.

In [None]:
#rolling total pepceiption on a few different time scales. 
w_avg['three_day_rain'] = w_avg['PrecipTotal'].rolling(3, min_periods=1).sum()
w_avg['week_rain'] = w_avg['PrecipTotal'].rolling(7, min_periods=1).sum()
w_avg['two_week_rain'] = w_avg['PrecipTotal'].rolling(14, min_periods=1).sum()
w_avg['month_rain'] = w_avg['PrecipTotal'].rolling(28, min_periods=1).sum()

#and for the separate weather stations
w_all['three_day_rain_x'] = w_all['PrecipTotal_x'].rolling(3, min_periods=1).sum()
w_all['week_rain_x'] = w_all['PrecipTotal_x'].rolling(7, min_periods=1).sum()
w_all['two_week_rain_x'] = w_all['PrecipTotal_x'].rolling(14, min_periods=1).sum()
w_all['month_rain_x'] = w_all['PrecipTotal_x'].rolling(28, min_periods=1).sum()

w_all['three_day_rain_y'] = w_all['PrecipTotal_y'].rolling(3, min_periods=1).sum()
w_all['week_rain_y'] = w_all['PrecipTotal_y'].rolling(7, min_periods=1).sum()
w_all['two_week_rain_y'] = w_all['PrecipTotal_y'].rolling(14, min_periods=1).sum()
w_all['month_rain_y'] = w_all['PrecipTotal_y'].rolling(28, min_periods=1).sum()

In [None]:
df_avg = df.merge(w_avg.reset_index(),on='Date')
df_all = df.merge(w_all.reset_index(),on='Date')

In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='three_day_rain')

In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='week_rain')

In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='two_week_rain')

In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='month_rain')

In [None]:
subset = df_avg[['WnvPresent','PrecipTotal','three_day_rain','week_rain','two_week_rain',
                'month_rain','DewPoint','Tavg']]
sns.heatmap(subset.corr())

In [None]:
corr = df_avg.corr()
corr['WnvPresent']



In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='spray_month')

In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='Sunrise')

In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='SeaLevel')

In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='AvgSpeed')

In [None]:
sns.violinplot(data=df_avg,x='WnvPresent',y='DewPoint')

In [None]:
df_avg['T_spread'] = df_avg.Tmax - df_avg.Tmin
df_avg['longer_days'] = [1 if r[0] - r[1] > 0 else 0 for i,r in df_avg[['sunset_diff','sunrise_diff']].iterrows()]


In [None]:
df_avg.info()

<h3>Correlations for lagged weather variables</h3>

Here I will explore how correlation changes based on the lag in the variable. For each variable, I will save the index (lag or span) that produced the best correlation

In [None]:
w_avg['longer_days'] = [1 if r[1] - r[0] > 0 else 0 for i, r in w_avg[['sunrise_diff','sunset_diff']].iterrows()]
w_avg['T_spread'] = w_avg.Tmax - w_avg.Tmin
w_avg['rainy_and_hot'] = pd.qcut(w_avg.Tmax,5,labels=range(0,5)).astype('float') + pd.cut(
    w_avg.PrecipTotal,5,labels=range(0,5)).astype('float')


In [None]:
#lagged weather
def lagged(df,n):
    return df.rolling(n).mean().dropna()

def exp_lag(df,n):
    return df.ewm(span=n).mean().dropna()

w_exp = {s:exp_lag(w_avg,s) for s in np.arange(2,60)}
w_lag = {s:lagged(w_avg,s) for s in np.arange(2,60)}

In [None]:
def lag_corr(weather, data=df):
    best_corr = {c:[0,0] for c in df.merge(weather[2].reset_index(),on='Date').columns}
    for lag,w_df in weather.items():
        d = data.merge(w_df.reset_index(),on='Date')
        for i, corr in d.corr()['WnvPresent'].iteritems():
            if np.abs(best_corr[i][1]) < np.abs(corr):
                best_corr[i][1] = corr
                best_corr[i][0] = lag
        
    return best_corr

In [None]:
lag_df = pd.DataFrame(lag_corr(w_lag,df)).T
ewm_df = pd.DataFrame(lag_corr(w_exp,df)).T
ewm_df

In [None]:
diff = np.abs(lag_df) - np.abs(ewm_df)
# if 1, lagged approach is best, if -1, exp lag is best
def pos_neg(x):
    if x < 0:
        return -1
    elif x == 0:
        return 0
    elif x > 0:
        return 1

choice = diff.loc[:,1].apply(pos_neg)
choice

In [None]:
df_best = df

for col in w_lag[2].columns:
    c = choice[col]
    if c == -1:
        lag = ewm_df.loc[col][0]
        best_col = w_exp[lag][col].reset_index()
        df_best = df_best.merge(best_col, on='Date')
        cols = list(df_best.columns)
        cols[-1] = col + ' ' + str(lag) + ' ewm'
        df_best.columns = cols
        
    elif c == 1:
        lag = lag_df.loc[col][0]
        best_col = w_lag[lag][col].reset_index()
        df_best = df_best.merge(best_col, on='Date')
        cols = list(df_best.columns)
        cols[-1]= col + ' ' + str(lag) + ' lag'
        df_best.columns = cols

In [None]:
df_best.corr()['WnvPresent']

In [None]:
sns.violinplot(data=df_best,x='WnvPresent',y='HZ 10.0 lag')

In [None]:
sns.violinplot(data=df_best,x='WnvPresent',y='longer_days 58.0 lag')

In [None]:
sns.violinplot(data=df_best,x='WnvPresent',y='spray_month')

In [None]:
sns.violinplot(data=df_best,x='WnvPresent',y='PrecipTotal 59.0 lag')

In [None]:
sns.violinplot(data=df_best,x='WnvPresent',y='DewPoint 59.0 ewm')

In [None]:
sns.violinplot(data=df_best,x='WnvPresent',y='Heat 59.0 ewm')

In [None]:
sns.violinplot(data=df_best,x='WnvPresent',y='Tmax 59.0 lag')

In [None]:
sns.histplot(data=df_best,x='Tmax 59.0 lag')

In [None]:
sns.violinplot(data=df_best,x='WnvPresent',y='Sunrise 2.0 ewm')

In [None]:
sns.violinplot(data=df_best,x='WnvPresent',y='ResultSpeed 32.0 lag')

In [None]:
ax = plt.figure(figsize=(20,20))

sns.heatmap(df_best.corr())

In [None]:
df_best.info()

In [None]:
max_bin = 20
force_bin = 3

def mono_bin(Y, X, n = max_bin):
    
    np.seterr(divide='ignore')
    
    df1 = pd.DataFrame({"X": X, "Y": Y})
    justmiss = df1[['X','Y']][df1.X.isnull()]
    notmiss = df1[['X','Y']][df1.X.notnull()]
    r = 0
    while np.abs(r) < 1:
        try:
            d1 = pd.DataFrame({"X": notmiss.X, "Y": notmiss.Y, "Bucket": pd.qcut(notmiss.X, n)})
            d2 = d1.groupby('Bucket', as_index=True)
            r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
            n = n - 1 
        except Exception as e:
            n = n - 1

    if len(d2) == 1:
        n = force_bin         
        bins = algos.quantile(notmiss.X, np.linspace(0, 1, n))
        if len(np.unique(bins)) == 2:
            bins = np.insert(bins, 0, 1)
            bins[1] = bins[1]-(bins[1]/2)
        d1 = pd.DataFrame({"X": notmiss.X, "Y": notmiss.Y, "Bucket": pd.cut(notmiss.X, np.unique(bins),include_lowest=True)}) 
        d2 = d1.groupby('Bucket', as_index=True)
    
    d3 = pd.DataFrame({},index=[])
    d3["MIN_VALUE"] = d2.min().X
    d3["MAX_VALUE"] = d2.max().X
    d3["COUNT"] = d2.count().Y
    d3["EVENT"] = d2.sum().Y
    d3["NONEVENT"] = d2.count().Y - d2.sum().Y
    d3=d3.reset_index(drop=True)
    
    if len(justmiss.index) > 0:
        d4 = pd.DataFrame({'MIN_VALUE':np.nan},index=[0])
        d4["MAX_VALUE"] = np.nan
        d4["COUNT"] = justmiss.count().Y
        d4["EVENT"] = justmiss.sum().Y
        d4["NONEVENT"] = justmiss.count().Y - justmiss.sum().Y
        d3 = d3.append(d4,ignore_index=True)
    
    d3["EVENT_RATE"] = np.divide(d3.EVENT,d3.COUNT)
    d3["NON_EVENT_RATE"] = np.divide(d3.NONEVENT,d3.COUNT)
    d3["DIST_EVENT"] = np.divide(d3.EVENT,d3.sum().EVENT)
    d3["DIST_NON_EVENT"] = np.divide(d3.NONEVENT,d3.sum().NONEVENT)
    d3["WOE"] = np.log(np.divide(d3.DIST_EVENT,d3.DIST_NON_EVENT))
    d3["IV"] = (d3.DIST_EVENT-d3.DIST_NON_EVENT)*d3.WOE
    d3["VAR_NAME"] = "VAR"
    d3 = d3[['VAR_NAME','MIN_VALUE', 'MAX_VALUE', 'COUNT', 'EVENT', 'EVENT_RATE', 'NONEVENT', 'NON_EVENT_RATE', 'DIST_EVENT','DIST_NON_EVENT','WOE', 'IV']]       
    d3 = d3.replace([np.inf, -np.inf], 0)
    d3.IV = d3.IV.sum()
    
    return(d3)

def char_bin(Y, X):
        
    df1 = pd.DataFrame({"X": X, "Y": Y})
    justmiss = df1[['X','Y']][df1.X.isnull()]
    notmiss = df1[['X','Y']][df1.X.notnull()]    
    df2 = notmiss.groupby('X',as_index=True)
    
    d3 = pd.DataFrame({},index=[])
    d3["COUNT"] = df2.count().Y
    d3["MIN_VALUE"] = df2.sum().Y.index
    d3["MAX_VALUE"] = d3["MIN_VALUE"]
    d3["EVENT"] = df2.sum().Y
    d3["NONEVENT"] = df2.count().Y - df2.sum().Y
    
    if len(justmiss.index) > 0:
        d4 = pd.DataFrame({'MIN_VALUE':np.nan},index=[0])
        d4["MAX_VALUE"] = np.nan
        d4["COUNT"] = justmiss.count().Y
        d4["EVENT"] = justmiss.sum().Y
        d4["NONEVENT"] = justmiss.count().Y - justmiss.sum().Y
        d3 = d3.append(d4,ignore_index=True)
    
    
    d3["EVENT_RATE"] = np.divide(d3.EVENT,d3.COUNT)
    d3["NON_EVENT_RATE"] = np.divide(d3.NONEVENT,d3.COUNT)
    d3["DIST_EVENT"] = np.divide(d3.EVENT,d3.sum().EVENT)
    d3["DIST_NON_EVENT"] = np.divide(d3.NONEVENT,d3.sum().NONEVENT)
    d3["WOE"] = np.log(np.divide(d3.DIST_EVENT,d3.DIST_NON_EVENT))
    
    d3["IV"] = (d3.DIST_EVENT-d3.DIST_NON_EVENT)*np.log(d3.DIST_EVENT/d3.DIST_NON_EVENT)
    d3["VAR_NAME"] = "VAR"
    d3 = d3[['VAR_NAME','MIN_VALUE', 'MAX_VALUE', 'COUNT', 'EVENT', 'EVENT_RATE', 'NONEVENT', 'NON_EVENT_RATE', 'DIST_EVENT','DIST_NON_EVENT','WOE', 'IV']]      
    d3 = d3.replace([np.inf, -np.inf], 0)
    d3.IV = d3.IV.sum()
    d3 = d3.reset_index(drop=True)
    
    return(d3)

def data_vars(df1, target):
    
    stack = traceback.extract_stack()
    filename, lineno, function_name, code = stack[-2]
    vars_name = re.compile(r'\((.*?)\).*$').search(code).groups()[0]
    final = (re.findall(r"[\w']+", vars_name))[-1]
    
    x = df1.dtypes.index
    count = -1
    
    for i in x:
        if i.upper() not in (final.upper()):
            if np.issubdtype(df1[i], np.number) and len(pd.Series.unique(df1[i])) > 2:
                conv = mono_bin(target, df1[i])
                conv["VAR_NAME"] = i
                count = count + 1
            else:
                conv = char_bin(target, df1[i])
                conv["VAR_NAME"] = i            
                count = count + 1
                
            if count == 0:
                iv_df = conv
            else:
                iv_df = iv_df.append(conv,ignore_index=True)
    
    iv = pd.DataFrame({'IV':iv_df.groupby('VAR_NAME').IV.max()})
    iv = iv.reset_index()
    return(iv_df,iv)

In [None]:
df_ivs, IV = data_vars(df_best.drop('WnvPresent',axis=1),df_best['WnvPresent'])
to_drop = IV[(IV.IV > 0.8) | (IV.IV < 0.02)].drop(29)
df_iv = df_best.drop(to_drop.VAR_NAME, axis=1)
df_iv = df_iv.drop(['spray_year','spray_day','Trap','Species'],axis=1)

In [None]:
IV

In [None]:
features = ['spray_month','Tavg 59.0 ewm','Depart 3.0 lag','PrecipTotal 59.0 lag','longer_days 58.0 lag',
           'ResultSpeed 32.0 lag','HZ 10.0 lag','BR 48.0 lag','T_spread 56.0 lag']
df_X = df_iv[features]

<h2>Conclusions / Notes</h2>

1. Mosquitos are only present in CULEX PIPIENS/RESTUANS. I should combine the mosquito species data into a yes or no columns on species

2. Lagging variables is incredbily important. For each variable, I was able to select the lag that correlated the variable most strongly with WnvPresent. I use a mixed approach of exponentially weighted moving averages and uniformly wieghted. 

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
def iterate_vif(df, vif_threshold=10, max_vif=11):
  count = 0
  while max_vif > vif_threshold:
    count += 1
    print("Iteration # "+str(count))
    vif = pd.DataFrame()
    vif["VIFactor"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    vif["features"] = df.columns
    
    if np.round(vif['VIFactor'].max(),1) > vif_threshold:
      print('Removing %s with VIF of %f' % (vif[vif['VIFactor'] == vif['VIFactor'].max()]['features'].values[0], np.round(vif['VIFactor'].max(),1)))
      df = df.drop(vif[vif['VIFactor'] == vif['VIFactor'].max()]['features'].values[0], axis=1)
      max_vif = np.round(vif['VIFactor'].max(),1)
    else:
        print('Complete')
        return df, np.round(vif.sort_values('VIFactor'),1)

In [None]:
X = df_X
y = df_iv.WnvPresent

In [None]:
X2, vif = iterate_vif(X._get_numeric_data())

In [None]:
X2.info()

In [None]:
vif

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

X2.to_pickle('./data/X2.pkl')
y.to_pickle('./data/y.pkl')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X2,y, test_size = 0.2, random_state = 42)

In [None]:
rf = RandomForestClassifier()

params = {'n_estimators': np.arange(750,2000,50),'max_depth': np.arange(3,8)}

rf_rand = RandomizedSearchCV(rf,param_distributions=params,cv=5, n_jobs=-1, scoring = 'roc_auc',n_iter=10,
                            random_state = 42, verbose=10)

In [None]:
rf_rand.fit(X_train,y_train)

In [None]:
rf_rand.best_score_

In [None]:
from sklearn.metrics import roc_auc_score

y_pred = rf_rand.best_estimator_.predict_proba(X_test)

roc_auc_score(y_test,y_pred[:,1])


In [None]:
r = rf_rand.best_estimator_
imp = r.feature_importances_

ax = plt.figure(figsize=(10,5))

i = pd.DataFrame({'feature':X_test.columns,'importance':imp}).sort_values('importance',ascending=False)

sns.barplot(data=i,y='feature',x='importance',orient='h',color='gray')

In [None]:
from sklearn.metrics import roc_curve

def roc_plot(e,x,y):
    """e is estimator, x is data and y is true value"""

    #predicting from model
    y_pred = e.predict_proba(x)[:,1]

    #finding curve
    fpr, tpr, t = roc_curve(y, y_pred)
    sns.lineplot(x=fpr,y=tpr)
    plt.xlabel('false positive rate')
    plt.ylabel('true positive rate')
    plt.title('Compute Receiver Operating Characteristic Curve')
    return pd.DataFrame({'fpr':fpr,'tpr':tpr,'threshold':t})

from sklearn.metrics import roc_auc_score


def test_comp(e):
    #predicting probability from test set
    y_pred = e.predict_proba(X_test)[:,1]
    
    
    r = np.round(roc_auc_score(y_test,y_pred),3)
    
    print('Evaluating the classifier on the test set, area under the roc curve is: ' +  
           '{}'.format(r))

test_comp(rf_rand.best_estimator_)



In [None]:
test_comp(rf_rand.best_estimator_)

In [None]:
est = rf_rand.best_estimator_
c = roc_plot(est, X_train, y_train)