# DSI Project 4: West Nile Virus Prediction

## Problem Statement:

Due to the recent epidemic of West Nile Virus in Chicago, we need to deploy pesticides throughout the city cost-effectively (as pesticides are expensive) and safely without causing hazard to public health.<br>
Given weather, location, testing, and spraying data, the task is to predict where and when different species of mosquitos will test positive for West Nile virus. Mosquitos in traps across the city are tested for the virus. The results of these tests influence when and where the city will spray airborne pesticides to control adult mosquito populations. Success is evaluated on area under the ROC curve between the predicted probability that West Nile Virus is present and the observed outcomes.


## Executive Summary
---
### Contents:
- [Data Description](#Data-Description)
- [Data Cleaning & EDA](#Data-Cleaning-&-EDA)
- [Feature Engineering](#Feature-Engineering)
- [Model Evaluation & Kaggle Prediction Scoring](#Model-Evaluation-&-Kaggle-Prediction-Scoring)
- [Conclusion](#Conclusion)

# Data Description

Every year from late-May to early-October, public health workers in Chicago setup mosquito traps scattered across the city. Every week from Monday through Wednesday, these traps collect mosquitos, and the mosquitos are tested for the presence of West Nile virus before the end of the week. The test results include the number of mosquitos, the mosquitos species, and whether or not West Nile virus is present in the cohort.

There are in total 4 datasets provided:
- Training Dataset (May-Oct 2007, May-Oct 2009, Jun-Sep 2011, Jun-Sep 2013) 
- Test Dataset (Jun-Sep 2008, Jun-Oct 2010, Jun-Sep 2012, Jun-Oct 2014)
- Weather Dataset (May-Oct 2007-2014)
- Sparay Dataset (Aug-Sep 2011, Jul-Sep 2013)    

## Data Cleaning & EDA

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import roc_auc_score,confusion_matrix,classification_report
from imblearn.over_sampling import SMOTE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

%matplotlib inline

pd.set_option('display.max_rows', 100) # to look at more rows of data later
pd.set_option('display.max_columns', 100) # to expand columns view so that all can be seen later

Cleaning of the Weather Data has been done, for details of the cleaning process you may refer [here](./weather_cleaned.ipynb). <br>
You may refer to the clean csv [here](../dataset/weather_final.csv)

In [2]:
# Load dataset
train_df = pd.read_csv('../dataset/train.csv')
test_df = pd.read_csv('../dataset/test.csv')
weather_df = pd.read_csv('../dataset/weather_final.csv')
spray_df=pd.read_csv('../dataset/spray.csv')

In [3]:
# Print shape of dataset
print(train_df.shape)
print(test_df.shape)

(10506, 12)
(116293, 11)


There are more test data (92%) than training data (8%). Imbalance Data?

In [4]:
# Print columns
print(train_df.columns)
print(test_df.columns)

Index(['Date', 'Address', 'Species', 'Block', 'Street', 'Trap',
       'AddressNumberAndStreet', 'Latitude', 'Longitude', 'AddressAccuracy',
       'NumMosquitos', 'WnvPresent'],
      dtype='object')
Index(['Id', 'Date', 'Address', 'Species', 'Block', 'Street', 'Trap',
       'AddressNumberAndStreet', 'Latitude', 'Longitude', 'AddressAccuracy'],
      dtype='object')


In [5]:
train_df.head()

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0


In [6]:
test_df.head()

Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
2,3,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
3,4,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX SALINARIUS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9
4,5,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX TERRITANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9


In [7]:
print(train_df[train_df.duplicated()].count())

Date                      813
Address                   813
Species                   813
Block                     813
Street                    813
Trap                      813
AddressNumberAndStreet    813
Latitude                  813
Longitude                 813
AddressAccuracy           813
NumMosquitos              813
WnvPresent                813
dtype: int64


There are 813 duplicate observations in train dataset due to the test results are organized in such a way that when the number of mosquitos exceed 50, they are split into another row in the dataset.(In short, the number of mosquito of each row are capped at 50).

In [8]:
print(test_df[test_df.duplicated()].count())

Id                        0
Date                      0
Address                   0
Species                   0
Block                     0
Street                    0
Trap                      0
AddressNumberAndStreet    0
Latitude                  0
Longitude                 0
AddressAccuracy           0
dtype: int64


There is no duplicate observations in test dataset as there's no 'NumMosquitos' in test dataset.

In [9]:
train_df.groupby(by=['Date','Address','Species','WnvPresent']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Block,Latitude,Longitude,AddressAccuracy,NumMosquitos
Date,Address,Species,WnvPresent,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2007-05-29,"1100 Roosevelt Road, Chicago, IL 60608, USA",CULEX PIPIENS/RESTUANS,0,11,41.867108,-87.654224,8,1
2007-05-29,"1100 Roosevelt Road, Chicago, IL 60608, USA",CULEX RESTUANS,0,11,41.867108,-87.654224,8,2
2007-05-29,"1100 South Peoria Street, Chicago, IL 60608, USA",CULEX RESTUANS,0,11,41.862292,-87.648860,8,1
2007-05-29,"1100 West Chicago Avenue, Chicago, IL 60642, USA",CULEX RESTUANS,0,11,41.896282,-87.655232,8,1
2007-05-29,"1500 North Long Avenue, Chicago, IL 60651, USA",CULEX RESTUANS,0,15,41.907645,-87.760886,8,1
2007-05-29,"1500 West Webster Avenue, Chicago, IL 60614, USA",CULEX RESTUANS,0,15,41.921600,-87.666455,8,2
2007-05-29,"1700 West 95th Street, Chicago, IL 60643, USA",CULEX RESTUANS,0,17,41.720848,-87.666014,9,3
2007-05-29,"2100 North Stave Street, Chicago, IL 60647, USA",CULEX PIPIENS/RESTUANS,0,21,41.919343,-87.694259,8,1
2007-05-29,"2200 North Cannon Drive, Chicago, IL 60614, USA",CULEX PIPIENS/RESTUANS,0,22,41.921965,-87.632085,8,2
2007-05-29,"2200 North Cannon Drive, Chicago, IL 60614, USA",CULEX RESTUANS,0,22,41.921965,-87.632085,8,3


We define unique trap observation by grouping 'Date','Address','Species','WnvPresent'.<br>
We are not considering 'NumMosquitos' as our feature as there's no ''NumMosquitos' found in test dataset.<br>There are in total 8610 unique trap observations in train dataset.

In [10]:
# Drop duplicates
train_df.drop_duplicates(subset=['Date','Address','Species','Trap','Block','WnvPresent'],inplace=True)
train_df.reset_index(inplace=True)

In [11]:
train_df.shape

(8610, 13)

In [12]:
# Check which mozzies spread WNV
train_df[train_df['WnvPresent'] == 1]['Species'].unique()

array(['CULEX PIPIENS/RESTUANS', 'CULEX PIPIENS', 'CULEX RESTUANS'],
      dtype=object)

From train dataset it can be observed that WNV is only present in the following 3 species of Mosquitos:
- CULEX PIPIENS/RESTUANS
- CULEX PIPIENS
- CULEX RESTUANS

In [13]:
# Check if there's overlap
train_df[train_df['WnvPresent'] == 0]['Species'].unique()

array(['CULEX PIPIENS/RESTUANS', 'CULEX RESTUANS', 'CULEX PIPIENS',
       'CULEX SALINARIUS', 'CULEX TERRITANS', 'CULEX TARSALIS',
       'CULEX ERRATICUS'], dtype=object)

However, the presence of any single species does not mandate the presence of WNV. 

In [14]:
# One-hot encode mozzies that spread WNV
train_species = pd.get_dummies(train_df['Species'])[['CULEX PIPIENS/RESTUANS','CULEX PIPIENS','CULEX RESTUANS']]
test_species = pd.get_dummies(test_df['Species'])[['CULEX PIPIENS/RESTUANS','CULEX PIPIENS','CULEX RESTUANS']]

Based on earlier observation on train dataset, only 'CULEX PIPIENS/RESTUANS','CULEX PIPIENS','CULEX RESTUANS' are found to carry WNV.<br>
We one hot encode these 3 species and for other species that does not fall in any of the 3 categories, it will be encoded as all zeroes (Other Species).

In [15]:
train_df = pd.concat([train_df,train_species],axis=1,sort=False)
test_df = pd.concat([test_df,test_species],axis=1,sort=False)

In Weather dataset, there are 2 weather stations that records weather conditions daily.<br>

We need to identify which weather station(1 or 2) is closest to each trap, and join the weather data based on the date of observation.

Distance is calculated by euclidean distance between weather stations and traps coordinate.

In [16]:
# Calculate euclidean distance of weather station from city and determine which station is nearest
# This is calculated using pythagoras theorem  
# Station 1: CHICAGO O'HARE INTERNATIONAL AIRPORT Lat: 41.995 Lon: -87.933 Elev: 662 ft. above sea level
# Station 2: CHICAGO MIDWAY INTL ARPT Lat: 41.786 Lon: -87.752 Elev: 612 ft. above sea level
train_df['diststat1'] = np.sqrt((train_df['Latitude'] - 41.995) ** 2 + (train_df['Longitude'] - (-87.933)) ** 2)
train_df['diststat2'] = np.sqrt((train_df['Latitude'] - 41.786) ** 2 + (train_df['Longitude'] - (-87.752)) ** 2)
train_df['Station'] = [2 if train_df['diststat1'][i] > train_df['diststat2'][i] else 1 for i in range(train_df.shape[0])]
train_df.head(2)


Unnamed: 0,index,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent,CULEX PIPIENS/RESTUANS,CULEX PIPIENS,CULEX RESTUANS,diststat1,diststat2,Station
0,0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0,1,0,0,0.138026,0.17566,1
1,1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0,0,0,1,0.138026,0.17566,1


In [17]:
# Apply for test data set

test_df['diststat1'] = np.sqrt((test_df['Latitude'] - 41.995) ** 2 + (test_df['Longitude'] - (-87.933)) ** 2)
test_df['diststat2'] = np.sqrt((test_df['Latitude'] - 41.786) ** 2 + (test_df['Longitude'] - (-87.752)) ** 2)
test_df['Station'] = [2 if test_df['diststat1'][i] > test_df['diststat2'][i] else 1 for i in range(test_df.shape[0])]
test_df.head()

Unnamed: 0,Id,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,CULEX PIPIENS/RESTUANS,CULEX PIPIENS,CULEX RESTUANS,diststat1,diststat2,Station
0,1,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0,0,0.138026,0.17566,1
1,2,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,0,0,1,0.138026,0.17566,1
2,3,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,0,1,0,0.138026,0.17566,1
3,4,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX SALINARIUS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,0,0,0,0.138026,0.17566,1
4,5,2008-06-11,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX TERRITANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,0,0,0,0.138026,0.17566,1


## Feature Engineering

In [18]:
# Feature engineer to add a new column 'dateofyear', which is the day within a year.
train_df['dateofyear'] = pd.to_datetime(train_df['Date'], format='%Y-%m-%d').dt.dayofyear
test_df['dateofyear'] = pd.to_datetime(test_df['Date'], format='%Y-%m-%d').dt.dayofyear

In [19]:
# Merged weather and train/test to one dataframe
train_weather_df = pd.merge(train_df,weather_df,on=['Station','Date'])
train_weather_df.drop(axis=1,columns=['index'],inplace=True)
train_weather_df.head()
train_weather_df.to_csv('../dataset/train_weather_csv')

In [20]:
test_weather_df = pd.merge(test_df,weather_df,on=['Station','Date'])
# test_weather_df.drop(axis=1,columns=['index'],inplace=True)
test_weather_df.head()
train_weather_df.to_csv('../dataset/test_weather_csv')

In [21]:
train_weather_df.dtypes

Date                       object
Address                    object
Species                    object
Block                       int64
Street                     object
Trap                       object
AddressNumberAndStreet     object
Latitude                  float64
Longitude                 float64
AddressAccuracy             int64
NumMosquitos                int64
WnvPresent                  int64
CULEX PIPIENS/RESTUANS      uint8
CULEX PIPIENS               uint8
CULEX RESTUANS              uint8
diststat1                 float64
diststat2                 float64
Station                     int64
dateofyear                  int64
Tmax                        int64
Tmin                        int64
Tavg                      float64
Depart                    float64
DewPoint                    int64
WetBulb                   float64
Heat                      float64
Cool                      float64
Sunrise                     int64
Sunset                      int64
CodeSum       

In [22]:
# Rearrange Column Name
col_at_end=['WnvPresent']
train_weather_df=train_weather_df[[c for c in train_weather_df if c not in col_at_end]+
                                [c for c in col_at_end]]
print(train_weather_df.shape)
train_weather_df.head(2)

(8610, 38)


Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,CULEX PIPIENS/RESTUANS,CULEX PIPIENS,CULEX RESTUANS,diststat1,diststat2,Station,dateofyear,Tmax,Tmin,Tavg,Depart,DewPoint,WetBulb,Heat,Cool,Sunrise,Sunset,CodeSum,PrecipTotal,StnPressure,SeaLevel,ResultSpeed,ResultDir,AvgSpeed,Year,Month,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,1,0,0,0.138026,0.17566,1,149,88,60,74.0,10.0,58,65.0,0.0,9.0,421,1917,BR HZ,0.0,29.39,30.11,5.8,18,6.5,2007,5,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0,0,1,0.138026,0.17566,1,149,88,60,74.0,10.0,58,65.0,0.0,9.0,421,1917,BR HZ,0.0,29.39,30.11,5.8,18,6.5,2007,5,0


In [23]:
feat = ['dateofyear','Latitude','Longitude','AddressAccuracy','CULEX PIPIENS/RESTUANS','CULEX PIPIENS','CULEX RESTUANS','Heat','Cool','WetBulb','PrecipTotal','Sunrise','Sunset','Tmin','Tmax']

X_subset = train_weather_df[feat]
y = train_weather_df['WnvPresent']
X_kaggle_subset = test_weather_df[feat]

Applying polynomial features to get a sense of interaction terms.

In [24]:
poly = PolynomialFeatures(interaction_only=True, include_bias=False)
X_train_poly = poly.fit_transform(X_subset)
X_kaggle_poly = poly.fit_transform(X_kaggle_subset)

In [25]:
poly_train = pd.DataFrame(X_train_poly, columns = poly.get_feature_names(X_subset.columns),index=train_weather_df.index)
poly_kaggle = pd.DataFrame(X_kaggle_poly, columns = poly.get_feature_names(X_kaggle_subset.columns),index=test_weather_df.index)

In [26]:
poly_train.columns

Index(['dateofyear', 'Latitude', 'Longitude', 'AddressAccuracy',
       'CULEX PIPIENS/RESTUANS', 'CULEX PIPIENS', 'CULEX RESTUANS', 'Heat',
       'Cool', 'WetBulb',
       ...
       'PrecipTotal Sunrise', 'PrecipTotal Sunset', 'PrecipTotal Tmin',
       'PrecipTotal Tmax', 'Sunrise Sunset', 'Sunrise Tmin', 'Sunrise Tmax',
       'Sunset Tmin', 'Sunset Tmax', 'Tmin Tmax'],
      dtype='object', length=120)

In [27]:
# feature_list = ['dateofyear',
#                 'Latitude', 
#                 'Longitude',
#                 'CULEX PIPIENS/RESTUANS',
#                 'CULEX PIPIENS',
#                 'CULEX RESTUANS',
#                 'dateofyear CULEX PIPIENS/RESTUANS',
#                 'dateofyear CULEX PIPIENS',
#                 'dateofyear CULEX RESTUANS',
#                 'Latitude CULEX PIPIENS/RESTUANS',
#                 'Latitude CULEX PIPIENS',
#                 'Latitude CULEX RESTUANS',
#                 'Longitude CULEX PIPIENS/RESTUANS', 
#                 'Longitude CULEX PIPIENS',
#                 'Longitude CULEX RESTUANS',
#                'Heat','Cool','WetBulb','PrecipTotal','Sunrise','Sunset','Tmin','Tmax']

feature_list = ['dateofyear CULEX PIPIENS/RESTUANS',
               'dateofyear CULEX PIPIENS',
               'dateofyear CULEX RESTUANS',
               'CULEX PIPIENS/RESTUANS Sunrise',
               'CULEX PIPIENS Sunrise',
               'CULEX RESTUANS Sunrise',
               'Longitude CULEX PIPIENS/RESTUANS',
               'Longitude CULEX PIPIENS',
               'Longitude CULEX RESTUANS',
              'WetBulb Sunrise','Sunrise']

X = poly_train[feature_list]
X_kaggle = poly_kaggle[feature_list]

## Model Evaluation & Kaggle Prediction Scoring

We first create functions that will help us with our modelling later.

In [49]:
model_dict = {
    'ss': StandardScaler(),
    'lr': LogisticRegression(solver='lbfgs'),
    'nb': MultinomialNB(),
    'knn': KNeighborsClassifier(),
    'dt': DecisionTreeClassifier(),
    'rf': RandomForestClassifier(random_state=42),
    'et': ExtraTreesClassifier(),
    'ada_dt': AdaBoostClassifier(random_state=42),
    'ada_rf': AdaBoostClassifier(base_estimator=RandomForestClassifier(random_state=42),random_state=42),
    'gboost': GradientBoostingClassifier()
}

model_full = {
    'ss': 'Standard Scaler',
    'lr': 'Logistic Regression',
    'knn': 'KNearestNeighbor',
    'nb': 'Multinomial NB',
    'dt': 'Decision Tree',
    'rf': 'Random Forest',
    'et': 'Extra Tree',
    'ada_dt': 'AdaBoost - Decision Tree',
    'ada_rf': 'AdaBoost - Random Forest',
    'gboost': 'Gradient Boosting Classifier'
}

param_dict = {    
    'knn': {
        'knn__n_neighbors': [2,3,4,5]
    },
    'lr': {
        'lr__max_iter': [100,200]
    },
    'nb': {},
    'dt': {
        'dt__max_depth': [5,7],
        'dt__min_samples_split': [10,15],
        'dt__min_samples_leaf': [3,4]
    },
    'rf': {
        'rf__n_estimators': [500,1000,2000],
        'rf__min_samples_split': [2,3],
        'rf__max_depth': [2,3],
        'rf__min_samples_leaf': [3,4]
        
    },
    'et': {
        'et__n_estimators': [1000,2000],
        'et__min_samples_split': [2,3],
    },
    'ada_dt': {
        'ada_dt__n_estimators': [50,100,200],
        'ada_dt__learning_rate': [0.9, 1]
    },
    'ada_rf': {
        'ada_rf__n_estimators': [50,100,200],
        'ada_rf__learning_rate': [0.9, 1],
        'ada_rf__base_estimator__max_depth': [3], 
        'ada_rf__base_estimator__min_samples_leaf': [4], 
        'ada_rf__base_estimator__min_samples_split': [2], 
        'ada_rf__base_estimator__n_estimators': [1000]
    },
    'gboost': {
        'gboost__n_estimators': [50,100],
        'gboost__max_depth': [2,3,4],
        'gboost__learning_rate': [0.1, 0.5, 1]
    }
}

def prepare_pipeline(list_of_models):
    """
    Prepare pipeline of models to be used for modelling
    
    Parameters
    ----------
    list_of_models: list[str]
        List of models to be included for pipeline
    
    Returns
    -------
    Pipeline
        Pipeline of models to be run
    """
    pipe_list = [(i,model_dict[i]) for i in list_of_models]
    return Pipeline(pipe_list)

def add_params(name,pipe_dict):
    """
    Add parameters for GridSearch
    
    Parameters
    ----------
    name: str
        Name of model/vectorization method to have params added.
    pipe_dict: Dictionary
        Dictionary that contains parameters to be added into GridSearch
    
    Returns
    -------
    Dictionary
        Dictionary that contains parameters to be added for GridSearch
    """
    params = param_dict[name]
    for k,v in params.items():
        pipe_dict[k] = v
    return pipe_dict

def grid_search(model,train_data=X,train_target=y):
    """
    Initialize and run GridSearch
    
    Parameters
    ----------
    model: str
        Initialize which classification model to use. Note classification model has to be contained in model_dict.
        
    train_data: list[str]
        List of training data to be used
    
    Returns
    -------
    List
        List that contains predicted values of the test data
    """
    X_train, X_test, y_train,y_test = train_test_split(train_data,train_target,test_size=0.25,stratify=train_target,random_state=42)
    pipe_params = {}
    pipe_params = add_params(model,pipe_params)
    pipe = prepare_pipeline(['ss',model])
    gs = GridSearchCV(pipe,param_grid=pipe_params,cv=3,n_jobs=-1,scoring='roc_auc')
    gs.fit(X_train,y_train)
    print(f'Using {model_full[model]}:')
    print(f'Train Score: {round(gs.best_score_,4)}')
    print(f'Test Score: {round(gs.score(X_test,y_test),4)}')
    print(f'Using the following parameters: {gs.best_params_}')
    pass


In [29]:
## Function to fit full data and predict kaggle target, store as csv
def predict_kaggle(model,output,X=X,y=y,X_kaggle=X_kaggle):
    model.fit(X,y)
    pred = model.predict_proba(X_kaggle)[:,1]
    pred_df = pd.DataFrame({'Id':test_weather_df['Id'],'WnvPresent': pred})
    pred_df.to_csv('../KaggleSubmission/'+output+'.csv',index=False)
    pass

We will run the following models and get the train-test scores + Kaggle scores as well.

1. Random Forest
2. Logistic Regression
3. AdaBoost with Decision Trees
4. Gradient Boosting
5. Adaboost with Random Forest

In [50]:
grid_search('rf')

Using Random Forest:
Train Score: 0.8003
Test Score: 0.8166
Using the following parameters: {'rf__max_depth': 3, 'rf__min_samples_leaf': 4, 'rf__min_samples_split': 2, 'rf__n_estimators': 2000}


In [51]:
predict_kaggle(RandomForestClassifier(n_estimators=2000,min_samples_leaf=4,min_samples_split=2,max_depth=3,random_state=42),'rf_prediction')

In [32]:
grid_search('lr')

Using Logistic Regression:
Train Score: 0.7265
Test Score: 0.711
Using the following parameters: {'lr__max_iter': 100}


In [33]:
predict_kaggle(LogisticRegression(solver='lbfgs',max_iter=100),'lr_prediction')



In [52]:
grid_search('ada_dt')

Using AdaBoost - Decision Tree:
Train Score: 0.8144
Test Score: 0.8444
Using the following parameters: {'ada_dt__learning_rate': 1, 'ada_dt__n_estimators': 100}


In [53]:
predict_kaggle(AdaBoostClassifier(learning_rate=1,n_estimators=100,random_state=42),'ada_prediction')

In [36]:
grid_search('gboost')

Using Gradient Boosting Classifier:
Train Score: 0.8175
Test Score: 0.838
Using the following parameters: {'gboost__learning_rate': 0.1, 'gboost__max_depth': 3, 'gboost__n_estimators': 50}


In [37]:
predict_kaggle(GradientBoostingClassifier(n_estimators=50,max_depth=3,learning_rate=0.1),'gboost_prediction')

In [38]:
grid_search('ada_rf')

In [39]:
predict_kaggle(AdaBoostClassifier(base_estimator=RandomForestClassifier(max_depth=3,min_samples_leaf=4,min_samples_split=2,n_estimators=1000,random_state=42),learning_rate=0.9,n_estimators=50),'ada_rf_prediction')

In [40]:
## Trying with SMOTE (again)

from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
#from imblearn.over_sampling import SMOTE
# Resample the minority class. You can change the strategy to 'auto' if you are not sure.
sm = SMOTE(sampling_strategy='minority', random_state=7)
# Fit the model to generate the data.
oversampled_trainX, oversampled_trainY = sm.fit_sample(X,y)
oversampled_train = pd.concat([pd.DataFrame(oversampled_trainY), pd.DataFrame(oversampled_trainX)], axis=1)
col = X.columns
oversampled_train.columns = col.insert(0,'WnvPresent')
# oversampled_train.columns = X.columns.append('WnvPresent')

In [41]:
oversampled_train.shape

(16306, 12)

In [42]:
grid_search('lr',train_data=oversampled_train[col],train_target=oversampled_train['WnvPresent'])

Using Logistic Regression:
Train Score: 0.7222
Test Score: 0.7304
Using the following parameters: {'lr__max_iter': 100}




In [43]:
predict_kaggle(LogisticRegression(max_iter=200),'lr_smote_prediction',X=oversampled_train[col],y=oversampled_train['WnvPresent'])



In [44]:
grid_search('ada_dt',train_data=oversampled_train[col],train_target=oversampled_train['WnvPresent'])

Using AdaBoost - Decision Tree:
Train Score: 0.8937
Test Score: 0.9029
Using the following parameters: {'ada_dt__learning_rate': 1, 'ada_dt__n_estimators': 200}


In [45]:
predict_kaggle(AdaBoostClassifier(learning_rate=1,n_estimators=200),'ada_prediction_smote',X=oversampled_train[col],y=oversampled_train['WnvPresent'])

In [46]:
grid_search('gboost',train_data=oversampled_train[col],train_target=oversampled_train['WnvPresent'])

Using Gradient Boosting Classifier:
Train Score: 0.9604
Test Score: 0.9662
Using the following parameters: {'gboost__learning_rate': 1, 'gboost__max_depth': 4, 'gboost__n_estimators': 100}


In [47]:
predict_kaggle(GradientBoostingClassifier(n_estimators=100,max_depth=4,learning_rate=0.5),'gboost_prediction_smote',X=oversampled_train[col],y=oversampled_train['WnvPresent'])

## Conclusion