<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4: Predicting Presence of West Nile Virus <br>
**Notebook 4a: Modelling (Pycaret)**

## TABLE OF CONTENTS

**1a. EDA on Training Dataset** <br>
**1b. EDA on Weather Dataset** <br>
**1c. EDA on Spray Dataset** <br>
**2. Data Preprocessing I** <br>
**3. Data Preprocessing II** <br>
**4a. Modelling (Pycaret) (This Notebook)** <br>
- [01. Get Data](#01.-Get-Data) <br>
- [02. Compress Data](#02.-Compress-Data) <br>
- [03. Run PyCaret!](#03.-Run-PyCaret!) <br>
- [04. Model(s) Selection](#04.-Model(s)-Selection)

**4b. Modelling** <br>
**5. Cost Benefit Analysis** <br>
**6. Conclusion & Recommendations** <br>

In [15]:
#!pip install --pre pycaret;
from pycaret.classification import *

In [16]:
## Import libraries
from tqdm import tqdm
import pandas as pd
import numpy as np

# For split our data and gridsearch
from sklearn.model_selection import train_test_split

# For specific functions 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# For metrics to assess model
from sklearn.metrics import accuracy_score,f1_score,roc_auc_score,recall_score,precision_score

## 01. Get Data

In [17]:
def df_getter(rolling_days, drop_codesum = False):
    
    # load in data
    locals()['train_' + str(rolling_days)] = pd.read_csv('./assets/Modelling_Data/train_r'+ str(rolling_days) +'.csv', index_col=0)
    
    # drop Date Column
    locals()['train_' + str(rolling_days)].drop(columns = ['Date'], inplace = True)
    
    # drop CodeSum Column
    if drop_codesum == True:
        locals()['train_' + str(rolling_days)].drop(columns = ['MIFG','TS','SQ','GR','VCFG','FG+','SN','FG',
                                                               'VCTS','BCFG','BR','RA','FU','DZ','TSRA','HZ'], inplace = True)
    
    # Create X and Y
    X = eval('train_' + str(rolling_days)).drop(columns = ['WnvPresent'])
    X.reset_index(drop = True, inplace = True)
    y = eval('train_' + str(rolling_days))['WnvPresent']
    
    # Dummify Columns
    X = pd.get_dummies(X, columns=['Species', 'Trap'], drop_first = True)
    
    # Train Test Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 88, stratify = y)
    
    #Standard Scale weather variables
    ct = ColumnTransformer([("sc", StandardScaler(), 
                             ['Tavg', 'PrecipTotal','StnPressure','ResultDir','AvgSpeed','Sunlight'])],
                          remainder = 'passthrough')

    X_train_sc = ct.fit_transform(X_train)
    X_test_sc = ct.transform(X_test)
    
    # Convert to dataframe
    X_train_sc = pd.DataFrame(X_train_sc, columns=X_train.columns)
    X_test_sc = pd.DataFrame(X_test_sc, columns=X_test.columns)
    
    return((X_train_sc, X_test_sc, y_train, y_test))

In [18]:
X_train_sc, X_test_sc, y_train, y_test = df_getter(20)

## 02. Compress Data

In [19]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in tqdm(df.columns):
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [20]:
X_train_sc_compressed = reduce_mem_usage(X_train_sc)

Memory usage of dataframe is 8.65 MB


100%|███████████████████████████████████████| 161/161 [00:00<00:00, 1613.51it/s]

Memory usage after optimization is: 2.16 MB
Decreased by 75.0%





In [21]:
## Pycaret required 'y' to be with the X_Train DataFrame

In [22]:
# Recombine with y_train
y_train.reset_index(drop=True, inplace = True)
pycaret_df = pd.concat([X_train_sc_compressed,y_train],axis = 1)

# Rename columns
pycaret_df.columns = (list(X_train_sc.columns) + ['WnvPresent'])

In [23]:
pycaret_df.head()

Unnamed: 0,Tavg,PrecipTotal,StnPressure,ResultDir,AvgSpeed,Sunlight,MIFG,TS,SQ,GR,...,Trap_T231,Trap_T232,Trap_T233,Trap_T235,Trap_T236,Trap_T237,Trap_T238,Trap_T900,Trap_T903,WnvPresent
0,0.822754,2.994141,-0.439941,-0.037933,-0.944336,-0.369385,0.0,0.290771,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,-0.876953,-0.675781,0.556641,-0.008759,1.103516,-1.99707,0.0,0.049988,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,0.314453,1.56543,0.771484,-0.062195,-0.750977,-0.979004,0.0,0.19397,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,-1.31543,-0.182373,1.863281,-2.976562,-1.994141,-1.37207,0.039032,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,2.197266,-0.273438,-0.630371,0.199951,-0.614258,0.394287,0.0,0.294678,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


## 03. Run PyCaret!

In [25]:
%%time

## Setup pycaret models scan

my_pie_carrot = setup(data = pycaret_df,
                      target = 'WnvPresent',
                      normalize = False,
                      fold = 5,
                      #use_gpu=True,
                      numeric_features=list(pycaret_df.columns)[0:(len(pycaret_df.columns)-1)],
                      imputation_type='iterative',
                      n_jobs= -1,
                      session_id = 42,
                      preprocess = False
                     )

Unnamed: 0,Description,Value
0,session_id,42
1,Target,WnvPresent
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7039, 162)"
5,Missing Values,False
6,Numeric Features,161
7,Categorical Features,0
8,Transformed Train Set,"(4927, 161)"
9,Transformed Test Set,"(2112, 161)"


CPU times: user 411 ms, sys: 44.3 ms, total: 455 ms
Wall time: 4.13 s


In [26]:
%%time

## Perform pycaret models scan
my_pie_carrot_results = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.9454,0.7731,0.0,0.0,0.0,0.0,0.0,0.038
ridge,Ridge Classifier,0.9454,0.0,0.0,0.0,0.0,0.0,0.0,0.028
qda,Quadratic Discriminant Analysis,0.9454,0.5,0.0,0.0,0.0,0.0,0.0,0.036
ada,Ada Boost Classifier,0.9432,0.7717,0.0335,0.4344,0.0604,0.0506,0.1008,0.084
gbc,Gradient Boosting Classifier,0.9432,0.792,0.0448,0.346,0.0785,0.067,0.1067,0.244
svm,SVM - Linear Kernel,0.9424,0.0,0.0074,0.0211,0.011,0.0054,0.0062,0.036
knn,K Neighbors Classifier,0.942,0.7045,0.0669,0.3297,0.1102,0.0943,0.1275,0.068
et,Extra Trees Classifier,0.9407,0.6889,0.0781,0.3271,0.1259,0.1069,0.1375,0.088
rf,Random Forest Classifier,0.9393,0.7375,0.0931,0.3214,0.1432,0.1215,0.1476,0.094
lightgbm,Light Gradient Boosting Machine,0.9379,0.8027,0.1192,0.3193,0.1728,0.1475,0.1675,0.052


CPU times: user 2.78 s, sys: 974 ms, total: 3.75 s
Wall time: 5.95 s


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

In [27]:
# Export csv of model evaluation
pycaret_summary = pull()
pycaret_summary.to_csv('./assets/Pycaret/pycaret_summary.csv')

## 04. Model(s) Selection

Examining the performance, we select the rf, lr, gbc models.

In [35]:
rf = create_model('rf')
tuned_rf = tune_model(rf, n_iter = 10, optimize='AUC', choose_better=True)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9452,0.809,0.0,0.0,0.0,0.0,0.0
1,0.9452,0.8158,0.0,0.0,0.0,0.0,0.0
2,0.9462,0.8225,0.0,0.0,0.0,0.0,0.0
3,0.9452,0.8149,0.0,0.0,0.0,0.0,0.0
4,0.9452,0.8063,0.0,0.0,0.0,0.0,0.0
Mean,0.9454,0.8137,0.0,0.0,0.0,0.0,0.0
SD,0.0004,0.0057,0.0,0.0,0.0,0.0,0.0


In [36]:
lr = create_model('lr')
tuned_lr = tune_model(lr, n_iter = 10, optimize='AUC', choose_better=True)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9452,0.7489,0.0,0.0,0.0,0.0,0.0
1,0.9462,0.7702,0.0185,1.0,0.0364,0.0344,0.1324
2,0.9472,0.8399,0.0189,1.0,0.037,0.0351,0.1337
3,0.9452,0.7521,0.0,0.0,0.0,0.0,0.0
4,0.9452,0.7752,0.0,0.0,0.0,0.0,0.0
Mean,0.9458,0.7773,0.0075,0.4,0.0147,0.0139,0.0532
SD,0.0008,0.0329,0.0092,0.4899,0.018,0.017,0.0652


In [37]:
gbc = create_model('gbc')
tuned_gbc = tune_model(gbc, n_iter = 10, optimize='AUC',choose_better=True)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9432,0.7966,0.037,0.3333,0.0667,0.0563,0.0958
1,0.9442,0.8142,0.0185,0.3333,0.0351,0.0295,0.0676
2,0.9482,0.8274,0.0566,0.75,0.1053,0.0985,0.197
3,0.9462,0.8039,0.037,0.6667,0.0702,0.0648,0.1486
4,0.9442,0.8067,0.037,0.4,0.0678,0.0591,0.1083
Mean,0.9452,0.8098,0.0372,0.4967,0.069,0.0616,0.1235
SD,0.0018,0.0105,0.012,0.1765,0.0222,0.0221,0.0451
