## Modeling the h1n1 and seasonal flu shots
I have 2 target features
1. h1n1 vaccination
2. seasonal vaccination

For now, both target variables will be predicted by the same underlying X features.
As most of my features are categorical / binary variables, my goal is to build 4 models, then stack them
1. CatBoost on original (not one hot encoded) dataset
2. LightGBM on original (not one hot encoded) dataset by setting categorical features
3. xgBoost on one hot encoded dataset
4. LightGbm / CatBoost / sklearn GBM on one-hot encoded dataset (whichever is best)

### First, I'll build models for h1n1

In [87]:
import pandas as pd
import numpy as np

from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import time
import warnings
warnings.filterwarnings('ignore')

def evaluate_model(model_name, model, X, y):
    
    predictions_probas = model.predict_proba(X)[:,1]

    AUC = roc_auc_score(y, predictions_probas)
    LogLoss = log_loss(y, predictions_probas)

    print('AUC for', model_name, ': %1.4f' % AUC)
    print('LogLoss for', model_name, ': %1.3f' % LogLoss)

    metrics_table = pd.DataFrame({'AUC' : [round(AUC, 4)], 'LogLoss' : [round(LogLoss, 3)]}, index = [model_name])
    
    return metrics_table

#### CatBoost

In [81]:
h1n1_cat = pd.read_csv('../../data/h1n1_catboost.csv')

X_catboost = h1n1_cat.drop(['h1n1_vaccine'], 1)
y_catboost = h1n1_cat['h1n1_vaccine'].copy()

print('Original shape:', h1n1_cat.shape)
print('X shape:', X_catboost.shape)
print('y shape:', y_catboost.shape)

Original shape: (26707, 34)
X shape: (26707, 33)
y shape: (26707,)


In [82]:
for col in X_catboost.dtypes[X_catboost.dtypes == 'float64'].index.tolist():
    X_catboost[col] = X_catboost[col].astype(pd.Int64Dtype()).astype('O')
    
for col in X_catboost.columns.tolist():
    X_catboost[col] = X_catboost[col].fillna('None')

In [85]:
X_cat_train, X_cat_test, y_cat_train, y_cat_test = train_test_split(X_catboost, y_catboost, test_size = 0.2, random_state = 20202020)

In [None]:
start = time.time()
print("Started at", str(time.ctime(int(start))))

cat_params = {'learning_rate': [0.1, 0.2, 0.01],
              'l2_leaf_reg': [0.5, 0.1, 1],
              'subsample': [0.75],
              'rsm' : [2/3, 3/4],
              'max_depth': [7, 9, 11], # up to 16 (8 on gpu)
              'grow_policy': ['Lossguide', 'Depthwise', 'SymmetricTree'],
              'min_data_in_leaf' : [17, 23, 29], 
              'max_leaves' : [15, 19, 23],
              'iterations' : [100, 500]} 

cat = CatBoostClassifier(random_state = 20202020, verbose = 0,
                         eval_metric = 'AUC:hints=skip_train~false', objective = 'Logloss',
                         cat_features = X_catboost.columns.tolist())

GRID_cat = GridSearchCV(cat, param_grid = cat_params, cv = 5, scoring = 'roc_auc', n_jobs = -1)

GRID_cat.fit(X_cat_train, y_cat_train)

print("Ended at", str(time.ctime(int(time.time()))))
print((time.time() - start) / 60, 'minutes')

Started at Thu Feb  4 14:37:21 2021


In [88]:
evaluate_model('CatBoost', GRID_cat.best_estimator_, X_cat_test, y_cat_test)

AUC for CatBoost : 0.8338
LogLoss for CatBoost : 0.478


Unnamed: 0,AUC,LogLoss
CatBoost,0.8338,0.478


In [None]:
evaluate_model('CatBoost', GRID_cat.best_estimator_, X_cat_test, y_cat_test)