## Advanced Modelling of the Data

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# first lets reload the dataset
# we do need to import pandas as it is also need for display max_rows 
import pandas as pd
import cudf
import cudf as dd
import numpy as np
import gc
import matplotlib.pyplot as plt
from cuml.preprocessing.model_selection import train_test_split
from cuml.metrics import confusion_matrix, roc_auc_score
use_gpu=True
%matplotlib inline

## Read in the Features

In [3]:
## testing our previous big feats table
train = dd.read_parquet('data_eng/feats/train_feats.parquet')
test = dd.read_parquet('data_eng/feats/test_feats.parquet')
train_target = train['TARGET']
train = train.drop('TARGET', axis =1)

In [4]:
train.shape

(307511, 380)

In [33]:
# we have an unbalanced pos to negative ratio so it will help to feed this into xgb
ratio = (train_target == 0).sum()/ (train_target == 1).sum()
ratio

11.387150050352467

## Cross Validating our XGB Model

Default train test split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(train, train_target, 
                                                    test_size=0.3, random_state=42)

In [7]:
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold

We use stratified KFold as the classes are heavily unbalanced. Stratified KFold makes sure that we have a same proportion of target variables (0/1) in each kfold split 

In [8]:
skf = StratifiedKFold()

In [23]:
cv_params = {'tree_method': 'gpu_hist', 'max_depth': 6, 
        'learning_rate': 0.05, 'subsample':0.5, 'objective': 'binary:logistic',
         'eval_metric':'auc', 'scale_pos_weight': ratio,
         'gamma':0.3, 'subsample':0.5}

We use stratified KFold to check and see if different splits of the data produce consistant results. This is important to make sure we aren't overfitting to a particular subset of data

In [24]:
for i, (train_index, test_index) in enumerate(skf.split(X_train.index.to_arrow().tolist(), 
                                                        y_train.to_arrow().tolist())):
    print("Fold {0}".format(i))
    X_train_kf, X_valid_kf = X_train.iloc[train_index], X_train.iloc[test_index]
    y_train_kf, y_valid_kf = y_train.iloc[train_index], y_train.iloc[test_index]
    
    train_matrix_kf= xgb.DMatrix(X_train_kf, label=y_train_kf)
    val_matrix_kf = xgb.DMatrix(X_valid_kf, label=y_valid_kf)
    
    bst = xgb.train(params=cv_params, dtrain=train_matrix_kf, 
                evals=[(train_matrix_kf, 'train'), (val_matrix_kf, 'valid')], 
                num_boost_round=100, early_stopping_rounds=20, verbose_eval=20)

Fold 0
[0]	train-auc:0.70735	valid-auc:0.68573
[20]	train-auc:0.75794	valid-auc:0.72514
[40]	train-auc:0.77431	valid-auc:0.73567
[60]	train-auc:0.78762	valid-auc:0.74180
[80]	train-auc:0.79764	valid-auc:0.74578
[99]	train-auc:0.80578	valid-auc:0.74765
Fold 1
[0]	train-auc:0.70571	valid-auc:0.69247
[20]	train-auc:0.75680	valid-auc:0.73440
[40]	train-auc:0.77371	valid-auc:0.74497
[60]	train-auc:0.78644	valid-auc:0.75032
[80]	train-auc:0.79595	valid-auc:0.75201
[99]	train-auc:0.80295	valid-auc:0.75382
Fold 2
[0]	train-auc:0.70790	valid-auc:0.69130
[20]	train-auc:0.75737	valid-auc:0.73389
[40]	train-auc:0.77405	valid-auc:0.74464
[60]	train-auc:0.78607	valid-auc:0.74971
[80]	train-auc:0.79635	valid-auc:0.75321
[99]	train-auc:0.80420	valid-auc:0.75544
Fold 3
[0]	train-auc:0.70881	valid-auc:0.69364
[20]	train-auc:0.75679	valid-auc:0.72928
[40]	train-auc:0.77411	valid-auc:0.73986
[60]	train-auc:0.78747	valid-auc:0.74563
[80]	train-auc:0.79581	valid-auc:0.74798
[99]	train-auc:0.80423	valid-auc:

Once things look a bit stable with the tuning, we can train the whole dataset once

In [25]:
full_cv_params =  cv_params
full_cv_params['learning_rate'] = cv_params['learning_rate']/10

In [26]:
train_matrix = xgb.DMatrix(X_train, label=y_train)
val_matrix = xgb.DMatrix(X_test, label=y_test)
    
final_bst = xgb.train(params=full_cv_params, dtrain=train_matrix, 
                evals=[(train_matrix, 'train'), (val_matrix, 'valid')], 
                num_boost_round=500, early_stopping_rounds=20, verbose_eval=50)

[0]	train-auc:0.71216	valid-auc:0.69115
[50]	train-auc:0.74007	valid-auc:0.71782
[100]	train-auc:0.74560	valid-auc:0.72171
[150]	train-auc:0.75016	valid-auc:0.72491
[200]	train-auc:0.75452	valid-auc:0.72795
[250]	train-auc:0.75881	valid-auc:0.73062
[300]	train-auc:0.76311	valid-auc:0.73342
[350]	train-auc:0.76689	valid-auc:0.73570
[400]	train-auc:0.77046	valid-auc:0.73779
[450]	train-auc:0.77398	valid-auc:0.73967
[499]	train-auc:0.77719	valid-auc:0.74139


In [27]:
adv_y_pred = final_bst.predict(val_matrix)
# convert to 0/1 based on threshold
adv_y_final = np.where(adv_y_pred>0.5, 1, 0)

# Final Assessments

In [28]:
confusion_matrix(y_test, adv_y_final.astype('Int64'))

  confusion_matrix(y_test, adv_y_final.astype('Int64'))


Unnamed: 0,0,1
0,59476,25351
1,2543,4883


In [29]:
roc_auc_score(y_test, adv_y_pred)

0.7413902878761292