## Advanced Modelling of the Data

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# first lets reload the dataset
# we do need to import pandas as it is also need for display max_rows 
import pandas as pd
import cudf
import cudf as dd
import numpy as np
import gc
import matplotlib.pyplot as plt
from cuml.preprocessing.model_selection import train_test_split
from cuml.metrics import confusion_matrix, roc_auc_score
use_gpu=True
%matplotlib inline

## Read in the Features

In [3]:
## testing our previous big feats table
train = dd.read_parquet('data_eng/feats/train_feats.parquet')
test = dd.read_parquet('data_eng/feats/test_feats.parquet')
train_target = train['TARGET']
train = train.drop('TARGET', axis =1)

In [4]:
train.shape

(307511, 380)

In [5]:
# we have an unbalanced pos to negative ratio so it will help to feed this into xgb
ratio = (train_target == 0).sum()/ (train_target == 1).sum()
ratio

11.387150050352467

## Cross Validating our XGB Model

Default train test split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(train, train_target, 
                                                    test_size=0.3, random_state=42)

In [7]:
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold

We use stratified KFold as the classes are heavily unbalanced. Stratified KFold makes sure that we have a same proportion of target variables (0/1) in each kfold split 

In [8]:
skf = StratifiedKFold()

In [9]:
cv_params = {'tree_method': 'gpu_hist', 'max_depth': 6, 
        'learning_rate': 0.05, 'subsample':0.5, 'objective': 'binary:logistic',
         'eval_metric':'auc', 'scale_pos_weight': ratio,
         'gamma':0.3, 'subsample':0.5}

We use stratified KFold to check and see if different splits of the data produce consistant results. This is important to make sure we aren't overfitting to a particular subset of data

In [16]:
for i, (train_index, test_index) in enumerate(skf.split(X_train.index.to_arrow().tolist(), 
                                                        y_train.to_arrow().tolist())):
    print("Fold {0}".format(i))
    X_train_kf, X_valid_kf = X_train.iloc[train_index], X_train.iloc[test_index]
    y_train_kf, y_valid_kf = y_train.iloc[train_index], y_train.iloc[test_index]
    
    train_matrix_kf= xgb.DMatrix(X_train_kf, label=y_train_kf)
    val_matrix_kf = xgb.DMatrix(X_valid_kf, label=y_valid_kf)
    
    bst = xgb.train(params=cv_params, dtrain=train_matrix_kf, 
                evals=[(train_matrix_kf, 'train'), (val_matrix_kf, 'valid')], 
                num_boost_round=100, early_stopping_rounds=20, verbose_eval=20)

Fold 0
[0]	train-auc:0.71138	valid-auc:0.69010
[20]	train-auc:0.74005	valid-auc:0.71538
[40]	train-auc:0.74308	valid-auc:0.71749
[60]	train-auc:0.74568	valid-auc:0.71993
[80]	train-auc:0.74777	valid-auc:0.72130
[99]	train-auc:0.74963	valid-auc:0.72295
Fold 1
[0]	train-auc:0.70649	valid-auc:0.69244
[20]	train-auc:0.73809	valid-auc:0.71940
[40]	train-auc:0.74022	valid-auc:0.72057
[60]	train-auc:0.74271	valid-auc:0.72174
[80]	train-auc:0.74562	valid-auc:0.72387
[99]	train-auc:0.74833	valid-auc:0.72566
Fold 2
[0]	train-auc:0.70639	valid-auc:0.70249
[20]	train-auc:0.73675	valid-auc:0.72853
[40]	train-auc:0.73934	valid-auc:0.73135
[60]	train-auc:0.74212	valid-auc:0.73332
[80]	train-auc:0.74516	valid-auc:0.73541
[99]	train-auc:0.74731	valid-auc:0.73678
Fold 3
[0]	train-auc:0.70936	valid-auc:0.69049
[20]	train-auc:0.73827	valid-auc:0.71424
[40]	train-auc:0.74187	valid-auc:0.71655
[60]	train-auc:0.74370	valid-auc:0.71758
[80]	train-auc:0.74642	valid-auc:0.71976
[99]	train-auc:0.74839	valid-auc:

Once things look a bit stable with the tuning, we will train our "final" model on the full dataset

In [None]:
full_cv_params =  cv_params
# a common technique is to set learning rate lower and 
# boost the num_boost_rounds for the full train
full_cv_params['learning_rate'] = cv_params['learning_rate']/10

In [None]:
train_matrix = xgb.DMatrix(X_train, label=y_train)
val_matrix = xgb.DMatrix(X_test, label=y_test)
    
final_bst = xgb.train(params=full_cv_params, dtrain=train_matrix, 
                evals=[(train_matrix, 'train'), (val_matrix, 'valid')], 
                num_boost_round=1000, early_stopping_rounds=20, verbose_eval=50)

In [None]:
adv_y_pred = final_bst.predict(val_matrix)
# convert to 0/1 based on threshold
adv_y_final = np.where(adv_y_pred>0.5, 1, 0)

# Final Assessments

In [14]:
confusion_matrix(y_test, adv_y_final.astype('Int64'))

  confusion_matrix(y_test, adv_y_final.astype('Int64'))


Unnamed: 0,0,1
0,60214,24622
1,2550,4867


In [15]:
roc_auc_score(y_test, adv_y_pred)

0.7492623925209045