# Embrace Randomness - in XGBoost

**IMPORTANT:** XGBoost is deterministic when the sampling parameters *subsample* and *colsample_by_** are not changed (i.e. left to the default value of 1.0).

Thus, running XGBoost with the default parameters will always return the same model (given the same training set as input). Even changing XGBoost's random state has no effect in this case - because it only comes into play when sampling is used.

However, to prevent overfitting it is common to test different values for the sampling parameters (e.g. during hyperparameter optimization). Consequently, XGBoost will generate different models for the same input data when trained repeatedly.

There are two ways to deal with the randomness:

1. set *random_state* to a fixed value - this is good for reproducability, but **not for production-ready models!** (TODO: add reference)
2. train the model many times with cross validation and choose the model with the highest mean score as final model, i.e. the model which achieves the best generalization on the given training and test data.

## Overview

In this notebook, we will

* Train XGBoost with default parameters (deterministic models).
* Train XGBoost with resampling (random models).
* Train XGBoost with resampling and cross validation.

## Pre-requisites

* Load dataset (breast cancer - clean data, no missing values, no features engineering necessary)
* Split into training and test data (70%/30%)

In [1]:
from time import time
import numpy as np
import pandas as pd
import sklearn.datasets as datasets
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = datasets.load_breast_cancer()

# IMPORTANT: switch target labels as malignant should be 1

X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = 1-pd.Series(data['target'], name='target')

labels = data['target_names'][[1,0]]

# Split training and test data

In [3]:
from sklearn.model_selection import train_test_split

random_seed = 42

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=random_seed)
print("train: ", X_train.shape, ', test:', X_test.shape)

train:  (398, 30) , test: (171, 30)


# Train XGBoost Models

## Deterministic: XGBoost with default parameters

In [4]:
max_rounds = 100   # maximum number of boosting iterations
early_stop = 50     # stop if metric does not improve for X rounds

In [5]:
# https://xgboost.readthedocs.io/en/latest/python/python_api.html
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

def train_xgb_sklearn(params, X_train, y_train, X_test, y_test, random_seed):
    '''Train and predict with Scikit-Learn XGBClassifier'''
    clf = XGBClassifier(n_estimators=max_rounds, **params, random_state=random_seed)
    clf.fit(X_train, y_train)
    return clf.predict_proba(X_test)[:,1]

def train_xgb_native(params, X_train, y_train, X_test, y_test, random_seed):
    params = {**params, 'seed':random_seed}
    train = xgb.DMatrix(X_train.values, y_train.values)
    test  = xgb.DMatrix(X_test.values, y_test.values)
    bst = xgb.train(params, train, max_rounds)
    return bst.predict(test)

def eval_metrics(y_true, y_hat):
    return {
        'roc': roc_auc_score(y_true, y_hat),
        'acc': accuracy_score(y_true, y_hat >= 0.5),
        'wrong': y_true[y_true != (y_hat >= 0.5)].index.values,
    }

In [6]:
def eval_n_times(n, xgb_train, params, verbose=True):
    print(f"RUNNING {n}-times '{xgb_train.__name__}' with {params}")
    results=[]
    for i in range(n):
        random_seed = np.random.randint(1000)
        y_hat = xgb_train(params, X_train, y_train, X_test, y_test, random_seed)
        metrics = {**eval_metrics(y_test, y_hat), 'seed':random_seed}
        results.append(metrics)
        if verbose:
            print("roc_auc={roc:.4f}, accuracy={acc:.4f}, wrong:{wrong}, seed:{seed}".format(**metrics))
    return results

In [7]:
# using default XGBoost parameters
params = {'objective':'binary:logistic'}

for fun in [train_xgb_sklearn, train_xgb_native]:
    eval_n_times(5, fun, params);

RUNNING 5-times 'train_xgb_sklearn' with {'objective': 'binary:logistic'}
roc_auc=0.9949, accuracy=0.9649, wrong:[205   5  86 193  73 385], seed:185
roc_auc=0.9949, accuracy=0.9649, wrong:[205   5  86 193  73 385], seed:27
roc_auc=0.9949, accuracy=0.9649, wrong:[205   5  86 193  73 385], seed:481
roc_auc=0.9949, accuracy=0.9649, wrong:[205   5  86 193  73 385], seed:995
roc_auc=0.9949, accuracy=0.9649, wrong:[205   5  86 193  73 385], seed:546
RUNNING 5-times 'train_xgb_native' with {'objective': 'binary:logistic'}
roc_auc=0.9952, accuracy=0.9825, wrong:[205  86  73], seed:582
roc_auc=0.9952, accuracy=0.9825, wrong:[205  86  73], seed:484
roc_auc=0.9952, accuracy=0.9825, wrong:[205  86  73], seed:795
roc_auc=0.9952, accuracy=0.9825, wrong:[205  86  73], seed:499
roc_auc=0.9952, accuracy=0.9825, wrong:[205  86  73], seed:22


## Random: with subsample parameter

In [8]:
# using subsample in XGBoost parameters
params = {
    'objective':'binary:logistic',
    'subsample' : 0.8,
}

for fun in [train_xgb_sklearn, train_xgb_native]:
    results = eval_n_times(10, fun, params);
    print("MEAN: roc_auc={roc:.4f}, accuracy={acc:.4f}".format(**pd.DataFrame(results).mean().to_dict()))

RUNNING 10-times 'train_xgb_sklearn' with {'objective': 'binary:logistic', 'subsample': 0.8}
roc_auc=0.9968, accuracy=0.9708, wrong:[205  86 193  73 385], seed:732
roc_auc=0.9969, accuracy=0.9766, wrong:[205  86  73 385], seed:139
roc_auc=0.9956, accuracy=0.9766, wrong:[ 86 193  73 385], seed:533
roc_auc=0.9974, accuracy=0.9766, wrong:[205  86  73 385], seed:384
roc_auc=0.9961, accuracy=0.9708, wrong:[205  86 193  73 385], seed:387
roc_auc=0.9978, accuracy=0.9766, wrong:[ 86 193  73 385], seed:820
roc_auc=0.9966, accuracy=0.9766, wrong:[205  86  73 385], seed:758
roc_auc=0.9939, accuracy=0.9766, wrong:[205  86  73 385], seed:344
roc_auc=0.9956, accuracy=0.9766, wrong:[205  86  73 385], seed:593
roc_auc=0.9958, accuracy=0.9766, wrong:[205  86  73 385], seed:116
MEAN: roc_auc=0.9962, accuracy=0.9754
RUNNING 10-times 'train_xgb_native' with {'objective': 'binary:logistic', 'subsample': 0.8}
roc_auc=0.9962, accuracy=0.9825, wrong:[ 86  73 385], seed:469
roc_auc=0.9966, accuracy=0.9708, wro

# Cross Validation

In [9]:
from sklearn.model_selection import StratifiedKFold

def train_xgb_cv_native(X_train, y_train, params, max_rounds, skb):
    #params = {**params, 'seed':random_seed}
    train = xgb.DMatrix(X_train.values, y_train.values)
    result = xgb.cv(params, train, max_rounds, folds=skb, metrics=['error','auc'])
    roc, err = result.iloc[-1][['test-auc-mean', 'test-error-mean']].values
    return {'roc':roc, 'acc':1-err}

def train_xgb_cv_custom(X_train, y_train, params, max_rounds, skb):
    fold_results=[]
    train = xgb.DMatrix(X_train.values, y_train.values)
    for i,s in enumerate(skb.split(X_train,y_train)):
        fold_train = train.slice(s[0])
        fold_test  = train.slice(s[1])
        bst = xgb.train(params, fold_train, max_rounds)
        y_hat = bst.predict(fold_test)
        metrics = eval_metrics(y_train.iloc[s[1]], y_hat)
        fold_results.append(metrics)
    return pd.DataFrame(fold_results).mean().to_dict()

def eval_n_times_cv(X_train, y_train, n, xgb_cv, params, nfold=5, verbose=True):
    print(f"RUNNING {n}-times {nfold}-fold '{xgb_cv.__name__}' with {params}")
    results = []
    start = time()
    for i in range(n):
        random_seed = np.random.randint(1000)
        skb = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=random_seed)
        metrics = xgb_cv(X_train, y_train, params, max_rounds, skb)
        results.append(metrics)
        if (verbose != 0) & (i % verbose == 0):
            print(f"{i}:", "roc_auc={roc:.4f}, accuracy={acc:.4f}".format(**metrics))

    print("took %.1f seconds" % (time() - start))
    return results

In [10]:
results = eval_n_times_cv(X_train, y_train, 5, train_xgb_cv_native, params, verbose=1)
print("MEAN: roc_auc={roc:.4f}, accuracy={acc:.4f}".format(**pd.DataFrame(results).mean().to_dict()))

RUNNING 5-times 5-fold 'train_xgb_cv_native' with {'objective': 'binary:logistic', 'subsample': 0.8}
0: roc_auc=0.9908, accuracy=0.9649
1: roc_auc=0.9887, accuracy=0.9597
2: roc_auc=0.9887, accuracy=0.9597
3: roc_auc=0.9887, accuracy=0.9597
4: roc_auc=0.9887, accuracy=0.9597
took 4.2 seconds
MEAN: roc_auc=0.9891, accuracy=0.9607


In [11]:
results = eval_n_times_cv(X_train, y_train, 100, train_xgb_cv_custom, params, verbose=10)
print("MEAN: roc_auc={roc:.4f}, accuracy={acc:.4f}".format(**pd.DataFrame(results).mean().to_dict()))

RUNNING 100-times 5-fold 'train_xgb_cv_custom' with {'objective': 'binary:logistic', 'subsample': 0.8}
0: roc_auc=0.9898, accuracy=0.9647
10: roc_auc=0.9918, accuracy=0.9723
20: roc_auc=0.9918, accuracy=0.9573
30: roc_auc=0.9905, accuracy=0.9649
40: roc_auc=0.9909, accuracy=0.9598
50: roc_auc=0.9891, accuracy=0.9597
60: roc_auc=0.9887, accuracy=0.9573
70: roc_auc=0.9910, accuracy=0.9723
80: roc_auc=0.9911, accuracy=0.9648
90: roc_auc=0.9912, accuracy=0.9623
took 24.2 seconds
MEAN: roc_auc=0.9905, accuracy=0.9628
