# Embrace Randomness - in XGBoost

**IMPORTANT:** XGBoost is deterministic when the sampling parameters *subsample* and *colsample_by_** are not changed (i.e. left to the default value of 1.0).

Thus, running XGBoost with the default parameters will always return the same model (given the same training set as input). Even changing XGBoost's random state has no effect in this case - because it only comes into play when sampling is used.

However, to prevent overfitting it is common to test different values for the sampling parameters (e.g. during hyperparameter optimization). Consequently, XGBoost will generate different models for the same input data when trained repeatedly.

There are two ways to deal with the randomness:

1. set *random_state* to a fixed value - this is good for reproducability, but **not for production-ready models!** (TODO: add reference)
2. train the model many times with cross validation and choose the model with the highest mean score as final model, i.e. the model which achieves the best generalization on the given training and test data.

## Overview

In this notebook, we will

* Train XGBoost with default parameters (deterministic model).
* Train XGBoost with resampling (random model).
* Train XGBoost with cross validation.

## Pre-requisites

* Load dataset (breast cancer - clean data, no missing values, no features engineering necessary)
* Split into training and test data (70%/30%)

In [1]:
from time import time
import numpy as np
import pandas as pd
import sklearn.datasets as datasets
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = datasets.load_breast_cancer()

# IMPORTANT: switch target labels as malignant should be 1

X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = 1-pd.Series(data['target'], name='target')

labels = data['target_names'][[1,0]]

# Split training and test data

In [3]:
from sklearn.model_selection import train_test_split

random_seed = 42

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=random_seed)
print("train: ", X_train.shape, ', test:', X_test.shape)

train:  (398, 30) , test: (171, 30)


# Train XGBoost Models

## Deterministic: XGBoost with default parameters

In [4]:
max_rounds = 100   # maximum number of boosting iterations
early_stop = 50    # stop if metric does not improve for X rounds

In [5]:
# https://xgboost.readthedocs.io/en/latest/python/python_api.html
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import roc_auc_score, accuracy_score

def train_xgb_sklearn(params, X_train, y_train, X_test, y_test, random_seed):
    '''Train and predict with Scikit-Learn XGBClassifier'''
    clf = XGBClassifier(n_estimators=max_rounds, **params, random_state=random_seed)
    clf.fit(X_train, y_train)
    return clf.predict_proba(X_test)[:,1]

def train_xgb_native(params, X_train, y_train, X_test, y_test, random_seed):
    params = {**params, 'seed':random_seed}
    train = xgb.DMatrix(X_train.values, y_train.values)
    test  = xgb.DMatrix(X_test.values, y_test.values)
    bst = xgb.train(params, train, max_rounds)
    return bst.predict(test)

def eval_metrics(y_true, y_hat):
    return {
        'roc': roc_auc_score(y_true, y_hat),
        'acc': accuracy_score(y_true, y_hat >= 0.5),
        'wrong': y_true[y_true != (y_hat >= 0.5)].index.values,
    }

In [6]:
def eval_n_times(n, xgb_train, params, verbose=True):
    print(f"RUNNING {n}-times '{xgb_train.__name__}' with {params}")
    results=[]
    for i in range(n):
        random_seed = np.random.randint(1000)
        y_hat = xgb_train(params, X_train, y_train, X_test, y_test, random_seed)
        metrics = {**eval_metrics(y_test, y_hat), 'seed':random_seed}
        results.append(metrics)
        if verbose:
            print("roc_auc={roc:.4f}, accuracy={acc:.4f}, wrong:{wrong}, seed:{seed}".format(**metrics))
    return results

In [7]:
# using default XGBoost parameters
params = {'objective':'binary:logistic'}

for fun in [train_xgb_sklearn, train_xgb_native]:
    eval_n_times(5, fun, params);

RUNNING 5-times 'train_xgb_sklearn' with {'objective': 'binary:logistic'}
roc_auc=0.9949, accuracy=0.9649, wrong:[205   5  86 193  73 385], seed:96
roc_auc=0.9949, accuracy=0.9649, wrong:[205   5  86 193  73 385], seed:936
roc_auc=0.9949, accuracy=0.9649, wrong:[205   5  86 193  73 385], seed:381
roc_auc=0.9949, accuracy=0.9649, wrong:[205   5  86 193  73 385], seed:311
roc_auc=0.9949, accuracy=0.9649, wrong:[205   5  86 193  73 385], seed:815
RUNNING 5-times 'train_xgb_native' with {'objective': 'binary:logistic'}
roc_auc=0.9952, accuracy=0.9825, wrong:[205  86  73], seed:37
roc_auc=0.9952, accuracy=0.9825, wrong:[205  86  73], seed:230
roc_auc=0.9952, accuracy=0.9825, wrong:[205  86  73], seed:215
roc_auc=0.9952, accuracy=0.9825, wrong:[205  86  73], seed:946
roc_auc=0.9952, accuracy=0.9825, wrong:[205  86  73], seed:989


## Random: with subsample parameter

In [8]:
# using subsample in XGBoost parameters
params = {
    'objective':'binary:logistic',
    'subsample' : 0.8,
}

for fun in [train_xgb_sklearn, train_xgb_native]:
    results = eval_n_times(10, fun, params);
    print("MEAN: roc_auc={roc:.4f}, accuracy={acc:.4f}".format(**pd.DataFrame(results).mean().to_dict()))

RUNNING 10-times 'train_xgb_sklearn' with {'objective': 'binary:logistic', 'subsample': 0.8}
roc_auc=0.9962, accuracy=0.9766, wrong:[205  86  73 385], seed:886
roc_auc=0.9965, accuracy=0.9708, wrong:[205  86 193  73 385], seed:471
roc_auc=0.9953, accuracy=0.9766, wrong:[ 86 193  73 385], seed:971
roc_auc=0.9947, accuracy=0.9766, wrong:[205  86  73 385], seed:541
roc_auc=0.9950, accuracy=0.9825, wrong:[ 86  73 385], seed:173
roc_auc=0.9978, accuracy=0.9766, wrong:[ 86 193  73 385], seed:871
roc_auc=0.9961, accuracy=0.9766, wrong:[205  86  73 385], seed:266
roc_auc=0.9963, accuracy=0.9766, wrong:[ 86 193  73 385], seed:699
roc_auc=0.9963, accuracy=0.9766, wrong:[205  86  73 385], seed:729
roc_auc=0.9965, accuracy=0.9825, wrong:[ 86  73 385], seed:747
MEAN: roc_auc=0.9961, accuracy=0.9772
RUNNING 10-times 'train_xgb_native' with {'objective': 'binary:logistic', 'subsample': 0.8}
roc_auc=0.9947, accuracy=0.9825, wrong:[205  86  73], seed:594
roc_auc=0.9947, accuracy=0.9766, wrong:[205   5 

# Cross Validation

In [9]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate

# not possible to use early stopping with sklearn cross validate
def train_xgb_cv_sklearn(X_train, y_train, params, max_rounds, skb, random_seed):
    #skb = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=random_seed)
    clf = XGBClassifier(n_estimators=max_rounds, **params, random_state=random_seed)
    results = cross_validate(clf, X_train, y_train, cv=skb, scoring=['roc_auc', 'accuracy'], n_jobs=4, return_train_score=False)
    roc, acc = pd.DataFrame(results).mean()[['test_roc_auc', 'test_accuracy']].values
    return {'roc':roc, 'acc':acc}

def train_xgb_cv_native(X_train, y_train, params, max_rounds, skb, random_seed):
    params = {**params, 'seed':random_seed}
    train = xgb.DMatrix(X_train.values, y_train.values)
    result = xgb.cv(params, train, max_rounds, folds=skb, metrics=['error','auc'])
    roc, err = result.iloc[-1][['test-auc-mean', 'test-error-mean']].values
    return {'roc':roc, 'acc':1-err}

def train_xgb_cv_custom(X_train, y_train, params, max_rounds, skb, random_seed):
    fold_results=[]
    params = {**params, 'seed':random_seed}
    train = xgb.DMatrix(X_train.values, y_train.values)
    for i,s in enumerate(skb.split(X_train,y_train)):
        fold_train = train.slice(s[0])
        fold_test  = train.slice(s[1])
        bst = xgb.train(params, fold_train, max_rounds)
        y_hat = bst.predict(fold_test)
        metrics = eval_metrics(y_train.iloc[s[1]], y_hat)
        fold_results.append(metrics)
    return pd.DataFrame(fold_results).mean().to_dict()

def eval_n_times_cv(X_train, y_train, n, xgb_cv, params, nfold=5, verbose=True):
    print(f"RUNNING {n}x {nfold}-fold '{xgb_cv.__name__}' with {params}")
    results = []
    start = time()
    rs = np.random.RandomState(42)
    for i in range(n):
        random_seed = rs.randint(1000)
        skb = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=random_seed)
        metrics = xgb_cv(X_train, y_train, params, max_rounds, skb, random_seed)
        results.append(metrics)
        metrics = {**metrics, 'seed':random_seed}
        if (verbose != 0) & (i % verbose == 0):
            print(f"{i}:", "roc_auc={roc:.4f}, accuracy={acc:.4f}, seed={seed}".format(**metrics))

    print("took %.1f seconds" % (time() - start))
    return results

In [10]:
results = eval_n_times_cv(X_train, y_train, 100, train_xgb_cv_sklearn, params, verbose=10)
print("MEAN: roc_auc={roc:.4f}, accuracy={acc:.4f}".format(**pd.DataFrame(results).mean().to_dict()))

RUNNING 100x 5-fold 'train_xgb_cv_sklearn' with {'objective': 'binary:logistic', 'subsample': 0.8}
0: roc_auc=0.9924, accuracy=0.9723, seed=102
10: roc_auc=0.9906, accuracy=0.9649, seed=466
20: roc_auc=0.9902, accuracy=0.9699, seed=661
30: roc_auc=0.9883, accuracy=0.9572, seed=276
40: roc_auc=0.9907, accuracy=0.9573, seed=58
50: roc_auc=0.9898, accuracy=0.9623, seed=957
60: roc_auc=0.9874, accuracy=0.9699, seed=646
70: roc_auc=0.9923, accuracy=0.9599, seed=776
80: roc_auc=0.9916, accuracy=0.9673, seed=508
90: roc_auc=0.9933, accuracy=0.9623, seed=1
took 32.6 seconds
MEAN: roc_auc=0.9904, accuracy=0.9605


In [11]:
results = eval_n_times_cv(X_train, y_train, 30, train_xgb_cv_native, params, verbose=10)
print("MEAN: roc_auc={roc:.4f}, accuracy={acc:.4f}".format(**pd.DataFrame(results).mean().to_dict()))

RUNNING 30x 5-fold 'train_xgb_cv_native' with {'objective': 'binary:logistic', 'subsample': 0.8}
0: roc_auc=0.9911, accuracy=0.9698, seed=102
10: roc_auc=0.9916, accuracy=0.9674, seed=466
20: roc_auc=0.9868, accuracy=0.9623, seed=661
took 30.3 seconds
MEAN: roc_auc=0.9907, accuracy=0.9637


In [12]:
results = eval_n_times_cv(X_train, y_train, 100, train_xgb_cv_custom, params, verbose=10)
print("MEAN: roc_auc={roc:.4f}, accuracy={acc:.4f}".format(**pd.DataFrame(results).mean().to_dict()))

RUNNING 100x 5-fold 'train_xgb_cv_custom' with {'objective': 'binary:logistic', 'subsample': 0.8}
0: roc_auc=0.9916, accuracy=0.9699, seed=102
10: roc_auc=0.9920, accuracy=0.9624, seed=466
20: roc_auc=0.9866, accuracy=0.9674, seed=661
30: roc_auc=0.9878, accuracy=0.9622, seed=276
40: roc_auc=0.9907, accuracy=0.9649, seed=58
50: roc_auc=0.9925, accuracy=0.9723, seed=957
60: roc_auc=0.9875, accuracy=0.9598, seed=646
70: roc_auc=0.9930, accuracy=0.9649, seed=776
80: roc_auc=0.9919, accuracy=0.9623, seed=508
90: roc_auc=0.9927, accuracy=0.9598, seed=1
took 32.0 seconds
MEAN: roc_auc=0.9910, accuracy=0.9622


## With Hyperparameter Optimization

In [13]:
from sklearn.model_selection import GridSearchCV

def xgb_grid_search(space, X_train, y_train, params, max_rounds, skb, random_seed):
    clf = XGBClassifier(n_estimators=max_rounds, **params, random_state=random_seed)
    
    scoring = ['roc_auc']

    #print(f"xgb_cv '{key}', shape={X.shape}")
    start = time()

    grid_search = GridSearchCV(
        estimator=clf,
        param_grid=space,
        n_jobs=4,
        cv=skb,
        scoring=scoring,
        verbose=0,
        refit=scoring[0],
        return_train_score=False,
    )

    fit_gs = grid_search.fit(X_train, y_train)
    print(fit_gs.best_params_, fit_gs.best_score_)

search_space = {
    "max_depth": [3, 4, 5],
    "subsample": [0.8, 1.0],  # subsample of observations
    "colsample_bytree": [0.8, 1.0],  # subsample of features
    "learning_rate": [0.1, 0.05],
}

rs = np.random.RandomState(42)
for i in range(10):
    random_seed = rs.randint(1000)
    skb = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_seed)
    xgb_grid_search(search_space, X_train, y_train, params, max_rounds, skb, random_seed)

{'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'subsample': 1.0} 0.9926714029919713
{'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 5, 'subsample': 1.0} 0.9908126841102063
{'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 4, 'subsample': 1.0} 0.9888199618783575
{'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 5, 'subsample': 0.8} 0.9913342574943683
{'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 5, 'subsample': 0.8} 0.9931167330907411
{'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 4, 'subsample': 1.0} 0.9951383353549356
{'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 4, 'subsample': 0.8} 0.992939409692139
{'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'subsample': 0.8} 0.9926685149887368
{'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 4, 'subsample': 0.8} 0.9912233581701612
{'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'subsample': 0.8} 0.99255183