In this notebook we will tune the **XGBoost** model using the minimal data set `X_train_0.csv` and `X_test_0.csv` which are taken from `application_train.csv` and `application_test.csv` without merging with other tables. After the first round of model tuning, we will try to do some feature engineering. Finally, we will tune the model again on the "best" feature set to obtain the Kaggle submission file for scoring.

In [15]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import argparse
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from xgboost import XGBClassifier

from _preprocessing import change_dtypes
from _preprocessing import deal_with_abnormal_days_employed

INP_DIR = "data/data_"
TUNING_DIR = "data/tuning_"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [17]:
# This is useful for running non-interactively in a remote linux cluster

parser = argparse.ArgumentParser()
parser.add_argument("--n_jobs", type=int, default=1)
args = parser.parse_args()
n_jobs = args.n_jobs

usage: ipykernel_launcher.py [-h] [--n_jobs N_JOBS]
ipykernel_launcher.py: error: unrecognized arguments: -f /Users/nthai/Library/Jupyter/runtime/kernel-6d3eda6e-29b9-44b9-be34-1d24cd717a81.json


SystemExit: 2

## Load `X_train_0.csv` and `X_test_0.csv`

In [7]:
X_train_0 = pd.read_csv(os.path.join(INP_DIR, "X_train_0.csv"))
X_train_0 = change_dtypes(X_train_0)

y_train = pd.read_csv(os.path.join(INP_DIR, "y_train.csv"))
y_train = y_train["TARGET"]

X_test_0 = pd.read_csv(os.path.join(INP_DIR, "X_test_0.csv"))
X_test_0 = change_dtypes(X_test_0)

id_test = pd.read_csv(os.path.join(INP_DIR, "id_test.csv"))

print("X_train_0 shape:", X_train_0.shape)
print("y_train shape:", y_train.shape)
print("X_test_0 shape:", X_test_0.shape)
print("id_test shape:", id_test.shape)

Memory usage before changing types 284.45 MB
Memory usage after changing types 128.24 MB
Memory usage before changing types 45.09 MB
Memory usage after changing types 20.33 MB
X_train_0 shape: (307511, 120)
y_train shape: (307511,)
X_test_0 shape: (48744, 120)
id_test shape: (48744, 1)


## Data preprocessing 

In [9]:
# Modify abnormal value in DAYS_EMPLOYED
X_train_0_prep = deal_with_abnormal_days_employed(X_train_0)
X_test_0_prep = deal_with_abnormal_days_employed(X_test_0)

# Onehot enconding
X_train_0_prep = pd.get_dummies(X_train_0_prep)
X_test_0_prep = pd.get_dummies(X_test_0_prep)

print("X_train_0_prep shape", X_train_0_prep.shape)
print("X_test_0_prep shape", X_test_0_prep.shape)

X_train_0_prep shape (307511, 245)
X_test_0_prep shape (48744, 242)


Since `XGBoost` can deal with missing values, we will not impute them. It is based on trees, so feature scaling is not necessary.

## Performance of default setting

In [12]:
kfold = StratifiedKFold(n_splits=5, random_state=123)
cv_scores = cross_val_score(XGBClassifier(), X_train_0_prep, y_train, scoring="roc_auc", cv=kfold)
print("CV AUC of XGBoost model: %0.5f +/- %0.5f" % (cv_scores.mean(), cv_scores.std()))

CV AUC of XGBoost model: 0.75116 +/- 0.00264


## Tuning `XGBoost`

In [None]:
fast_learning_rate = 0.3
slow_learning_rate = 0.01

early_stopping_begin_pickle = os.path.join(TUNING_DIR, "early_stopping_begin.pkl")
grid_search_steps_pickle = os.path.join(TUNING_DIR, "grid_search_steps.pkl")
early_stopping_end_pickle = os.path.join(TUNING_DIR, "early_stopping_end.pkl")

# construct the estimator
xgb = XGBClassifier(learning_rate=fast_learning_rate, subsample=0.8, colsample_bytree=0.8, n_jobs=n_jobs)

# tune n_estimators by early stopping using fast learning rate
xgb = tune_n_estimators_w_early_stopping(xgb, X_cancer_train_std, y_cancer_train,
                                   max_n_estimators=5000, eval_size=0.2,
                                   eval_metric="auc",
                                   early_stopping_rounds=50,
                                   random_state=123, pkl_out="early_stopping_begin.pkl")

# tune other hyper-parameters by grid search
step_1 = dict(max_depth=range(3, 11))

step_2 = dict(gamma=gamma_grid=[0, 0.2, 0.4, 0.6, 0.8, 1.])

step_3 = dict(subsample=[0.6, 0.7, 0.8, 0.9, 1.0])
step_4 = dict(colsample_bytree=[0.6, 0.7, 0.8, 0.9, 1.0])
step_5 = dict(colsample_bylevel=[0.6, 0.7, 0.8, 0.9, 1.0])

step_6 = dict(reg_lambda=[1e-5, 1e-3, 1e-1, 1, 10, 100])

params_grid_steps = [step_1, step_2, step_3, step_4, step_5, step_6]
print("params_grid_steps", params_grid_steps)

results = grid_search_stepwise(xgb, X_cancer_train_std, y_cancer_train, params_grid_steps, 
                               scoring="roc_auc", cv=5,
                               random_state=456, pkl_out=grid_search_steps_pickle)

# tune n_estimators by early stopping using slow learning rate
xgb = results["best_estimator"]
best_params = xgb.get_params()
best_params["learning_rate"] = slow_learning_rate
xgb.set_params(**best_params)

xgb = tune_n_estimators_w_early_stopping(xgb, X_cancer_train_std, y_cancer_train,
                                   max_n_estimators=5000, eval_size=0.2,
                                   eval_metric="auc",
                                   early_stopping_rounds=50,
                                   random_state=123, pkl_out=early_stopping_end_pickle)