In this notebook we will tune the **XGBoost** model using the minimal data set `X_train_0.csv` and `X_test_0.csv` which are taken from `application_train.csv` and `application_test.csv` without merging with other tables. We will try some model tuning and feature engineering.

In [13]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import pickle
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from xgboost import XGBClassifier

from _preprocessing import change_dtypes
from _preprocessing import deal_with_abnormal_days_employed
from _preprocessing import onehot_encoding
from _preprocessing import GeneralLabelEncoder

from _model_tunning import tune_n_estimators_w_early_stopping
from _model_tunning import grid_search_stepwise

INP_DIR = "data/data_"
TUNING_DIR = "data/tuning_"

N_JOBS = 1

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load `X_train_0.csv` and `X_test_0.csv`

In [2]:
X_train_0 = pd.read_csv(os.path.join(INP_DIR, "X_train_0.csv"))
X_train_0 = change_dtypes(X_train_0)

y_train = pd.read_csv(os.path.join(INP_DIR, "y_train.csv"))
y_train = y_train["TARGET"]

X_test_0 = pd.read_csv(os.path.join(INP_DIR, "X_test_0.csv"))
X_test_0 = change_dtypes(X_test_0)

id_test = pd.read_csv(os.path.join(INP_DIR, "id_test.csv"))

print("X_train_0 shape:", X_train_0.shape)
print("y_train shape:", y_train.shape)
print("X_test_0 shape:", X_test_0.shape)
print("id_test shape:", id_test.shape)

Memory usage before changing types 284.45 MB
Memory usage after changing types 128.24 MB
Memory usage before changing types 45.09 MB
Memory usage after changing types 20.33 MB
X_train_0 shape: (307511, 120)
y_train shape: (307511,)
X_test_0 shape: (48744, 120)
id_test shape: (48744, 1)


## Data preprocessing 

In [3]:
# Modify abnormal value in DAYS_EMPLOYED
X_train_0_prep = deal_with_abnormal_days_employed(X_train_0)
X_test_0_prep = deal_with_abnormal_days_employed(X_test_0)

# Onehot enconding
X_train_0_ohe, X_test_0_ohe = onehot_encoding(X_train_0_prep, X_test_0_prep)

print("X_train_0_ohe shape", X_train_0_ohe.shape)
print("X_test_0_ohe shape", X_test_0_ohe.shape)

X_train_0_ohe shape (307511, 242)
X_test_0_ohe shape (48744, 242)


Since `XGBoost` can deal with missing values, we will not impute them. It is based on trees, so feature scaling is not necessary.

## Performance of default setting

kfold = StratifiedKFold(n_splits=5, random_state=123)
cv_scores = cross_val_score(XGBClassifier(n_jobs=N_JOBS), X_train_0_ohe, y_train, scoring="roc_auc", cv=kfold)
print("CV AUC of XGBoost model: %0.5f +/- %0.5f" % (cv_scores.mean(), cv_scores.std()))

Runing this on a linux machine using 8 CPUs give:

`CV AUC of XGBoost model: 0.75119 +/- 0.00278`

## Tuning `XGBoost`

In [None]:
n_estimators = 250
pickle_out = os.path.join(TUNING_DIR, "tuning_0.pkl")

xgb = XGBClassifier(n_estimators=n_estimators, n_jobs=N_JOBS)

step_1 = dict(learning_rate = [0.01, 0.05, 0.1, 0.2, 0.3, 0.5])
step_2 = dict(max_depth = [2, 4, 6, 8, 10])
step_3 = dict(min_child_weight = [0, 1, 3, 5, 7, 9])

step_4 = dict(subsample=[0.6, 0.8, 1.0])
step_5 = dict(colsample_bytree=[0.6, 0.8, 1.0])

step_6 = dict(reg_lambda=[0, 1, 10, 100, 1000, 10000])
step_7 = dict(reg_alpha=[0, 1, 10, 100, 1000, 10000])


params_grid_steps = [step_1, step_2, step_3, step_4, step_5, step_6, step_7]
print("params_grid_steps:\n", params_grid_steps)

results = grid_search_stepwise(xgb, X_train_0_ohe, y_train, params_grid_steps, 
                               scoring="roc_auc", cv=5,
                               random_state=123, pkl_out=pickle_out)

This tuning is run on a remote linux machine using 8 CPUs. The pickled best estimator was downloaded to the local laptop. **The best CV AUC is 0.7607**.

In [12]:
results = pickle.load(open(os.path.join(TUNING_DIR, "tuning_0.pkl"), "rb"))
xgb = results["best_estimator"]
print("Best params:\n", xgb.get_params())
print("Best CV AUC:\n", results["best_scores"][-1])

Best params:
 {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 0.8, 'gamma': 0, 'learning_rate': 0.2, 'max_delta_step': 0, 'max_depth': 4, 'min_child_weight': 7, 'missing': nan, 'n_estimators': 250, 'n_jobs': 8, 'nthread': None, 'objective': 'binary:logistic', 'random_state': 0, 'reg_alpha': 10, 'reg_lambda': 100, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 1.0, 'verbosity': 1}
Best CV AUC:
 0.7607209322734456


## Label encoding
A member of the first-place team said that using label encoding can boost the performance. Let's try that.

In [9]:
lbe = GeneralLabelEncoder()
X_train_0_lbe = lbe.fit(X_train_0_prep)

X_train_0_lbe = lbe.transform(X_train_0_prep)
X_test_0_lbe = lbe.transform(X_test_0_prep)

print("X_train_0_lbe shape:", X_train_0_lbe.shape)
print("X_test_0_lbe shape:", X_test_0_lbe.shape)

X_train_0_lbe shape: (307511, 121)
X_test_0_lbe shape: (48744, 121)


## Performance of the model tuned above

In [10]:
kfold = StratifiedKFold(n_splits=5, random_state=123)
cv_scores = cross_val_score(xgb, X_train_0_lbe, y_train, scoring="roc_auc", cv=kfold)
print("CV AUC of XGBoost model: %0.5f +/- %0.5f" % (cv_scores.mean(), cv_scores.std()))

CV AUC of XGBoost model: 0.76036 +/- 0.00317


`CV AUC of XGBoost model: 0.76036 +/- 0.00317`. A little bit worse. Maybe we have to re-tune the model with this label-encoded features.

## Tuning

In [None]:
pickle_out = os.path.join(TUNING_DIR, "tuning_0a.pkl")

step_1 = dict(learning_rate = [0.01, 0.05, 0.1, 0.2, 0.3, 0.5])
step_2 = dict(max_depth = [2, 4, 6, 8, 10])
step_3 = dict(min_child_weight = [0, 1, 3, 5, 7, 9])

step_4 = dict(subsample=[0.6, 0.8, 1.0])
step_5 = dict(colsample_bytree=[0.6, 0.8, 1.0])

step_6 = dict(reg_lambda=[0, 1, 10, 100, 1000, 10000])
step_7 = dict(reg_alpha=[0, 1, 10, 100, 1000, 10000])


params_grid_steps = [step_1, step_2, step_3, step_4, step_5, step_6, step_7]
print("params_grid_steps:\n", params_grid_steps)

results = grid_search_stepwise(xgb, X_train_0_lbe, y_train, params_grid_steps, 
                               scoring="roc_auc", cv=5,
                               random_state=123, pkl_out=pickle_out)

In [14]:
XGBClassifier().get_params()

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 100,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'binary:logistic',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 1,
 'verbosity': 1}