In this notebook we will tune the **XGBoost** model using the minimal data set `X_train_0.csv` and `X_test_0.csv` which are taken from `application_train.csv` and `application_test.csv` without merging with other tables. After the first round of model tuning, we will try to do some feature engineering. Finally, we will tune the model again on the "best" feature set to obtain the Kaggle submission file for scoring.

In [10]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.filterwarnings('ignore')

import os
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from xgboost import XGBClassifier

from _preprocessing import change_dtypes
from _preprocessing import deal_with_abnormal_days_employed

INP_DIR = "data/data_"
TUNING_DIR = "data/tuning_"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load `X_train_0.csv` and `X_test_0.csv`

In [7]:
X_train_0 = pd.read_csv(os.path.join(INP_DIR, "X_train_0.csv"))
X_train_0 = change_dtypes(X_train_0)

y_train = pd.read_csv(os.path.join(INP_DIR, "y_train.csv"))
y_train = y_train["TARGET"]

X_test_0 = pd.read_csv(os.path.join(INP_DIR, "X_test_0.csv"))
X_test_0 = change_dtypes(X_test_0)

id_test = pd.read_csv(os.path.join(INP_DIR, "id_test.csv"))

print("X_train_0 shape:", X_train_0.shape)
print("y_train shape:", y_train.shape)
print("X_test_0 shape:", X_test_0.shape)
print("id_test shape:", id_test.shape)

Memory usage before changing types 284.45 MB
Memory usage after changing types 128.24 MB
Memory usage before changing types 45.09 MB
Memory usage after changing types 20.33 MB
X_train_0 shape: (307511, 120)
y_train shape: (307511,)
X_test_0 shape: (48744, 120)
id_test shape: (48744, 1)


## Data preprocessing 

In [9]:
# Modify abnormal value in DAYS_EMPLOYED
X_train_0_prep = deal_with_abnormal_days_employed(X_train_0)
X_test_0_prep = deal_with_abnormal_days_employed(X_test_0)

# Onehot enconding
X_train_0_prep = pd.get_dummies(X_train_0_prep)
X_test_0_prep = pd.get_dummies(X_test_0_prep)

print("X_train_0_prep shape", X_train_0_prep.shape)
print("X_test_0_prep shape", X_test_0_prep.shape)

X_train_0_prep shape (307511, 245)
X_test_0_prep shape (48744, 242)


Since `XGBoost` can deal with missing values, we will not impute them. It is based on trees, so feature scaling is not necessary.

## Performance of default setting

In [None]:
kfold = StratifiedKFold(n_splits=5, random_state=123)
cv_scores = cross_val_score(XGBClassifier(), X_train_0_prep, y_train, scoring="roc_auc", cv=kfold)
print("CV AUC of XGBoost model: %0.5f +/- %0.5f" % (cv_scores.mean(), cv_scores.std()))