## Getting Started

With the necessary background out of the way, let's get started. For this notebook, we will work with a subset of the data consisting of 10000 rows. Hyperparameter tuning is extremely computationally expensive and working with the full dataset in a Kaggle Kernel would not be feasible for more than a few search iterations. However, the same ideas that we will implement here can be applied to the full dataset and while this notebook is specifically aimed at the GBM, the methods can be applied for any machine learning model.

To "test" the tuning results, we will save some of the training data, 6000 rows, as a separate testing set. When we do hyperparameter tuning, it's crucial to **not tune the hyperparameters on the testing data**. We can only use the testing data a **single time** when we evaluate the final model that has been tuned on the validation data. To actually test our methods from this notebook, we would need to train the best model on all of the training data, make predictions on the actual testing data, and then submit our answers to the competition.

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Modeling
import lightgbm as lgb

# Splitting data
from sklearn.model_selection import train_test_split

N_FOLDS = 5
MAX_EVALS = 5

In [23]:
features = pd.read_csv('../dataset/application_train.csv')
features = features.sample(16000, random_state=1)

In [24]:
features.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=122)

In [30]:
# Only numeric features
features = features.select_dtypes('number')
features = features.drop(columns=['TARGET', 'SK_ID_CURR'])

In [31]:
features.columns

Index(['CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
       'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=104)

In [33]:
# labels = np.array(features['TARGET'].astype(np.int32)).reshape(-1,)
labels

array([1, 0, 0, ..., 0, 0, 0])

In [34]:
labels.shape

(16000,)

In [35]:
# Split into training and testing data
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=6000)

In [36]:
train_features.shape, test_features.shape

((10000, 104), (6000, 104))

We will also use only the numeric features to reduce the number of dimensions which will help speed up the hyperparameter search. Again, this is something we would not want to do on a real problem, but for demonstration purposes, it will allow us to see the concepts in practice (rather than waiting days/months for the search to finish).

In [37]:
train_set = lgb.Dataset(data=train_features, label=train_labels)
test_set = lgb.Dataset(data=test_features, label=test_labels)

In [38]:
model = lgb.LGBMClassifier()

In [40]:
default_params = model.get_params()
del default_params['n_estimators']

In [59]:
cv_result = lgb.cv(default_params, train_set, num_boost_round=1000, early_stopping_rounds=100, metrics='auc', nfold=N_FOLDS)

In [49]:
print(
    'The maximum validation ROC AUC was: {:.5f} with a standard deviation of {:.5f}.'.format(
        cv_result['auc-mean'][-1], cv_result['auc-stdv'][-1]))

The maximum validation ROC AUC was: 0.71407 with a standard deviation of 0.01947.


In [56]:
cv_result.keys()

dict_keys(['auc-mean', 'auc-stdv'])