# Home Credit - Gradient Boosting
This model is based on [light GBM model](https://lightgbm.readthedocs.io). Some additional feature engineering is performed. For brevity, these are in a separate utils python class. These currenlty extract the data from the other data sources performing aggregations, encondings etc. then merging with the training / test data sets. The engineered data is then fed to the gradient boosting model. Data is split into cross folds and an ROC score calculated.

In [1]:
import os, sys
import numpy as np
from matplotlib import pyplot as plt

import pandas as pd

from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, average_precision_score
from sklearn.model_selection import KFold
from lightgbm import LGBMClassifier

from pre_process import *
from lightgbm_utils import *

%matplotlib inline

In [2]:
# Init some useful dirs
current_dir = os.getcwd()
DATA_HOME_DIR = current_dir+'/../data/'

## Data

In [3]:
pd.options.display.max_columns = None

In [4]:
df_train_pre, df_test_pre, y = load_train_test_data(DATA_HOME_DIR) 

In [5]:
df_train_pre.shape

(307511, 121)

### Handle categoricals

In [6]:
df_train, df_test = load_data_dummies(df_train_pre, df_test_pre)

In [7]:
df_train.shape

(307511, 245)

In [8]:
df_test.shape

(48744, 245)

### Additional features

In [9]:
df_train, df_test = append_bureau_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [10]:
df_train, df_test = append_previous_applications(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [11]:
df_train, df_test = append_pos_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [12]:
df_train, df_test = append_credit_card_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [13]:
df_train, df_test = append_installments_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [14]:
df_train.shape

(307511, 504)

# Split data
Run algorithm using cross folds

In [15]:
feats = [f for f in df_train.columns if f not in ['SK_ID_CURR']]

In [16]:
folds = KFold(n_splits=5, shuffle=True, random_state=42)

# The Model
Now run the light GBM model using the cross folds. First the model. 

TODO: Plugin optunity here...

In [17]:
# The hyper parameters
EARLY_STOPPING_ROUNDS = 150
num_leaves = 35

The lighgbm file provides a utility class to run a cross fold / lightgbm model. See docs of that method for details.

In [None]:
run_lightgbm_model(df_train, df_test, y, folds, feats, 10, 
          num_leaves=num_leaves, save_model=True, file_prefix="m1_nl35", objective='binary', metric='binary_logloss')

Training until validation scores don't improve for 10 rounds.
Early stopping, best iteration is:
[3]	training's auc: 0.738472	valid_1's auc: 0.732281
Fold  1 AUC : 0.732281
Training until validation scores don't improve for 10 rounds.
[100]	training's auc: 0.782692	valid_1's auc: 0.762573
[200]	training's auc: 0.806206	valid_1's auc: 0.774109
[300]	training's auc: 0.821031	valid_1's auc: 0.777856
[400]	training's auc: 0.833314	valid_1's auc: 0.779838
Early stopping, best iteration is:
[489]	training's auc: 0.842198	valid_1's auc: 0.780698
Fold  2 AUC : 0.780698
Training until validation scores don't improve for 10 rounds.
[100]	training's auc: 0.784184	valid_1's auc: 0.7613
[200]	training's auc: 0.806982	valid_1's auc: 0.772695
[300]	training's auc: 0.82139	valid_1's auc: 0.776832
[400]	training's auc: 0.833503	valid_1's auc: 0.778731
[500]	training's auc: 0.843394	valid_1's auc: 0.779781
Early stopping, best iteration is:
[511]	training's auc: 0.84451	valid_1's auc: 0.779843
Fold  3 A

In [None]:
run_model()

### Submission

In [None]:
df_submission = df_test[['SK_ID_CURR']]
df_submission['TARGET'] = fold_preds_test
df_submission.to_csv('lgbm_submission.csv', index=False)