# Home Credit - Gradient Boosting
This model is based on [light GBM model](https://lightgbm.readthedocs.io). Some additional feature engineering is performed. For brevity, these are in a separate utils python class. These currenlty extract the data from the other data sources performing aggregations, encondings etc. then merging with the training / test data sets. The engineered data is then fed to the gradient boosting model. Data is split into cross folds and an ROC score calculated.

In [1]:
import os, sys
import numpy as np
from matplotlib import pyplot as plt

import pandas as pd

from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, average_precision_score
from sklearn.model_selection import KFold
from lightgbm import LGBMClassifier

from pre_process import *
from lightgbm_utils import *
from sklearn.model_selection import train_test_split

%matplotlib inline

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [2]:
# Init some useful dirs
current_dir = os.getcwd()
DATA_HOME_DIR = current_dir+'/../data/'

## Data

In [3]:
pd.options.display.max_columns = None

In [4]:
df_train_pre, df_test_pre, y = load_train_test_data(DATA_HOME_DIR) 

FileNotFoundError: [Errno 2] File b'/Users/lsmith/Projects/kaggle-home-credit/src/../data//application_train.csv' does not exist: b'/Users/lsmith/Projects/kaggle-home-credit/src/../data//application_train.csv'

In [6]:
df_train_pre.shape

(307511, 121)

### Handle categoricals

In [7]:
df_train, df_test = load_data_dummies(df_train_pre, df_test_pre)

In [8]:
df_train.shape

(307511, 245)

In [9]:
df_test.shape

(48744, 245)

### Additional features

In [10]:
df_train, df_test = append_bureau_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [11]:
df_train, df_test = append_previous_applications(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [12]:
df_train, df_test = append_pos_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [13]:
df_train, df_test = append_credit_card_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [14]:
df_train, df_test = append_installments_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [15]:
df_train.shape

(307511, 504)

In [16]:
df_test.shape

(48744, 504)

In [17]:
y.shape

(307511,)

In [18]:
y.sample(5)

33043     0
184601    1
72737     0
296454    0
222918    1
Name: TARGET, dtype: int64

# Split data
Run algorithm using cross folds

In [19]:
feats = [f for f in df_train.columns if f not in ['SK_ID_CURR']]

In [20]:
folds = KFold(n_splits=5, shuffle=True, random_state=42)

# The Model
Now run the light GBM model using the cross folds. First the model. 

TODO: Plugin optunity here...

In [25]:
# The hyper parameters
EARLY_STOPPING_ROUNDS = 150
args = {
    "n_estimators": 4000,
    "learning_rate": 0.03,
    "num_leaves": 30,
    "colsample_bytree": 0.8,
    "subsample": 0.9,
    "max_depth": 6,
    "max_bin": 255,
    "num_iterations": 1000,
    "min_data_in_leaf": 20,
    "reg_alpha": 0.1,
    "reg_lambda": 0.1,
    "min_split_gain": 0.01,
    "min_child_weight": 2,
    "silent": -1,
    "verbose": -1,
    "objective": "regression",
    "metric": "",
    "bagging_fraction": 1.0,
    "bagging_freq": 0,
    "lambda_l1": 0.0,
    "lambda_l2": 0.0,
    "min_gain_to_split": 0.0,
    "feature_fraction": 1.0
}

The lighgbm file provides a utility class to run a cross fold / lightgbm model. See docs of that method for details.

In [26]:
df_fold_preds_train, df_fold_preds_test, df_feature_importance = \
    run_lightgbm_model(df_train, df_test, y, folds, feats, early_stopping=EARLY_STOPPING_ROUNDS, args_dict=args)
                       #save_model=True, file_prefix="m1_nl35")



Training until validation scores don't improve for 150 rounds.
[100]	training's auc: 0.773597	valid_1's auc: 0.760179
[200]	training's auc: 0.793487	valid_1's auc: 0.771515
[300]	training's auc: 0.805815	valid_1's auc: 0.775637
[400]	training's auc: 0.815278	valid_1's auc: 0.777681
[500]	training's auc: 0.82355	valid_1's auc: 0.778545
[600]	training's auc: 0.830652	valid_1's auc: 0.779
[700]	training's auc: 0.837308	valid_1's auc: 0.779432
[800]	training's auc: 0.843495	valid_1's auc: 0.779714
[900]	training's auc: 0.849352	valid_1's auc: 0.779985
[1000]	training's auc: 0.854956	valid_1's auc: 0.77992
Did not meet early stopping. Best iteration is:
[1000]	training's auc: 0.854956	valid_1's auc: 0.77992
Fold  1 AUC : 0.779920




Training until validation scores don't improve for 150 rounds.


KeyboardInterrupt: 

In [None]:
run_model()

### Submission

In [None]:
df_submission = df_test[['SK_ID_CURR']]
df_submission['TARGET'] = fold_preds_test
df_submission.to_csv('lgbm_submission.csv', index=False)