# Home Credit - Gradient Boosting
This model is based on [light GBM model](https://lightgbm.readthedocs.io). Some additional feature engineering is performed. For brevity, these are in a separate utils python class. These currenlty extract the data from the other data sources performing aggregations, encondings etc. then merging with the training / test data sets. The engineered data is then fed to the gradient boosting model. Data is split into cross folds and an ROC score calculated.

In [1]:
import os, sys
import numpy as np
from matplotlib import pyplot as plt

import pandas as pd

from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, average_precision_score
from sklearn.model_selection import KFold
from lightgbm import LGBMClassifier

from pre_process import *
from lightgbm_utils import *

%matplotlib inline

In [2]:
# Init some useful dirs
current_dir = os.getcwd()
DATA_HOME_DIR = current_dir+'/../data/'

## Data

In [3]:
pd.options.display.max_columns = None

In [4]:
df_train_pre, df_test_pre, y = load_train_test_data(DATA_HOME_DIR)

In [5]:
df_train_pre.shape

(307511, 121)

In [6]:
df_train, df_test = load_data_dummies(df_train_pre, df_test_pre)
df_train, df_test = append_poly_feature(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)
df_train, df_test = append_bureau_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)
df_train, df_test = append_previous_applications(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)
df_train, df_test = append_pos_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)
df_train, df_test = append_credit_card_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)
df_train, df_test = append_installments_data(in_dir=DATA_HOME_DIR, df_train=df_train, df_test=df_test)

In [7]:
df_train.shape

(307511, 831)

In [8]:
df_test.shape

(48744, 831)

In [9]:
y.shape

(307511,)

# Split data
Run algorithm using cross folds

In [10]:
feats = [f for f in df_train.columns if f not in ['SK_ID_CURR']]

In [18]:
folds = KFold(n_splits=5, shuffle=True)#, random_state=42) # TODO Remove random seed - only for testing consistency

# The Model
Now run the light GBM model using the cross folds. First the model. 

TODO: Plugin optunity here...

In [12]:
# The hyper parameters
EARLY_STOPPING_ROUNDS = 250
args = {
    "n_estimators": 4000,
    "learning_rate": 0.03,
    "num_leaves": 30,
    "colsample_bytree": 0.8,
    "subsample": 0.9,
    "max_depth": 6,
    "max_bin": 1024,
    "num_iterations": 1000,
    "min_data_in_leaf": 20,
    "reg_alpha": 0.1,
    "reg_lambda": 0.1,
    "min_split_gain": 0.01,
    "min_child_weight": 2,
    "silent": -1,
    "verbose": -1,
   # "objective": "regression",
   # "metric": "",
    "objective": "binary",
    "metric": "binary_log_loss",
    "bagging_fraction": 0.9,
    "bagging_freq": 15,
    "lambda_l1": 0.0,
    "lambda_l2": 0.0,
    "min_gain_to_split": 0.0,
    "feature_fraction": 1.0
}

The lighgbm file provides a utility class to run a cross fold / lightgbm model. See docs of that method for details.

In [13]:
df_fold_preds_train, df_fold_preds_test, df_feature_importance = \
    run_lightgbm_model(df_train, df_test, y, folds, feats, early_stopping=EARLY_STOPPING_ROUNDS, args_dict=args)
                       #save_model=True, file_prefix="m1_nl35")



Training until validation scores don't improve for 250 rounds.
[100]	training's auc: 0.785565	valid_1's auc: 0.764575
[200]	training's auc: 0.808956	valid_1's auc: 0.77656
[300]	training's auc: 0.824786	valid_1's auc: 0.781211
[400]	training's auc: 0.837411	valid_1's auc: 0.783129
[500]	training's auc: 0.848305	valid_1's auc: 0.784292
[600]	training's auc: 0.858255	valid_1's auc: 0.784882
[700]	training's auc: 0.867474	valid_1's auc: 0.785221
[800]	training's auc: 0.875767	valid_1's auc: 0.785839
[900]	training's auc: 0.882787	valid_1's auc: 0.785732
[1000]	training's auc: 0.888979	valid_1's auc: 0.785764
Did not meet early stopping. Best iteration is:
[1000]	training's auc: 0.888979	valid_1's auc: 0.785764
Fold  1 AUC : 0.785764




Training until validation scores don't improve for 250 rounds.
[100]	training's auc: 0.785746	valid_1's auc: 0.767342
[200]	training's auc: 0.809581	valid_1's auc: 0.77807
[300]	training's auc: 0.825679	valid_1's auc: 0.781999
[400]	training's auc: 0.838473	valid_1's auc: 0.784082
[500]	training's auc: 0.850154	valid_1's auc: 0.785059
[600]	training's auc: 0.859744	valid_1's auc: 0.785903
[700]	training's auc: 0.868447	valid_1's auc: 0.786204
[800]	training's auc: 0.875588	valid_1's auc: 0.786564
[900]	training's auc: 0.882851	valid_1's auc: 0.78688
[1000]	training's auc: 0.889706	valid_1's auc: 0.786947
Did not meet early stopping. Best iteration is:
[1000]	training's auc: 0.889706	valid_1's auc: 0.786947
Fold  2 AUC : 0.786947




Training until validation scores don't improve for 250 rounds.
[100]	training's auc: 0.786586	valid_1's auc: 0.763858
[200]	training's auc: 0.810096	valid_1's auc: 0.775464
[300]	training's auc: 0.825705	valid_1's auc: 0.779768
[400]	training's auc: 0.838133	valid_1's auc: 0.781896
[500]	training's auc: 0.849128	valid_1's auc: 0.782914
[600]	training's auc: 0.858523	valid_1's auc: 0.783588
[700]	training's auc: 0.867028	valid_1's auc: 0.783994
[800]	training's auc: 0.875251	valid_1's auc: 0.784416
[900]	training's auc: 0.882202	valid_1's auc: 0.784165
[1000]	training's auc: 0.889305	valid_1's auc: 0.784331
Did not meet early stopping. Best iteration is:
[1000]	training's auc: 0.889305	valid_1's auc: 0.784331
Fold  3 AUC : 0.784331




Training until validation scores don't improve for 250 rounds.
[100]	training's auc: 0.785648	valid_1's auc: 0.76379
[200]	training's auc: 0.809027	valid_1's auc: 0.775583
[300]	training's auc: 0.824207	valid_1's auc: 0.780109
[400]	training's auc: 0.836938	valid_1's auc: 0.782852
[500]	training's auc: 0.848238	valid_1's auc: 0.784222
[600]	training's auc: 0.858148	valid_1's auc: 0.785483
[700]	training's auc: 0.866898	valid_1's auc: 0.786237
[800]	training's auc: 0.875078	valid_1's auc: 0.786634
[900]	training's auc: 0.882281	valid_1's auc: 0.786718
[1000]	training's auc: 0.889547	valid_1's auc: 0.787082
Did not meet early stopping. Best iteration is:
[1000]	training's auc: 0.889547	valid_1's auc: 0.787082
Fold  4 AUC : 0.787082




Training until validation scores don't improve for 250 rounds.
[100]	training's auc: 0.785321	valid_1's auc: 0.765754
[200]	training's auc: 0.809128	valid_1's auc: 0.778211
[300]	training's auc: 0.825089	valid_1's auc: 0.78268
[400]	training's auc: 0.837232	valid_1's auc: 0.784952
[500]	training's auc: 0.84837	valid_1's auc: 0.786159
[600]	training's auc: 0.858011	valid_1's auc: 0.787317
[700]	training's auc: 0.86689	valid_1's auc: 0.787714
[800]	training's auc: 0.873872	valid_1's auc: 0.788258
[900]	training's auc: 0.880798	valid_1's auc: 0.788474
[1000]	training's auc: 0.887541	valid_1's auc: 0.788715
Did not meet early stopping. Best iteration is:
[1000]	training's auc: 0.887541	valid_1's auc: 0.788715
Fold  5 AUC : 0.788715
Overall AUC score 0.786547


In [None]:
args["num_leaves"]=64
args["max_depth"]=7
print(args)

In [None]:
df_fold_preds_train, df_fold_preds_test, df_feature_importance = \
    run_lightgbm_model(df_train, df_test, y, folds, feats, early_stopping=EARLY_STOPPING_ROUNDS, args_dict=args)
                       #save_model=True, file_prefix="m1_nl35")

In [16]:
#args["num_leaves"]=64
args["boosting"]="dart"
args["drop_rate"]=0.1
args["learning_rate"]=0.03
EARLY_STOPPING_ROUNDS=200
print(args)

{'n_estimators': 4000, 'learning_rate': 0.03, 'num_leaves': 30, 'colsample_bytree': 0.8, 'subsample': 0.9, 'max_depth': 6, 'max_bin': 1024, 'num_iterations': 1000, 'min_data_in_leaf': 20, 'reg_alpha': 0.1, 'reg_lambda': 0.1, 'min_split_gain': 0.01, 'min_child_weight': 2, 'silent': -1, 'verbose': -1, 'objective': 'binary', 'metric': 'binary_log_loss', 'bagging_fraction': 0.9, 'bagging_freq': 15, 'lambda_l1': 0.0, 'lambda_l2': 0.0, 'min_gain_to_split': 0.0, 'feature_fraction': 1.0, 'boosting': 'dart', 'drop_rate': 0.1}


In [None]:
df_fold_preds_train, df_fold_preds_test, df_feature_importance = \
    run_lightgbm_model(df_train, df_test, y, folds, feats, early_stopping=EARLY_STOPPING_ROUNDS, args_dict=args)
                       #save_model=True, file_prefix="m1_nl35")



Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.761537	valid_1's auc: 0.751174
[200]	training's auc: 0.766409	valid_1's auc: 0.754385
[300]	training's auc: 0.774974	valid_1's auc: 0.760211
[400]	training's auc: 0.783584	valid_1's auc: 0.764976
[500]	training's auc: 0.793979	valid_1's auc: 0.77101
[600]	training's auc: 0.800346	valid_1's auc: 0.773916
[700]	training's auc: 0.806392	valid_1's auc: 0.77587


### Submission

In [15]:
df_submission = df_test[['SK_ID_CURR']]
df_submission['TARGET'] = df_fold_preds_test
df_submission.to_csv('lgbm_submission3.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
