## Default Rate Estimation using LightGBM

### Introduction
As we known, `LightGBM` is a very popular machine learning library in the data competitions and industries because of its excellent effect and interpretability. In this notebook, we will use this library to build our binary classification model trained on dataset of [Tianchi Competetion](https://tianchi.aliyun.com/competition/entrance/531830/information), referencing some excellent work as below:
 * Feature Eningeering: https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.6.3b30b135z4zdwX&postId=129321
 * Hypermeter Tunning: https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.3.3b30b1352BkwCe&postId=129346
 
### Requirements
Suppose we have run the `../../../tianchi_loan/fg.ipynb`.

In [1]:
import pyspark
import yaml
import argparse
import subprocess
import lightgbm as lgb
import numpy as np
import pandas as pd
import warnings

from lightgbm import Booster, LGBMClassifier
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss

warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

### Read Dataset

In this section, we need read training dataset stored in `${MY_S3_BUCKET}/risk/tianchi/fg_train_data.csv`. You may need substitute `${MY_S3_BUCKET}` with your own S3 bucket before runing the **commented code** below.

In [2]:
# !aws s3 cp ${MY_S3_BUCKET}/risk/tianchi/fg_train_data.csv ../../../dataset/tianchi_loan/

In [3]:
train_data = pd.read_csv('../../../dataset/tianchi_loan/fg_train_data.csv')

In [4]:
train_data[:10]

Unnamed: 0,loanAmnt,term,interestRate,installment,grade,subGrade,employmentTitle,employmentLength,homeOwnership,annualIncome,verificationStatus,isDefault,purpose,postCode,regionCode,dti,delinquency_2years,ficoRangeLow,ficoRangeHigh,openAcc,pubRec,pubRecBankruptcies,revolBal,revolUtil,totalAcc,initialListStatus,applicationType,earliesCreditLine,title,policyCode,n0,n1,n2,n3,n4,n5,n6,n7,n8,n9,n10,n11,n12,n13,n14,issueDateDT,grade_target_mean,subGrade_target_mean,grade_to_mean_n0,grade_to_std_n0,grade_to_mean_n1,grade_to_std_n1,grade_to_mean_n2,grade_to_std_n2,grade_to_mean_n4,grade_to_std_n4,grade_to_mean_n5,grade_to_std_n5,grade_to_mean_n6,grade_to_std_n6,grade_to_mean_n7,grade_to_std_n7,grade_to_mean_n8,grade_to_std_n8,grade_to_mean_n9,grade_to_std_n9,grade_to_mean_n10,grade_to_std_n10,grade_to_mean_n11,grade_to_std_n11,grade_to_mean_n12,grade_to_std_n12,grade_to_mean_n13,grade_to_std_n13,grade_to_mean_n14,grade_to_std_n14
0,35000.0,5,19.52,917.97,5,21,161280,2,2,110000.0,2,1,1,43,32,17.05,0.0,730.0,734.0,7.0,0.0,0.0,24178.0,48.9,27.0,0,0,2001,1,1.0,0.0,2.0,2.0,2.0,4.0,9.0,8.0,4.0,12.0,2.0,7.0,0.0,0.0,0.0,2.0,2587,0.386234,0.380444,1.876011,3.992386,1.87462,4.053876,1.942294,4.023418,1.86916,3.948124,1.897562,4.055665,1.86576,4.017884,1.840872,4.074681,1.851544,4.040923,1.938318,4.024912,1.84221,4.108917,1.85281,4.009823,1.85281,4.009823,1.857394,4.005352,1.856379,3.991791
1,18000.0,5,18.49,461.9,4,16,89538,5,0,46000.0,2,0,0,64,18,27.83,0.0,700.0,704.0,13.0,0.0,0.0,15096.0,38.9,18.0,1,0,2002,5768,1.0,0.0,3.0,5.0,5.0,10.0,7.0,7.0,7.0,13.0,5.0,13.0,0.0,0.0,0.0,2.0,1888,0.304227,0.29819,1.500809,3.193909,1.502905,3.185919,1.504054,3.173189,1.567352,3.204484,1.511316,3.139166,1.515599,3.098975,1.500817,3.139721,1.517874,3.086106,1.50414,3.174194,1.484104,3.173687,1.482248,3.207858,1.482248,3.207858,1.485915,3.204282,1.485103,3.193433
2,12000.0,5,16.99,298.17,4,17,159367,8,0,74000.0,2,0,0,265,14,22.77,0.0,675.0,679.0,11.0,0.0,0.0,4606.0,51.8,27.0,0,0,2006,0,1.0,0.0,0.0,3.0,3.0,0.0,0.0,21.0,4.0,5.0,3.0,11.0,0.0,0.0,0.0,4.0,3044,0.304227,0.302541,1.500809,3.193909,1.360761,2.99819,1.532981,3.241462,1.273891,3.071276,1.162371,3.176718,1.480241,3.125317,1.472698,3.259745,1.406712,3.254085,1.530998,3.244609,1.50423,3.089208,1.482248,3.207858,1.482248,3.207858,1.485915,3.204282,1.315111,3.146801
3,2050.0,3,7.69,63.95,1,3,59830,9,0,35000.0,0,0,0,465,14,17.49,0.0,755.0,759.0,12.0,0.0,0.0,3111.0,8.5,23.0,0,0,2006,0,1.0,0.0,1.0,3.0,3.0,7.0,11.0,3.0,10.0,18.0,3.0,12.0,0.0,0.0,0.0,3.0,2679,0.059838,0.065532,0.375202,0.798477,0.368239,0.796491,0.383245,0.810366,0.380622,0.806605,0.384972,0.802575,0.368526,0.819126,0.369865,0.798404,0.377964,0.799464,0.38275,0.811152,0.370128,0.799459,0.370562,0.801965,0.370562,0.801965,0.371479,0.80107,0.344287,0.793451
4,11500.0,3,14.98,398.54,3,12,85242,1,1,30000.0,2,0,0,3,4,32.6,0.0,665.0,669.0,8.0,1.0,1.0,14021.0,59.7,33.0,1,0,1994,0,1.0,0.0,4.0,4.0,4.0,4.0,16.0,10.0,5.0,21.0,4.0,8.0,0.0,0.0,0.0,2.0,2406,0.224522,0.224686,1.125607,2.395431,1.113406,2.430896,1.133984,2.439745,1.121496,2.368874,1.19793,2.401168,1.120956,2.388727,1.106851,2.450979,1.144817,2.403154,1.133458,2.44134,1.104961,2.446307,1.111686,2.405894,1.111686,2.405894,1.114436,2.403211,1.113827,2.395075
5,12000.0,3,12.99,404.27,3,11,65718,5,2,60000.0,1,1,0,770,13,19.22,0.0,690.0,694.0,15.0,0.0,0.0,27176.0,46.0,21.0,1,0,1994,0,1.0,0.0,7.0,13.0,13.0,7.0,7.0,2.0,13.0,17.0,11.0,15.0,0.0,0.0,0.0,6.0,3257,0.224522,0.204005,1.125607,2.395431,1.085997,2.408741,0.984707,2.361605,1.141867,2.419815,1.133487,2.354374,1.100101,2.459716,1.119411,2.396658,1.136053,2.409156,1.011351,2.376224,1.124941,2.384061,1.111686,2.405894,1.111686,2.405894,1.114436,2.403211,0.92343,2.361914
6,24000.0,3,9.99,774.3,2,7,209276,10,0,150000.0,1,0,2,40,8,5.68,0.0,690.0,694.0,7.0,0.0,0.0,4334.0,68.8,25.0,0,0,1983,18780,1.0,1.0,1.0,3.0,3.0,2.0,7.0,7.0,6.0,17.0,3.0,7.0,0.0,0.0,0.0,2.0,2983,0.13121,0.128111,0.707941,1.635584,0.736477,1.592982,0.766491,1.620731,0.720818,1.621383,0.755658,1.569583,0.7578,1.549487,0.738697,1.62501,0.757368,1.606104,0.765499,1.622304,0.736884,1.643567,0.741124,1.603929,0.741124,1.603929,0.742958,1.602141,0.742552,1.596716
7,16000.0,3,7.91,500.72,1,4,8198,2,1,50000.0,0,0,4,76,8,38.95,0.0,710.0,714.0,9.0,0.0,0.0,19023.0,60.8,11.0,0,0,2011,16334,1.0,0.0,4.0,5.0,5.0,4.0,6.0,2.0,7.0,9.0,5.0,9.0,0.0,0.0,0.0,1.0,3136,0.059838,0.083522,0.375202,0.798477,0.371135,0.810299,0.376013,0.793297,0.373832,0.789625,0.368325,0.815212,0.3667,0.819905,0.375204,0.78493,0.364666,0.813245,0.376035,0.793549,0.368003,0.809138,0.370562,0.801965,0.370562,0.801965,0.371479,0.80107,0.395135,0.846111
8,6000.0,3,10.49,194.99,2,6,115263,2,0,77000.0,1,0,2,106,38,17.27,0.0,660.0,664.0,16.0,1.0,1.0,220.0,3.6,49.0,0,0,1996,18780,1.0,0.0,1.0,4.0,4.0,2.0,11.0,14.0,13.0,32.0,4.0,15.0,0.0,0.0,0.0,0.0,3533,0.13121,0.109461,0.750404,1.596954,0.736477,1.592982,0.755989,1.626497,0.720818,1.621383,0.769944,1.605151,0.739618,1.580526,0.746274,1.597772,0.788374,1.610142,0.755638,1.62756,0.749961,1.589374,0.741124,1.603929,0.741124,1.603929,0.742958,1.602141,0.846155,1.753293
9,10375.0,5,15.61,250.16,4,15,74728,9,0,58000.0,0,0,2,437,36,21.02,0.0,705.0,709.0,16.0,0.0,0.0,36609.0,61.1,33.0,0,0,2002,18780,1.0,0.0,3.0,4.0,4.0,5.0,6.0,14.0,13.0,14.0,4.0,16.0,0.0,0.0,0.0,2.0,2526,0.304227,0.279444,1.500809,3.193909,1.502905,3.185919,1.511979,3.252993,1.494754,3.218213,1.473298,3.26085,1.479236,3.161051,1.492548,3.195544,1.497336,3.234727,1.511277,3.25512,1.496655,3.146687,1.482248,3.207858,1.482248,3.207858,1.485915,3.204282,1.485103,3.193433


### Label and Features 
Suppose the Spark Dataframe of this training dataset only contains numerical features. Here we use `params['label']` column value as label and other columns as features.

In [5]:
def load_config(path):
    params = dict()
    with open(path, 'r') as stream:
        params = yaml.load(stream, Loader=yaml.FullLoader)
    return params

params = load_config('../conf/spark_lgbm_dev.yaml')

In [6]:
label = params['label']
feature_cols = [x for x in train_data.columns if x not in [label]]

In [7]:
feature_cols

['loanAmnt',
 'term',
 'interestRate',
 'installment',
 'grade',
 'subGrade',
 'employmentTitle',
 'employmentLength',
 'homeOwnership',
 'annualIncome',
 'verificationStatus',
 'purpose',
 'postCode',
 'regionCode',
 'dti',
 'delinquency_2years',
 'ficoRangeLow',
 'ficoRangeHigh',
 'openAcc',
 'pubRec',
 'pubRecBankruptcies',
 'revolBal',
 'revolUtil',
 'totalAcc',
 'initialListStatus',
 'applicationType',
 'earliesCreditLine',
 'title',
 'policyCode',
 'n0',
 'n1',
 'n2',
 'n3',
 'n4',
 'n5',
 'n6',
 'n7',
 'n8',
 'n9',
 'n10',
 'n11',
 'n12',
 'n13',
 'n14',
 'issueDateDT',
 'grade_target_mean',
 'subGrade_target_mean',
 'grade_to_mean_n0',
 'grade_to_std_n0',
 'grade_to_mean_n1',
 'grade_to_std_n1',
 'grade_to_mean_n2',
 'grade_to_std_n2',
 'grade_to_mean_n4',
 'grade_to_std_n4',
 'grade_to_mean_n5',
 'grade_to_std_n5',
 'grade_to_mean_n6',
 'grade_to_std_n6',
 'grade_to_mean_n7',
 'grade_to_std_n7',
 'grade_to_mean_n8',
 'grade_to_std_n8',
 'grade_to_mean_n9',
 'grade_to_std_n9',


### Train and evaluation
In this section, we will use LightGBM to build our binary classification model. The meaning of model hyper parameters can be referred to:
 * https://lightgbm.readthedocs.io/en/latest/Parameters.html

In [8]:
from sklearn.model_selection import train_test_split
train, valid = train_test_split(train_data, test_size=0.2, random_state=1)
X_train, y_train = train[feature_cols], train[label]
X_valid, y_valid = valid[feature_cols], valid[label]
train_matrix = lgb.Dataset(X_train, label=y_train)
valid_matrix = lgb.Dataset(X_valid, label=y_valid)

params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'min_child_weight': 5,
    'num_leaves': 2 ** 5,
    'lambda_l2': 10,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 4,
    'learning_rate': 0.1,
    'seed': 2022,
    'nthread': 28,
    'n_jobs':24,
    'silent': True,
    'verbose': -1,

}

model = lgb.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200, early_stopping_rounds=200)
print("Feature Importance:\n", list(sorted(zip(X_train.columns, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:30])     

Training until validation scores don't improve for 200 rounds
[200]	training's auc: 0.749044	valid_1's auc: 0.72963
[400]	training's auc: 0.764902	valid_1's auc: 0.730119
[600]	training's auc: 0.778805	valid_1's auc: 0.729919
Early stopping, best iteration is:
[408]	training's auc: 0.765464	valid_1's auc: 0.730148
Feature Importance:
 [('subGrade', 55731.91763496399), ('subGrade_target_mean', 25544.73084115982), ('issueDateDT', 23477.302619218826), ('grade_to_mean_n4', 20431.835428237915), ('grade_to_mean_n7', 17742.464669704437), ('term', 13944.866245031357), ('grade_to_std_n4', 11223.571625232697), ('dti', 11088.541720628738), ('annualIncome', 10994.476479291916), ('homeOwnership', 9251.973315000534), ('revolBal', 8729.287752866745), ('employmentTitle', 8060.54189157486), ('loanAmnt', 7468.738019227982), ('installment', 6830.989168405533), ('regionCode', 6447.618449687958), ('ficoRangeLow', 5539.447093009949), ('revolUtil', 5221.779930591583), ('earliesCreditLine', 4704.3636956214905

In [9]:
pred_train = model.predict(X_train, num_iteration=model.best_iteration)     
pred_valid = model.predict(X_valid, num_iteration=model.best_iteration)            

print("train dataset auc:", roc_auc_score(y_train, pred_train))
print("validation dataset auc:", roc_auc_score(y_valid, pred_valid))

train dataset auc: 0.7654641469179302
validation dataset auc: 0.7301476655170198


### Hyper Parameters Tuning
In this section, we will use `BayesianOptimization` to tune the hyper parameters of the lightgbm model. In this demo, we fix the `learning_rate` and `n_estimators` to search other hyper parameter combinations. According to its [Github issue](https://github.com/fmfn/BayesianOptimization/issues/300), we should make sure the scipy version equals to `1.7`. Please refer to `../requirements.txt`.

In [10]:
X_train = train_data[feature_cols]
y_train = train_data[label]

In [11]:
from sklearn.model_selection import cross_val_score

def rf_cv_lgb(X_train,
              y_train,
              num_leaves, 
              max_depth,
              bagging_fraction, 
              feature_fraction, 
              bagging_freq, 
              min_data_in_leaf, 
              min_child_weight, 
              min_split_gain, 
              reg_lambda, 
              reg_alpha,
              cv=5,
              scoreing='roc_auc',
              **kwargs):
    model_lgb = lgb.LGBMClassifier(
        boosting_type='gbdt', 
        objective='binary', 
        metric='auc',             
        learning_rate=0.1, 
        n_estimators=500,
        num_leaves=int(num_leaves), 
        max_depth=int(max_depth),          
        bagging_fraction=round(bagging_fraction, 2), 
        feature_fraction=round(feature_fraction, 2),                           
        bagging_freq=int(bagging_freq), 
        min_data_in_leaf=int(min_data_in_leaf),
        min_child_weight=min_child_weight, 
        min_split_gain=min_split_gain,
        reg_lambda=reg_lambda, 
        reg_alpha=reg_alpha,
        n_jobs=8)
    
    val = cross_val_score(model_lgb, X_train, y_train, cv=cv, scoring='roc_auc').mean()
    
    return val

from functools import partial

from bayes_opt import BayesianOptimization
bayes_lgb = BayesianOptimization(
    partial(rf_cv_lgb, X_train=X_train, y_train=y_train), 
    {
        'num_leaves':(10, 200),
        'max_depth':(3, 20),
        'bagging_fraction':(0.5, 1.0),
        'feature_fraction':(0.5, 1.0),
        'bagging_freq':(0, 100),
        'min_data_in_leaf':(10,100),
        'min_child_weight':(0, 10),
        'min_split_gain':(0.0, 1.0),
        'reg_alpha':(0.0, 10),
        'reg_lambda':(0.0, 10),
    }
)

In [12]:
# we need roll back to scipy==1.7
# https://github.com/fmfn/BayesianOptimization/issues/300
bayes_lgb.maximize(n_iter=10)

|   iter    |  target   | baggin... | baggin... | featur... | max_depth | min_ch... | min_da... | min_sp... | num_le... | reg_alpha | reg_la... |
-------------------------------------------------------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.7311  [0m | [0m 0.7511  [0m | [0m 41.04   [0m | [0m 0.7437  [0m | [0m 12.45   [0m | [0m 7.707   [0m | [0m 42.82   [0m | [0m 0.7638  [0m | [0m 70.27   [0m | [0m 8.191   [0m | [0m 0.4402  [0m |
| [0m 2       [0m | [0m 0.7299  [0m | [0m 0.6489  [0m | [0m 92.2    [0m | [0m 0.9367  [0m | [0m 19.24   [0m | [0m 4.542   [0m | [0m 89.16   [0m | [0m 0.505   [0m | [0m 24.56   [0m | [0m 0.1941  [0m | [0m 2.123   [0m |
| [0m 3       [0m | [0m 0.7266  [0m | [0m 0.6861  [0m | [0m 36.82   [0m | [0m 0.8071  [0m | [0m 15.89   [0m | [0m 2.367   [0m | [0m 77.38   [0m | [0m 0.09882 [0m | [0m 107.3   [0m | [0m 2.223   [0m | [

In [13]:
bayes_lgb.max

{'target': 0.7318286646522262,
 'params': {'bagging_fraction': 0.841721472291185,
  'bagging_freq': 54.592298927681746,
  'feature_fraction': 0.6385920673933405,
  'max_depth': 12.04978395140528,
  'min_child_weight': 9.392496953196893,
  'min_data_in_leaf': 51.3017845966266,
  'min_split_gain': 0.47736324501182315,
  'num_leaves': 34.643935032031465,
  'reg_alpha': 6.3523692278351005,
  'reg_lambda': 8.733053317385387}}

### K-Fold Cross validation

In [14]:
def cv_model(clf, train_x, train_y, clf_name, best_params):
    folds = 5
    seed = 2020
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    train = np.zeros(train_x.shape[0])
    cv_scores = []
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ Round {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        
        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)

            params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'metric': 'auc',
                'min_child_weight': best_params['min_child_weight'],
                'num_leaves': int(best_params['num_leaves']),
                'lambda_l1': best_params['reg_alpha'],
                'lambda_l2': best_params['reg_lambda'],
                'feature_fraction': best_params['feature_fraction'],
                'bagging_fraction': best_params['bagging_fraction'],
                'bagging_freq': int(best_params['bagging_freq']),
                'learning_rate': 0.1,
                'seed': 2022,
                'nthread': 28,
                'n_jobs':24,
                'silent': True,
                'verbose': -1,
                
            }

            model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200, early_stopping_rounds=200)
            print("Feature Importance:\n", list(sorted(zip(trn_x.columns, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:30])
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)            
        else:
            raise NotImplementedError('Unsupported classifer {}'.format(clf_name)) 
            
        train[valid_index] = val_pred
        cv_scores.append(roc_auc_score(val_y, val_pred))
    
    print("%s_score_train_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train

def lgb_model(X_train, y_train, best_params):
    lgb_train = cv_model(lgb, X_train, y_train, "lgb", best_params)
    return lgb_train

In [15]:
best_params = bayes_lgb.max['params']
lgb_train = lgb_model(X_train, y_train, best_params)

************************************ Round 1 ************************************
Training until validation scores don't improve for 200 rounds
[200]	training's auc: 0.748213	valid_1's auc: 0.730123
[400]	training's auc: 0.762271	valid_1's auc: 0.730685
Early stopping, best iteration is:
[330]	training's auc: 0.757732	valid_1's auc: 0.730799
Feature Importance:
 [('subGrade', 71254.78851079941), ('issueDateDT', 21611.031715273857), ('grade_to_std_n7', 18241.14588224888), ('grade_to_mean_n4', 16897.05391216278), ('term', 14653.579342603683), ('grade_to_mean_n7', 13450.159541726112), ('grade_to_std_n4', 11264.357944369316), ('annualIncome', 9850.464185237885), ('dti', 9529.163885712624), ('homeOwnership', 9188.58430826664), ('subGrade_target_mean', 8472.05193889141), ('loanAmnt', 7373.37351167202), ('revolBal', 6915.743562698364), ('employmentTitle', 6291.324475646019), ('regionCode', 5682.361752152443), ('installment', 5089.7564042806625), ('grade_to_std_n8', 4489.833872437477), ('ficoR

### Retrain the model on the full training dataset
In this section, we should use the full training dataset and evaluate the model effect on the test dataset. However, since we do not have labeled test dateset, we can only use the same training settings as before.

In [16]:
from sklearn.model_selection import train_test_split
train, valid = train_test_split(train_data, test_size=0.2, random_state=1)
X_train, y_train = train[feature_cols], train[label]
X_valid, y_valid = valid[feature_cols], valid[label]
train_matrix = lgb.Dataset(X_train, label=y_train)
valid_matrix = lgb.Dataset(X_valid, label=y_valid)

params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'min_child_weight': best_params['min_child_weight'],
    'num_leaves': int(best_params['num_leaves']),
    'lambda_l1': best_params['reg_alpha'],
    'lambda_l2': best_params['reg_lambda'],
    'feature_fraction': best_params['feature_fraction'],
    'bagging_fraction': best_params['bagging_fraction'],
    'bagging_freq': int(best_params['bagging_freq']),
    'learning_rate': 0.1,
    'seed': 2022,
    'nthread': 28,
    'n_jobs':24,
    'silent': True,
    'verbose': -1,

}

model = lgb.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200, early_stopping_rounds=200)
print("Feature Importance:\n", list(sorted(zip(X_train.columns, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:30])  

Training until validation scores don't improve for 200 rounds
[200]	training's auc: 0.748111	valid_1's auc: 0.729737
[400]	training's auc: 0.762032	valid_1's auc: 0.73047
Early stopping, best iteration is:
[325]	training's auc: 0.757318	valid_1's auc: 0.730558
Feature Importance:
 [('subGrade', 70057.59370136261), ('issueDateDT', 21630.305247068405), ('grade_to_mean_n4', 20378.347585082054), ('grade_to_std_n7', 20268.76363682747), ('grade_to_mean_n7', 16803.37327694893), ('term', 14263.790437221527), ('grade_to_std_n4', 11528.269649267197), ('annualIncome', 9617.149665117264), ('homeOwnership', 9208.273541092873), ('dti', 8944.923849582672), ('loanAmnt', 6901.981902122498), ('employmentTitle', 6674.215544462204), ('revolBal', 6415.410744309425), ('regionCode', 5876.525661706924), ('subGrade_target_mean', 5788.768405795097), ('installment', 5115.87585067749), ('ficoRangeLow', 4422.903689146042), ('grade_to_mean_n10', 4343.6329555511475), ('n2', 3731.807690858841), ('revolUtil', 3568.556

### Acknowledgement
Thanks to the Tianchi community for providing the loan default dataset and corresponding tutorial for risk management based on this dataset.