# Default prediction for Lending Club data

## XGBoost

Authors : Iker Aguirre, Carlos Serrano

Date : 04/12/2020

XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. It is an improvement of the Random Forests. In this method the cases that have failed previously in their classification have more importance for the model, so they are used again in next samples to try to classify them correctly.

In this notebook we will just train the model (after hyperparameters tuning) and save it in a ".sav" document for further tests. Other metrics that evaluate the performance of the model have been developed in the "08_Conclusions" notebook.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

In [4]:
lc_data = pd.read_csv("lc_cleaned.csv")
lc_data.drop('Unnamed: 0', inplace=True, axis=1)
lc_data = lc_data.dropna()

In [5]:
lc_data

Unnamed: 0,funded_amnt,int_rate,annual_inc,loan_status,dti,delinq_2yrs,fico_range_low,inq_last_6mths,open_acc,pub_rec,...,num_tl_op_past_12m,pct_tl_nvr_dlq,pub_rec_bankruptcies,tax_liens,total_bal_ex_mort,home_ownership_cat,verification_status_cat,application_type_Joint App,hardship_flag_Y,debt_settlement_flag_Y
0,-0.090625,-0.9024,2.000000,0.0,-0.075856,0.0,1.000000,0.0,1.285714,1.0,...,0.0,-0.206897,1.0,0.0,1.523500,0.0,-0.5,0.0,0.0,0.0
1,-0.455208,1.8976,-0.340909,0.0,-0.966558,0.0,0.000000,0.0,-1.000000,0.0,...,-1.0,-2.011494,0.0,0.0,-0.777654,0.5,0.0,0.0,0.0,0.0
2,-0.430208,-0.9024,1.022727,0.0,-0.391517,0.0,0.714286,2.0,1.142857,0.0,...,0.5,0.287356,0.0,0.0,0.197300,-0.5,-0.5,0.0,0.0,0.0
3,-0.221875,0.4848,-0.295932,0.0,-0.637031,0.0,0.142857,0.0,0.571429,2.0,...,0.5,0.287356,2.0,0.0,-0.648005,0.5,0.0,0.0,0.0,0.0
4,0.111458,0.4848,0.227273,0.0,-0.575856,1.0,0.000000,0.0,-1.000000,0.0,...,0.5,-0.862069,0.0,0.0,-0.243986,-0.5,0.5,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456681,-0.090625,0.4160,-0.900000,0.0,0.747145,0.0,-0.142857,0.0,-0.285714,0.0,...,-0.5,0.287356,0.0,0.0,-0.434451,0.5,0.5,0.0,0.0,0.0
456682,-0.055208,-0.1600,-0.045455,-1.0,0.470636,1.0,0.000000,0.0,0.285714,0.0,...,0.5,0.091954,0.0,0.0,1.816346,-0.5,0.5,0.0,0.0,0.0
456683,0.028125,0.4800,-0.681818,-1.0,1.058728,0.0,-0.142857,0.0,-0.285714,1.0,...,0.5,0.287356,1.0,0.0,-0.087141,0.5,0.5,0.0,0.0,0.0
456684,-0.055208,1.1200,-0.013636,-1.0,0.756117,1.0,0.285714,2.0,0.857143,0.0,...,0.0,-0.287356,0.0,0.0,0.498309,0.5,0.0,0.0,0.0,0.0


In [7]:
y = lc_data['loan_status']
X = lc_data.drop('loan_status', axis=1)

In [8]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

In [10]:
%%time

gs_xgb = GridSearchCV(estimator = xgb.XGBClassifier(n_estimators=500, subsample=0.75,
                                                             colsample_bytree = 0.75,objective= 'binary:logistic',
                                                             scale_pos_weight = 1, seed= 42, nthread = 5), 
                               param_grid = {'learning_rate':np.arange(0.1,1.0,0.2),'gamma': [0.5, 1, 1.5]}, 
                               scoring='roc_auc',
                               n_jobs = -1,
                               iid=False, 
                               cv=3)
gs_xgb.fit(x_train, y_train)



CPU times: user 15min 54s, sys: 2.42 s, total: 15min 57s
Wall time: 1h 31min 34s


GridSearchCV(cv=3,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=0.75, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=500, n_jobs=None, nthread=5,
                                     num_parallel_tree=None, random_state=None,
                                     reg_alpha=None, reg_lambda=None,
                                     scale_pos_weight=1, seed=42,
                                     subsample=0.75, 

In [11]:
gs_xgb_optimum = gs_xgb.best_estimator_
gs_xgb_optimum.fit(x_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.75, gamma=1.5, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=500, n_jobs=5, nthread=5, num_parallel_tree=1,
              random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=42, subsample=0.75, tree_method='exact',
              validate_parameters=1, verbosity=None)

In [12]:
import pickle

def save_models(filename, model):
    with open(filename, 'wb') as file:
        pickle.dump(model, file)

In [13]:
save_models('xgb_model2.sav', gs_xgb_optimum)