# Credit Scoring Prediction -Model 1

## 1. Goal and Objective
The objective of this script is to create a model that successfully categorizes borrowers as good payers(those who pay on time), late payers and bad borrowers (those who default)
## 2. Model Description

XGBoost, stands for eXtreme Gradient Boosting and is a relatively new algorithm that implements  gradient boosting decision tree algorithm. Boosting is a technique that uses an ensemble( a collection) of models. New models are added to correct the errors of previous models. Gradient Boosting uses the gradient descent algorithm to minimize the residuals(loss) of prior models. Unlike other boosting algoritms, XGBoost is particularly fast and accurate. For more about the algorithm, please see: https://arxiv.org/pdf/1603.02754.pdf

## 3. Data Description


### 1. Load Data

In [319]:
%pylab inline

#Importations
from xgboost.sklearn import XGBClassifier
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from typing import Tuple
import pandas as pd
import json
from sklearn.externals import joblib
import pickle
import numpy as np

#Variables
random_state = 42
seed = 27
test_size = 0.33
reg_lambda = 2 #XG Boost's L2 regularization term on weights, increasing it makes the model more conservative.default=1
num_of_classes = 3

# load data and the features to be used in the classification
data = pd.read_csv('../data/loans_oct_dec_2017/combined.csv')




Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


In [320]:
data.head()

Unnamed: 0,user_id,no_of_loans_completed_previously,start_date,principal_amount,no_of_days_paid_early,credit_score,data_score,date_joined,avg_daily_mpesa_txns_amount,bio_data,Unnamed: 10,education,Employment,marital,dependants,status
0,dda4da05-844d-4c4e-99a3-c58daf6f9d47,3,10/19/17,1000,1,661,575,7/11/17,262.17,"{""education"": {""level"": ""Secondary""}, ""employm...",,Secondary,Employed,married,2 - 5,complete
1,ce0f9932-3072-4672-97b1-4e7ecabcc71b,2,10/19/17,1000,1,588,740,7/14/17,2024.27,"{""education"": {""level"": ""College/University""},...",,College/University,Self employed,married,2 - 5,complete
2,61ad009e-ca38-4c92-bdcd-c6f47ee7ac04,0,10/19/17,1000,24,465,591,10/16/17,325.76,"{""education"": {""level"": ""College/University""},...",,College/University,Employed,married,1,complete
3,edcea65a-7fbc-4b25-9393-36a65032dcf3,3,10/19/17,1000,2,-1,778,6/26/17,10249.01,"{""education"": {""level"": ""College/University""},...",,College/University,Self employed,single,,complete
4,20f868bf-7006-4a0c-9a1a-aee8396e1d5b,0,10/19/17,1500,18,587,630,10/13/17,0.0,"{""education"": {""level"": ""College/University""},...",,College/University,Employed,married,2 - 5,complete


In [321]:
# Data Preparation

#Dummy Variable for married or single. 
s = pd.get_dummies(data['marital'])
data['married'] = s[' married '] #set Married =1 and save results as a new column

#Because dependants have more than 2 values, there is need for more than binary dummy variable encoding
#We therefore do label encoding

depend_as_cat = data['dependants'].astype('category') #1. We first convert the column into a category
data['depend']  = depend_as_cat.cat.codes #2. assign the encoded variable to a new column using the cat.codes

#Do the same for Status, Education and Employment columns.
data['educ'] = (data['education'].astype('category')).cat.codes 
data['employment'] = (data['Employment'].astype('category')).cat.codes 

status_as_cat =data['status'].astype('category') 
data['repayment'] = status_as_cat.cat.codes 

data.head()

Unnamed: 0,user_id,no_of_loans_completed_previously,start_date,principal_amount,no_of_days_paid_early,credit_score,data_score,date_joined,avg_daily_mpesa_txns_amount,bio_data,...,education,Employment,marital,dependants,status,married,depend,educ,employment,repayment
0,dda4da05-844d-4c4e-99a3-c58daf6f9d47,3,10/19/17,1000,1,661,575,7/11/17,262.17,"{""education"": {""level"": ""Secondary""}, ""employm...",...,Secondary,Employed,married,2 - 5,complete,1,0,3,0,0
1,ce0f9932-3072-4672-97b1-4e7ecabcc71b,2,10/19/17,1000,1,588,740,7/14/17,2024.27,"{""education"": {""level"": ""College/University""},...",...,College/University,Self employed,married,2 - 5,complete,1,0,0,1,0
2,61ad009e-ca38-4c92-bdcd-c6f47ee7ac04,0,10/19/17,1000,24,465,591,10/16/17,325.76,"{""education"": {""level"": ""College/University""},...",...,College/University,Employed,married,1,complete,1,3,0,0,0
3,edcea65a-7fbc-4b25-9393-36a65032dcf3,3,10/19/17,1000,2,-1,778,6/26/17,10249.01,"{""education"": {""level"": ""College/University""},...",...,College/University,Self employed,single,,complete,0,1,0,1,0
4,20f868bf-7006-4a0c-9a1a-aee8396e1d5b,0,10/19/17,1500,18,587,630,10/13/17,0.0,"{""education"": {""level"": ""College/University""},...",...,College/University,Employed,married,2 - 5,complete,1,0,0,0,0


### Train the XGBoost Model

In [322]:
#Code excerpts from:  https://github.com/berkeley-biosense

#Function that creates a new classifier
def fresh_clf () -> XGBClassifier:
    return XGBClassifier(
        objective= 'multi:softmax',
        seed=seed,
        reg_lambda=reg_lambda
        
    )

#Function that returns the classifier and the resulting dataframe
def xgb_cross_validate (
    X: np.array,
    y: np.array,
    nfold: int=7
) -> Tuple[XGBClassifier, pd.DataFrame]:
    # eval_metrics:
    # http://xgboost.readthedocs.io/en/latest//parameter.html
    metrics = ['merror','mlogloss' ]
    #metrics = ['error@0.1', 'auc']
#     metrics = [ 'auc' ]

    alg = fresh_clf()
    xgtrain = xgb.DMatrix(X,y)
    param = alg.get_xgb_params()
    param['n_classes'] = num_of_classes
    param['num_class'] = num_of_classes
    cvresults = xgb.cv(param,
                      xgtrain,
                      num_boost_round=alg.get_params()['n_estimators'],
                      nfold=nfold,
                      metrics=metrics,
                      early_stopping_rounds=100
                      )
    alg.set_params(n_estimators=cvresults.shape[0])
    alg.fit(X,y,eval_metric=metrics)
    return alg, cvresults

In [323]:
#Define the features
features  = ['no_of_loans_completed_previously', 'credit_score','avg_daily_mpesa_txns_amount','married','depend','educ']

#Define the input matrix and output/target 
X = data.filter(items = features)
y= data['repayment']

#Split X and Y into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,shuffle=True, random_state=seed)


clf, cvres = xgb_cross_validate(X_train, y_train)
#clf.score(X_test, y_test)
m = clf.predict(X_test)
accuracy = clf.score(X_test, y_test)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 68.81%


In [324]:
# # Another way to do it:
# model = XGBClassifier(objective = "multi:softmax", seed =27)
# model.fit(X_train, y_train)
# # make predictions for test data
# y_pred = model.predict(X_test)
# predictions = [value for value in y_pred]
# accuracy = accuracy_score(y_test, predictions)

### Suppose we didnt care if someone paid on time or not?
In this next step, we retrain the XGBoost model but considering those who completed loans on time, and those who were late as one category.

Essentially, we are assuming that we only care if the person defaults or not.

In [330]:
#Dummy Variable for default or not
status_dummy = pd.get_dummies(data['status'])

data['defaulted'] = status_dummy['default'] #Save this as a new column called defaulted
data.head()

Unnamed: 0,user_id,no_of_loans_completed_previously,start_date,principal_amount,no_of_days_paid_early,credit_score,data_score,date_joined,avg_daily_mpesa_txns_amount,bio_data,...,Employment,marital,dependants,status,married,depend,educ,employment,repayment,defaulted
0,dda4da05-844d-4c4e-99a3-c58daf6f9d47,3,10/19/17,1000,1,661,575,7/11/17,262.17,"{""education"": {""level"": ""Secondary""}, ""employm...",...,Employed,married,2 - 5,complete,1,0,3,0,0,0
1,ce0f9932-3072-4672-97b1-4e7ecabcc71b,2,10/19/17,1000,1,588,740,7/14/17,2024.27,"{""education"": {""level"": ""College/University""},...",...,Self employed,married,2 - 5,complete,1,0,0,1,0,0
2,61ad009e-ca38-4c92-bdcd-c6f47ee7ac04,0,10/19/17,1000,24,465,591,10/16/17,325.76,"{""education"": {""level"": ""College/University""},...",...,Employed,married,1,complete,1,3,0,0,0,0
3,edcea65a-7fbc-4b25-9393-36a65032dcf3,3,10/19/17,1000,2,-1,778,6/26/17,10249.01,"{""education"": {""level"": ""College/University""},...",...,Self employed,single,,complete,0,1,0,1,0,0
4,20f868bf-7006-4a0c-9a1a-aee8396e1d5b,0,10/19/17,1500,18,587,630,10/13/17,0.0,"{""education"": {""level"": ""College/University""},...",...,Employed,married,2 - 5,complete,1,0,0,0,0,0


### Retrain the XGBoost Model

In [326]:
# Make a few changes to the classifier, the evaluation metrics,y, remove num_classes etc.

#Function that creates a new classifier
def fresh_clf2 () -> XGBClassifier:
    return XGBClassifier(
        objective= 'binary:logistic',
        seed=seed,
        reg_lambda=reg_lambda
        
    )

#Function that returns the classifier and the resulting dataframe
def xgb_cross_validate2 (
    X: np.array,
    y: np.array,
    nfold: int=7
) -> Tuple[XGBClassifier, pd.DataFrame]:
    # eval_metrics:
    # http://xgboost.readthedocs.io/en/latest//parameter.html
    # metrics = ['merror','mlogloss' ]
    # metrics = ['error@0.1', 'auc']
    metrics = [ 'auc' ]
    alg = fresh_clf2()
    xgtrain = xgb.DMatrix(X,y)
    param = alg.get_xgb_params()
    cvresults = xgb.cv(param,
                      xgtrain,
                      num_boost_round=alg.get_params()['n_estimators'],
                      nfold=nfold,
                      metrics=metrics,
                      early_stopping_rounds=100
                      )
    alg.set_params(n_estimators=cvresults.shape[0])
    alg.fit(X,y,eval_metric=metrics)
    return alg, cvresults

#Change y
y_new= data['defaulted']

#Split X and Y into training set and test set
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y_new, test_size=test_size,shuffle=True, random_state=seed)


clf2, cvres2 = xgb_cross_validate2(X_train2, y_train2)
m_ = clf2.predict(X_test2)
c = clf2.score(X_test2, y_test2)
print("Accuracy: %.2f%%" % (c * 100.0))

Accuracy: 78.29%


The model seems to have made a small improvement. We save it for later

#### Model Persistence 

To save the model for later, we can use serialization(pickling) or we can use JobLib

In [327]:
joblib.dump(clf2, 'XGBoost_model.pkl')

['XGBoost_model.pkl']

In [339]:
loaded_clf = joblib.load('XGBoost_model.pkl')
# features  = ['no_of_loans_completed_previously', 'credit_score','avg_daily_mpesa_txns_amount','married','depend','educ']
real_world_X = [0,200,8,1,0,3] 

#Convert the list of features to a pandas dataframe
real_world_dataframe = pd.DataFrame(np.array(real_world_X).reshape(1,6), columns = features)

#Make prediction
new_case = loaded_clf.predict(real_world_dataframe)
new_case_prob = loaded_clf.predict_proba(new_case_df) #Calculate the probability of belonging to each class

new_case,new_case_prob #Prediction for new class, and probability of belonging to each class

(array([1], dtype=uint8), array([[ 0.69530076,  0.30469924]], dtype=float32))

In [340]:
# save the model to disk using serialization
filename = 'XGBoost_model.sav'
pickle.dump(clf2, open(filename, 'wb'))
 
# some time later...
 
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result=loaded_model.predict(real_world_dataframe)


# For the non-binary case
#predicted_category =  list(status_as_cat.cat.categories)[result[0]]
#predicted_category_prob = new_case_prob.tolist()[0][result[0]]
#print("For this new case we predicted ", predicted_category," with a probability of: %.2f%%" % (predicted_category_prob * 100.0))

#For binary case 1=defaulted 0 means did not default
if result[0] == 1:
    default = True
    print("For this new case, we predict that the lender will default with a probability of: %.2f%%" % (new_case_prob.tolist()[0][0] * 100.0))
else:
    default= False
    print("For this new case, we predict that the lender will NOT default with a probability of: %.2f%%" % (new_case_prob.tolist()[0][0] * 100.0))



For this new case, we predict that the lender will default with a probability of: 69.53%
