<a href="https://colab.research.google.com/github/mnijhuis-dnb/open_source_workshop/blob/master/ML_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We are going to use machine learning to develop a default model for credit card loans. We will do this by training a classification model which will classify whether or not a customer of a bank will default on their credit card loan. We will also inspect whether or not our model identifies the most logical features to base its predictions on. For the data we will be using a public data set of credit card data from a Taiwanese bank. 

First we need to install some packages

In [None]:
!pip install shap
!pip install --upgrade xlrd

Next we will download the data

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls

We will import the packages we are going to use

In [None]:
import pandas as pd
import shap
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import minmax_scale

We set the max display of pandas to show more columns and rows

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

We read in the data

In [None]:
df = pd.read_excel('/content/default of credit card clients.xls', header=1, index_col=0).rename(columns={'default payment next month':'DEFAULTS'})

LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.

SEX: Gender (1 = male; 2 = female).

EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).

MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).

AGE: Age (year).

PAY_0 - PAY_6: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: PAY_0 = the repayment status in September, 2005; PAY_1 = the repayment status in August, 2005; . . .;PAY_6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.

BILL_AMT1-BILL_AMT6: Amount of bill statement (NT dollar). BILL_AMT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; . . .; BILL_AMT6 = amount of bill statement in April, 2005.

PAY_AMT1-PAY_AMT6: Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; . . .;PAY_AMT6 = amount paid in April, 2005.

DEFAULTS: Default payment next month (1 = defaulted; 0 = not defaulted)

Have a look at the data

In [None]:
df

## Preprocessing

First we need to do some cleaning of the data

We want to see all the values in the EDUCATION column and how often they occur

In [None]:
df['EDUCATION'].value_counts()

We don't know what the education number 0, 5 and 6 mean, so we will have to do something with it, we can drop these rows

In [None]:
df[~df['EDUCATION'].isin([0,5,6])]

Or convert the numbers we don't know to the other category

In [None]:
df.loc[df['EDUCATION'].isin([0,5,6]),'EDUCATION'] = 4

**Determine if other data cleaning steps need to be taken and clean the data further if**

The next step is the conversion of categorical variables to numerical values

In [None]:
# first do the one-hot-encoding
education_dummies = pd.get_dummies(df['EDUCATION'])

# rename the columns to more understandable values
education_dummies = education_dummies.rename(columns={1:'GRADUTE_SCHOOL', 2:'UNIVERSITY', 3:'HIGH_SCHOOL',  4:'OTHERS_EDUCATION'})

# drop the old EDUCATION column
df = df.drop(columns='EDUCATION')

# combine the education dummies with the original data 
df = pd.merge(df, education_dummies, left_index=True, right_index=True)

**Determine if other columns have categorical variables and encode these columns**

The next step is to normalise the data

We apply min-max scaling to the age column

In [None]:
df['AGE'] = minmax_scale(df['AGE'])

**Determine if other columns should be normalised and normalise them, do take into account that certain columns are related to eachother**

## Training the model

First step is to make a test-train split in the data

In [None]:
x_data = df.drop(columns='DEFAULTS')
y_data = df['DEFAULTS']

In [None]:
train_x, test_x, train_y, test_y = train_test_split(x_data, y_data, random_state=1)

**Check whether the data is balanced enough to make a prediction and rebalance if needed**

Now we can define the actual model. We are going to use a random forest model. In this model we have a few parameters we can adjust to impact the performance of the model. Below the parameters are explained in a bit more detail.

* ***max_depth*** : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than
min_samples_split samples.
* ***min_samples_split*** : The minimum number of samples required to split an internal node:
* ***min_samples_leaf*** : The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at
least min_samples_leaf training samples in each of the left and
right branches.  
* ***max_leaf_nodes*** : Grow trees with max_leaf_nodes in best-first fashion.
Best nodes are defined as relative reduction in impurity.
If None then unlimited number of leaf nodes.
* ***bootstrap*** : Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
* ***oob_score*** : Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.

In [None]:
model = RandomForestClassifier(random_state = 2, 
                               max_depth = 500, 
                               min_samples_split = 2, 
                               min_samples_leaf = 1, 
                               max_leaf_nodes = 10000, 
                               bootstrap = False,
                               oob_score = False)

After the model is defined we can train the model

In [None]:
model = model.fit(train_x.values, train_y.values)

Now the model is trained, we can try and make predictions based on the test data

In [None]:
test_predict = model.predict(test_x.values)

 With the predictions we can evaluate the performance of the modelx, by making a confussion table

In [None]:
confusion_table = confusion_matrix(test_y, test_predict)
confusion_table

**Determine the accuracy, precision and recall of the results**

**Can you improve the quality of the forecast of the model by adjusting the parameters**

## Evaluation of the prediction

To further evaluate the prediction we will be using the shap package to calculate the shapley values and see which columns are most important to our predictions

First we start looking at a specific prediction, we are going to predict the class of the 54th row of the test data

In [None]:
row_number = 54
data_for_prediction = test_x.iloc[row_number]  
model.predict_proba(data_for_prediction.values.reshape(1, -1))

Now we can caluculate the shapley values from the data

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(data_for_prediction)
shap_values

These don't say much on their own, let's plot them

In [None]:
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], test_x.loc[test_x.index[row_number]])

**Can you find some interesting prediction reasons?**

We can also look at all the rows at the same time (this does take some time to calculate depending on the size of your model, so we only use a subset of the data)

In [None]:
shap.initjs()
explainer = shap.TreeExplainer(model)
subset = test_x.iloc[:1000,:]
shap_values = explainer.shap_values(subset)
shap.force_plot(explainer.expected_value[1], shap_values[1], subset)

Or in a simpler overview

In [None]:
shap.summary_plot(shap_values[1], subset)

**What can you say about the factors which are most important for the prediction of the class?**

You can also look at the interactions between two columns

In [None]:
shap.dependence_plot('LIMIT_BAL', shap_values[1], subset, interaction_index='PAY_0', x_jitter=1, dot_size=20)

**Can you find some interesting interactions**