### Using Decision Tree for Loan Default Prediction

### What is our objective ?
* To reliably predict wether a person's loan payment will be defaulted based on features such as Salary, Account Balance etc.

### Getting to know the dataset!
LoanDefault dataset contains historic data for loan defaultees, along with their associated financial background, it has the following features.
* Employed - Employment status of the borrower, (1 - Employed | 0 - Unemployed).
* Bank Balance - Account Balance of the borrower at the time of repayment / default.
* Annual Salary - Per year income of the borrower at the time of repayment / default.
* Default - Target variable, indicated if the borrower repayed the loaned amount within the stipulated time period, (1 - Defaulted | 0 - Re-Paid).

### Approach
* This is an trivial example for dataset containing class imbalance, considering most of the people will be repaying their loan without default.
* So, we have to explore our data to check for imbalance, handle it using various techniques.
* Explore the correlation between various features in the dataset
* Split the preprocessed dataset into train and test sets respectively.
* Train a DecisionTree (Classifier) using mlpack.
* Finally we'll predict on the test set and using various evaluation metrics such as Accuracy, F1-Score, ROC AUC to judge the performance of our model on unseen data.

#### NOTE: In this example we'll be implementing 4 parts i.e modelling on imbalanced, oversampled, SMOTE & undersampled data respectively.

In [1]:
// Import necessary library headers.
#include <mlpack/xeus-cling.hpp>
#include <mlpack/core.hpp>
#include <mlpack/core/data/split_data.hpp>
#include <mlpack/methods/decision_tree/decision_tree.hpp>
#include <mlpack/core/data/scaler_methods/standard_scaler.hpp>

In [2]:
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"
#include "../utils/preprocess.hpp"
#include "../utils/plot.hpp"

namespace plt = matplotlibcpp;

In [None]:
using namespace mlpack;

In [None]:
using namespace mlpack::data;

In [None]:
using namespace mlpack::tree;

In [3]:
// Utility functions for evaluation metrics.
double accuracy(const arma::Row<size_t>& yPreds, const arma::Row<size_t>& yTrue)
{
    const size_t correct = arma::accu(yPreds == yTrue);
    return (double)correct / (double)yTrue.n_elem;
}

In [4]:
double precision(const size_t truePos, const size_t falsePos)
{
    return (double)truePos / (double)(truePos + falsePos);
}

In [5]:
double recall(const size_t truePos, const size_t falseNeg)
{
    return (double)truePos / (double)(truePos + falseNeg);
}

In [6]:
double f1score(const size_t truePos, const size_t falsePos, const size_t falseNeg)
{
    double prec = precision(truePos, falsePos);
    double rec = precision(truePos, falseNeg);
    return 2 * (prec * rec) / (prec + rec);
}

In [7]:
void classification_report(const arma::Row<size_t>& yPreds, const arma::Row<size_t>& yTrue)
{
    arma::Row<size_t> uniqs = arma::unique(yTrue);
    std::cout << std::setw(29) << "precision" << std::setw(15) << "recall" 
              << std::setw(15) << "f1-score" << std::setw(15) << "support" 
              << std::endl << std::endl;
    
    for(auto val: uniqs)
    {
        size_t truePos = arma::accu(yTrue == val && yPreds == val && yPreds == yTrue);
        size_t falsePos = arma::accu(yPreds == val && yPreds != yTrue);
        size_t trueNeg = arma::accu(yTrue != val && yPreds != val && yPreds == yTrue);
        size_t falseNeg = arma::accu(yPreds != val && yPreds != yTrue);
        
        std::cout << std::setw(15) << val
                  << std::setw(12) << std::setprecision(2) << precision(truePos, falsePos) 
                  << std::setw(16) << std::setprecision(2) << recall(truePos, falseNeg) 
                  << std::setw(14) << std::setprecision(2) << f1score(truePos, falsePos, falseNeg)
                  << std::setw(16) << truePos
                  << std::endl;
    }
}

Drop the dataset header using sed, sed is an unix utility that prases and transforms text.

In [11]:
!cat LoanDefault.csv | sed 1d > LoanDefault_trim.csv

### Loading the Data

In [12]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat loanData;
data::Load("LoanDefault_trim.csv", loanData);

In [13]:
// Inspect the first 5 examples in the dataset
std::cout << std::setw(12) << "Employed" << std::setw(15) << "Bank Balance" << std::setw(15) << "Annual Salary" 
          << std::setw(12) << "Defaulted" << std::endl;
std::cout << loanData.submat(0, 0, loanData.n_rows-1, 5).t() << std::endl;

    Employed   Bank Balance  Annual Salary   Defaulted
   1.0000e+00   8.7544e+03   5.3234e+05            0
            0   9.8062e+03   1.4527e+05            0
   1.0000e+00   1.2883e+04   3.8121e+05            0
   1.0000e+00   6.3510e+03   4.2845e+05            0
   1.0000e+00   9.4279e+03   4.6156e+05            0
            0   1.1035e+04   8.9899e+04            0



### Part 1 - Modelling using Imbalanced Dataset

In [14]:
// Visualize the distribution of target classes
countplot("LoanDefault.csv", "Defaulted?", "", "Part-1 Distribution of target class");
auto img = xw::image_from_file("Part-1 Distribution of target class.png").finalize();
img

A Jupyter widget with unique id: fbcbb08bb8444dfab0d897b51909e1c4

In [15]:
countplot("LoanDefault.csv", "Defaulted?", "Employed", "Part-1 Distribution of target class & Employed");
auto img = xw::image_from_file("Part-1 Distribution of target class & Employed.png").finalize();
img

A Jupyter widget with unique id: aa3a87c8c07b4ab79f8f4c19e5f2e9a5

### Visualize Correlation

In [16]:
heatmap("LoanDefault.csv", "coolwarm", "Part-1 Correlation Heatmap", 1);
auto img = xw::image_from_file("Part-1 Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: 00cd0329f2a648ebb64d615fa74ed1a8

In [17]:
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(loanData.row(loanData.n_rows - 1));
loanData.shed_row(loanData.n_rows-1);

### Train Test Split
The data set has to be split into a training set and a test set. Here the dataset has 10000 observations and the test Ratio is taken as 25% of the total observations. This indicates the test set should have 25% * 10000 = 2500 observations and trainng test should have 7500 observations respectively.

In [18]:
// Split the dataset into train and test sets using mlpack.
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;
mlpack::data::Split(loanData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

In [20]:
// Create and train Decision Tree model using mlpack.
DecisionTree<> dt(Xtrain, Ytrain, 2);

In [21]:
// Classify the test set using trained model & get the probabilities.
arma::Row<size_t> output;
arma::mat probs;
dt.Classify(Xtest, output, probs);

### Evaluation metrics

* True Positive - The actual value was true & the model predicted true.
* False Positive - The actual value was false & the model predicted true, Type I error.
* True Negative - The actual value was false & the model predicted false.
* False Negative - The actual value was true & the model predicted false, Type II error.

`Accuracy`: is a metric that generally describes how the model performs across all classes. It is useful when all classes are of equal importance. It is calculated as the ratio between the number of correct predictions to the total number of predictions.

$$Accuracy = \frac{True_{positive} + True_{negative}}{True_{positive} + True_{negative} + False_{positive} + False_{negative}}$$

`Precision`: is calculated as the ratio between the number of positive samples correctly classified to the total number of samples classified as Positive. The precision measures the model's accuracy in classifying a sample as positive.

$$Precision = \frac{True_{positive}}{True_{positive} + False_{positive}}$$

`Recall`: is calulated as the ratio between the number of positive samples correctly classified as Positive to the total number of Positive samples. The recall measures the model's ability to detect Positive samples. The higher the recall, the more positive samples detected.

$$Recall = \frac{True_{positive}}{True_{positive} + False_{negative}}$$

* The decision of whether to use precision or recall depends on the type of problem begin solved.
* If the goal is to detect all positive samples then use recall.
* Use precision if the problem is sensitive to classifying a sample as Positive in general.

* ROC graph has the True Positive rate on the y axis and the False Positive rate on the x axis.
* ROC Area under the curve in the graph is the primary metric to determine if the classifier is doing well, the higher the value the higher the model performance.

In [22]:
// Save the yTest and probabilities into csv for generating ROC AUC plot.
data::Save("probabilities.csv", probs);
data::Save("ytest.csv", Ytest);

In [23]:
// Model evaluation metrics.
std::cout <<  "Accuracy: " << accuracy(output, Ytest) << std::endl;
classification_report(output, Ytest);

Accuracy: 0.9636
                    precision         recall       f1-score        support

              0        0.98            0.99          0.98            2384
              1        0.42            0.31          0.35              25


In [24]:
plotRocAUC("ytest.csv", "probabilities.csv", "roc_auc");
auto img = xw::image_from_file("roc_auc.png").finalize();
img

A Jupyter widget with unique id: c8b89bd3096c4565a7e2dcbd2a112e45

### Part 2 - Modelling using Random Oversampling

In [25]:
resample("LoanDefault.csv", "Defaulted?", 0, 1, "oversample");

In [26]:
// Visualize the distribution of target classes
countplot("LoanDefault_oversampled.csv", "Defaulted?", "", "Part-2 Distribution of target class");
auto img = xw::image_from_file("Part-2 Distribution of target class.png").finalize();
img

A Jupyter widget with unique id: 7405cdae1676457585ece8aa6c6ec08f

In [27]:
!cat LoanDefault_oversampled.csv | sed 1d > LoanDefault_trim.csv

In [28]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat loanData;
data::Load("LoanDefault_trim.csv", loanData);

In [29]:
heatmap("LoanDefault_oversampled.csv", "coolwarm", "Part-2 Correlation Heatmap", 1);
auto img = xw::image_from_file("Part-2 Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: 5feabf76c72648deb9832022db955aa0

In [30]:
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(loanData.row(loanData.n_rows - 1));
loanData.shed_row(loanData.n_rows-1);

In [31]:
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;

In [32]:
mlpack::data::Split(loanData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

In [33]:
DecisionTree<> dt(Xtrain, Ytrain, 2);

In [52]:
arma::Row<size_t> output;
arma::mat probs;
dt.Classify(Xtest, output, probs);

In [35]:
data::Save("probabilities.csv", probs);
data::Save("ytest.csv", Ytest);

In [36]:
std::cout <<  "Accuracy: " << accuracy(output, Ytest) << std::endl;
classification_report(output, Ytest);

Accuracy: 0.96
                    precision         recall       f1-score        support

              0           1            0.93          0.96            2260
              1        0.93               1          0.96            2397


In [37]:
plotRocAUC("ytest.csv", "probabilities.csv", "roc_auc");
auto img = xw::image_from_file("roc_auc.png").finalize();
img

A Jupyter widget with unique id: e029381cc4e04ca2b9079b06a41bfda4

### Part 3 - Modelling using Synthetic Minority Oversampling Technique

In [38]:
resample("LoanDefault.csv", "Defaulted?", 0, 1, "smote");

In [43]:
!sed -i "1iEmployed,Bank Balance,Annual Salary,Defaulted?" LoanDefault_smotesampled.csv

In [44]:
// Visualize the distribution of target classes
countplot("LoanDefault_smotesampled.csv", "Defaulted?", "", "Part-3 Distribution of target class");
auto img = xw::image_from_file("Part-3 Distribution of target class.png").finalize();
img

A Jupyter widget with unique id: d83a2af98d4c441592cfd42c44ff4184

In [45]:
!cat LoanDefault_smotesampled.csv | sed 1d > LoanDefault_trim.csv

In [46]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat loanData;
data::Load("LoanDefault_trim.csv", loanData);

In [47]:
heatmap("LoanDefault_smotesampled.csv", "coolwarm", "Part-3 Correlation Heatmap", 1);
auto img = xw::image_from_file("Part-3 Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: 4d1ecc6458b5449fba4afec9ccdabcc4

In [48]:
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(loanData.row(loanData.n_rows - 1));
loanData.shed_row(loanData.n_rows-1);

In [49]:
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;

In [50]:
mlpack::data::Split(loanData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

In [51]:
DecisionTree<> dt(Xtrain, Ytrain, 2);

In [53]:
arma::Row<size_t> output;
arma::mat probs;
dt.Classify(Xtest, output, probs);

In [54]:
data::Save("probabilities.csv", probs);
data::Save("ytest.csv", Ytest);

In [55]:
std::cout <<  "Accuracy: " << accuracy(output, Ytest) << std::endl;
classification_report(output, Ytest);

Accuracy: 0.9
                    precision         recall       f1-score        support

              0        0.92            0.89           0.9            2165
              1        0.89            0.92           0.9            2202


In [56]:
plotRocAUC("ytest.csv", "probabilities.csv", "roc_auc");
auto img = xw::image_from_file("roc_auc.png").finalize();
img

A Jupyter widget with unique id: 20b03b047ff847edbc5d67a0f4b5bb17

### Part 4 - Modelling using Random Undersampling

Since the size of the data set is quite small, undersampling of majority class would not make much sense here. But still we are going forward with this part to get a sense of how our model performs on less amount of data and it's impact on the learning.

In [57]:
resample("LoanDefault.csv", "Defaulted?", 0, 1, "undersample");

In [59]:
// Visualize the distribution of target classes
countplot("LoanDefault_undersampled.csv", "Defaulted?", "", "Part-4 Distribution of target class");
auto img = xw::image_from_file("Part-4 Distribution of target class.png").finalize();
img

A Jupyter widget with unique id: c952a145d27d420ebb1a9092696cb606

In [60]:
!cat LoanDefault_undersampled.csv | sed 1d > LoanDefault_trim.csv

In [61]:
// Load the preprocessed dataset into armadillo matrix.
arma::mat loanData;
data::Load("LoanDefault_trim.csv", loanData);

In [62]:
heatmap("LoanDefault_undersampled.csv", "coolwarm", "Part-4 Correlation Heatmap", 1);
auto img = xw::image_from_file("Part-4 Correlation Heatmap.png").finalize();
img

A Jupyter widget with unique id: ca8365080f0f4c029e0358588cb6431b

In [63]:
arma::Row<size_t> targets = arma::conv_to<arma::Row<size_t>>::from(loanData.row(loanData.n_rows - 1));
loanData.shed_row(loanData.n_rows-1);

In [64]:
arma::mat Xtrain, Xtest;
arma::Row<size_t> Ytrain, Ytest;

In [65]:
mlpack::data::Split(loanData, targets, Xtrain, Xtest, Ytrain, Ytest, 0.25);

In [66]:
DecisionTree<> dt(Xtrain, Ytrain, 2);

In [67]:
arma::Row<size_t> output;
arma::mat probs;
dt.Classify(Xtest, output, probs);

In [68]:
data::Save("probabilities.csv", probs);
data::Save("ytest.csv", Ytest);

In [69]:
std::cout <<  "Accuracy: " << accuracy(output, Ytest) << std::endl;
classification_report(output, Ytest);

Accuracy: 0.9
                    precision         recall       f1-score        support

              0        0.88            0.91          0.89              70
              1        0.92            0.89           0.9              79


In [70]:
plotRocAUC("ytest.csv", "probabilities.csv", "roc_auc");
auto img = xw::image_from_file("roc_auc.png").finalize();
img

A Jupyter widget with unique id: fc62224368a340ddab7c46f5f053d7b0