In [None]:
# call functions and config modules 
exec(open("./functions/functions.py").read())
exec(open("./config/config.py").read())

# set style for notebook 
_set_css_style('./www/custom.css') 

# toggle code on/off button 
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')


# call data manipulation and plotting libraries 
import json
import pandas as pd 
import seaborn as sb
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import numpy as np
import warnings
import shap
import random
import plotly.graph_objects as go
from IPython.display import HTML
import plotly.io as pio; pio.renderers.default='notebook'
import plotly.offline as py
py.init_notebook_mode(connected=True)
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import plotly.offline as offline
from datetime import date, time, datetime, timedelta
from plydata import define, if_else
from numpy import mean, absolute

# import librariers for machine learning 
from statsmodels.formula.api import ols
import pycaret 
import sklearn 
import featuretools as ft

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import mean_squared_error 
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score 
from pylab import rcParams
from sklearn import metrics 
from sklearn.metrics import classification_report, accuracy_score, f1_score
from imblearn.over_sampling import SMOTE

from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_predict
from pycaret.classification import *

# turn off notebook warnings 
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

# turn off scientific notation 
pd.options.display.float_format = '{:.2f}'.format


# Data overview

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this is often the most marginalised part of the population. The data set used in this analysis includes consumer default instances, the `target` variable. A default is defined to be the customer that has late payment more than 90 days on a given load or failed to repay the loan. 

The borrower characteristics include quantative metrics such as income, credit amount, value of good purchased, days employed etc. It also include qualatative/catergorical meausures such as gender, education, home ownership, mobile ownership, number of children etc. 

This analysis does not take into account financial inclusion for the unbanked population. In fronteir markets, its important to make sure that underserved populations has a positive loan experience and given everyone an fair assessment. This data does not take into account alternative data sources to predict customers repayment abilities.

Alternative data sources that could be incorporated into the model includes;

* Macro-economic variables, such as inflation, GDP, unemployment, local FX rates, tax etc. This would required a timestamp to identify macro-economic conditions at the time of default. 
* Recession/pandemic indicators, e.g. extreme macro-economic climates impacting the borrowers ability to repay their loans. 
* Non-financial assets, such as land, livestock, agriculture machinary atc. 
* Health related measures or history. 

# Import and interrogate data 

The feature set includes borrower measures of credit usage, income, annuities, value of good purchases as well as qualative features such as education type, home ownership, birthday, gender etc. The response variable is a borrow DEFAULT (`TARGET = 1`). 

The **target** variable takes the value **1** if someone experiences payment difficulties or fails to repay a loan. 

In [None]:
# read in the credit risk data set
credit_risk= pd.read_csv("./data/credit_risk_data.csv")

# reset index to ID 
credit_risk = credit_risk.set_index('SK_ID_CURR')


In [None]:
# first look at the structur of the data 
print('Data types in credit risk dataset as follows:' '\n')
credit_risk.info()

In [None]:
# explore data properties 
credit_risk.describe(percentiles=[.001, .1, .9, .99])

In [None]:
print('Null values in dataset are as follows:' '\n')
null = credit_risk.isnull().sum()
print(null[null>0])


There are some missing values that need to be imputed or removed from the data, namely 'annuity ammount', 'occupation type' and 'amount good price'. There are also columns that dont contain useful information, such as ID. 

In [None]:
colnames = list(credit_risk.columns.drop('TARGET')) # create list of predictor names
categorical_names = credit_risk.select_dtypes(include=[object, np.int64]).columns # create list of categorical names
continuous_names = credit_risk.select_dtypes(exclude=[object, np.int64]).columns # create list of continuous names

print("The categorical/discrete features include: ")
print(", ".join(list(map(str,categorical_names))))

print("\nThe continuous features include: ")
print(", ".join(list(map(str,continuous_names))))

In [None]:
# compute correlation matrix of feature set 
corr = credit_risk[[*continuous_names]].corr()

sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

The above chart shows that the continuous variables are relatively correlated, e.g. a high credit amount correlates with a high good price (correlation = ~0.9). This can be a problem in classification context; however less so in a machine learning context. The problem is that in practice, you need to explain the system’s behaviour, especially if it makes decisions. ML explainability is important so that intelligent technologies don’t inherit societal biases. 

In [None]:
plot_data = credit_risk[['TARGET', *continuous_names]]
plot_data = plot_data.melt(id_vars="TARGET", 
                           var_name="name", 
                           value_name="value")

fig = px.box(plot_data,
             x="name",
             y="value", 
             color="TARGET",
             notched=True)

fig.update_layout(template= "simple_white", 
                  title="Target vs continuous variable box plot")

fig.show()



The chart above suggests there are some outliers in the data, most notably for **'total income ammount'**, these will be either imputed or removed from the data. The features identify a few key relationships;

* **AMT_INCOME_TOTAL** - People will lower income are more likely to default;
* **AMT_GOODS_PRICE** - For consumer loans, the higher the price of goods for which the loan is given, the less likely they are to default;
* **AMT_ANNUITY** - Higher loan annuitys tend to see lower default rates. 


In [None]:
# remove outliers in **AMT_INCOME_TOTAL** 
credit_risk = remove_outliers(data = credit_risk, column = 'AMT_INCOME_TOTAL')
credit_risk = remove_outliers(data = credit_risk, column = 'AMT_INCOME_TOTAL')

# remove null values 
credit_risk = credit_risk.dropna() 

In [None]:

plot_data = credit_risk[['TARGET', *continuous_names]]
plot_data = plot_data.melt(id_vars="TARGET", 
                           var_name="name", 
                           value_name="value")

fig = px.scatter(data_frame = plot_data,
              x = 'TARGET',
              y = 'value',
              color = 'name', 
              color_discrete_sequence= color_palette[1:10], 
              facet_col="name", 
              facet_col_wrap=2
             )

fig.update_yaxes(matches=None)
fig.update_layout(template= "simple_white", 
                  title="Target vs continuous variables",
                  xaxis = dict(
                      tickmode = 'array',
                      tickvals = [0,1],
                      ticktext = ['Zero', 'One']
                  )
                 )
fig.show()

## Target variable

The target variable contains information on consumer defaults, where 1 represents a default and 0 otherwise.  


In [None]:
# calculate default rates as a % of the data 
plot_data = credit_risk['TARGET'].value_counts().reset_index() 
ratio = plot_data['TARGET'].iloc[1]/(plot_data['TARGET'].iloc[0] + plot_data['TARGET'].iloc[0])*100

plot_data['index'] = plot_data['index'].astype(str)

print("Defaults represent " + str(round(ratio,2)) + " % of the data" )


The dataset is imbalanced, meaning the target class has an uneven distribution of observations, i.e. `TARGET = 1` occurs less frequently than `TARGET = 0`. Imbalanced classification is primarily challenging due to the severely skewed class distribution. This may cause poor performance in machine learning models.

In [None]:
# plot count of default rates vs non-default rates

fig = px.bar(plot_data, 
             x='TARGET',
             y = 'index', 
             color_discrete_sequence=color_palette, 
             barmode="stack")
fig.update_layout(template= "simple_white", 
                  title="Count of default rates, default = 1", 
                  yaxis_title = "Default", 
                  xaxis_title = "Count")
fig.show()

In [None]:
# compute correlation of features with target variable
corr_with_target = credit_risk[colnames + ['TARGET']].corr().iloc[:-1, -1].sort_values(ascending=False)

# plot feature correlation vs target
fig = px.bar(x=corr_with_target.values, 
             y=corr_with_target.index, 
             color_discrete_sequence= color_palette)

fig.update_layout(template= "simple_white", 
                  title="Correlation with target variable", 
                  yaxis_title = "", 
                  xaxis_title = "Correlation")
fig.show()

## Categorical features

There are 10 categorical features in the data - its important to understand their relationship with the **target** variable;

In [None]:
# compute correlation matrix of feature set 
corr = credit_risk[[*categorical_names]].corr()

sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

The categorical predictive features appear relatively uncorrelated. 

In [None]:
# plot target vs income type
make_bar_plot(credit_risk, 'NAME_INCOME_TYPE', title= 'Income sources vs target', xaxis='', yaxis='Count (%)')

In [None]:
# plot target vs education type
make_bar_plot(credit_risk, 'NAME_EDUCATION_TYPE', title= 'Education type vs target', xaxis='', yaxis='Count (%)')

In [None]:
# plot target vs occupation type
make_bar_plot(credit_risk, 'OCCUPATION_TYPE', title= 'Occupation type vs target', xaxis='', yaxis='Count (%)')

In [None]:
# plot target vs occupation type
make_bar_plot(credit_risk, 'CNT_CHILDREN', title= 'Occupation type vs target', xaxis='', yaxis='Count (%)')

In [None]:
# plot target vs occupation type
make_bar_plot(credit_risk, 'CODE_GENDER', title= 'Occupation type vs target', xaxis='', yaxis='Count (%)')


# Transform and encode data

Before applying machine learning methods, catergorical variables needs to be encoded into numerial form. 

In [None]:
names_to_encode = credit_risk.select_dtypes(include=[object], exclude=[float]).columns # create list of object names
credit_risk[names_to_encode] = MultiColumnLabelEncoder(columns = names_to_encode).fit_transform(credit_risk[names_to_encode].astype(str))
credit_risk.tail()

Apply box-cox transformer to continuous features so that the resulting variable looks more normally distributed. This will help reduce skew in the raw variables. 

In [None]:
power = PowerTransformer(method='box-cox', standardize=True)
credit_risk[continuous_names] = power.fit_transform(credit_risk[continuous_names])

# Test and training datasets 

In [None]:
credit_risk['TARGET'] = credit_risk['TARGET'].astype(str)
X = credit_risk[colnames] # X value contains all the variables except labels
y = credit_risk['TARGET'] # these are the labe'

# create training test split data sets, with test size of 30% of the data 

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
credit_train = pd.DataFrame(X_train).assign(TARGET = y_train)
credit_test = pd.DataFrame(X_test).assign(TARGET = y_test)
credit_test.head()

# Benchmark logistic model

First a logistic model is applied to the data to act as a baseline for performance evaluation. Logistic models are used to predict the probability that an observation falls into one of two categories; `TARGET = 1` or `TARGET = 0` based on the set of predictors (features). 

In [None]:
# set up and fit logistic regression model 
logit= LogisticRegression()
logit.fit(X_train, y_train)

# Predicting the model
pred_logit= logit.predict(X_test)


print("The accuracy of logit model is:", accuracy_score(y_test, pred_logit))
print(classification_report(y_test, pred_logit))


Note the accuracy is 91%; however this is not a good measure due to the class imbalances in the data. The recall and precision score is 0% for `TARGET = 1`, i.e. the model is very bad at predicting true defaults, whilst the model is also misclassifying non-defaults as defaults.

# Model fitting on raw data

In this section an array of classifiers are explored. Each model is fitted to the data using a ‘training’ sample. The ‘testing’ sample is used to evaluate each model’s performance in predicting credit defaults. The model fits can be seen in the Annex. Once the models have been fitted using the training data, each model’s performance can be evaluated when applied to unseen data (out-of-sample).The algorithms include logistic, K-nearest neighbours, random forest, decision trees, ada boost, naive Bayes and gradient boosted classifier. 

In [None]:
s = setup(
    fold=10, # 10 fold cross validation 
    data = credit_train, # training data 
    silent = True, 
    test_data = credit_test, # test data 
    target = 'TARGET', 
    session_id = 123)

# fit set of models based on the above configuration, sorted by F1 score
best = compare_models(include = ['lr', 'knn', 'rf', 'dt', 'ada', 'nb', 'gbc'], 
                      sort = 'F1')

The predictive performance on the imbalanced data is poor, with the naive bayes model outperforming with an F1 score of just 17%. The accuracy statistics are misleading 

The decision tree classifier performs best in terms of F1 score and recall i.e. the model is best at detecting defaults, whilst the random forest generated the best precision or best at detecting defaults out of all identified cases. That said, the model performance is poor and is not identifying defaults particuarly well.   

In [None]:
# create decision tree model, using 5-fold cross validation
dt_fit = create_model('dt', fold=5)

The following confusion matrix and AUC charts show the poor predictive performance of the decision tree model. 

In [None]:
# plot confusion matrix 
plot_model(dt_fit, plot='confusion_matrix',  plot_kwargs = {'percent' : True})

In [None]:
# plot AUC 
plot_model(dt_fit, plot='auc')

The following helps identify which features are the most import in predicting default rates under the decision tree model. 

In [None]:
plot_model(dt_fit, 'feature')

The above chart shows that birthday, employement history, annuity, income and credit are all very important in predicting consumer default rates. 

# Resampling

The credit risk dataset is heavility imbalanced with a small percentage of observations in the data representing defaults. Re-sampling methods are explored here. Oversampling is used, which help ensure no information from the original training set is lost and all members from the minority and majority classes are retained. However, its computationally coslty as it increases the size of the training set.

Synthetic Minority Oversampling TEchnique (SMOTE) is used here as an oversampling method. SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

In [None]:
# oversample using smote to oversample the minority class.
# this synthesizes new samples from the existing ones.

oversample = SMOTE()

# define new training and test data using resampled data 
X_train_bal, y_train_bal = oversample.fit_resample(X_train, y_train)
credit_train_bal = pd.DataFrame(X_train_bal).assign(TARGET = y_train_bal)
credit_test = pd.DataFrame(X_test).assign(TARGET = y_test)

# calculate default rates as a % of the data in reblanced data 
plot_data = credit_train_bal['TARGET'].value_counts().reset_index() 
ratio = plot_data['TARGET'].iloc[1]/(plot_data['TARGET'].iloc[0] + plot_data['TARGET'].iloc[1])*100
print("In the reblanced data - defaults represent " + str(round(ratio,2)) + " % of the data" )

In [None]:
# using the rebalanced data, can now apply the same set of algorithms 
s = setup(
    fold=10, # 10 fold cross validation 
    data = credit_train_bal, 
    silent = True, 
    test_data = credit_test, 
    target = 'TARGET', 
    session_id = 123)

best = compare_models(include = ['lr', 'knn', 'rf', 'dt', 'ada', 'nb', 'gbc'], 
                      sort = 'F1')

Rebalancing the data makes a significant improvement to the model results, the F1 score is now 87% for the random forest. 

In [None]:
rf_fit = create_model('rf', fold=10)

# Feature engineering 

In this section, feature engineering is considered in an attempt to improve the model's predictive performance. 

In [None]:
credit_risk['CREDIT_INCOME_RATIO'] = credit_risk['AMT_CREDIT']/credit_risk['AMT_INCOME_TOTAL']
credit_risk['GOODS_CREDIT_RATIO'] = credit_risk['AMT_GOODS_PRICE']/credit_risk['AMT_CREDIT']
credit_risk['GOODS_INCOME_RATIO'] = credit_risk['AMT_GOODS_PRICE']/credit_risk['AMT_INCOME_TOTAL']

# feature primatives to engineer additional model features 
default_agg_primitives =  ["sum", "std", "max", "skew", "min", "mean"]

# Make an entityset and add the data 
es = ft.EntitySet(id = 'credit_train')

es.entity_from_dataframe(entity_id = 'data', 
                         dataframe = credit_risk, 
                         make_index = True, 
                         index = 'index')

# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset = es, 
                                      target_entity = 'data',
                                      agg_primitives=default_agg_primitives, # use default primatives definte above
                                      trans_primitives = ['add_numeric', 'multiply_numeric'])

print('%d total Features have been created' % len(feature_defs))

In [None]:
credit_features = pd.DataFrame(feature_matrix)

X = credit_features.drop('TARGET',axis=1) # X value contains all the variables except labels
y = credit_features['TARGET'] # these are the labe'

# create training test split data sets, with test size of 30% of the data 

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
credit_features_train = pd.DataFrame(X_train).assign(TARGET = y_train)
credit_features_test = pd.DataFrame(X_test).assign(TARGET = y_test)

In [None]:
s = setup(
    fold=5, # 10 fold cross validation 
    data = credit_features_train, 
    silent = True, 
    test_data = credit_features_test, 
    target = 'TARGET', 
    session_id = 123)

best = compare_models(include = ['lr', 'knn', 'rf', 'dt', 'ada', 'nb', 'gbc'], 
                      sort = 'F1')

In [None]:
nb_fit_features = create_model('nb', cross_validation=False)

In [None]:
plot_model(nb_fit_features, plot = 'auc')

In [None]:
plot_model(nb_fit_features,  plot = 'confusion_matrix', plot_kwargs = {'percent' : True})

The feature selection methods improve the model performance somewhat. The F1 statistic under the naive Bayes classifier is now 18.5%, with the recall notably improved at 81%. The model still predicting a large number of false positives and false negatives. 

In [None]:
plot_model(nb_fit_features, plot="feature")

# Model hyperparamter tuning 

# Bias-variance trade off

When it comes to machine learning, there is a trade of between model performance and cost. Basic models with few degrees of freedom (logistic case) are often simple to calculate (low cost); however may lead to poorer model fits and performance (e.g. under-fitting, when there is a non-linear relationship). On the other hand, sophisticated models such as ANN can provide more accurate fits, as demonstrated above but are computationally intensive (cv ANN > 10mins to compute). In addition, complex models with a large number of parameters can lead to overfitting or be subject to a lot of variance (bias-variance trade off). The cross validation process used in this analysis can help calibrate a model’s fit, which in turn can improve predictive performance.

# Overfitting

# Scalability

Thus far, only the accuracy of the model has been considered. The scalability of the machine learning algorithm is of importance when implementing such solutions. For example, algorithm such as neural networks are powerful but computationally intensive and slow to train. The credit default data consists of ~112,000 observations, which is a relatively modest size for dataset. It is possible run these algorithms on a local machine; however given the ever growing set of records, one day this may become a problem. Solutions such as pySpark and sparklyr can help overcome such challenges along with other distributed processing systems used for big data workloads. 

# Extensions 

The models explored in this exercise peform relatively in predicting consumer credit defaults when accounting for preferences between Type I and Type II erros. Model performance could be further enhanced via the application of some of the following techniques;

1. **Regularisation**: This is a smoothing method that introduces a penalty term to the error function acting on non-zero, especially large, parameter values. Its effect is the reduction of the complexity, or the smoothening, of a model. This is useful when faced with an over-fitting model but also for feature selection, which can sometimes be problematic in neural networks.

2. **Bagging**: A popular method for improving an estimator’s properties. The idea is that, when aggregating (averaging) the outputs of many estimators, one reduces the variance of the full model. When over-fitting is an issue, as for tree models, bagging is likely to improve performance.

3. **Boosting**: Describes a set of methods which transform, or boost, a weak learner into a strong one (model). This means that one iteratively applies models to the data which individually have weak generalisation properties, but the final ensemble of models generalises well. 

4. **Simulated data**: This involves creating synthetic data of the minority class (TARGET = 1) to help rebalance the minority class. This was explore and improved model performance, but there are further re-sampling methods that could improve the predictive accuracy even further. The result could be better out of sample predictive performance as the model has ‘seen’ more instances of what a TARGET looks like and is therefore more likely to be able to identify future instances. 