# Micro Mortgages - Tree-based Models
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, roc_curve, RocCurveDisplay, auc
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import tree
#import graphviz as graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier


In [None]:
plt.style.use('fivethirtyeight')

## Problem description

In India, there are about 20 million home loan (mortgage) aspirants
working in the informal sector:

- Monthly income between INR 20,000-25,000 (\$ 325-400)
- Typically no formal accounts and documents (e.g., tax returns, income proofs, bank statements)
- Often use services of money lenders with interest rates between 30 and 60% per annum

Providing mortgages to this group of customers requires to quickly and
efficiently assess their creditworthiness. Due to a lack of formal
documents and objective data, most financial institutions perform
interview-based processes to decide about these loan requests:

Strength of the current process:

-   Interview-based field assessment

-   Relaxation of document requirements

Weaknesses of the current process:

-   Costly (total transaction costs as high as 30% of loan volume)

-   Subjective judgments; depends on individual skills and motivations

-   Low reliability across branches and credit officers

-   Risk of corruption and fraud

## Load data

Load training data from CSV file.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/olivermueller/vhbprodok_datascience/main/micro_mortgages/data/micromortgage.csv')

In [None]:
data.head()

## Prepare data

In [None]:
data = data.drop(['ID'], axis=1)
data["Tier"] = data["Tier"].apply(lambda x: "T"+str(x))

In [None]:
X = data.drop("Decision", axis=1)
y = data["Decision"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
categorical_features = X_train.select_dtypes(include='object').columns
numerical_features = X_train.select_dtypes(exclude='object').columns

In [None]:
categorical_features

In [None]:
numerical_features

In [None]:
enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
enc.fit(X_train[categorical_features])

X_train_cat = enc.transform(X_train[categorical_features])
X_test_cat = enc.transform(X_test[categorical_features])

X_train_cat = pd.DataFrame(X_train_cat, columns=enc.get_feature_names_out(categorical_features))
X_test_cat = pd.DataFrame(X_test_cat, columns=enc.get_feature_names_out(categorical_features))

In [None]:
X_train_cat.head()

In [None]:
scaler = StandardScaler()
scaler.fit(X_train[numerical_features]) 

X_train_num = scaler.transform(X_train[numerical_features])
X_test_num = scaler.transform(X_test[numerical_features])

X_train_num = pd.DataFrame(X_train_num, columns=numerical_features)
X_test_num = pd.DataFrame(X_test_num, columns=numerical_features)

In [None]:
X_train_num.head()

In [None]:
X_train = pd.concat([X_train_num, X_train_cat], axis=1)
X_test = pd.concat([X_test_num, X_test_cat], axis=1)

## Classification and Regression Trees (CART)

First, we will grow a single CART tree.

In [None]:
model_cart = DecisionTreeClassifier(criterion='entropy')

In [None]:
model_cart.fit(X_train, y_train)

Let's check how well the tree performs on the training and test data.

In [None]:
# Training data
pred_label_train = model_cart.predict(X_train)
pred_proba_train = model_cart.predict_proba(X_train)[:,1]
acc_train = accuracy_score(y_train, pred_label_train)
auc_train = roc_auc_score(y_train, pred_proba_train)
print('ACC on training set:', round(acc_train, 2))
print('AUC on training set:', round(auc_train, 2))

print("===")

# Test data
pred_label_test = model_cart.predict(X_test)
pred_proba_test = model_cart.predict_proba(X_test)[:,1]
acc_test = accuracy_score(y_test, pred_label_test)
auc_test = roc_auc_score(y_test, pred_proba_test)
print('ACC on training set:', round(acc_test, 2))
print('AUC on test set:', round(auc_test, 2))

The advantage of a single tree is that it is easy to interpret and visualize. Let's have a look.

In [None]:
fn = model_cart.feature_names_in_
labels = model_cart.classes_
labels = [str(item) for item in labels]

tree.plot_tree(model_cart, feature_names=fn, class_names=labels, filled=True, proportion=True, rounded=True)
plt.show()

Wow, that's a big tree! It is hard to interpret and understand. Let's go back and try to grow a smaller tree by setting stopping rules (e.g., `max_depth`) or pruning the tree (`ccp_alpha`).

## Random Forest

Let's now try to grow a random forest. This time, we will use hyperparameter tuning with k-fold cross validation to find the best model.

In [None]:
model_rf_cv = lambda: GridSearchCV(
                estimator=RandomForestClassifier(random_state=42),
                param_grid={
                        'n_estimators': [100, 200, 300],
                        'max_depth': [3, 5, None],
                        'min_samples_leaf': [2, 5, 10]
                    }, cv=5, n_jobs=-1
                )

In [None]:
model_rf_tuned = model_rf_cv()
model_rf_tuned.fit(X_train, y_train)

Make predictions on test set and evaluate the model.

In [None]:
pred_label = model_rf_tuned.predict(X_test)
pred_proba = model_rf_tuned.predict_proba(X_test)[:,1]

In [None]:
print(classification_report(y_test, pred_label))

Plot the ROC curve and calculate the AUC.

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, pred_proba)
auc_score = auc(fpr, tpr)
display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc_score, estimator_name='Random Forest')
display.plot()
plt.show()

## Boosted Trees

Finally, we will try boosted trees. We will use the fast `HistGradientBoostingClassifier` (Histogram-based Gradient Boosting Classification Tree) learner, which is inspired by the LightGBM algorithm. Again, we will use hyperparameter tuning with k-fold cross validation to find the best model.

In [None]:
model_boost_cv = lambda: GridSearchCV(
                estimator=HistGradientBoostingClassifier(random_state=42),
                param_grid={
                        'max_iter': [100, 200, 300],
                        'max_depth': [1, 3, 5, 10]
                    }, cv=5, n_jobs=-1
                )

In [None]:
model_boost_tuned = model_boost_cv()
model_boost_tuned.fit(X_train, y_train)

In [None]:
pred_label = model_boost_tuned.predict(X_test)
pred_proba = model_boost_tuned.predict_proba(X_test)[:,1]

In [None]:
print(classification_report(y_test, pred_label))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, pred_proba)
auc_score = auc(fpr, tpr)
display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc_score, estimator_name='Gradient Boosting')
display.plot()
plt.show()