# ML Training Pipeline

Author: Marco Pellegrino<br>
Year: 2024

This overall project aims to build a simple model to predict the probability of loan default based on loan application data. This information helps assess business risk and improve loan approval decisions.

In this notebook, a Decision Tree model is tuned, trained, and evaluated.

In [None]:
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import roc_curve, auc, log_loss, f1_score
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree, export_text
import time
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
import joblib

# Import paths
from config import *

# Load Data

In [None]:
# Import data
df = pd.read_csv(PATH_DATA_PREPROCESSED+'loan_application_data-preprocessed.csv', index_col=False)

In [None]:
df.info()

# Identify features requiring data cleaning

In [None]:
def get_na_info(df):
    """ Return total and percentages of NA in a data frame for each feature
    """
    return pd.DataFrame({
        'NaN Count': df.isna().sum(),
        'NaN Percentage (%)': round((df.isna().sum() / len(df)) * 100, 2)
    })

In [None]:
# Check how many null values there are
display(get_na_info(df))

- The `uc_risk_class` feature contains a substantial amount of missing data, necessitating its removal. Due to the insufficient data, imputing values using median imputation or machine learning techniques is not feasible. Moreover, as UC Risk Class is typically derived from existing data in the data frame, eliminating `uc_risk_class` should not result in significant information loss. Consequently, reducing the feature quantity is expected to decrease model complexity, leading to improved performance. The use of `uc_risk_class` can be reconsidered once more updated client cases are collected in the future.

In [None]:
df = df.drop(columns=['uc_risk_class'])

- `company_rating`, `person_scoring`, and `incorporation_days` have very few NA.
It is possible to remove such NA entries in those features because such a reduction will not have a big impact both overall and on their distributions. As a benefit, it will increase the overall data quality. This can be done now because missing values removal does not lead to data leakage

In [None]:
df = df.dropna(subset=['company_rating', 'incorporation_days', 'person_scoring']).reset_index(drop=True)

- `net_turnover` has several missing data points, and imputation can be employed for reconstruction. Various techniques, including mode imputation, K-Nearest Neighbors (KNN) imputation, and Random Forest imputation, can be considered. For simplicity and lack of time to tune KNN or ML models, median imputation is used. Median imputation is chosen over mean imputation, considering that the features `net_turnover` exhibits a high degree of skewness. Mean imputation is suitable for approximately normally distributed and non-skewed data.
It is advisable to avoid the following imputation techniques for this feature:
  - Forward Fill or Backward Fill: Data order is not significant.
  - Linear Regression Imputation: No linear relationship is observed (refer to the plot at the end of the notebook).
  - Deep Learning Imputation: Given the sensitive financial nature of the feature, a transparent method is preferred.
    
Imputation must be done during the training phase to avoid data leakage to the test set.

In [None]:
# Check again how many null values there are
display(get_na_info(df))

# Split data between features and target

In [None]:
X = df.drop(['default'], axis=1)
y = df['default']

# Split data between Training and Test set

Split making sure that the random split has the same original target distribution

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Imbalanced target

Firstly, let's recall the imbalanced target, this time on the training set:

In [None]:
sns.set(style="whitegrid")

ax = sns.countplot(x=y_train, palette="Set3", hue=y_train, legend=False)
sns.set(font_scale=1.5)
ax.set_xlabel('Loan Default')
ax.set_ylabel('Frequency')
fig = plt.gcf()
fig.set_size_inches(10, 5)

# Adding percentage labels on each bar
total = len(y_train)
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height() / total)
    x = p.get_x() + p.get_width() / 2
    y = p.get_height()
    ax.annotate(percentage, (x, y), ha='center', va='bottom')

plt.title('RAW Distribution of Loan Default')
plt.show()

There are 3 possible solutions:
- Under-sampling: give more weight to the positive class, by removing negative observations to reach an equal target balance. However, it sacrificies a lot observations: training set size would be around 40000 records
- Over-sampling: give more weight to the positive class, by duplicating positive observations to reach an equal target balance. However, it might lead to overfitting because of repeated data points. Also, having more data points to train lead to longer training time
- Class weight: modifies the model loss function by giving more penalty to the positive class which has more weight because under-represented

Class weight is used because:
- It does not affect the dataset's size
- It is less computationally expensive (over-sampling)

In [None]:
sample_weight = compute_sample_weight(
    class_weight='balanced',
    y=y_train
)

# Hyperparameter tuning

Create a Decision Tree model

In [None]:
model = DecisionTreeClassifier()

Define hyperparameters to tune

In [None]:
param_grid = {
    'splitter': ['best', 'random'],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

To avoid data leakage, median imputation for `net_turnover` is performed in each fold of the cross-validation tuning search. To do so, a pipeline is created, where first the imputation is performed and then the model is trained

Add prefix to parameters so that the pipeline knows that the parameters apply to the model and not the imputer:

In [None]:
def modify_prefix_to_params(param_dict, prefix, action='add'):
    """
    Add or remove a prefix to/from each parameter in the parameter dictionary.

    Args:
    - param_dict (dict): A dictionary containing parameter names as keys and their respective values.
    - prefix (str): The prefix to be added to or removed from each parameter name.
    - action (str): 'add' to add the prefix, 'remove' to remove it. Default is 'add'.

    Returns:
    - dict: A new dictionary with the prefix added to or removed from each parameter name.
    """
    if action == 'add':
        return {f"{prefix}{param}": values for param, values in param_dict.items()}
    elif action == 'remove':
        prefix_length = len(prefix)
        return {param[prefix_length:]: values for param, values in param_dict.items() if param.startswith(prefix)}
    else:
        raise ValueError("Invalid action. Please specify 'add' or 'remove'.")

In [None]:
param_grid = modify_prefix_to_params(param_grid, 'model__', action='add')

Define mode imputer

In [None]:
median_imputer = make_column_transformer(
  (SimpleImputer(strategy='median'), ['net_turnover']),
  remainder='passthrough',
)

Build 2-steps pipeline with imputer and model training

In [None]:
pipeline = Pipeline([
    ('median_imputer', median_imputer), 
    ('model', model)
])

Define hypeparameter tuning search

In [None]:
search_results = RandomizedSearchCV(
    estimator = pipeline,
    param_distributions = param_grid,
    cv = 10,
    scoring='neg_log_loss',
    n_iter = 20,
)

#### Execute hypeparameter tuning search

In [None]:
start_time = time.time()

# Add the sample_weight for the model so that each fold gives equal importance to positive and negative observations, avoiding imbalanced classes
search_results.fit(X_train, y_train, **{'model__sample_weight': sample_weight})
end_time = time.time()

minutes, seconds = divmod(end_time-start_time, 60)
print(f'Time for hyperparameter tuning: {int(minutes):02d}:{seconds:.2f} minutes')

In [None]:
best_params = search_results.best_params_

print('Best parameter set:')
param_grid_print = modify_prefix_to_params(best_params, 'model__', action='remove')
for param, value in param_grid_print.items():
    print(f'{param}: {value}')

# Model Training

Train on the whole training set with the found best performing parameter set. By reusing the pipeline, the imputation is done again too

In [None]:
final_pipeline = pipeline.set_params(**best_params).fit(X_train, y_train, **{'model__sample_weight': sample_weight})

# Model evaluation

Make predictions on unseen test set

In [None]:
# Probability prediction on positive class
y_pred_prob = final_pipeline.predict_proba(X_test)[:, 1]

# Extract predicted class with threshold
threshold = 0.5
y_pred = (y_pred_prob > threshold).astype(int)

#### ROC curve

In [None]:
# Compute ROC curve and AUC score
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")

# Check if the directory exists
if not os.path.exists(PATH_PLOTS_AUC):
    # If it doesn't exist, create the directory
    os.makedirs(PATH_PLOTS_AUC)
plt.savefig(PATH_PLOTS_AUC+"auc_roc_curve-DT.png")

plt.show()

#### Other metrics

In [None]:
logloss = log_loss(y_test, y_pred_prob)

print("Logloss: %.2f" % logloss)

In [None]:
f1_weighted = f1_score(y_test, y_pred, average='weighted')
f1_target_pos = f1_score(y_test, y_pred, average='binary', pos_label=1)
f1_target_neg = f1_score(y_test, y_pred, average='binary', pos_label=0)

print("Weighted F1 score: %.2f" % f1_weighted)
print("F1 score positive class: %.2f" % f1_target_pos)
print("F1 negative class: %.2f" % f1_target_neg)

#### Save results

In [None]:
# Check if the directory exists
if not os.path.exists(PATH_RESULTS):
    # If it doesn't exist, create the directory
    os.makedirs(PATH_RESULTS)

In [None]:
df_results = pd.DataFrame({'Model': ['Decision Tree'],
                           'ROC-AUC': [roc_auc],
                           'LogLoss': [logloss],
                           'F1 Weighted-averaged': [f1_weighted],
                           'F1 Default=1': [f1_target_pos],
                           'F1 Default=0': [f1_target_neg]})

# Save
df_results.to_csv(PATH_RESULTS+'all/evaluation-DT.csv', header=True, index=False)

In [None]:
fpr_df = pd.DataFrame({'Decision Tree': fpr})
tpr_df = pd.DataFrame({'Decision Tree': tpr})

# Save
fpr_df.to_csv(PATH_RESULTS+'fpr/evaluation_fpr-DT.csv', header=True, index=False)
tpr_df.to_csv(PATH_RESULTS+'tpr/evaluation_tpr-DT.csv', header=True, index=False)

# Save model locally

In [None]:
# Check if the directory for the mode dump exists
if not os.path.exists(ML_MODEL_PATH):
    # If it doesn't exist, create the directory
    os.makedirs(ML_MODEL_PATH)

In [None]:
# Extract model from pipeline
final_model = pipeline.named_steps['model']

# Save
joblib.dump(final_model, ML_MODEL_PATH+'DT.pkl')
print("Model saved successfully.")

# Extra: Model Feature Importance

In [None]:
# Extract feature importance
feature_importance = final_model.feature_importances_
feature_names = X_train.columns

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Plot the feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df, palette='viridis', hue='Feature', dodge=False)
plt.title('Decision Tree - Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()

# Check if the directory for the plot exists
if not os.path.exists(PATH_PLOTS_FEATURE_IMPORTANCE):
    # If it doesn't exist, create the directory
    os.makedirs(PATH_PLOTS_FEATURE_IMPORTANCE)

plt.savefig(PATH_PLOTS_FEATURE_IMPORTANCE+"feature_importance-DT.png")

plt.show()

# Extra: Inspect Tree Rules

In [None]:
# Plotting the decision tree
plt.figure(figsize=(20, 10))
plot_tree(final_model, filled=True, feature_names=X.columns, class_names=['0', '1'])
plt.savefig('plots/models/rules_decision_tree.png', dpi=300) # in high resolutions to see better
plt.show()

In [None]:
# Text version
tree_rules = export_text(final_model, feature_names=X.columns, show_weights=True)
print(tree_rules)