### 1. Introduction

Credit default prediction is central to managing risk in a consumer lending business. Credit default prediction allows lenders to optimize lending decisions, which leads to a better customer experience and sound business economics. Current models exist to help manage risk. But it's possible to create better models that can outperform those currently in use.

In this notebook, I will apply my machine learning skills to predict credit default. Specifically, I will leverage the Kaggle's "American Express - Default Prediction" Competition data set to build a machine learning model. Training, validation, and testing datasets include time-series behavioral data and anonymized customer profile information.


The objective is to predict the probability that a customer does not pay back their credit card balance amount in the future based on their monthly customer profile. The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.


The dataset contains aggregated profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:

D_* = Delinquency variables
S_* = Spend variables
P_* = Payment variables
B_* = Balance variables
R_* = Risk variables
with the following features being categorical:

['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']


My purpose in this notebook is to explore and model the time series component of the data.

### 2. Key Findings

- In this notebook, I used a simple LSTM network to train time series credit card data from customers.

- I was only able to include a subset of the data in the notebook, since otherwise I was running into memory issues.

- The competition metric I got from this model was 0.635. To improve this result, I will try to solve the memory issue so that I can train the model on the whole dataset.

### 3. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from kerastuner.tuners import RandomSearch
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense

### 4. Load Dataset

The dataset of this competition has a considerable size. If I read the original csv files, the data barely fits into memory. That's why I read the data from @munumbutt's AMEX-Feather-Dataset. In this Feather file, the floating point precision has been reduced from 64 bit to 16 bit. And reading a Feather file is faster than reading a csv file because the Feather file format is binary.

There are 5.5 million rows of data in this dataset. I ran out of memory in the model training stage. So I am only reading the first 900000 rows of data.

I will focus on the numerical features. There are 11 categorical features, which I will drop.

In [None]:
train_data = pd.read_feather("../input/amexfeather/train_data.ftr").iloc[:900000,:]
categorical_features = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

train_data = train_data.drop(columns = categorical_features)
train_data['Date Time'] = pd.to_datetime(train_data['S_2'], format='%d.%m.%Y %H:%M:%S')
train_data = train_data.drop(columns='S_2')

### 5. Preprocess the Data

80% of the customers have 13 statements. The remaining 20% of the customers have 1 to 12 statements. For purpose of time-series modelling, I will only include customers that have 13 statements.

In [None]:
train_data.customer_ID.value_counts().value_counts().sort_index(ascending=False)

In [None]:
train_data = train_data[train_data['customer_ID'].map(train_data['customer_ID'].value_counts()) == 13]

There is a significant number of missing data. It is not reasonable to drop all columns or rows that have a missing value.

Neural networks can not deal with missing values. So I need to impute values for NNs. I will sort the train data first by costumer_ID and then by date (S_2). I will then interpolate linearly the missing values,

In [None]:
null= pd.DataFrame(train_data.isnull().sum(),columns=['number_of_nulls'])
null['percentage_of_null'] = round(((null['number_of_nulls']/len(train_data))*100) , 2)
null = null[null['number_of_nulls']>0]
null= null.sort_values(by='percentage_of_null',ascending=False)
null.head(10)

In [None]:
train_data = train_data.sort_values(['customer_ID', 'S_2'])
train_data.interpolate(method='linear', inplace=True, limit_direction='both')
#train_data = train_data.fillna(0)
train_data.isnull().sum()

In [None]:
train_data.describe()

Let's visualize a few features as a function of time and see how they evolve.

We have 177 features. I will select top 10 features to plot. Top features are selected based on the LightGBM feature importance I did in another notebook.

In [None]:
# Most important features based on LGBM feature importance
top_features = ["P_2","D_39", "S_3","B_4", "D_43","D_42","B_3","B_5","D_46","D_49"]
titles = top_features

colors = ["blue",  "orange",  "green",  "red",  "purple",  "brown",  "pink",  "gray",  "olive",  "cyan"]

def show_raw_visualization(data):
    time_data = data['S_2']
    fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(15, 20), dpi=80, facecolor="w", edgecolor="k")
    for i in range(len(top_features)):
        key = top_features[i]
        c = colors[i % (len(colors))]
        t_data = data[key]
        t_data.index = time_data
        t_data.head()
        ax = t_data.plot(
            ax=axes[i // 2, i % 2],
            color=c,
            title="{}".format(titles[i]),
            rot=25)
        ax.legend([titles[i]])
    plt.tight_layout()

show_raw_visualization(train_data)

### Standardize the Data

Each feature has a different range. This is not ideal for a neural network. In general It is better to normalize the input values. Data normalization is a crucial preprocessing step when working with neural networks to ensure stable and efficient training, faster convergence, and better generalization on the data

I will do normalization to confine feature values to a range of [0, 1] before training a neural network. I do this by subtracting the mean and dividing by the standard deviation of each feature.

In [None]:
def normalize(data, train_split):
    data_mean = data[:train_split].mean(axis=0)
    data_std = data[:train_split].std(axis=0)
    return (data - data_mean) / data_std

75% of the data will be used to train the model.

25% of the data will be used for validation.

In [None]:
train_data.index = train_data['Date Time']
split_fraction = 0.75
train_split = int(split_fraction * int(train_data.shape[0]))
train_split = train_split + 13 - train_split%13
train_data.iloc[:,1:-2] = normalize(train_data.iloc[:,1:-2].values, train_split)

tra_data = train_data.iloc[0 : train_split - 1]
val_data = train_data.iloc[train_split:]
del train_data

In [None]:
x_train = tra_data.iloc[:,1:-1].values
y_train = tra_data['target'].values
print(x_train.shape)
print(y_train.shape)

In [None]:
x_val = val_data.iloc[:,1:-1].values
y_val = val_data['target'].values
print(x_val.shape)
print(y_val.shape)

The timeseries_dataset_from_array function takes in a sequence of data-points gathered at equal intervals, along with time series parameters such as length of the sequences/windows, spacing between two sequence/windows, etc., to produce batches of sub-timeseries inputs and targets sampled from the main timeseries.

The following is the train and the validation datasets.

In [None]:
dataset_train = keras.preprocessing.timeseries_dataset_from_array(
    x_train,
    y_train,
    sequence_length=13,
    sampling_rate=1,
    batch_size=512,
)


dataset_val = keras.preprocessing.timeseries_dataset_from_array(
    x_val,
    y_val,
    sequence_length=13,
    sampling_rate=1,
    batch_size=512,
)


for batch in dataset_train.take(1):
    inputs, targets = batch

print("Input shape:", inputs.numpy().shape)
print("Target shape:", targets.numpy().shape)

### 6. Model Training

I will define a simple Long Short-Term Memory (LSTM) network using the Keras API.

LSTM (Long Short-Term Memory) is a type of recurrent neural network (RNN) architecture that is well-suited for learning from sequences of data. It is designed to overcome the limitations of traditional RNNs in capturing long-term dependencies in sequential data.

In sequential data, such as time series data, the order of the elements carries crucial information. However, traditional RNNs face challenges in learning and retaining information over long sequences due to the vanishing gradient problem, where the gradients of the loss function tend to become very small, leading to slow or no learning.

I will use binary_crossentropy as my loss metric. Binary Cross Entropy, also known as Log Loss, is a loss function used in binary classification tasks. It measures the performance of a classification model whose output is a probability value between 0 and 1.

I will use sigmoid as much activation output layer. The sigmoid activation function is appropriate for use as the output layer of a binary classification model, where it squashes the final output to represent the probability of belonging to a certain class.

Let's see how this simple network will perform for this dataset.

In [None]:
learning_rate=0.0001

inputs = keras.layers.Input(shape=(inputs.shape[1], inputs.shape[2]))
lstm_out = keras.layers.LSTM(32)(inputs)
outputs = keras.layers.Dense(1, activation="sigmoid")(lstm_out)


model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), loss="binary_crossentropy")
model.summary()

I will use the ModelCheckpoint callback to regularly save checkpoints.

I will use EarlyStopping callback to interrupt training when the validation loss is not longer improving.

In [None]:
path_checkpoint = "model_checkpoint.h5"
es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", min_delta=0.00001, patience=5)

modelckpt_callback = keras.callbacks.ModelCheckpoint(
    monitor="val_loss",
    filepath=path_checkpoint,
    verbose=1,
    save_weights_only=True,
    save_best_only=True,
)

history = model.fit(
    dataset_train,
    epochs=50,
    validation_data=dataset_val,
    callbacks=[es_callback, modelckpt_callback],
)


The model stopped after 44 epochs, because validation loss was no longer improving.

### Visualize Loss

I will visualize the loss with the function below. At one point, the validation loss stops decreasing, even though the training loss keeps going down.

The training was stopped when validation loss was no longer decreasing (Early Stpping).

In [None]:
def visualize_loss(history, title):
    loss = history.history["loss"]
    val_loss = history.history["val_loss"]
    epochs = range(len(loss))
    plt.figure()
    plt.plot(epochs, loss, "b", label="Training loss")
    plt.plot(epochs, val_loss, "r", label="Validation loss")
    plt.title(title)
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.legend()
    plt.show()


visualize_loss(history, "Training and Validation Loss")


### Prediction


In [None]:
y_pred = model.predict(dataset_val)

### Competition Metric

The evaluation metric, 𝑀, for this competition is the mean of two measures of rank ordering: Normalized Gini Coefficient, 𝐺, and default rate captured at 4%, 𝐷.

𝑀=0.5⋅(𝐺+𝐷)

The default rate captured at 4% is the percentage of the positive labels (defaults) captured within the highest-ranked 4% of the predictions, and represents a Sensitivity/Recall statistic.

For both of the sub-metrics 𝐺 and 𝐷, the negative labels are given a weight of 20 to adjust for downsampling.

This metric has a maximum value of 1.0.

This is the code for calculating this metric, provided by the competition.


In [None]:
def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:

    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
        
    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename('prediction')
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)

In [None]:
y_pred = pd.Series(y_pred.reshape(y_pred.shape[0])).rename('prediction', inplace=True)
y_true = pd.Series(y_val).reset_index(drop=True).rename('target', inplace=True)
y_pred = y_pred.reset_index(drop=True)

In [None]:
amex_metric(y_true, y_pred)

### Insights:

- The competition metric I got from my neural network was 0.635. This is not bad and it is better than a random chance.

- There are ways to improve these results. My next step is to try tf.data API to solve the memory issues. That will allow me to include all the training data for my model training.

- Another future step for this project would be to use keras tuner to tune hyperparameters, number of layers, learning rate, etc.