# Detecting payment card fraud with Temporian and TensorFlow Decision Forests

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/temporian/blob/main/docs/src/tutorials/bank_fraud_detection_with_tfdf.ipynb)

Detection of fraud in online banking is critical for banks, businesses, and their consumers. The book "[Reproducible Machine Learning for Credit Card Fraud Detection](https://fraud-detection-handbook.github.io/fraud-detection-handbook/Foreword.html)" by Le Borgne et al. introduces the problem of payment card fraud and shows how fraud can be detected using machine learning. However, since banking transactions are sensitive and not widely available, the book uses a synthetic dataset for practical exercises.

This notebook uses the same dataset to show how to use Temporian and [TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests) to detect fraud. Temporian is used for data preprocessing and augmentation, while TensorFlow Decision Forests is used for model training. Data augmentation is often critical for temporal data, and this notebook demonstrates how complex data augmentation can be performed with ease using Temporian.

The notebook is divided into three parts:

1. Download the dataset and load it as a Temporian EventSet.
1.  Perform various types of augmentations and visualize the correlation between the augmented features and fraud target labels.
1. Train and evaluate a machine learning model to detect fraud using the augmented features.


*Note: This notebook assumes a basic understanding of Temporian. If you are not familiar with Temporian, we recommend that you read the [3 minutes guide to Temporian](https://temporian.readthedocs.io/en/latest/3_minutes) guide first.*



## Install and import dependencies


In [None]:
# For data preprocessing and augmantation
%pip install temporian -q

# For model training
%pip install tensorflow tensorflow_decision_forests -q

# To plots the results
%pip install seaborn -q

# To compute the ROC curve of the model
%pip install scikit-learn -q

In [None]:
import os
import datetime
import concurrent.futures

import pandas as pd
import temporian as tp
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow_decision_forests as tfdf
from sklearn import metrics as sk_metrics

The dataset consists of banking transactions sampled between April 1, 2018 and September 30, 2018. The transactions are stored in CSV files, one for each day. The transactions from April 1, 2018 to August 31, 2018 (inclusive) are used for training, while the transactions from September 1, 2018 to September 30, 2018 are used for evaluation.

In [None]:
start_date = datetime.date(2018, 4, 1)
end_date = datetime.date(2018, 9, 30)
train_test_split = datetime.datetime(2018, 9, 1)

# Note: You can reduce the end and train/test split dates to speed-up the notebook execution.

# List the input csv files
filenames = []
while start_date <= end_date:
    filenames.append(f"{start_date}")
    start_date += datetime.timedelta(days=1)
print(f"{len(filenames)} dates")

The dataset is downloaded and converted into a Pandas dataframe.

In [None]:
def load_date(filename):
    print(".",end="", flush=True)
    return pd.read_pickle(f"https://github.com/Fraud-Detection-Handbook/simulated-data-raw/raw/main/data/{filename}.pkl")

print("Downloading dataset",end="")
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    frames = executor.map(load_date, filenames)
dataset_pd = pd.concat(frames)
print("done")
print(f"Found {len(dataset_pd)} transactions")

dataset_pd

We only keep the following columns of interest:

- **TX_DATETIME**: The date and time of the transaction.
- **CUSTOMER_ID**: The unique identifier of the customer.
- **TERMINAL_ID**: The identifier of the terminal where the transaction was made.
- **TX_AMOUNT**: The amount of the transaction.
- **TX_FRAUD**: Whether the transaction is fraudulent (1) or not (0).

Our goal is to predict whether a transaction is fraudulent at the time of the transaction, using only the information from this transaction and previous transactions. The information about whether a transaction is fraudulent is not known at the time of the transaction. Instead, it is only known one week after the transaction. While this is too late to prevent the fraudulent transaction, it is available for future transactions.


In [None]:
dataset_pd = dataset_pd[["TX_DATETIME", "CUSTOMER_ID", "TERMINAL_ID", "TX_AMOUNT", "TX_FRAUD"]]

Convert the Pandas DataFrame into a Temporian EventSet.

In [None]:
dataset_tp = tp.from_pandas(dataset_pd, timestamps="TX_DATETIME")

dataset_tp

We can plot the whole dataset, but the resulting plot will be very busy because all the transactions are currently grouped together. 

In [None]:
dataset_tp.plot()

Instead, we can plot the transaction of a single client.

In [None]:
tp.add_index(dataset_tp, "CUSTOMER_ID").plot(indexes="3774")

# Same plot as:
# tp.filter(dataset_tp, tp.equal(dataset_tp["CUSTOMER_ID"], "3774")).plot()

After exploring the dataset, we want to compute some augmented features that may correlate with fraudulent activities. We will compute the following three features:

**Calendar features**: We will extract the hour of the day and the day of the week as individual features. This is because fraudulent transactions may be more likely to occur at specific times.

**Moving sum of fraud per customer**: For each client, we will extract the number of fraudulent transactions in the last 4 weeks. This is because clients who start to commit fraud (maybe the a card was stolen) may be more likely to commit fraud in the future. However, since we only know after a week if a transaction is fraudulent, there will be a lag in this feature.

**Moving sum of fraud per terminal**: For each terminal, we will extract the number of fraudulent transactions in the last 4 weeks. This is because some fraudulent transactions may be caused by ATM skimmers. In this case, many transactions from the same terminal may be fraudulent. However, since we only know after a week if a transaction is fraudulent, there will be a lag in this feature as well.

Data augmentation features often have parameters that need to be selected. For example, why look at the last 4 weeks instead of the last 8 weeks? In practice, you will want to compute the features with many different parameter values (e.g., 1 day, 2 days, 1 week, 2 weeks, 4 weeks, and 8 weeks). However, to keep this example simple, we will only use 4 weeks here.


In [None]:
@tp.compile
def augment_transactions(transactions: tp.EventSetNode) -> tp.EventSetNode:
    print("TRANSACTIONS:\n", transactions.schema, sep = '')

    # Create a unique ID for each transaction.
    transaction_id = tp.rename(tp.enumerate(transactions), "transaction_id")
    transactions = tp.glue(transactions, transaction_id)

    # 1.
    # Hour of day and day of week of each transaction.
    calendar = tp.glue(
        tp.calendar_hour(transactions),
        tp.calendar_day_of_week(transactions),
        transactions["transaction_id"],
    )
    print("CALENDAR:\n", calendar.schema, sep = '')

    # 2.
    # Index the transactions per customer
    per_customer = tp.add_index(transactions, "CUSTOMER_ID")
    # Lag the fraud by 1 week
    lagged_fraud_per_customer = tp.lag(per_customer["TX_FRAUD"], tp.duration.weeks(1))
    # Moving sum of transactions over the last 4 weeks
    feature_per_customer = tp.moving_sum(lagged_fraud_per_customer, tp.duration.weeks(4), sampling=per_customer)
    # Rename the feature for book-keeping
    feature_per_customer = tp.rename(feature_per_customer, "per_customer.moving_sum_frauds")
    # Aggregate the newly computed feature with the ther customer features.
    feature_per_customer = tp.glue(feature_per_customer, per_customer)
    # Print the schema
    print("PER CUSTOMER:\n", per_customer.schema, sep = '')

    # 3.
    # The moving sum of fraud per terminal is similar to the moving sum per customer.
    # Instead of indexing by customer, the dataset is indexed by terminal.
    per_terminal = tp.add_index(transactions, "TERMINAL_ID")
    lagged_fraud_per_terminal = tp.lag(per_terminal["TX_FRAUD"], tp.duration.weeks(1))
    feature_per_terminal = tp.moving_sum(lagged_fraud_per_terminal, tp.duration.weeks(4), sampling=per_terminal)
    feature_per_terminal = tp.rename(feature_per_terminal, "per_terminal.moving_sum_frauds")
    feature_per_terminal = tp.glue(feature_per_terminal, per_terminal)
    print("PER TERMINAL:\n", per_terminal.schema, sep = '')

    # Join the per customer and per terminal features
    augmented_transactions = tp.join(
        tp.drop_index(feature_per_terminal),
        tp.drop_index(feature_per_customer)[["per_customer.moving_sum_frauds","transaction_id"]],
        on="transaction_id")
    
    # Join the calendar features
    augmented_transactions = tp.join(
        augmented_transactions,
        calendar[["calendar_hour", "calendar_day_of_week", "transaction_id"]],
        on="transaction_id")
    
    print("AUGMENTED TRANSACTIONS:\n", augmented_transactions.schema)

    return augmented_transactions

# Compute the augmanted features
augmented_dataset_tp = augment_transactions(dataset_tp)

Plot the augmented features on the selected customer.

In [None]:
tp.add_index(augmented_dataset_tp, "CUSTOMER_ID").plot(indexes="3774")

Save the Temporian program to compute the augmented transactions. We will not use this program again in this notebook, but in practice, this data augmentation stage should be included with the model.

A saved Temporian program can also be applied on a large dataset that does not fit in memory using the Beam backend.

In [None]:
tp.save(augment_transactions, inputs={"transactions":dataset_tp.schema}, path="/tmp/augment_transactions.tempo")

Convert the Temporian EventSet into a Pandas DataFrame and plot the relation between the augmented features and the label.

**Observations:** The feature `per_terminal.moving_sum_frauds` and `per_customer.moving_sum_frauds` seems to discriminate between fraudulent and non-fraudulent transactions, while the calendar features are not discriminative.

In [None]:
fig, axs = plt.subplots(ncols=2, nrows=2, figsize=(10, 8))

sns.ecdfplot(data=augmented_dataset_pd, x="per_terminal.moving_sum_frauds", hue="TX_FRAUD", ax=axs[0,0])
sns.ecdfplot(data=augmented_dataset_pd, x="per_customer.moving_sum_frauds", hue="TX_FRAUD", ax=axs[1,0])
sns.ecdfplot(data=augmented_dataset_pd, x="calendar_hour", hue="TX_FRAUD", ax=axs[0,1])
sns.ecdfplot(data=augmented_dataset_pd, x="calendar_day_of_week", hue="TX_FRAUD", ax=axs[1,1])

The next step is to split the dataset into a training and testing dataset.

One common approach is to use the `tp.timestamps` operator. This operator converts the timestamp of a transaction into a feature that can be compared to `train_test_split`.

In [None]:
is_train = tp.timestamps(augmented_dataset_tp) < train_test_split.timestamp()
is_test = tp.invert(is_train)

# Plot
is_train = tp.rename(is_train, "is_train")
is_test = tp.rename(is_test, "is_test")
tp.plot([is_train, is_test])

An alternative and equivalent solution is to create a demarcating event that separates the training and testing examples. We can then use the `tp.since_last` and `tp.isnan` functions to compute for each transaction whether the demarcating event has already been seen.

In [None]:
# Create a demarcating event.
train_test_switch_tp = tp.event_set(timestamps=[train_test_split])

# Plot
train_test_switch_tp.plot()

# All the transactions before the demarcating event are part of the training dataset (i.e. `is_train=True`) 
is_train = tp.isnan(tp.since_last(train_test_switch_tp, augmented_dataset_tp))
is_test = tp.invert(is_train)

# Plot
is_train = tp.rename(is_train, "is_train")
is_test = tp.rename(is_test, "is_test")
tp.plot([is_train, is_test])

We can now split the dataset into training and testing.

In [None]:
augmented_dataset_train_tp = tp.filter(augmented_dataset_tp, is_train)
augmented_dataset_test_tp = tp.filter(augmented_dataset_tp, is_test)

# Print the schema of the training dataset
augmented_dataset_train_tp.schema.features

We first convert the Temporal EventSets into Pandas DataFrames. Then, we use the `tfdf.keras.pd_dataframe_to_tf_dataset` function to convert these DataFrames into TensorFlow datasets that can be used by TensorFlow Decision Forests.

In [None]:
# Temporian EventSet to Pandas DataFrame
dataset_train_pd = tp.to_pandas(augmented_dataset_train_tp)
dataset_test_pd = tp.to_pandas(augmented_dataset_test_tp)

print(f"Train example: {len(dataset_train_pd)}")
print(f"Test example: {len(dataset_test_pd)}")

# Pandas DataFrame to TensorFlow Dataset
dataset_train_tf = tfdf.keras.pd_dataframe_to_tf_dataset(dataset_train_pd, label="TX_FRAUD")
dataset_test_tf = tfdf.keras.pd_dataframe_to_tf_dataset(dataset_test_pd, label="TX_FRAUD")

We can then train a TF-DF model.

In [None]:
model = tfdf.keras.GradientBoostedTreesModel(features=[tfdf.keras.FeatureUsage("per_customer.moving_sum_frauds"),
                                                       tfdf.keras.FeatureUsage("per_terminal.moving_sum_frauds"),
                                                       tfdf.keras.FeatureUsage("calendar_hour"),
                                                       tfdf.keras.FeatureUsage("calendar_day_of_week"),
                                                      ],
                                             exclude_non_specified_features=True)
model.fit(dataset_train_tf, verbose=2)

Finally, we plot the ROC (Receiver operating characteristic) curve and compute the AUC (Area Under the Curve).

In [None]:
# Predictions of the model
test_predictions = model.predict(dataset_test_tf, verbose=0)[:,0]

# The real fraud information
test_labels = dataset_test_pd["TX_FRAUD"].values

# Compute the ROC and AUC.
fpr, tpr, thresholds = sk_metrics.roc_curve(test_labels, test_predictions)
auc = sk_metrics.roc_auc_score(test_labels, test_predictions)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (AUC = %0.3f)'  % auc )
plt.show()

The augmented features we created are efficient at identifying some types of fraud, as evidenced by the point (FPR=0.02, TPR=0.5).

However, for FPRs greater than 0.02, the TPR increases slowly, indicating that the remaining types of fraud are more difficult to detect. We need to conduct further analysis and create new features to improve our ability to detect these remaining frauds.

Do you have any ideas for other features or feature augmentations that could improve the model's performance? For example, we could compute features per customer and per terminal, or we could create features related to transaction amount. These changes could help us reach an AUC of >0.88.
