Copyright (c) 2023 Graphcore Ltd. All rights reserved.

# Training a GNN to do Fraud Detection on Graphcore IPUs with PyTorch Geometric

This notebook demonstrates using PyTorch Geometric on Graphcore IPUs to train a model for fraud detection using the [IEEE-CIS dataset](https://www.kaggle.com/competitions/ieee-fraud-detection/data). The approach is inspired by the [AWS Fraud Detection with GNNs](https://github.com/awslabs/realtime-fraud-detection-with-gnn-on-dgl) project, framing the problem as a node classification task using a heterogeneous graph, where the transaction node types have a label indicating whether they are fraudulent or not.

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
|   GNNs   |  Fraud detection using Node Classification  | RGCN | IEEE-CIS Fraud Detection | Training, evaluation | recommended: 16 (min: 4) | 40 min |

In this notebook, you will learn how to:

- Take a PyTorch Geometric heterogeneous graph data object, that we loaded in the "Preprocessing a Tabular Dataset into a PyTorch Geometric Data Object suitable for Fraud Detection" `1_dataset_preprocessing.ipynb` notebook, and prepare it for training.
- Select a model suitable for the task of predicting fraudulent transactions
- Train the model on IPUs
- Run validation on the trained model

This notebook assumes some familiarity with PopTorch as well as PyTorch Geometric (PyG). For additional resources please consult:
* [PopTorch documentation](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/index.html)
* [PopTorch examples and tutorials](https://docs.graphcore.ai/en/latest/examples.html#pytorch)
* [PyTorch Geometric documentation](https://pytorch-geometric.readthedocs.io/en/latest/)
* [PopTorch Geometric documentation](https://docs.graphcore.ai/projects/poptorch-geometric-user-guide/en/latest/index.html)

## Running on Paperspace

The Paperspace environment lets you run this notebook with no set up. To improve your experience we preload datasets and pre-install packages, this can take a few minutes, if you experience errors immediately after starting a session please try restarting the kernel before contacting support. If a problem persists or you want to give us feedback on the content of this notebook, please reach out to through our community of developers using our [slack channel](https://www.graphcore.ai/join-community) or raise a [GitHub issue](https://github.com/graphcore/examples).

In order to improve usability and support for future users, Graphcore would like to collect information about the
applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:

- User progression through the notebook
- Notebook details: number of cells, code being run and the output of the cells
- Environment details

You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell.

## Dependencies and configuration

Install the dependencies the notebook needs.

In [None]:
%pip install -q -r requirements.txt
%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

To improve your experience, read in some configuration related to the environment you are running the notebook.

In [None]:
import os

number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", 16))
pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod16")
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/")
dataset_directory = os.getenv("DATASETS_DIR", ".") + "/ieee-fraud-detection"
checkpoint_directory = os.getenv("CHECKPOINT_DIR", ".")

## Loading the dataset

First we will load the dataset for training. We have created a PyTorch Geometric dataset object already containing the preprocessed dataset. If you want to see the preprocessing steps see all the steps in the "Preprocessing a Tabular Dataset into a PyTorch Geometric Data Object suitable for Fraud Detection" `1_dataset_preprocessing.ipynb` notebook, or take a look at the dataset object directly in the `dataset.py` script.

In [None]:
from dataset import IeeeFraudDetectionDataset

dataset = IeeeFraudDetectionDataset(dataset_directory)
data = dataset[0]

Let's see what the dataset looks like:

In [None]:
data

You can see it is a single large heterogeneous graph. The node type we will train on is the `transaction` node, for which we have a label that indicates whether that transaction is fraudulent or not. We have a number of other node types detailing different properties of the transactions, with an edge going out from each transaction to each of the other node types. Full details about the structure of this data can be found in the "Preprocessing a Tabular Dataset into a PyTorch Geometric Data Object suitable for Fraud Detection" `1_dataset_preprocessing.ipynb` notebook.

## Preprocessing the dataset

Before training on this dataset, we will do some preprocessing.

As a first preprocessing step, we apply some transforms on the original graph dataset to:

 * Make the graph undirected, which will add a reverse edge type for every existing edge type.
 * Add self loops to all of the node types, which will add a self loop for each of the edge types.
 * Normalize all the node features.

In [None]:
import torch_geometric.transforms as T

data = T.ToUndirected()(data)
data = T.AddSelfLoops()(data)
data = T.NormalizeFeatures()(data)

data

Next, we will create the dataset splits. For this, we will simply use the final 20% of the nodes for validation.

In [None]:
import torch

num_nodes_train = int(0.8 * data["transaction"].num_nodes)
data["transaction"].train_mask = torch.zeros(data["transaction"].num_nodes, dtype=bool)
data["transaction"].train_mask[:num_nodes_train] = True
data["transaction"].val_mask = torch.zeros(data["transaction"].num_nodes, dtype=bool)
data["transaction"].val_mask[num_nodes_train:] = True

print(f"Number of training nodes: {data['transaction'].train_mask.sum()}")
print(f"Number of validation nodes: {data['transaction'].val_mask.sum()}")

Now let's understand how many transactions in the dataset are actually fraudulent.

In [None]:
num_fraud_train = data["transaction"].y[data["transaction"].train_mask].sum()
num_total_train = len(data["transaction"].train_mask)
num_fraud_val = data["transaction"].y[data["transaction"].val_mask].sum()
num_total_val = len(data["transaction"].val_mask)

In [None]:
# Number of fraud transactions
percentage_fraud_train = num_fraud_train / num_total_train
percentage_fraud_val = num_fraud_val / num_total_val
print(f"{percentage_fraud_train = :%}")
print(f"{percentage_fraud_val = :%}")

We see that there are very few fraudulent transactions in the dataset. As this class imbalance is quite large, we could weight our loss to emphasise fraudulent transactions more than non-fraudulent ones. We can calculate a class weight which we will use later as follows:

In [None]:
# Use this to set a class weight
class_weight = (
    (num_total_train / (2 * (num_total_train - num_fraud_train))).item(),
    (num_total_train / (2 * num_fraud_train)).item(),
)
class_weight

Later we will see how this class imbalance affects how we track our results. 

## Data loading using sampling

As the graph we are using is large, we will need some form of sampling to train our model. We will use neighbour sampling. PyTorch Geometric provides a data loader for this, `torch_geometric.loader.NeighborLoader`. Let's create an instance of `NeighborLoader`:

In [None]:
batch_size = 64
num_layers = 2
num_neighbors = [11, 4]

In [None]:
from torch_geometric.loader import NeighborLoader

train_loader = NeighborLoader(
    data,
    num_neighbors=num_neighbors,
    batch_size=batch_size,
    input_nodes=("transaction", data["transaction"].train_mask),
)

In [None]:
next(iter(train_loader))

When using this data loader, each mini-batch produced has a different shape, depending on how much of the original graph is sampled. The IPU uses ahead-of-time compilation, which requires each mini-batch to have the same size. To achieve this we can use a fixed-size version of this data loader provided in PopTorch Geometric `poptorch_geometric.FixedSizeNeighborLoader`.

First, by sampling from the non-fixed-size data loader, we can get an idea of the fixed-sizes we need to accommodate our neighbour loader.

In [None]:
from poptorch_geometric import FixedSizeOptions

torch.manual_seed(42)

fixed_size_options = FixedSizeOptions.from_loader(train_loader)
fixed_size_options

Here we have an approximation for the number of nodes and edges, for each node and edge type, to accommodate the mini-batches produced by the neighbour loader. Now using this with `FixedSizeNeighborLoader`, the mini-batches will be padded to the sizes specified in `fixed_size_options`.

In [None]:
from poptorch_geometric.neighbor_loader import FixedSizeNeighborLoader

train_loader_ipu = FixedSizeNeighborLoader(
    data,
    num_neighbors=num_neighbors,
    fixed_size_options=fixed_size_options,
    batch_size=batch_size,
    input_nodes=("transaction", data["transaction"].train_mask),
)

Now, looking at the first sample you can see it has the dimensions from the `fixed_size_options` object.

In [None]:
sample = next(iter(train_loader_ipu))
sample

## Picking the right model

In order to pick the right model, we should reflect on the task we are doing. For each transaction node we are attempting to predict whether a transaction is fraudulent or not. Each transaction has a number of features as well as being connected to other node types.

We will essentially need a [Relational Graph Convolution Network (R-GCN)](https://arxiv.org/abs/1703.06103), where each relation type will have its own set of weights. In this case, each relation type will have its own `SAGEConv` layer. Remember that we only have features for the transactions, so we will need to create some features for the other node types as well. We will train an embedding for each node type for this purpose, after which all the node types will have an embedding to use as an input to the message passing layers.

The exact model definition can be seen in `model.py`. Here we:
 * create the GNN we want to use in homogeneous form
 * transform the GNN to a heterogeneous GNN, such that we have a convolution layer for each of the edge types
 * wrap this heterogeneous GNN in a model that contains an embedding for all the non-transaction node types and a loss function such that we can use PopTorch and Graphcore IPUs to train this model.

In [None]:
from torch_geometric.nn import to_hetero

from model import GNN, Model


model = GNN(hidden_channels=64, num_layers=num_layers)
model = to_hetero(model, data.metadata(), aggr="sum")
model = Model(
    model,
    embedding_size=128,
    out_channels=2,
    node_types=data.node_types,
    num_nodes_per_type={
        node_type: data[node_type].num_nodes for node_type in data.node_types
    },
    class_weight=class_weight,
)
model

Using the first sample from the data loader, we can lazily initialize the modules in our model.

In [None]:
sample = next(iter(train_loader_ipu))

In [None]:
model.eval()
out_cpu = model(
    sample.x_dict,
    sample.edge_index_dict,
    batch_size=sample["transaction"].batch_size,
    n_id_dict=sample.n_id_dict,
    target=sample["transaction"].y,
    mask=sample["transaction"].train_mask,
)
out_cpu

As a sanity check, we will run inference using this sample on the IPU and verify the results between the IPU and CPU.

In [None]:
import poptorch

poptorch_options = poptorch.Options()
poptorch_options.enableExecutableCaching(executable_cache_dir)
inf_model = poptorch.inferenceModel(model, options=poptorch_options)

out_ipu = inf_model(
    sample.x_dict,
    sample.edge_index_dict,
    batch_size=sample["transaction"].batch_size,
    n_id_dict=sample.n_id_dict,
    target=sample["transaction"].y,
    mask=sample["transaction"].train_mask,
)

inf_model.detachFromDevice()

out_ipu

In [None]:
assert torch.allclose(out_cpu, out_ipu, rtol=1e-05, atol=1e-05)

All looks good.

## Training the model

We are ready to start training our model. Let's specify the hyperparameters.

In [None]:
learning_rate = 0.01
weight_decay = 5e-5
num_layers = 2
embedding_size = 128
hidden_channels = 16
log_freq = 10
class_weight = (1.00, 5.00)

We will train for 50 epochs:

In [None]:
num_epochs = 50

Create the model:

In [None]:
model = GNN(hidden_channels=hidden_channels, num_layers=num_layers)
model = to_hetero(model, data.metadata(), aggr="sum")
model = Model(
    model,
    embedding_size=embedding_size,
    out_channels=2,
    node_types=data.node_types,
    num_nodes_per_type={
        node_type: data[node_type].num_nodes for node_type in data.node_types
    },
    class_weight=class_weight,
)

Get the first sample from the data loader to initialize the model lazily.

In [None]:
train_loader_ipu = FixedSizeNeighborLoader(
    data,
    num_neighbors=num_neighbors,
    fixed_size_options=fixed_size_options,
    batch_size=batch_size,
    input_nodes=("transaction", data["transaction"].train_mask),
    shuffle=True,
)

sample = next(iter(train_loader_ipu))

with torch.no_grad():  # Initialize lazy modules.
    out_cpu, loss = model(
        sample.x_dict,
        sample.edge_index_dict,
        batch_size=sample["transaction"].batch_size,
        n_id_dict=sample.n_id_dict,
        target=sample["transaction"].y,
    )

To accelerate training we will replicate the model over multiple IPUs (4 in this case) and increase the PopTorch `deviceIterations` option to reduce interactions between the host and IPUs.

In [None]:
import poptorch

replication_factor = number_of_ipus
device_iterations = 64

# Reduce the size of the global batch if it ends up being greater
# than the number of training transactions in the dataset
if (
    data["transaction"].train_mask.sum()
    < replication_factor * batch_size * device_iterations
):
    replication_factor = 1
    device_iterations = 1

poptorch_options = poptorch.Options()
poptorch_options.enableExecutableCaching(executable_cache_dir)
poptorch_options.replicationFactor(replication_factor)
poptorch_options.deviceIterations(device_iterations)

Re-create the data loader with these options:

In [None]:
train_loader_ipu = FixedSizeNeighborLoader(
    data,
    num_neighbors=num_neighbors,
    fixed_size_options=fixed_size_options,
    batch_size=batch_size,
    input_nodes=("transaction", data["transaction"].train_mask),
    shuffle=True,
    options=poptorch_options,
)

Wrap the model in `poptorch.trainingModel` specifying the optimizer:

In [None]:
model.train()
optimizer = poptorch.optim.Adam(
    model.parameters(), lr=learning_rate, weight_decay=weight_decay
)
training_model = poptorch.trainingModel(
    model, optimizer=optimizer, options=poptorch_options
)

And finally create the training loop and begin training:

In [None]:
for epoch in range(num_epochs):
    total_examples = total_loss = 0
    for batch in train_loader_ipu:
        out, loss = training_model(
            batch.x_dict,
            batch.edge_index_dict,
            batch_size=sample["transaction"].batch_size,
            n_id_dict=batch.n_id_dict,
            target=batch["transaction"].y,
        )
        examples = (
            sample["transaction"].batch_size * replication_factor * device_iterations
        )
        total_examples += examples
        total_loss += float(loss.mean()) * examples

    if epoch % log_freq == 0:
        print(f"Epoch {epoch}, Loss: {total_loss / total_examples}")

Now that the model is trained we can detach it from the IPUs.

In [None]:
training_model.detachFromDevice()

Let's save the trained weights.

In [None]:
os.makedirs(checkpoint_directory, exist_ok=True)
torch.save(training_model.state_dict(), os.path.join(checkpoint_directory, "model.pt"))

## Validating the trained model

In order to validate the trained model, we must create a data loader that samples from the validation nodes.

In [None]:
poptorch_options = poptorch.Options()
poptorch_options.enableExecutableCaching(executable_cache_dir)

In [None]:
batch_size_val = 1
device_iterations = 64
replication_factor = number_of_ipus
# Reduce the size of the global batch if it ends up being greater
# than the number of validation transactions in the dataset
if (
    data["transaction"].val_mask.sum()
    < replication_factor * batch_size_val * device_iterations
):
    replication_factor = 1
    device_iterations = 1

poptorch_options.replicationFactor(replication_factor)
poptorch_options.deviceIterations(device_iterations)

For validation, we want to sample the full neighbourhood of the validation nodes. Let's recreate the fixed-size options to ensure we allocate enough space for this:

In [None]:
num_neighbors = [-1, 100]

val_loader = NeighborLoader(
    data,
    num_neighbors=num_neighbors,
    batch_size=batch_size_val,
    input_nodes=("transaction", data["transaction"].val_mask),
)

fixed_size_options = FixedSizeOptions.from_loader(val_loader, sample_limit=100)
fixed_size_options

And create a fixed-size neighbour loader for validation:

In [None]:
val_loader_ipu = FixedSizeNeighborLoader(
    data,
    num_neighbors=num_neighbors,
    fixed_size_options=fixed_size_options,
    batch_size=batch_size_val,
    input_nodes=("transaction", data["transaction"].val_mask),
    options=poptorch_options,
)

Let's now wrap the trained model in `poptorch.inferenceModel` and run a single epoch. We take the first `batch_size` number of outputs as this will contain the validation nodes that make up the batch size.

In [None]:
model.eval()
inference_model = poptorch.inferenceModel(model, options=poptorch_options)

outs = []
labels = []

for batch in val_loader_ipu:
    out = inference_model(
        batch.x_dict,
        batch.edge_index_dict,
        batch_size=sample["transaction"].batch_size,
        n_id_dict=batch.n_id_dict,
    )
    outs.append(out[0 :: fixed_size_options.num_nodes["transaction"]])
    labels.append(
        batch["transaction"].y[0 :: fixed_size_options.num_nodes["transaction"]]
    )

Again, we will detach the model from IPUs.

In [None]:
inference_model.detachFromDevice()

## Analysing the results

In this section we will attempt to understand how our trained model performs on the validation nodes.

First, we will flatten the results of the validation.

In [None]:
result = torch.stack(outs)
result = result.flatten(start_dim=0, end_dim=1)
result.shape

In [None]:
y_true = torch.stack(labels)
y_true = y_true.flatten(start_dim=0, end_dim=1)
y_true.shape

We can make our predictions using a softmax function and checking if the second class probability (the fraudulent class) is greater than 0.5.

In [None]:
import torch.nn as nn

y_pred = nn.Softmax(dim=-1)(result)
y_pred = y_pred[:, -1]
y_pred = y_pred > 0.5

And we can get the accuracy:

In [None]:
def accuracy(y_pred, y_true):
    correct = y_pred.eq(y_true).sum()
    return correct / len(y_pred)


accuracy(y_pred, y_true)

Unfortunately, accuracy is a poor metric for this problem as we have such a large class imbalance. Let's instead look at the confusion matrix:

In [None]:
def get_confusion_matrix(y_pred, y_true):
    y_pred = y_pred.bool()
    y_true = y_true.bool()
    true_positives = (y_pred * y_true).sum()
    false_positives = (y_pred * ~y_true).sum()
    true_negatives = (~y_pred * ~y_true).sum()
    false_negatives = (~y_pred * y_true).sum()
    return true_positives, false_positives, true_negatives, false_negatives


true_pos, false_pos, true_neg, false_neg = get_confusion_matrix(y_pred, y_true)
true_pos, false_pos, true_neg, false_neg

From these we get the true positive and false positive rates.

In [None]:
def get_rates(true_pos, false_pos, true_neg, false_neg):
    true_pos_rate = true_pos / (true_pos + false_neg)
    false_pos_rate = false_pos / (false_pos + true_neg)
    return true_pos_rate, false_pos_rate


get_rates(true_pos, false_pos, true_neg, false_neg)

Now by sweeping over the threshold used to deem a transaction fraudulent, we can get a ROC (receiver operating characteristic) curve.

In [None]:
import numpy as np

results = []
for threshold in np.arange(1.0, -0.1, -0.1):
    y_pred = nn.Softmax(dim=-1)(result)
    y_pred = y_pred[:, -1]
    y_pred = y_pred > threshold
    results.append((threshold, *get_rates(*get_confusion_matrix(y_pred, y_true))))

In [None]:
true_pos_rates = list(zip(*results))[1]
false_pos_rates = list(zip(*results))[2]

And plotting the ROC curve:

In [None]:
from matplotlib import pyplot as plt

fig, ax = plt.subplots()
ax.plot(false_pos_rates, true_pos_rates)
ax.set_xlabel("False positive rate")
ax.set_ylabel("True positive rate")
plt.grid(True)

A good metric is the area under this curve, let's calculate that:

In [None]:
import numpy as np

aoc = np.trapz(y=true_pos_rates, x=false_pos_rates)
aoc

This result is ok for a start, but could do with some improvement. As an extension you could try changing:
 * the model's layers - perhaps `SAGEConv` layers aren't the best for this use case,
 * the hyperparameters - does it help to train for more epochs, or maybe even less,
 * the class weight - are we putting enough weighting on the fraudulent nodes,
 * the dataset preprocessing - perhaps some features are more useful than others.

## Conclusion

In this notebook we have seen how to train a heterogeneous GNN model using PyTorch Geometric on Graphcore IPUs for a fraud detection task.

Specifically we have:

 - loaded the preprocessed PyTorch Geometric dataset,
 - done some additional preprocessing and generated the training and validation splits,
 - trained a model on the data using neighbour sampling,
 - validated our trained model by looking at the area under the ROC curve.

If you are interested in finding out more about this application check out the "Preprocessing a Tabular Dataset into a PyTorch Geometric Data Object suitable for Fraud Detection" `1_dataset_preprocessing.ipynb` notebook. To find out more about PyTorch Geometric on IPUs in general see our PyG tutorials found within `learning-pytorch-geometric-on-ipus/`