Copyright (c) 2023 Graphcore Ltd. All rights reserved.

# Preprocessing a Tabular Dataset into a PyTorch Geometric Data Object suitable for Fraud Detection

This notebook demonstrates how to preprocess a tabular fraud dataset, [IEEE-CIS dataset](https://www.kaggle.com/competitions/ieee-fraud-detection/data), into a PyTorch Geometric  (PyG) data object ready for use in the [Training a GNN to do Fraud Detection on Graphcore IPUs with PyTorch Geometric](2_training.ipynb) notebook. The approach is inspired by the [AWS Fraud Detection with GNNs](https://github.com/awslabs/realtime-fraud-detection-with-gnn-on-dgl) project, framing the problem as a node classification task using a heterogeneous graph, where the transaction node types have a label indicating whether they are fraudulent or not.

In this notebook, you will learn how to:

- Turn tabular transaction data into a PyTorch Geometric heterogeneous dataset object suitable for use in the "Training a GNN to do Fraud Detection on Graphcore IPUs with PyTorch Geometric" `2_training.ipynb` notebook.

This notebook assumes some familiarity with PopTorch as well as PyTorch Geometric. For additional resources please consult:
* [PopTorch documentation](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/index.html)
* [PopTorch examples and tutorials](https://docs.graphcore.ai/en/latest/examples.html#pytorch)
* [PyTorch Geometric documentation](https://pytorch-geometric.readthedocs.io/en/latest/)
* [PopTorch Geometric documentation](https://docs.graphcore.ai/projects/poptorch-geometric-user-guide/en/latest/index.html)

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

## Running on Paperspace

The Paperspace environment lets you run this notebook with no set up. To improve your experience we preload datasets and pre-install packages, this can take a few minutes, if you experience errors immediately after starting a session please try restarting the kernel before contacting support. If a problem persists or you want to give us feedback on the content of this notebook, please reach out to through our community of developers using our [slack channel](https://www.graphcore.ai/join-community) or raise a [GitHub issue](https://github.com/graphcore/examples).

In order to improve usability and support for future users, Graphcore would like to collect information about the
applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:

- User progression through the notebook
- Notebook details: number of cells, code being run and the output of the cells
- Environment details

You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell.



## Dependencies and configuration

Install the dependencies the notebook needs.

In [None]:
%pip install -q -r requirements.txt
%load_ext examples_utils.notebook_logging.gc_logger

To improve your experience, we read in some configuration related to the environment you are running the notebook.

In [None]:
import os

number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", 16))
pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod16")
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/")
dataset_directory = os.getenv("DATASETS_DIR", ".")
checkpoint_directory = os.getenv("CHECKPOINT_DIR", ".")

Now let's get started.

## Loading tabular data into PyTorch Geometric

Many real world problems start with a tabular dataset. In this section, we will load a tabular dataset, preprocess it into a graph and put it into a PyTorch Geometric data object ready to be used to train a PyTorch Geometric model.

### Getting the dataset

First we need a tabular dataset for fraud detection. We will use the dataset from the [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/data) competition on Kaggle.

You will need to download the dataset from the [Kaggle competition website](https://www.kaggle.com/c/ieee-fraud-detection/data) and place it in a directory called `raw`.

In [None]:
import os.path as osp
import pandas as pd

raw_dataset_path = osp.join(dataset_directory, "ieee-fraud-detection/raw")

dataset_raw_files = [
    "train_transaction.csv",
    "train_identity.csv",
    "test_transaction.csv",
    "test_identity.csv",
]

dataset_raw_paths = []
for file in dataset_raw_files:
    full_path = osp.join(raw_dataset_path, file)
    if not os.path.isfile(full_path):
        raise FileNotFoundError(
            f"Dataset at path {full_path} not found. Ensure the dataset"
            f" has been downloaded and unpacked into {raw_dataset_path}"
        )
    dataset_raw_paths.append(full_path)

train_transaction_df = pd.read_csv(dataset_raw_paths[0])
train_identity_df = pd.read_csv(dataset_raw_paths[1])
test_transaction_df = pd.read_csv(dataset_raw_paths[2])
test_identity_df = pd.read_csv(dataset_raw_paths[3])

We will concatenate the training and test datasets in order to make the PyTorch Geometric graph. Later we will redefine new dataset splits.

In [None]:
transaction_df = pd.concat([train_transaction_df, test_transaction_df], axis=0)
identity_df = pd.concat([train_identity_df, test_identity_df], axis=0)

So, we have two tables to work with:
 * `transaction_df` - properties about the transactions themselves, for example information about card used, or the billing address.
 * `identity_df` - identity information associated with the transactions, for example digital signature, or network connection information.

For more details on this data see the [Kaggle competition forum](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203) discussing this topic.

Let's take a look at the tables themselves:

In [None]:
transaction_df.head()

In [None]:
identity_df.head()

You may notice both tables have some `NaN` values. If the information wasn't available for that particular transaction, the value will be `NaN`.

As both tables have transaction IDs in common, we merge both tables into one.

In [None]:
transaction_df = pd.merge(transaction_df, identity_df, on="TransactionID")

We then sort the transactions based on their datetime information. When we create the dataset splits we will use the datetime of the transactions to decide how to split the data.

In [None]:
transaction_df.sort_values("TransactionDT")

In the interests of time, for this notebook we will only take the first 10000 samples. See `dataset.py` for the full dataset preprocessing.

In [None]:
transaction_df = transaction_df.head(10000)

### Preprocessing the dataset

We will frame this fraud detection task as a node classification problem. Each transaction in the table can be a distinct node, with a set of features and a label determining whether it is a fraudulent transaction or not. A transaction node will have some category features, like device type or device info, concatenated with some numerical features, like the transaction amount.

As well as transaction nodes, we can construct other node types based on some of the category columns in the table, for example `ProductCD` which represents the produce code or `card1` which represents some card information. Each transaction node will be connected to one node of each of the other node types, constructing a heterogeneous graph. For example, a transaction node will be connected to a single `ProductCD` node, a single `card1` node and to one node of each of the other node types. The columns we don't use to create new node types will be considered as category and numerical features of the transaction nodes themselves.

Now, let's preprocess the table, following the above method.

First, we filter the transactions which don't have `isFraud` values:

In [None]:
transaction_df = transaction_df[transaction_df["isFraud"].notna()]
transaction_df

#### Create the non-target node types

We want to create a heterogeneous graph where the node type we are training on are the transaction nodes. The other nodes, or non-target nodes, will be various category columns of the dataset. Specifically, the following columns will be the other node types:

In [None]:
non_target_node_types = [
    "card1",
    "card2",
    "card3",
    "card4",
    "card5",
    "card6",
    "ProductCD",
    "addr1",
    "addr2",
    "P_emaildomain",
    "R_emaildomain",
]

For each of these columns, we create a new node of that type and connect an edge from the transaction to that node type. If a node of that type with that category already exists, we just connect the edge from the transaction to the existing node.

In [None]:
import torch

get_cat_map = lambda vals: {val: idx for idx, val in enumerate(vals)}


def get_edge_list(df, identifier):
    # Find number of unique categories for this node type
    unique_entries = df[identifier].drop_duplicates().dropna()
    # Create a map of category to value
    entry_map = get_cat_map(unique_entries)
    # Create edge list mapping transaction to node type
    edge_list = [[], []]

    for idx, transaction in transaction_df.iterrows():
        node_type_val = transaction[identifier]
        # Don't create nodes for NaN values
        if pd.isna(node_type_val):
            continue
        edge_list[0].append(idx)
        edge_list[1].append(entry_map[node_type_val])
    return torch.tensor(edge_list, dtype=torch.long)

In [None]:
edge_dict = {
    node_type: get_edge_list(transaction_df, node_type)
    for node_type in non_target_node_types
}
edge_dict

This defines the edge index for each edge type from the transaction nodes.

Next we will create the features for the transaction nodes. The columns that we aren't using to create new node types will be transaction features. These columns either have category values or numeric values. We process the category features as concatenated one-hot tensors. All numeric features will be concatenated and then concatenated to the category features.

First we define the category columns:

In [None]:
target_cat_feat_cols = [
    "M1",
    "M2",
    "M3",
    "M4",
    "M5",
    "M6",
    "M7",
    "M8",
    "M9",
    "DeviceType",
    "DeviceInfo",
    "id_12",
    "id_13",
    "id_14",
    "id_15",
    "id_16",
    "id_17",
    "id_18",
    "id_19",
    "id_20",
    "id_21",
    "id_22",
    "id_23",
    "id_24",
    "id_25",
    "id_26",
    "id_27",
    "id_28",
    "id_29",
    "id_30",
    "id_31",
    "id_32",
    "id_33",
    "id_34",
    "id_35",
    "id_36",
    "id_37",
    "id_38",
]

We take the remaining columns as numeric features:

In [None]:
excl_cols = ["TransactionID", "isFraud", "TransactionDT"]

target_numeric_feat_cols = [
    column
    for column in transaction_df.columns
    if column not in non_target_node_types + excl_cols + target_cat_feat_cols
]
print(" ".join(target_numeric_feat_cols))

Create a dataframe of just these columns:

In [None]:
transaction_feat_df = transaction_df[
    target_numeric_feat_cols + target_cat_feat_cols
].copy()

Make any `NaN` values `0`:

In [None]:
transaction_feat_df = transaction_feat_df.fillna(0)

As mentioned, we will process the category columns into one-hot tensors and concatenate them.

In [None]:
import torch


def get_cat_feat(df, key):
    categories = set(row[key] for _, row in df.iterrows())
    mapping = {cat: i for i, cat in enumerate(categories)}

    x = torch.zeros((len(df), len(mapping)), dtype=torch.float32)
    for i, row in df.iterrows():
        x[i, mapping[row[key]]] = 1
    return x


cat_features = [get_cat_feat(transaction_feat_df, key) for key in target_cat_feat_cols]
cat_feats = torch.cat(cat_features, dim=-1)
cat_feats[0]

Process the numeric features:

In [None]:
import numpy as np


def process_val(col, val):
    if pd.isna(val):
        return 0.0

    if col == "TransactionAmt":
        val = np.log10(val)
    return val


num_feats = [
    list(
        map(
            process_val,
            target_numeric_feat_cols,
            [row[feat] for feat in target_numeric_feat_cols],
        )
    )
    for _, row in transaction_feat_df.iterrows()
]
num_feats = torch.tensor(num_feats, dtype=torch.float32)
num_feats.shape

Finally, concatenate the category and numeric features together:

In [None]:
import torch.nn.functional as F

transaction_feats = torch.cat((cat_feats, num_feats), -1)
transaction_feats.shape

We now have all the pieces to create the dataset, the transaction features and the edge indices for each transaction to node type edge.

### Creating a PyTorch Geometric dataset

Now we can put the transaction features and the edge indices into a PyTorch Geometric `HeteroData` object.

In [None]:
from torch_geometric.data import HeteroData

data = HeteroData()

Set the features and labels for the transaction nodes:

In [None]:
data["transaction"].num_nodes = len(transaction_df)
data["transaction"].x = transaction_feats
data["transaction"].y = torch.tensor(transaction_df["isFraud"].astype("long"))

Then, for each of the other node types we, create the nodes and the edges:

In [None]:
for node_type in non_target_node_types:
    data["transaction", "to", node_type].edge_index = edge_dict[node_type]
    data[node_type].num_nodes = edge_dict[node_type][1].max() + 1
    # Create dummy features for the non-transaction node types
    data[node_type].x = torch.zeros((edge_dict[node_type][1].max() + 1, 1))

We can validate the data we have created:

In [None]:
assert data.validate()

Now let's see what the resulting graph looks like:

In [None]:
data

In [None]:
data.num_nodes

The graph looks as expected. There are a number of node types, but only the transaction nodes have labels. Each transaction is connected to a node of a different node type. 

## Visualizing the graph

We can visualise the heterogeneous graph we have created from the tabular data.

Let's just select a fraction of the graph for visualizing:

In [None]:
from torch_geometric.transforms import RemoveIsolatedNodes

data = data.subgraph({"transaction": torch.arange(0, 3)})
data = RemoveIsolatedNodes()(data)
data

We can use NetworkX to visualise this graph:

In [None]:
import random

import networkx as nx
from matplotlib import pyplot as plt
from torch_geometric.utils import to_networkx

# Convert to homogeneous
data_homogeneous = data.to_homogeneous()
g = to_networkx(data_homogeneous)
# Use node types as colour map
colour_map = data_homogeneous.node_type

pos = nx.spring_layout(g)

# Split the nodes by node type and add some randomness to separate the nodes
for i in range(0, len(colour_map)):
    if colour_map[i] != 0:
        pos[i][0] += np.cos(colour_map[i] / 2) * 10 + random.randint(-1, 1)
        pos[i][1] += np.sin(colour_map[i] / 2) * 10 + random.randint(-1, 1)
    else:
        pos[i][0] += random.randint(-3, 3)
        pos[i][1] += random.randint(-3, 3)

nx.draw_networkx(g, pos=pos, node_color=colour_map * 40, cmap=plt.cm.tab20)
plt.show()

Nodes 0 - 2 represent the transaction nodes. As expected, each transaction node is connected out to the nodes of the other types, each represented with a different colour.

## Conclusion

In this notebook we have preprocessed a tabular dataset into a PyTorch Geometric `HeteroData` object, ready for training. Specifically we have:

 - Loaded the [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/data) dataset,
 - Created edge indices for each edge type from particular columns,
 - Created transaction features based on category and numeric columns,
 - Created a PyTorch Geometric `HeteroData` object containing these features and edges,
 - Visualised the resulting graph.

To preprocess and cache the entire dataset use the `dataset.py` script.

This dataset is used for training a GNN as shown in the "Training a GNN to do Fraud Detection on Graphcore IPUs with PyTorch Geometric" `2_training.ipynb` notebook.