Copyright (c) 2023 Graphcore Ltd. All rights reserved.

# Training a GNN to do Fraud Detection on Graphcore IPUs using your own dataset with PyTorch Geometric

TODO: Everything in this section

TODO: Update links:

[![Run on Gradient](../../gradient-badge.svg)](https://console.paperspace.com/github/<runtime-repo>?machine=Free-IPU-POD4&container=<dockerhub-image>&file=<path-to-file-in-repo>)  [![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

>
> We aim to have our notebook app demos to be focused on what the user is trying to
> do. To help you do this correctly please read [our user personnas](https://graphcore.atlassian.net/wiki/spaces/PM/pages/3157131517/Notebook+personas#Ellie%3A-The-Data-Scientist%2C-Business-Analysis%2C-Consultant),
and when in doubt ask yourself "does that person care about this?".
> To support that the first paragraph will contain all the key information, to
> help users rapidly identify if this is the right notebook for them to go
> though, based on:
>
> - The task/business problem they are trying to solve,
> - The features that are used (Focus on big picture Deep learning features - e.g.
>  Distributed training, not I/O overlap).
>
> To achieve this, each notebook should start with the following 3 paragraphs
> (detailed in the next three comments):
>
> - a table highlighting what we are going to do
> - a very short intro (3-5 sentences)
> - clear "steps to resolution" (bullet points stating what the user will have
>    to do to tackle their problem on the IPU - these need to reflect the notebook,
>    and be as simple as possible)
> - links to additional related resources.
>

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
|   GNNs   |  Fraud detection  | ? | ? | Training, evaluation | recommended: 16XX (min: 4X) | 20Xmn (X1h20mn)   |

>
>
> Start with a short introduction to the notebook. [suggested 3-5 sentences]
>
> This intro should focus on the problem you are fixing, and not on any IPU specific
> or framework specific features. The mindset is that anything that is non-standard
> is a barrier to entry, and will risk the user giving up.
>
> This short introduction should be followed by a clear bullet point summary of
> the steps of the demo. Each outcome should be of the form:
> - what the user will do (active verb) [and (optionally) how they do
>   it]. Jargon, if any, goes to the end of the bullet point.

In this demo, you will learn how to:

- Turn tabular transaction data into a PyTorch Geometric dataset
- Select a model suitable for the task of predicting fraudulent transactions
- Train the model on Graphcore IPUs
- Run validation on the trained model

This notebook assumes some familiarity with PopTorch as well as PyTorch Geometric (PyG). For additional resources please consult:
* [PopTorch Documentation](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/index.html)
* [PopTorch Examples and Tutorials](https://docs.graphcore.ai/en/latest/examples.html#pytorch)
* [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/)
* [PopTorch Geometric Documentation](https://docs.graphcore.ai/projects/poptorch-geometric-user-guide/en/latest/index.html)

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

In [3]:
# Make imported python modules automatically reload when the files are changed
# needs to be before the first import.
%load_ext autoreload
%autoreload 2
# TODO: remove at the end of notebook development

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
# TODO: Add gc-logger?

## Environment setup

[![Run on Gradient](../../gradient-badge.svg)](TODO)

The best way to try this demo is on Paperspace Gradient's cloud IPUs. To use on other hardware
make sure that you have the Poplar SDK enabled with the latest PopTorch Geometric installed.

In [5]:
%pip install -q -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


To improve your experience we read some configuration related to the environment you are running the notebook in.

In [10]:
import os

number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", 16))
pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod16")
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/")

# TODO Remove default
dataset_directory = os.getenv("DATASETS_DIR", ".")
checkpoint_directory = os.getenv("CHECKPOINT_DIR")

## Loading tabular data into PyTorch Geometric

### Getting the dataset

TODO: Using https://www.kaggle.com/c/ieee-fraud-detection/data

TODO: Run a script to download and tidy data?

In [12]:
import os.path as osp
import pandas as pd

raw_dataset_path = osp.join(dataset_directory, "raw")

train_transaction_path = osp.join(raw_dataset_path, "train_transaction.csv")
train_identity_path = osp.join(raw_dataset_path, "train_identity.csv")
train_transaction_df = pd.read_csv(train_transaction_path)
train_identity_df = pd.read_csv(train_identity_path)

test_transaction_path = osp.join(raw_dataset_path, "test_transaction.csv")
test_identity_path = osp.join(raw_dataset_path, "test_identity.csv")
test_transaction_df = pd.read_csv(test_transaction_path)
test_identity_df = pd.read_csv(test_identity_path)

In [13]:
transaction_df = pd.concat([train_transaction_df, test_transaction_df], axis=0)
identity_df = pd.concat([train_identity_df, test_identity_df], axis=0)

In [14]:
transaction_df.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0.0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0.0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0.0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0.0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0.0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
identity_df.head()

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,...,id-29,id-30,id-31,id-32,id-33,id-34,id-35,id-36,id-37,id-38
0,2987004,0.0,70787.0,,,,,,,,...,,,,,,,,,,
1,2987008,-5.0,98945.0,,,0.0,-5.0,,,,...,,,,,,,,,,
2,2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,...,,,,,,,,,,
3,2987011,-5.0,221832.0,,,0.0,-6.0,,,,...,,,,,,,,,,
4,2987016,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,...,,,,,,,,,,


In [16]:
# Sort by datetime, later we will use this to make a training and validation split
transaction_df.sort_values("TransactionDT")

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0.0,86400,68.500,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0.0,86401,29.000,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0.0,86469,59.000,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0.0,86499,50.000,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0.0,86506,50.000,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
506686,4170235,,34214279,94.679,C,13832,375.0,185.0,mastercard,224.0,...,,,,,,,,,,
506687,4170236,,34214287,12.173,C,3154,408.0,185.0,mastercard,224.0,...,,,,,,,,,,
506688,4170237,,34214326,49.000,W,16661,490.0,150.0,visa,226.0,...,,,,,,,,,,
506689,4170238,,34214337,202.000,W,16621,516.0,150.0,mastercard,224.0,...,,,,,,,,,,


In [17]:
transaction_df = pd.merge(transaction_df, identity_df, on="TransactionID")
transaction_df

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id-29,id-30,id-31,id-32,id-33,id-34,id-35,id-36,id-37,id-38
0,2987004,0.0,86506,50.000,H,4497,514.0,150.0,mastercard,102.0,...,,,,,,,,,,
1,2987008,0.0,86535,15.000,H,2803,100.0,150.0,visa,226.0,...,,,,,,,,,,
2,2987010,0.0,86549,75.887,C,16496,352.0,117.0,mastercard,134.0,...,,,,,,,,,,
3,2987011,0.0,86555,16.495,C,4461,375.0,185.0,mastercard,224.0,...,,,,,,,,,,
4,2987016,0.0,86620,30.000,H,1790,555.0,150.0,visa,226.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
286135,4170230,,34214253,10.452,C,5812,408.0,185.0,mastercard,224.0,...,NotFound,,chrome 71.0 for android,,,,F,F,T,F
286136,4170233,,34214271,13.403,C,3154,408.0,185.0,mastercard,224.0,...,Found,,chrome 71.0 for android,,,,F,F,T,F
286137,4170234,,34214277,50.000,H,9002,453.0,150.0,visa,226.0,...,NotFound,iOS 10.3.3,mobile safari 10.0,32.0,1334x750,match_status:2,T,F,F,T
286138,4170236,,34214287,12.173,C,3154,408.0,185.0,mastercard,224.0,...,NotFound,,chrome 43.0 for android,,,,F,F,T,F


In the interest of time, take only the first 10000 samples: TODO: See dataset.py for the full dataset preprocessing.

In [18]:
transaction_df = transaction_df.head(10000)

In [19]:
# Remove the transactions where isFraud is NaN
transaction_df = transaction_df[transaction_df["isFraud"].notna()]
transaction_df

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id-29,id-30,id-31,id-32,id-33,id-34,id-35,id-36,id-37,id-38
0,2987004,0.0,86506,50.000,H,4497,514.0,150.0,mastercard,102.0,...,,,,,,,,,,
1,2987008,0.0,86535,15.000,H,2803,100.0,150.0,visa,226.0,...,,,,,,,,,,
2,2987010,0.0,86549,75.887,C,16496,352.0,117.0,mastercard,134.0,...,,,,,,,,,,
3,2987011,0.0,86555,16.495,C,4461,375.0,185.0,mastercard,224.0,...,,,,,,,,,,
4,2987016,0.0,86620,30.000,H,1790,555.0,150.0,visa,226.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,3021377,0.0,850458,100.000,R,2616,327.0,150.0,discover,102.0,...,,,,,,,,,,
9996,3021379,0.0,850491,25.419,C,15885,545.0,185.0,visa,138.0,...,,,,,,,,,,
9997,3021380,0.0,850500,11.893,C,13832,375.0,185.0,mastercard,224.0,...,,,,,,,,,,
9998,3021381,0.0,850503,20.000,H,3552,555.0,150.0,visa,226.0,...,,,,,,,,,,


In [28]:
non_target_node_types = ["card1", "card2", "card3", "card4", "card5", "card6",
                         "ProductCD", "addr1", "addr2", "P_emaildomain", "R_emaildomain"]
target_cat_feat_cols = ["M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8", "M9",
                        "DeviceType", "DeviceInfo", "id_12", "id_13", "id_14",
                        "id_15", "id_16", "id_17", "id_18", "id_19", "id_20",
                        "id_21", "id_22", "id_23", "id_24", "id_25", "id_26",
                        "id_27", "id_28", "id_29", "id_30", "id_31", "id_32",
                        "id_33", "id_34", "id_35", "id_36", "id_37", "id_38"]
excl_cols = ["TransactionID", "isFraud", "TransactionDT"]

In [29]:

target_numeric_feat_cols = [
    column for column in transaction_df.columns
    if column not in non_target_node_types + excl_cols + target_cat_feat_cols]
print(" ".join(target_numeric_feat_cols))

TransactionAmt dist1 dist2 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71 V72 V73 V74 V75 V76 V77 V78 V79 V80 V81 V82 V83 V84 V85 V86 V87 V88 V89 V90 V91 V92 V93 V94 V95 V96 V97 V98 V99 V100 V101 V102 V103 V104 V105 V106 V107 V108 V109 V110 V111 V112 V113 V114 V115 V116 V117 V118 V119 V120 V121 V122 V123 V124 V125 V126 V127 V128 V129 V130 V131 V132 V133 V134 V135 V136 V137 V138 V139 V140 V141 V142 V143 V144 V145 V146 V147 V148 V149 V150 V151 V152 V153 V154 V155 V156 V157 V158 V159 V160 V161 V162 V163 V164 V165 V166 V167 V168 V169 V170 V171 V172 V173 V174 V175 V176 V177 V178 V179 V180 V181 V182 V183 V184 V185 V186 V187 V188 V189 V190 V191 V192 V193 V194 V195 V196 V19

In [30]:
transaction_feat_df = transaction_df[target_numeric_feat_cols + target_cat_feat_cols].copy()

In [31]:
transaction_feat_df = transaction_feat_df.fillna(0)

In [33]:
import torch

# Process categorical transaction features

# TODO: From pyg
# TODO: Check this

def get_cat_feat(df, key):
    categories = set(
        row[key] for _, row in df.iterrows())
    mapping = {cat: i for i, cat in enumerate(categories)}

    x = torch.zeros(len(df), len(mapping))
    for i, row in df.iterrows():
        x[i, mapping[row[key]]] = 1
    return x

cat_features = [get_cat_feat(transaction_feat_df, key) for key in target_cat_feat_cols]
cat_feats = torch.cat(cat_features, dim=-1)
cat_feats[0]

tensor([1., 1., 1.,  ..., 0., 0., 1.])

In [34]:
# Process non-categorical transaction features

# TODO: Do something with transactions amounts np.log

def process_val(val):
    if pd.isna(val):
        return 0.0
    return val

num_feats = [
    list(map(process_val, [row[feat] for feat in target_numeric_feat_cols]))
    for _, row in transaction_feat_df.iterrows()
]
num_feats = torch.tensor(num_feats)
num_feats.shape

torch.Size([10000, 420])

In [36]:
import torch.nn.functional as F

transaction_feats = torch.cat((cat_feats, num_feats), -1)
transaction_feats.shape

torch.Size([10000, 1917])

In [59]:
# TODO: This takes ages
# TODO: Doesn't work if all data is used
# TODO: Tidy this -> make encoders

get_cat_map = lambda vals: {val: idx for idx, val in enumerate(vals)}

def get_edge_list(df, identifier):
    # Find number of unique categories for this node type
    unique_entries = df[identifier].drop_duplicates().dropna()
    # Create a map of category to value
    entry_map = get_cat_map(unique_entries)
    print(len(entry_map))
    # Create edge list mapping transaction to node type
    edge_list = [[], []]

    for idx, transaction in transaction_df.iterrows():
        node_type_val = transaction[identifier]
        # Don't create nodes for NaN values
        if pd.isna(node_type_val):
            continue
        edge_list[0].append(idx)
        edge_list[1].append(entry_map[node_type_val])
    return torch.tensor(edge_list, dtype=torch.long)


In [60]:
get_edge_list(transaction_df, "addr2")[1].max()

27


tensor(26)

In [38]:
edge_dict = {node_type: get_edge_list(transaction_df, node_type) for node_type in non_target_node_types}

### Creating a PyTorch Geometric dataset

TODO: Create PyTorch Geometric dataset from above - maybe move to a separate script

In [39]:
from torch_geometric.data import HeteroData

data = HeteroData()

In [41]:
data["transaction"].num_nodes = len(transaction_df)
data["transaction"].x = transaction_feats
data["transaction"].y = torch.tensor(transaction_df['isFraud'], dtype=int)

  data["transaction"].y = torch.tensor(transaction_df['isFraud'], dtype=int)


In [42]:
for node_type in non_target_node_types:
    data["transaction", "to", node_type].edge_index = edge_dict[node_type]
    data[node_type].num_nodes = edge_dict[node_type][1].max() + 1
    # TODO: Shouldn't need this
    data[node_type].x = torch.zeros((edge_dict[node_type][1].max() + 1, 1))

In [44]:
assert data.validate()

In [45]:
data

HeteroData(
  [1mtransaction[0m={
    num_nodes=10000,
    x=[10000, 1917],
    y=[10000]
  },
  [1mcard1[0m={
    num_nodes=2088,
    x=[2088, 1]
  },
  [1mcard2[0m={
    num_nodes=333,
    x=[333, 1]
  },
  [1mcard3[0m={
    num_nodes=40,
    x=[40, 1]
  },
  [1mcard4[0m={
    num_nodes=4,
    x=[4, 1]
  },
  [1mcard5[0m={
    num_nodes=56,
    x=[56, 1]
  },
  [1mcard6[0m={
    num_nodes=3,
    x=[3, 1]
  },
  [1mProductCD[0m={
    num_nodes=4,
    x=[4, 1]
  },
  [1maddr1[0m={
    num_nodes=177,
    x=[177, 1]
  },
  [1maddr2[0m={
    num_nodes=27,
    x=[27, 1]
  },
  [1mP_emaildomain[0m={
    num_nodes=57,
    x=[57, 1]
  },
  [1mR_emaildomain[0m={
    num_nodes=54,
    x=[54, 1]
  },
  [1m(transaction, to, card1)[0m={ edge_index=[2, 10000] },
  [1m(transaction, to, card2)[0m={ edge_index=[2, 9958] },
  [1m(transaction, to, card3)[0m={ edge_index=[2, 9999] },
  [1m(transaction, to, card4)[0m={ edge_index=[2, 9997] },
  [1m(transaction, to, card5)

In [46]:
data.num_nodes

tensor(12843)

### Visualize

In [48]:
import networkx as nx
from matplotlib import pyplot as plt
from torch_geometric.utils import to_networkx

# Convert to homogeneous
data_homogeneous = data.to_homogeneous()
g = to_networkx(data_homogeneous)
# Use node types as colour map
colour_map = data_homogeneous.node_type

# TODO: This maybe?
## Get labels
#labels = {str(idx): val for idx, val in enumerate(data_homogeneous.y)}

# Plot the graph
nx.draw(g, node_color=colour_map, with_labels=True)
plt.show()

RuntimeError: repeats can not be negative

## Conclusion

> The conclusion to your demo should:
>
> - summarise the main steps that were taken in the demo making clear what
>  your user got to do (similar to steps at the start but more
>  specific, you can link a specific feature/method/class to achieving a specific
>  outcome). (short paragraph: 3-6 sentences)
> - provide resources to go further: these can be links to other tutorials, to
>  documentation, to code examples in the public_examples repo, tech notes, deployments,
>  etc... (2-4 suggestions)
>
> For pointing users to notebooks in the same runtime, point the user to where the file is rather than a link. For example: please see our tutorial, `<folder_name>/<notebook_name>.ipynb`. For relative links the paperspace platform will download the file locally if the machine is running and if the machine is not running will 404. For full path links a new window is opened.