Copyright (c) 2023 Graphcore Ltd. All rights reserved.

# Training a GNN to do Fraud Detection on Graphcore IPUs using your own dataset with PyTorch Geometric

TODO: Everything in this section

TODO: Update links:

[![Run on Gradient](../../gradient-badge.svg)](https://console.paperspace.com/github/<runtime-repo>?machine=Free-IPU-POD4&container=<dockerhub-image>&file=<path-to-file-in-repo>)  [![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

>
> We aim to have our notebook app demos to be focused on what the user is trying to
> do. To help you do this correctly please read [our user personnas](https://graphcore.atlassian.net/wiki/spaces/PM/pages/3157131517/Notebook+personas#Ellie%3A-The-Data-Scientist%2C-Business-Analysis%2C-Consultant),
and when in doubt ask yourself "does that person care about this?".
> To support that the first paragraph will contain all the key information, to
> help users rapidly identify if this is the right notebook for them to go
> though, based on:
>
> - The task/business problem they are trying to solve,
> - The features that are used (Focus on big picture Deep learning features - e.g.
>  Distributed training, not I/O overlap).
>
> To achieve this, each notebook should start with the following 3 paragraphs
> (detailed in the next three comments):
>
> - a table highlighting what we are going to do
> - a very short intro (3-5 sentences)
> - clear "steps to resolution" (bullet points stating what the user will have
>    to do to tackle their problem on the IPU - these need to reflect the notebook,
>    and be as simple as possible)
> - links to additional related resources.
>

|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
|   GNNs   |  Fraud detection  | ? | ? | Training, evaluation | recommended: 16XX (min: 4X) | 20Xmn (X1h20mn)   |

>
>
> Start with a short introduction to the notebook. [suggested 3-5 sentences]
>
> This intro should focus on the problem you are fixing, and not on any IPU specific
> or framework specific features. The mindset is that anything that is non-standard
> is a barrier to entry, and will risk the user giving up.
>
> This short introduction should be followed by a clear bullet point summary of
> the steps of the demo. Each outcome should be of the form:
> - what the user will do (active verb) [and (optionally) how they do
>   it]. Jargon, if any, goes to the end of the bullet point.

In this demo, you will learn how to:

- Turn tabular transaction data into a PyTorch Geometric dataset
- Select a model suitable for the task of predicting fraudulent transactions
- Train the model on Graphcore IPUs
- Run validation on the trained model

This notebook assumes some familiarity with PopTorch as well as PyTorch Geometric (PyG). For additional resources please consult:
* [PopTorch Documentation](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/index.html)
* [PopTorch Examples and Tutorials](https://docs.graphcore.ai/en/latest/examples.html#pytorch)
* [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/)
* [PopTorch Geometric Documentation](https://docs.graphcore.ai/projects/poptorch-geometric-user-guide/en/latest/index.html)

[![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

In [1]:
# Make imported python modules automatically reload when the files are changed
# needs to be before the first import.
%load_ext autoreload
%autoreload 2
# TODO: remove at the end of notebook development

In [2]:
# TODO: Add gc-logger?

## Environment setup

[![Run on Gradient](../../gradient-badge.svg)](TODO)

The best way to try this demo is on Paperspace Gradient's cloud IPUs. To use on other hardware
make sure that you have the Poplar SDK enabled with the latest PopTorch Geometric installed.

In [3]:
%pip install -q -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


To improve your experience we read some configuration related to the environment you are running the notebook in.

In [4]:
import os

number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", 16))
pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod16")
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/")

# TODO Remove default
dataset_directory = os.getenv("DATASETS_DIR", "~")
checkpoint_directory = os.getenv("CHECKPOINT_DIR")

## Loading tabular data into PyTorch Geometric

### Getting the dataset

TODO: Using https://www.kaggle.com/c/ieee-fraud-detection/data

TODO: Run a script to download and tidy data?

In [5]:
import os.path as osp
import pandas as pd

raw_dataset_path = osp.join(dataset_directory, "ieee-fraud-detection", "raw")

train_transaction_path = osp.join(raw_dataset_path, "train_transaction.csv")
train_identity_path = osp.join(raw_dataset_path, "train_identity.csv")
train_transaction_df = pd.read_csv(train_transaction_path)
train_identity_df = pd.read_csv(train_identity_path)

test_transaction_path = osp.join(raw_dataset_path, "test_transaction.csv")
test_identity_path = osp.join(raw_dataset_path, "test_identity.csv")
test_transaction_df = pd.read_csv(test_transaction_path)
test_identity_df = pd.read_csv(test_identity_path)

In [44]:
transaction_df = pd.concat([train_transaction_df, test_transaction_df], axis=0)
identity_df = pd.concat([train_identity_df, test_identity_df], axis=0)

In [45]:
transaction_df.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0.0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0.0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0.0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0.0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0.0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
identity_df.head()

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,...,id-29,id-30,id-31,id-32,id-33,id-34,id-35,id-36,id-37,id-38
0,2987004,0.0,70787.0,,,,,,,,...,,,,,,,,,,
1,2987008,-5.0,98945.0,,,0.0,-5.0,,,,...,,,,,,,,,,
2,2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,...,,,,,,,,,,
3,2987011,-5.0,221832.0,,,0.0,-6.0,,,,...,,,,,,,,,,
4,2987016,0.0,7460.0,0.0,0.0,1.0,0.0,,,0.0,...,,,,,,,,,,


In [47]:
# Sort by datetime, later we will use this to make a training and validation split
transaction_df.sort_values("TransactionDT")

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0.0,86400,68.500,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0.0,86401,29.000,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0.0,86469,59.000,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0.0,86499,50.000,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0.0,86506,50.000,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
506686,4170235,,34214279,94.679,C,13832,375.0,185.0,mastercard,224.0,...,,,,,,,,,,
506687,4170236,,34214287,12.173,C,3154,408.0,185.0,mastercard,224.0,...,,,,,,,,,,
506688,4170237,,34214326,49.000,W,16661,490.0,150.0,visa,226.0,...,,,,,,,,,,
506689,4170238,,34214337,202.000,W,16621,516.0,150.0,mastercard,224.0,...,,,,,,,,,,


In [48]:
transaction_df = pd.merge(transaction_df, identity_df, on="TransactionID")
transaction_df

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id-29,id-30,id-31,id-32,id-33,id-34,id-35,id-36,id-37,id-38
0,2987004,0.0,86506,50.000,H,4497,514.0,150.0,mastercard,102.0,...,,,,,,,,,,
1,2987008,0.0,86535,15.000,H,2803,100.0,150.0,visa,226.0,...,,,,,,,,,,
2,2987010,0.0,86549,75.887,C,16496,352.0,117.0,mastercard,134.0,...,,,,,,,,,,
3,2987011,0.0,86555,16.495,C,4461,375.0,185.0,mastercard,224.0,...,,,,,,,,,,
4,2987016,0.0,86620,30.000,H,1790,555.0,150.0,visa,226.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
286135,4170230,,34214253,10.452,C,5812,408.0,185.0,mastercard,224.0,...,NotFound,,chrome 71.0 for android,,,,F,F,T,F
286136,4170233,,34214271,13.403,C,3154,408.0,185.0,mastercard,224.0,...,Found,,chrome 71.0 for android,,,,F,F,T,F
286137,4170234,,34214277,50.000,H,9002,453.0,150.0,visa,226.0,...,NotFound,iOS 10.3.3,mobile safari 10.0,32.0,1334x750,match_status:2,T,F,F,T
286138,4170236,,34214287,12.173,C,3154,408.0,185.0,mastercard,224.0,...,NotFound,,chrome 43.0 for android,,,,F,F,T,F


In [49]:
# Remove the transactions where isFraud is NaN
transaction_df = transaction_df[transaction_df["isFraud"].notna()]
transaction_df

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id-29,id-30,id-31,id-32,id-33,id-34,id-35,id-36,id-37,id-38
0,2987004,0.0,86506,50.000,H,4497,514.0,150.0,mastercard,102.0,...,,,,,,,,,,
1,2987008,0.0,86535,15.000,H,2803,100.0,150.0,visa,226.0,...,,,,,,,,,,
2,2987010,0.0,86549,75.887,C,16496,352.0,117.0,mastercard,134.0,...,,,,,,,,,,
3,2987011,0.0,86555,16.495,C,4461,375.0,185.0,mastercard,224.0,...,,,,,,,,,,
4,2987016,0.0,86620,30.000,H,1790,555.0,150.0,visa,226.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144228,3577521,0.0,15810802,48.877,C,12019,305.0,106.0,mastercard,224.0,...,,,,,,,,,,
144229,3577526,1.0,15810876,250.000,R,1214,174.0,150.0,visa,226.0,...,,,,,,,,,,
144230,3577529,0.0,15810912,73.838,C,5096,555.0,185.0,mastercard,137.0,...,,,,,,,,,,
144231,3577531,0.0,15810935,400.000,R,6019,583.0,150.0,visa,226.0,...,,,,,,,,,,


In [50]:
# TODO: Remove this
transaction_df = transaction_df.head(1000)

In [51]:
non_target_node_types = ["card1", "card2", "card3", "card4", "card5", "card6",
                         "ProductCD", "addr1", "addr2", "P_emaildomain", "R_emaildomain"]

In [52]:
excl_cols = ["TransactionID", "isFraud", "TransactionDT"]
transaction_cat_features = ["M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8", "M9",
                            "DeviceType", "DeviceInfo", "id_12", "id_13", "id_14",
                            "id_15", "id_16", "id_17", "id_18", "id_19", "id_20",
                            "id_21", "id_22", "id_23", "id_24", "id_25", "id_26",
                            "id_27", "id_28", "id_29", "id_30", "id_31", "id_32",
                            "id_33", "id_34", "id_35", "id_36", "id_37", "id_38"]
transaction_numeric_features = [column for column in transaction_df.columns
                                if column not in non_target_node_types + excl_cols + transaction_cat_features]

In [53]:
transaction_feat_df = transaction_df[transaction_numeric_features + transaction_cat_features].copy()

In [54]:
transaction_feat_df = transaction_feat_df.fillna(0)

In [55]:
import torch

# Process categorical transaction features

# TODO: From pyg
# TODO: Check this

class CategoricalEncoder:

    def __init__(self, key):
        self.key = key

    def __call__(self, df):
        categories = set(
            row[self.key] for _, row in df.iterrows())
        mapping = {cat: i for i, cat in enumerate(categories)}

        x = torch.zeros(len(df), len(mapping))
        for i, row in df.iterrows():
            x[i, mapping[row[self.key]]] = 1
        return x


cat_encoders = [CategoricalEncoder(key) for key in transaction_cat_features]
cat_features = [cat_enc(transaction_feat_df) for cat_enc in cat_encoders]
node_feats = torch.cat(cat_features, dim=-1)
node_feats[0]

tensor([1., 1., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 

In [56]:
# Process non-categorical transaction features

# TODO: Do something with transactions amounts np.log

def process_val(val):
    if pd.isna(val):
        return 0.0
    return val

other_feats = [
    list(map(process_val, [row[feat] for feat in transaction_numeric_features]))
    for _, row in transaction_feat_df.iterrows()
]
other_feats = torch.tensor(other_feats)
other_feats.shape

torch.Size([1000, 420])

In [57]:
import torch.nn.functional as F

node_feats = torch.cat((node_feats, other_feats), -1)
node_feats.shape

torch.Size([1000, 963])

### Inspecting the dataset

TODO: Look at a few of the columns - call out important ones we will use for features and labels

TODO: Create validation / test split

NOTES: TODO: Remove

- Nodes are transaction IDs
-

### Creating a PyTorch Geometric dataset

TODO: Create PyTorch Geometric dataset from above - maybe move to a separate script

In [58]:
# TODO: This takes ages
# TODO: Doesn't work if all data is used
# TODO: Tidy this -> make encoders

get_cat_map = lambda vals: {val: idx for idx, val in enumerate(vals)}

def get_edge_list(df, identifier):
    # Find number of unique categories for this node type
    unique_entries = df[identifier].drop_duplicates().dropna()
    # Create a map of category to value
    entry_map = get_cat_map(unique_entries)
    # Create edge list mapping transaction to node type
    edge_list = [[], []]

    for idx, transaction in transaction_df.iterrows():
        node_type_val = transaction[identifier]
        # Don't create nodes for NaN values
        if pd.isna(node_type_val):
            continue
        edge_list[0].append(idx)
        edge_list[1].append(entry_map[node_type_val])
    return torch.tensor(edge_list, dtype=torch.long)


In [59]:
edge_dict = {node_type: get_edge_list(transaction_df, node_type) for node_type in non_target_node_types}

In [60]:
from torch_geometric.data import HeteroData

data = HeteroData()

In [61]:
data["transaction"].num_nodes = len(transaction_df)
data["transaction"].x = node_feats
data["transaction"].y = torch.tensor(transaction_df['isFraud'], dtype=int)

  data["transaction"].y = torch.tensor(transaction_df['isFraud'], dtype=int)


In [62]:
for node_type in non_target_node_types:
    data["transaction", "to", node_type].edge_index = edge_dict[node_type]
    data[node_type].num_nodes = edge_dict[node_type][1].max() + 1
    # TODO: Shouldn't need this
    data[node_type].x = torch.zeros((edge_dict[node_type][1].max() + 1, 1))

In [63]:
data.validate()

True

In [64]:
data

HeteroData(
  [1mtransaction[0m={
    num_nodes=1000,
    x=[1000, 963],
    y=[1000]
  },
  [1mcard1[0m={
    num_nodes=415,
    x=[415, 1]
  },
  [1mcard2[0m={
    num_nodes=132,
    x=[132, 1]
  },
  [1mcard3[0m={
    num_nodes=16,
    x=[16, 1]
  },
  [1mcard4[0m={
    num_nodes=4,
    x=[4, 1]
  },
  [1mcard5[0m={
    num_nodes=35,
    x=[35, 1]
  },
  [1mcard6[0m={
    num_nodes=2,
    x=[2, 1]
  },
  [1mProductCD[0m={
    num_nodes=4,
    x=[4, 1]
  },
  [1maddr1[0m={
    num_nodes=51,
    x=[51, 1]
  },
  [1maddr2[0m={
    num_nodes=2,
    x=[2, 1]
  },
  [1mP_emaildomain[0m={
    num_nodes=32,
    x=[32, 1]
  },
  [1mR_emaildomain[0m={
    num_nodes=26,
    x=[26, 1]
  },
  [1m(transaction, to, card1)[0m={ edge_index=[2, 1000] },
  [1m(transaction, to, card2)[0m={ edge_index=[2, 992] },
  [1m(transaction, to, card3)[0m={ edge_index=[2, 1000] },
  [1m(transaction, to, card4)[0m={ edge_index=[2, 1000] },
  [1m(transaction, to, card5)[0m={ edge_i

In [65]:
data.num_nodes

tensor(1719)

### Visualize

In [66]:
#import networkx as nx
#from matplotlib import pyplot as plt
#from torch_geometric.utils import to_networkx
#
## Convert to homogeneous
#data_homogeneous = data.to_homogeneous()
#g = to_networkx(data_homogeneous)
## Use node types as colour map
#colour_map = data_homogeneous.node_type
#
## TODO: This maybe?
### Get labels
##labels = {str(idx): val for idx, val in enumerate(data_homogeneous.y)}
#
## Plot the graph
#nx.draw(g, node_color=colour_map, with_labels=True)
#plt.show()

## Preprocess

In [67]:
import torch_geometric.transforms as T

data = T.ToUndirected()(data)
data = T.AddSelfLoops()(data)
data = T.NormalizeFeatures()(data)

data

HeteroData(
  [1mtransaction[0m={
    num_nodes=1000,
    x=[1000, 963],
    y=[1000]
  },
  [1mcard1[0m={
    num_nodes=415,
    x=[415, 1]
  },
  [1mcard2[0m={
    num_nodes=132,
    x=[132, 1]
  },
  [1mcard3[0m={
    num_nodes=16,
    x=[16, 1]
  },
  [1mcard4[0m={
    num_nodes=4,
    x=[4, 1]
  },
  [1mcard5[0m={
    num_nodes=35,
    x=[35, 1]
  },
  [1mcard6[0m={
    num_nodes=2,
    x=[2, 1]
  },
  [1mProductCD[0m={
    num_nodes=4,
    x=[4, 1]
  },
  [1maddr1[0m={
    num_nodes=51,
    x=[51, 1]
  },
  [1maddr2[0m={
    num_nodes=2,
    x=[2, 1]
  },
  [1mP_emaildomain[0m={
    num_nodes=32,
    x=[32, 1]
  },
  [1mR_emaildomain[0m={
    num_nodes=26,
    x=[26, 1]
  },
  [1m(transaction, to, card1)[0m={ edge_index=[2, 1000] },
  [1m(transaction, to, card2)[0m={ edge_index=[2, 992] },
  [1m(transaction, to, card3)[0m={ edge_index=[2, 1000] },
  [1m(transaction, to, card4)[0m={ edge_index=[2, 1000] },
  [1m(transaction, to, card5)[0m={ edge_i

### Create dataset splits

TODO: We sort by time above and then select those from more recent as the validation and test

In [68]:
num_nodes_train = int(0.8 * data["transaction"].num_nodes)
data["transaction"].train_mask = torch.zeros(data["transaction"].num_nodes, dtype=bool)
data["transaction"].train_mask[:num_nodes_train] = True
data["transaction"].val_mask = torch.zeros(data["transaction"].num_nodes, dtype=bool)
data["transaction"].val_mask[num_nodes_train:] = True

print(f"Number of training nodes: {data['transaction'].train_mask.sum()}")
print(f"Number of validation nodes: {data['transaction'].val_mask.sum()}")

Number of training nodes: 800
Number of validation nodes: 200


In [69]:
data

HeteroData(
  [1mtransaction[0m={
    num_nodes=1000,
    x=[1000, 963],
    y=[1000],
    train_mask=[1000],
    val_mask=[1000]
  },
  [1mcard1[0m={
    num_nodes=415,
    x=[415, 1]
  },
  [1mcard2[0m={
    num_nodes=132,
    x=[132, 1]
  },
  [1mcard3[0m={
    num_nodes=16,
    x=[16, 1]
  },
  [1mcard4[0m={
    num_nodes=4,
    x=[4, 1]
  },
  [1mcard5[0m={
    num_nodes=35,
    x=[35, 1]
  },
  [1mcard6[0m={
    num_nodes=2,
    x=[2, 1]
  },
  [1mProductCD[0m={
    num_nodes=4,
    x=[4, 1]
  },
  [1maddr1[0m={
    num_nodes=51,
    x=[51, 1]
  },
  [1maddr2[0m={
    num_nodes=2,
    x=[2, 1]
  },
  [1mP_emaildomain[0m={
    num_nodes=32,
    x=[32, 1]
  },
  [1mR_emaildomain[0m={
    num_nodes=26,
    x=[26, 1]
  },
  [1m(transaction, to, card1)[0m={ edge_index=[2, 1000] },
  [1m(transaction, to, card2)[0m={ edge_index=[2, 992] },
  [1m(transaction, to, card3)[0m={ edge_index=[2, 1000] },
  [1m(transaction, to, card4)[0m={ edge_index=[2, 1000] },

In [70]:
num_fraud_train = data["transaction"].y[data["transaction"].train_mask].sum()
num_total_train = len(data["transaction"].train_mask)
num_fraud_val = data["transaction"].y[data["transaction"].val_mask].sum()
num_total_val = len(data["transaction"].val_mask)

In [71]:
# Number of fraud transactions
percentage_fraud_train = num_fraud_train / num_total_train
percentage_fraud_val = num_fraud_val / num_total_val
print(f"{percentage_fraud_train = :%}")
print(f"{percentage_fraud_val = :%}")

percentage_fraud_train = 3.800000%
percentage_fraud_val = 1.100000%


In [72]:
# Use this to set a class weight
class_weight = (
    (num_total_train / (2 * (num_total_train - num_fraud_train))).item(),
    (num_total_train / (2 * num_fraud_train)).item())
class_weight 

(0.5197505354881287, 13.1578950881958)

## Dataloading

TODO: Is graph too large that need to do some sampling - neighbour sampling

In [73]:
from torch_geometric.loader import NeighborLoader

batch_size = 128
num_layers = 2

train_loader = NeighborLoader(
    data,
    num_neighbors=[15] * num_layers,
    batch_size=batch_size,
    input_nodes=('transaction', data['transaction'].train_mask),
)



In [74]:
from poptorch_geometric import FixedSizeOptions

# TODO: Use this
# fixed_size_options = FixedSizeOptions.from_loader(train_loader)
fixed_size_options = FixedSizeOptions(
    num_nodes=2000,
    num_edges=1000,
)

fixed_size_options

<poptorch_geometric.fixed_size_options.FixedSizeOptions at 0x7f2660ecf550>

In [75]:
from draft_fixed_size_neighbour_loader import FixedSizeNeighborLoader

train_loader_ipu = FixedSizeNeighborLoader(
    data,
    num_neighbors=[15] * num_layers,
    fixed_size_options=fixed_size_options,
    batch_size=batch_size,
    input_nodes=('transaction', data['transaction'].train_mask),
    exclude_keys=("batch_size",),
)

In [76]:
sample = next(iter(train_loader_ipu))
sample

HeteroDataBatch(
  graphs_mask=[2],
  num_nodes=24000,
  num_edges=22000,
  [1mtransaction[0m={
    num_nodes=2000,
    x=[2000, 963],
    y=[2000],
    train_mask=[2000],
    val_mask=[2000],
    n_id=[2000],
    input_id=[1326],
    batch=[2000],
    ptr=[3],
    nodes_mask=[2000]
  },
  [1mcard1[0m={
    num_nodes=2000,
    x=[2000, 1],
    n_id=[2000],
    batch=[2000],
    ptr=[3],
    nodes_mask=[2000]
  },
  [1mcard2[0m={
    num_nodes=2000,
    x=[2000, 1],
    n_id=[2000],
    batch=[2000],
    ptr=[3],
    nodes_mask=[2000]
  },
  [1mcard3[0m={
    num_nodes=2000,
    x=[2000, 1],
    n_id=[2000],
    batch=[2000],
    ptr=[3],
    nodes_mask=[2000]
  },
  [1mcard4[0m={
    num_nodes=2000,
    x=[2000, 1],
    n_id=[2000],
    batch=[2000],
    ptr=[3],
    nodes_mask=[2000]
  },
  [1mcard5[0m={
    num_nodes=2000,
    x=[2000, 1],
    n_id=[2000],
    batch=[2000],
    ptr=[3],
    nodes_mask=[2000]
  },
  [1mcard6[0m={
    num_nodes=2000,
    x=[2000, 1],
    

## Picking the right model

TODO: Describe the task

TODO: Describe different relations type - want something that can do different relations (heterogeneous graph) - RGCN could be a sensible choice
    - but requires weights for every relation type - we have 11 relation types so might be ok

TODO: Try CompGCN
    - Only requires 3 weights - in, out, self loops

In [83]:
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import Linear, SAGEConv, to_hetero

import poptorch

# TODO: Include num layers?
class GNN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        self.conv1 = SAGEConv((-1, -1), hidden_channels)
        self.conv2 = SAGEConv((-1, -1), hidden_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x


class Model(torch.nn.Module):

    def __init__(self,
                 hetero_gnn,
                 embedding_size,
                 out_channels,
                 class_weight=None,
                 batch_size=None):
        super().__init__()
        self.hetero_gnn = hetero_gnn
        self.embedding = nn.ModuleDict({
            node_type: nn.Embedding(data[node_type].num_nodes, embedding_size)
            for node_type in data.node_types
            if node_type != "transaction"
        })
        self.linear = Linear(-1, out_channels)
        self.full_batch = (batch_size is None)
        self.batch_size = batch_size
        self.class_weight = class_weight

    def forward(self,
                x_dict,
                edge_index_dict,
                n_id_dict=None,
                target=None,
                mask=None):
        for _, _, node_type in edge_index_dict.keys():
            if node_type != "transaction":
                if self.full_batch:
                    x_dict[node_type] = self.embedding[node_type].weight
                else:
                    assert n_id_dict is not None, (
                        "If using a sampled batch, `n_id_dict` must"
                        " be provided.")
                    x_dict[node_type] = self.embedding[node_type](n_id_dict[node_type])

        x_dict = self.hetero_gnn(x_dict, edge_index_dict)
        out = self.linear(x_dict['transaction'])
        if self.training:
            if self.full_batch:
                loss = F.cross_entropy(out, target, reduction='none')

                if self.class_weight is not None:
                    class_weight = target * self.class_weight[1] + (1 - target) * self.class_weight[0]
                    class_weight *= mask
                    class_weight *= (mask.sum() / class_weight.sum())
                    loss *= class_weight

                loss *= mask
                loss = loss.sum() / mask.sum()
                loss = poptorch.identity_loss(loss, reduction='none')
            else:
                out = out[:self.batch_size]
                target = target[:self.batch_size]
                # TODO: Use nodes_mask here

                loss = F.cross_entropy(out, target, reduction='none')

                if self.class_weight is not None:
                    class_weight = target * self.class_weight[1] + (1 - target) * self.class_weight[0]
                    class_weight *= (self.batch_size / class_weight.sum())
                    loss *= class_weight
                loss = poptorch.identity_loss(loss, reduction='mean')
            return out, loss
        return out


model = GNN(hidden_channels=64)
model = to_hetero(model, data.metadata(), aggr="sum")
model = Model(model,
              embedding_size=128,
              out_channels=2,
              class_weight=class_weight)

In [84]:
with torch.no_grad():  # Initialize lazy modules.
    out_cpu, loss = model(data.x_dict,
                          data.edge_index_dict,
                          target=data["transaction"].y,
                          mask=data["transaction"].train_mask)
out_cpu

tensor([[ 0.1387, -0.7291],
        [-0.9554, -0.1080],
        [-1.7077,  0.2353],
        ...,
        [ 0.9946, -0.0627],
        [-0.1329, -1.5922],
        [ 0.1098, -2.2154]])

In [85]:
import poptorch

model.eval()
inf_model = poptorch.inferenceModel(model)
out_ipu = inf_model(data.x_dict,
                    data.edge_index_dict,
                    target=data["transaction"].y,
                    mask=data["transaction"].train_mask)
out_ipu

Graph compilation: 100%|██████████| 100/100 [01:09<00:00]


tensor([[ 0.1387, -0.7291],
        [-0.9554, -0.1080],
        [-1.7077,  0.2353],
        ...,
        [ 0.9946, -0.0627],
        [-0.1329, -1.5922],
        [ 0.1098, -2.2154]])

In [86]:
assert torch.allclose(out_cpu, out_ipu, rtol=1e-05, atol=1e-05)

## Training the model

TODO: Train the model in the normal way

In [87]:
learning_rate = 0.01
num_epochs = 20
embedding_size = 128
hidden_channels = 16
log_freq = 10

In [150]:
model = GNN(hidden_channels=hidden_channels)
model = to_hetero(model, data.metadata(), aggr="sum")
model = Model(model,
              embedding_size=embedding_size,
              #class_weight=class_weight,
              out_channels=2,
              batch_size=batch_size)

In [151]:
poptorch_options = poptorch.Options()
poptorch_options.enableExecutableCaching(executable_cache_dir)



In [152]:
train_loader_ipu = FixedSizeNeighborLoader(
    data,
    num_neighbors=[15] * num_layers,
    fixed_size_options=fixed_size_options,
    batch_size=batch_size,
    input_nodes=('transaction', data['transaction'].train_mask),
    exclude_keys=("batch_size",),
    options=poptorch_options
)



In [153]:
sample = next(iter(train_loader_ipu))

with torch.no_grad():  # Initialize lazy modules.
    out_cpu, loss = model(sample.x_dict,
                          sample.edge_index_dict,
                          n_id_dict=sample.n_id_dict,
                          target=sample["transaction"].y)

loss

tensor(1.0652)

In [154]:
model.train()
optimizer = poptorch.optim.Adam(model.parameters(), lr=learning_rate)
training_model = poptorch.trainingModel(model, optimizer=optimizer, options=poptorch_options)

for epoch in range(num_epochs):
    total_examples = total_loss = 0
    for batch in train_loader_ipu:
        out, loss = training_model(batch.x_dict,
                                   batch.edge_index_dict,
                                   n_id_dict=batch.n_id_dict,
                                   target=batch['transaction'].y)
        total_examples += batch_size
        total_loss += float(loss) * batch_size

    if epoch % log_freq == 0:
        print(f"Epoch {epoch}, Loss: {total_loss / total_examples}")

Graph compilation: 100%|██████████| 100/100 [10:13<00:00]


Epoch 0, Loss: 1.5060148388147354
Epoch 10, Loss: 0.0412857144450148


## Validating our trained model

TODO: Validate the model in the normal way

In [155]:
# TODO: Should I sample?

model.eval()
model.full_batch = True
inference_model = poptorch.inferenceModel(model, options=poptorch_options)

out = inference_model(data.x_dict,
                      data.edge_index_dict,
                      target=data['transaction'].y,
                      mask=data['transaction'].val_mask)

Graph compilation: 100%|██████████| 100/100 [00:05<00:00]


In [162]:
y_pred = nn.Softmax(dim=-1)(out)
y_pred = y_pred[:, -1]
y_pred = y_pred > 0.5
y_pred = y_pred[data['transaction'].val_mask]
y_true = data['transaction'].y[data['transaction'].val_mask]

In [163]:
def accuracy(y_pred, y_true):
    correct = y_pred.eq(y_true).sum()
    return correct / len(y_pred)

accuracy(y_pred, y_true)

tensor(0.8850)

In [164]:
def get_confusion_matrix(y_pred, y_true):
    y_pred = y_pred.bool()
    y_true = y_true.bool()
    true_positives = (y_pred * y_true).sum()
    false_positives = (y_pred * ~y_true).sum()
    true_negatives = (~y_pred * ~y_true).sum()
    false_negatives = (~y_pred * y_true).sum()
    return true_positives, false_positives, true_negatives, false_negatives

true_pos, false_pos, true_neg, false_neg = get_confusion_matrix(y_pred, y_true)
true_pos, false_pos, true_neg, false_neg

(tensor(0), tensor(12), tensor(177), tensor(11))

In [165]:
def get_rates(true_pos, false_pos, true_neg, false_neg):
    true_pos_rate = true_pos / (true_pos + false_neg)
    false_pos_rate = false_pos / (false_pos + true_neg)
    return true_pos_rate, false_pos_rate

get_rates(true_pos, false_pos, true_neg, false_neg)

(tensor(0.), tensor(0.0635))

In [166]:
def precision(true_pos, false_pos):
    return true_pos / (true_pos + false_pos)

def recall(true_pos, false_neg):
    return true_pos / (true_pos + false_neg)

precision(true_pos, false_pos), recall(true_pos, false_neg)

(tensor(0.), tensor(0.))

### With threshold

In [168]:
import numpy as np

results = []
for threshold in np.arange(0.0, 1.1, 0.1):
    y_pred = nn.Softmax(dim=-1)(out)
    y_pred = y_pred[:, -1]
    y_pred = y_pred > threshold
    y_pred = y_pred[data['transaction'].val_mask]
    y_true = data['transaction'].y[data['transaction'].val_mask]
    results.append((threshold, *get_rates(*get_confusion_matrix(y_pred, y_true))))

In [169]:
results

[(0.0, tensor(1.), tensor(1.)),
 (0.1, tensor(0.0909), tensor(0.0847)),
 (0.2, tensor(0.), tensor(0.0794)),
 (0.30000000000000004, tensor(0.), tensor(0.0794)),
 (0.4, tensor(0.), tensor(0.0741)),
 (0.5, tensor(0.), tensor(0.0635)),
 (0.6000000000000001, tensor(0.), tensor(0.0582)),
 (0.7000000000000001, tensor(0.), tensor(0.0582)),
 (0.8, tensor(0.), tensor(0.0529)),
 (0.9, tensor(0.), tensor(0.0476)),
 (1.0, tensor(0.), tensor(0.))]

## Explainability

TODO

## Conclusion

> The conclusion to your demo should:
>
> - summarise the main steps that were taken in the demo making clear what
>  your user got to do (similar to steps at the start but more
>  specific, you can link a specific feature/method/class to achieving a specific
>  outcome). (short paragraph: 3-6 sentences)
> - provide resources to go further: these can be links to other tutorials, to
>  documentation, to code examples in the public_examples repo, tech notes, deployments,
>  etc... (2-4 suggestions)
>
> For pointing users to notebooks in the same runtime, point the user to where the file is rather than a link. For example: please see our tutorial, `<folder_name>/<notebook_name>.ipynb`. For relative links the paperspace platform will download the file locally if the machine is running and if the machine is not running will 404. For full path links a new window is opened.