# Fraud detection pipeline using Graph Neural Network

### Content
1. Introduction
1. Load transaction data
2. Graph construction
3. GraphSage training
5. Classifiction and Prediction
6. Evaluation
7. Conclusion

### 1. Introduction
This workflow shows an application of a graph neural network for fraud detection in a credit card transaction graph. We use a transaction dataset that includes three types of nodes, `transaction`, `client`, and `merchant` nodes. We use `GraphSAGE` along `XGBoost` to identify frauds in transactions. Since the graph is heterogeneous we employ HinSAGE a heterogeneous implementation of GraphSAGE.

First, GraphSAGE is trained separately to produce embedding of transaction nodes, then the embedding is fed to `XGBoost` classifier to identify fraud and nonfraud transactions. During the inference stage,  an embedding for a new transaction is computed from the trained GraphSAGE model and then feed to XGBoost model to get the anomaly scores.

### 2. Loading the Credit Card Transaction Data

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import os
import dgl
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from model import HeteroRGCN
from model import HinSAGE
from sklearn.metrics import accuracy_score
from sklearn.metrics import auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from torchmetrics.functional import accuracy
from tqdm import trange
from xgboost import XGBClassifier
from training import (get_metrics, evaluate, init_loaders, build_fsi_graph,
                     map_node_id, prepare_data, save_model, train)


In [2]:
np.random.seed(1001)
torch.manual_seed(1001)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [73]:
#device 

device(type='cuda', index=0)

##### Load traing and test dataset

In [3]:
# Replace training-data.csv and validation-data.csv with training & validation csv in dataset file.
TRAINING_DATA ='../../datasets/training-data/fraud-detection-training-data.csv'
VALIDATION_DATA = '../../datasets/validation-data/fraud-detection-validation-data.csv'
train_data = pd.read_csv(TRAINING_DATA)
inductive_data = pd.read_csv(VALIDATION_DATA)

Since the number of samples of training data is small we augment data using benign transaction examples from the original training samples. This increases the number of benign example and reduce the proportion of fraudulent transactions. This is similar to practical situation where frauds are few in proportion.

In [4]:
# Increase number of samples.
def augement_data(train_data=train_data, n=20):
    max_id = inductive_data.index.max()
    non_fraud = train_data[train_data['fraud_label'] == 0]
    
    non_fraud = non_fraud.drop(['index'], axis=1)
    df_fraud = pd.concat([non_fraud for i in range(n)])
    df_fraud.index = np.arange(1076, 1076 + df_fraud.shape[0])
    df_fraud['index'] = df_fraud.index
    
    return pd.concat((train_data, df_fraud))

In [5]:
train_data = augement_data(train_data, n=20)

The `train_data` variable stores the data that will be used to construct graphs on which the representation learners can train. 
The `inductive_data` will be used to test the inductive performance of our representation learning algorithms.

In [6]:
print('The distribution of fraud for the train data is:\n', train_data['fraud_label'].value_counts())
print('The distribution of fraud for the inductive data is:\n', inductive_data['fraud_label'].value_counts())

The distribution of fraud for the train data is:
 0    11865
1      188
Name: fraud_label, dtype: int64
The distribution of fraud for the inductive data is:
 0    244
1     21
Name: fraud_label, dtype: int64


In [7]:
# split train, test and create nodes index
def prepare_data(df_train, df_test):
    
    train_idx_ = df_train.shape[0]
    df = pd.concat([df_train, df_test], axis=0)
    df['tran_id'] = df['index']

    meta_cols = ['tran_id', 'client_node', 'merchant_node']
    for col in meta_cols:
        map_node_id(df, col)

    train_idx = df['tran_id'][:train_idx_]
    test_idx = df['tran_id'][train_idx_:]

    df['index'] = df['tran_id']
    df.index = df['index']

    return (df.iloc[train_idx, :], df.iloc[test_idx, :], train_idx, test_idx, df['fraud_label'].values, df)

In [8]:
train_data, test_data, train_idx, inductive_idx, labels, df = prepare_data(train_data, inductive_data)

### 3. Construct transasction graph network

Here, nodes, edges, and features are passed to the `build_fsi_graph` method. Note that client and merchant node data are featurless, instead node embedding is used as a feature for these nodes. Therefore all the relevant transaction data resides at the transaction node.

In [9]:
meta_cols = ["client_node", "merchant_node", "fraud_label", "index", "tran_id"]

# Build graph
whole_graph, feature_tensors = build_fsi_graph(df, meta_cols)
train_graph, _ = build_fsi_graph(train_data, meta_cols)
whole_graph = whole_graph.to(device)

In [10]:
# Dataset to tensors
feature_tensors = feature_tensors.to(device)
train_idx = torch.from_numpy(train_idx.values).to(device)
inductive_idx = torch.from_numpy(inductive_idx.values).to(device)
labels = torch.LongTensor(labels).to(device)


In [11]:
# Show structure of training graph.
print(train_graph)

Graph(num_nodes={'client': 623, 'merchant': 388, 'transaction': 12053},
      num_edges={('client', 'buy', 'transaction'): 12053, ('merchant', 'sell', 'transaction'): 12053, ('transaction', 'bought', 'client'): 12053, ('transaction', 'issued', 'merchant'): 12053},
      metagraph=[('client', 'transaction', 'buy'), ('transaction', 'client', 'bought'), ('transaction', 'merchant', 'issued'), ('merchant', 'transaction', 'sell')])


### 4. Train Heterogeneous GraphSAGE

HinSAGE, a heterogeneous graph implementation of the GraphSAGE framework is trained with user specified hyperparameters. The model train several GraphSAGE models on the type of relationship between different types of nodes.

In [21]:
# Hyperparameters
target_node = "transaction"
epochs = 20
in_size, hidden_size, out_size, n_layers,\
    embedding_size = 111, 64, 2, 2, 1
batch_size = 100
hyperparameters = {"in_size": in_size, "hidden_size": hidden_size,
                   "out_size": out_size, "n_layers": n_layers,
                   "embedding_size": embedding_size,
                   "target_node": target_node,
                   "epoch": epochs}


scale_pos_weight = train_data['fraud_label'].sum() / train_data.shape[0]
scale_pos_weight = torch.tensor(
    [scale_pos_weight, 1-scale_pos_weight]).to(device)

In [22]:
# Dataloaders
train_loader, val_loader, test_loader = init_loaders(train_graph.to(
    device), train_idx, test_idx=inductive_idx,
    val_idx=inductive_idx, g_test=whole_graph, batch_size=batch_size)


# Set model variables
model = HinSAGE(train_graph, in_size, hidden_size, out_size, n_layers, embedding_size).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
loss_func = nn.CrossEntropyLoss(weight=scale_pos_weight.float())

In [23]:

for epoch in trange(epochs):

    train_acc, loss = train(
        model, loss_func, train_loader, labels, optimizer, feature_tensors,
        target_node, device=device)
    print(f"Epoch {epoch}/{epochs} | Train Accuracy: {train_acc} | Train Loss: {loss}")
    val_logits, val_seed, _ = evaluate(model, val_loader, feature_tensors, target_node, device=device)
    val_accuracy = accuracy(val_logits.argmax(1), labels.long()[val_seed].cpu(), "binary").item()
    val_auc = roc_auc_score(
        labels.long()[val_seed].cpu().numpy(),
        val_logits[:, 1].numpy(),
    )
    print(f"Validation Accuracy: {val_accuracy} auc {val_auc}")


  0%|          | 0/20 [00:00<?, ?it/s]

  5%|▌         | 1/20 [00:02<00:47,  2.51s/it]

Epoch 0/20 | Train Accuracy: 1.0 | Train Loss: 4.077046836914391
Validation Accuracy: 0.9207547307014465 auc 0.13992974238875877


 10%|█         | 2/20 [00:04<00:43,  2.40s/it]

Epoch 1/20 | Train Accuracy: 1.0 | Train Loss: 110.9858230000423
Validation Accuracy: 0.9207547307014465 auc 0.5852849336455894


 15%|█▌        | 3/20 [00:07<00:40,  2.37s/it]

Epoch 2/20 | Train Accuracy: 1.0 | Train Loss: 419.0077720507543
Validation Accuracy: 0.9207547307014465 auc 0.6083138173302107


 20%|██        | 4/20 [00:09<00:37,  2.34s/it]

Epoch 3/20 | Train Accuracy: 1.0 | Train Loss: 176.4732639742433
Validation Accuracy: 0.9169811606407166 auc 0.7976190476190476


 25%|██▌       | 5/20 [00:11<00:34,  2.31s/it]

Epoch 4/20 | Train Accuracy: 1.0 | Train Loss: 49.66766470632865
Validation Accuracy: 0.9245283007621765 auc 0.8080601092896175


 30%|███       | 6/20 [00:14<00:33,  2.36s/it]

Epoch 5/20 | Train Accuracy: 1.0 | Train Loss: 31.406425931840204
Validation Accuracy: 0.9283018708229065 auc 0.858216237314598


 35%|███▌      | 7/20 [00:16<00:30,  2.35s/it]

Epoch 6/20 | Train Accuracy: 1.0 | Train Loss: 24.368114110082388
Validation Accuracy: 0.9283018708229065 auc 0.8635831381733021


 40%|████      | 8/20 [00:18<00:27,  2.31s/it]

Epoch 7/20 | Train Accuracy: 1.0 | Train Loss: 17.363841364858672
Validation Accuracy: 0.9283018708229065 auc 0.8762685402029665


 45%|████▌     | 9/20 [00:21<00:25,  2.33s/it]

Epoch 8/20 | Train Accuracy: 1.0 | Train Loss: 16.201855568680912
Validation Accuracy: 0.9320755004882812 auc 0.8788056206088993


 50%|█████     | 10/20 [00:23<00:23,  2.38s/it]

Epoch 9/20 | Train Accuracy: 1.0 | Train Loss: 15.001215729862452
Validation Accuracy: 0.9320755004882812 auc 0.8873926619828258


 55%|█████▌    | 11/20 [00:25<00:21,  2.37s/it]

Epoch 10/20 | Train Accuracy: 1.0 | Train Loss: 14.861962082330137
Validation Accuracy: 0.9358490705490112 auc 0.8791959406713505


 60%|██████    | 12/20 [00:28<00:19,  2.40s/it]

Epoch 11/20 | Train Accuracy: 1.0 | Train Loss: 13.089418702758849
Validation Accuracy: 0.9320755004882812 auc 0.8858313817330211


 65%|██████▌   | 13/20 [00:30<00:16,  2.40s/it]

Epoch 12/20 | Train Accuracy: 1.0 | Train Loss: 12.216756469802931
Validation Accuracy: 0.9320755004882812 auc 0.9127634660421545


 70%|███████   | 14/20 [00:33<00:14,  2.46s/it]

Epoch 13/20 | Train Accuracy: 1.0 | Train Loss: 12.858742844546214
Validation Accuracy: 0.9433962106704712 auc 0.9182279469164715


 75%|███████▌  | 15/20 [00:35<00:12,  2.43s/it]

Epoch 14/20 | Train Accuracy: 1.0 | Train Loss: 11.10123936785385
Validation Accuracy: 0.9320755004882812 auc 0.911592505854801


 80%|████████  | 16/20 [00:38<00:09,  2.44s/it]

Epoch 15/20 | Train Accuracy: 1.0 | Train Loss: 15.444379360007588
Validation Accuracy: 0.9207547307014465 auc 0.8721701795472288


 85%|████████▌ | 17/20 [00:40<00:07,  2.39s/it]

Epoch 16/20 | Train Accuracy: 1.0 | Train Loss: 15.353719354665373
Validation Accuracy: 0.9169811606407166 auc 0.822599531615925


 90%|█████████ | 18/20 [00:42<00:04,  2.35s/it]

Epoch 17/20 | Train Accuracy: 1.0 | Train Loss: 15.88208947563544
Validation Accuracy: 0.9283018708229065 auc 0.8528493364558939


 95%|█████████▌| 19/20 [00:45<00:02,  2.33s/it]

Epoch 18/20 | Train Accuracy: 1.0 | Train Loss: 12.539632054162212
Validation Accuracy: 0.9283018708229065 auc 0.9022248243559718


100%|██████████| 20/20 [00:47<00:00,  2.37s/it]

Epoch 19/20 | Train Accuracy: 1.0 | Train Loss: 13.172684742690763
Validation Accuracy: 0.9433962106704712 auc 0.9342310694769711





### 4.2 Inductive Step GraphSAGE

In this part, we want to compute the inductive embedding of a new transaction. To extract the embedding of the new transactions, we need to keep indices of the original graph nodes along with the new transaction nodes. We need to concatenate the test data frame to the train data frame to create a new graph that includes all nodes.

In [25]:
print(whole_graph)

Graph(num_nodes={'client': 861, 'merchant': 482, 'transaction': 12318},
      num_edges={('client', 'buy', 'transaction'): 12318, ('merchant', 'sell', 'transaction'): 12318, ('transaction', 'bought', 'client'): 12318, ('transaction', 'issued', 'merchant'): 12318},
      metagraph=[('client', 'transaction', 'buy'), ('transaction', 'client', 'bought'), ('transaction', 'merchant', 'issued'), ('merchant', 'transaction', 'sell')])


The inductive step applies the previously learned (and optimized) aggregation functions, part of the `trained_hinsage_model`. We also pass the new graph g_test, test data loader.

In [35]:
# Create embeddings
_, train_seeds, train_embedding = evaluate(model, train_loader, feature_tensors, target_node, device=device)
test_logits, test_seeds, test_embedding = evaluate(model, test_loader, feature_tensors, target_node, device=device)

# compute metrics
test_acc = accuracy(test_logits.argmax(dim=1), labels.long()[test_seeds].cpu(), "binary").item()
test_auc = roc_auc_score(labels.long()[test_seeds].cpu().numpy(), test_logits[:, 1].numpy())

metrics_result = pd.DataFrame()
print(f"Final Test Accuracy: {test_acc} auc {test_auc}")

#acc, f_1, precision, recall, roc_auc, pr_auc, average_precision, _, _ = get_metrics(
#    test_logits.numpy(), labels[test_seeds].cpu().numpy())

#print(f"Final Test Accuracy: {acc} auc {roc_auc}")


Final Test Accuracy: 0.9320755004882812 auc 0.889344262295082


### 5. Classification: predictions based on inductive embeddings

Now a selected classifier (XGBoost) can be trained using the training node embedding and test on the test node embedding.

In [27]:
from xgboost import XGBClassifier


In [28]:
# Train XGBoost classifier on embedding vector
classifier = XGBClassifier(n_estimators=100)
classifier.fit(train_embedding.cpu().numpy(), labels[train_seeds].cpu().numpy())


If requested, the original transaction features are added to the generated embeddings. If these features are added, a baseline consisting of only these features (without embeddings) is included to analyze the net impact of embeddings on the predictive performance.

In [29]:
xgb_pred = classifier.predict_proba(test_embedding.cpu().numpy())


### 6. Evaluation

Given the highly imbalanced nature of the dataset, we can evaluate the results based on AUC, accuracy ...etc.

In [32]:
acc, f_1, precision, recall, roc_auc, pr_auc, average_precision, _, _ = get_metrics(
    xgb_pred, labels[inductive_idx].cpu().numpy(),  name='HinSAGE_XGB')
print(f"Final Test Accuracy: {acc} auc {roc_auc}")

Final Test Accuracy: 0.9245283018867925 auc 0.9055425448868072


The result shows, using GNN embedded features with XGB achieves with a better performance when tested over embedded features. 

### 6.2 Save models

The graphsage and xgboost models can be saved into their respective save format using `save_model` method. For infernce, graphsage load as pytorch model, and the XGBoost load using `cuml` *Forest Inference Library (FIL)*.

In [45]:
model_dir= "modelpath/"

save_model(train_graph, model, hyperparameters, classifier, model_dir)


In [40]:
!ls modelpath

graph.pkl  hyperparams.pkl  model.pt  xgb.pt


In [None]:
## For inference we can load from file as follows. 
from training import load_model
# do inference on loaded model, as follows
# hinsage_model,  hyperparam, g = load_model(model_dir, device)

### 7. Conclusion

In this workflow, we show a hybrid approach how to use Graph Neural network along XGBoost for a fraud detection on credit card transaction network. For further, optimized inference pipeline refer to `Morpheus` inference pipeline of fraud detection.

### Reference
1. Van Belle, Rafaël, et al. "Inductive Graph Representation Learning for fraud detection." Expert Systems with Applications (2022): 116463.
2.https://stellargraph.readthedocs.io/en/stable/hinsage.html?highlight=hinsage
3.https://github.com/rapidsai/clx/blob/branch-0.20/examples/forest_inference/xgboost_training.ipynb"