# Fraud detection pipeline using Graph Neural Network

### Content
1. Introduction
1. Load transaction data
2. Graph construction
3. GraphSage training
5. Classifiction and Prediction
6. Evaluation
7. Conclusion

### 1. Introduction
This workflow shows an application of a graph neural network for fraud detection in a credit card transaction graph. We use a transaction dataset that includes three types of nodes, `transaction`, `client`, and `merchant` nodes. We use `GraphSAGE` along `XGBoost` to identify frauds in transactions. Since the graph is heterogeneous we employ HinSAGE a heterogeneous implementation of GraphSAGE.

First, GraphSAGE is trained separately to produce embedding of transaction nodes, then the embedding is fed to `XGBoost` classifier to identify fraud and nonfraud transactions. During the inference stage,  an embedding for a new transaction is computed from the trained GraphSAGE model and then feed to XGBoost model to get the anomaly scores.

### 2. Loading the Credit Card Transaction Data

In [1]:
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pylab as plt
import os
import dgl
import numpy as np
import torch
import torch.nn as nn
from model import HeteroRGCN
from model import HinSAGE
from model import prepare_data
from sklearn.metrics import accuracy_score
from sklearn.metrics import auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from torchmetrics.functional import accuracy
from tqdm import trange
from xgboost import XGBClassifier
from training import (get_metrics, evaluate, init_loaders, build_fsi_graph,
                   save_model, train)
import cudf

In [2]:
np.random.seed(1001)
torch.manual_seed(1001)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

##### Load training and test dataset

In [3]:
# Replace training-data.csv and validation-data.csv with training & validation csv in dataset file.
TRAINING_DATA ='../../datasets/training-data/fraud-detection-training-data.csv'
VALIDATION_DATA = '../../datasets/validation-data/fraud-detection-validation-data.csv'
train_data = cudf.read_csv(TRAINING_DATA)
inductive_data = cudf.read_csv(VALIDATION_DATA)

Since the number of samples of training data is small we augment data using benign transaction examples from the original training samples. This increases the number of benign example and reduce the proportion of fraudulent transactions. This is similar to practical situation where frauds are few in proportion.

In [4]:
# Increase number of samples.
def augment_data(train_data=train_data, n=20):
    train_data.drop(columns=['index'], inplace=True, axis=1)
    non_fraud = train_data[train_data['fraud_label'] == 0]
    df_fraud = cudf.concat([non_fraud for _ in range(n)])
    df_train = cudf.concat([train_data, df_fraud])
    df_train.reset_index(inplace=True)
    df_train['index'] = df_train.index

    return df_train


In [5]:
train_data = augment_data(train_data, n=20)

The `train_data` variable stores the data that will be used to construct graphs on which the representation learners can train. 
The `inductive_data` will be used to test the inductive performance of our representation learning algorithms.

In [6]:
print('The distribution of fraud for the train data is:\n', train_data['fraud_label'].value_counts())
print('The distribution of fraud for the inductive data is:\n', inductive_data['fraud_label'].value_counts())

The distribution of fraud for the train data is:
 0    11865
1      188
Name: fraud_label, dtype: int32
The distribution of fraud for the inductive data is:
 0    244
1     21
Name: fraud_label, dtype: int32


In [7]:
# train_data, test_data, train_index, test_index, labels, all_data
train_data, test_data, train_idx, inductive_idx, labels, df = prepare_data(train_data, inductive_data)

### 3. Construct transaction graph network

Here, nodes, edges, and features are passed to the `build_fsi_graph` method. Note that client and merchant node data are featurless, instead node embedding is used as a feature for these nodes. Therefore all the relevant transaction data resides at the transaction node.

In [8]:

meta_cols = ["client_node", "merchant_node", "index"]

# Build graph
whole_graph, feature_tensors = build_fsi_graph(df, meta_cols)
train_graph, _ = build_fsi_graph(train_data, meta_cols)

# Dataset
feature_tensors = feature_tensors.float()
train_idx = torch.from_dlpack(train_idx.values.toDlpack()).long()
inductive_idx = torch.from_dlpack(inductive_idx.values.toDlpack()).long()
labels = torch.from_dlpack(labels.toDlpack()).long()


### 4. Train Heterogeneous GraphSAGE

HinSAGE, a heterogeneous graph implementation of the GraphSAGE framework is trained with user specified hyperparameters. The model train several GraphSAGE models on the type of relationship between different types of nodes.

In [9]:
# Hyperparameters
target_node = "transaction"
epochs = 20
in_size, hidden_size, out_size, n_layers,\
    embedding_size = 111, 64, 2, 2, 1
batch_size = 100
in_size, hidden_size, out_size, n_layers, embedding_size = 111, 64, 2, 2, 1
hyperparameters = {
    "in_size": in_size,
    "hidden_size": hidden_size,
    "out_size": out_size,
    "n_layers": n_layers,
    "embedding_size": embedding_size,
    "target_node": target_node,
    "epoch": epochs
}

scale_pos_weight = (labels[train_idx].sum() / train_data.shape[0]).item()
scale_pos_weight = torch.FloatTensor([scale_pos_weight, 1 - scale_pos_weight]).to(device)

In [10]:
# Dataloaders
train_loader, val_loader, test_loader = init_loaders(train_graph.to(
    device), train_idx, test_idx=inductive_idx,
    val_idx=inductive_idx, g_test=whole_graph, batch_size=batch_size)


# Set model variables
model = HinSAGE(train_graph, in_size, hidden_size, out_size, n_layers, embedding_size).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
loss_func = nn.CrossEntropyLoss(weight=scale_pos_weight.float())

In [11]:

for epoch in trange(epochs):

    train_acc, loss = train(
        model, loss_func, train_loader, labels, optimizer, feature_tensors,
        target_node)
    print(f"Epoch {epoch}/{epochs} | Train Accuracy: {train_acc} | Train Loss: {loss}")
    val_logits, val_seed, _ = evaluate(model, val_loader, feature_tensors, target_node)
    val_accuracy = accuracy(val_logits.argmax(1), labels.long()[val_seed].cpu(), "binary").item()
    val_auc = roc_auc_score(
        labels.long()[val_seed].cpu().numpy(),
        val_logits[:, 1].numpy(),
    )
    print(f"Validation Accuracy: {val_accuracy} auc {val_auc}")


  assert input.numel() == input.storage().size(), (
  5%|▌         | 1/20 [00:02<00:52,  2.76s/it]

Epoch 0/20 | Train Accuracy: 1.0 | Train Loss: 8.364538975722887
Validation Accuracy: 0.7320754528045654 auc 0.1592130100188761


 10%|█         | 2/20 [00:05<00:47,  2.64s/it]

Epoch 1/20 | Train Accuracy: 1.0 | Train Loss: 112.15738137422963
Validation Accuracy: 0.7320754528045654 auc 0.5462465514737912


 15%|█▌        | 3/20 [00:07<00:44,  2.61s/it]

Epoch 2/20 | Train Accuracy: 1.0 | Train Loss: 525.2877785372972
Validation Accuracy: 0.7358490824699402 auc 0.8560331058516045


 20%|██        | 4/20 [00:10<00:41,  2.58s/it]

Epoch 3/20 | Train Accuracy: 1.0 | Train Loss: 200.01628349609354
Validation Accuracy: 0.7396226525306702 auc 0.8799186873820241


 25%|██▌       | 5/20 [00:13<00:38,  2.58s/it]

Epoch 4/20 | Train Accuracy: 1.0 | Train Loss: 90.46722278861125
Validation Accuracy: 0.7433962225914001 auc 0.8205314360389139


 30%|███       | 6/20 [00:15<00:35,  2.57s/it]

Epoch 5/20 | Train Accuracy: 1.0 | Train Loss: 57.87523431493901
Validation Accuracy: 0.7358490824699402 auc 0.8681573979962247


 35%|███▌      | 7/20 [00:18<00:33,  2.58s/it]

Epoch 6/20 | Train Accuracy: 1.0 | Train Loss: 24.51080822898075
Validation Accuracy: 0.7433962225914001 auc 0.9358211122404531


 40%|████      | 8/20 [00:20<00:31,  2.59s/it]

Epoch 7/20 | Train Accuracy: 1.0 | Train Loss: 19.741554783657193
Validation Accuracy: 0.7584905624389648 auc 0.9371279221722085


 45%|████▌     | 9/20 [00:23<00:28,  2.60s/it]

Epoch 8/20 | Train Accuracy: 1.0 | Train Loss: 15.446269849315286
Validation Accuracy: 0.7547169923782349 auc 0.9406127486568898


 50%|█████     | 10/20 [00:25<00:25,  2.57s/it]

Epoch 9/20 | Train Accuracy: 1.0 | Train Loss: 14.801136838272214
Validation Accuracy: 0.7547169923782349 auc 0.941846958036881


 55%|█████▌    | 11/20 [00:28<00:23,  2.56s/it]

Epoch 10/20 | Train Accuracy: 1.0 | Train Loss: 13.87586941383779
Validation Accuracy: 0.7547169923782349 auc 0.9493974154203572


 60%|██████    | 12/20 [00:30<00:20,  2.55s/it]

Epoch 11/20 | Train Accuracy: 1.0 | Train Loss: 14.225337034091353
Validation Accuracy: 0.7584905624389648 auc 0.9449687817627413


 65%|██████▌   | 13/20 [00:33<00:17,  2.55s/it]

Epoch 12/20 | Train Accuracy: 1.0 | Train Loss: 14.03758096601814
Validation Accuracy: 0.7584905624389648 auc 0.9457673878321475


 70%|███████   | 14/20 [00:36<00:15,  2.58s/it]

Epoch 13/20 | Train Accuracy: 1.0 | Train Loss: 13.155211296398193
Validation Accuracy: 0.7622641324996948 auc 0.9478002032815449


 75%|███████▌  | 15/20 [00:38<00:12,  2.57s/it]

Epoch 14/20 | Train Accuracy: 1.0 | Train Loss: 13.074136503273621
Validation Accuracy: 0.7622641324996948 auc 0.9480906054886018


 80%|████████  | 16/20 [00:41<00:10,  2.57s/it]

Epoch 15/20 | Train Accuracy: 1.0 | Train Loss: 12.887006599456072
Validation Accuracy: 0.7622641324996948 auc 0.9485262087991868


 85%|████████▌ | 17/20 [00:43<00:07,  2.57s/it]

Epoch 16/20 | Train Accuracy: 1.0 | Train Loss: 13.221301457379013
Validation Accuracy: 0.7584905624389648 auc 0.9499056192827066


 90%|█████████ | 18/20 [00:46<00:05,  2.56s/it]

Epoch 17/20 | Train Accuracy: 1.0 | Train Loss: 12.095282299211249
Validation Accuracy: 0.7660377621650696 auc 0.9544068534920864


 95%|█████████▌| 19/20 [00:48<00:02,  2.58s/it]

Epoch 18/20 | Train Accuracy: 1.0 | Train Loss: 12.169320295681246
Validation Accuracy: 0.7660377621650696 auc 0.9448235806592131


100%|██████████| 20/20 [00:51<00:00,  2.58s/it]

Epoch 19/20 | Train Accuracy: 1.0 | Train Loss: 12.063936041318811
Validation Accuracy: 0.7698113322257996 auc 0.9520836358356324





### 4.2 Inductive Step GraphSAGE

In this part, we want to compute the inductive embedding of a new transaction. To extract the embedding of the new transactions, we need to keep indices of the original graph nodes along with the new transaction nodes. We need to concatenate the test data frame to the train data frame to create a new graph that includes all nodes.

In [12]:
print(whole_graph)

Graph(num_nodes={'client': 861, 'merchant': 482, 'transaction': 12053},
      num_edges={('client', 'buy', 'transaction'): 12318, ('merchant', 'sell', 'transaction'): 12318, ('transaction', 'bought', 'client'): 12318, ('transaction', 'issued', 'merchant'): 12318},
      metagraph=[('client', 'transaction', 'buy'), ('transaction', 'client', 'bought'), ('transaction', 'merchant', 'issued'), ('merchant', 'transaction', 'sell')])


The inductive step applies the previously learned (and optimized) aggregation functions, part of the `trained_hinsage_model`. We also pass the new graph g_test, test data loader.

In [13]:
# Create embeddings
_, train_seeds, train_embedding = evaluate(model, train_loader, feature_tensors, target_node)
test_logits, test_seeds, test_embedding = evaluate(model, test_loader, feature_tensors, target_node)

# compute metrics
test_acc = accuracy(test_logits.argmax(dim=1), labels.long()[test_seeds].cpu(), "binary").item()
test_auc = roc_auc_score(labels.long()[test_seeds].cpu().numpy(), test_logits[:, 1].numpy())

print(f"Final Test Accuracy: {test_acc} auc {test_auc}")

#acc, f_1, precision, recall, roc_auc, pr_auc, average_precision, _, _ = get_metrics(
#    test_logits.numpy(), labels[test_seeds].cpu().numpy())

#print(f"Final Test Accuracy: {acc} auc {roc_auc}")


Final Test Accuracy: 0.7509434223175049 auc 0.9314650791346014


### 5. Classification: predictions based on inductive embeddings

Now a selected classifier (XGBoost) can be trained using the training node embedding and test on the test node embedding.

In [14]:
from xgboost import XGBClassifier


In [15]:
# Train XGBoost classifier on embedding vector
classifier = XGBClassifier(n_estimators=100)
classifier.fit(train_embedding.cpu().numpy(), labels[train_seeds].cpu().numpy())


If requested, the original transaction features are added to the generated embeddings. If these features are added, a baseline consisting of only these features (without embeddings) is included to analyze the net impact of embeddings on the predictive performance.

In [16]:
xgb_pred = classifier.predict_proba(test_embedding.cpu().numpy())


### 6. Evaluation

Given the highly imbalanced nature of the dataset, we can evaluate the results based on AUC, accuracy ...etc.

In [17]:
acc, f_1, precision, recall, roc_auc, pr_auc, average_precision, _, _ = get_metrics(
    xgb_pred, labels[inductive_idx].cpu().numpy(),  name='HinSAGE_XGB')
print(f"Final Test Accuracy: {acc} auc {roc_auc}")

Final Test Accuracy: 0.8377358490566038 auc 0.9036590678089155


The result shows, using GNN embedded features with XGB achieves with a better performance when tested over embedded features. 

### 6.2 Save models

The graphsage and xgboost models can be saved into their respective save format using `save_model` method. For infernce, graphsage load as pytorch model, and the XGBoost load using `cuml` *Forest Inference Library (FIL)*.

In [18]:
model_dir= "modelpath/"

save_model(train_graph, model, hyperparameters, classifier, model_dir)


In [19]:
!ls modelpath

graph.pkl  hyperparams.pkl  model.pt  xgb.pt


In [20]:
## For inference we can load from file as follows.
from training import load_model
# do inference on loaded model, as follows
# hinsage_model,  hyperparam, g = load_model(model_dir)

### 7. Conclusion

In this workflow, we show a hybrid approach how to use Graph Neural network along XGBoost for a fraud detection on credit card transaction network. For further, optimized inference pipeline refer to `Morpheus` inference pipeline of fraud detection.

### Reference
1. Van Belle, Rafaël, et al. "Inductive Graph Representation Learning for fraud detection." Expert Systems with Applications (2022): 116463.
2.https://stellargraph.readthedocs.io/en/stable/hinsage.html?highlight=hinsage
3.https://github.com/rapidsai/clx/blob/branch-0.20/examples/forest_inference/xgboost_training.ipynb"