# Anti Money Laundering Detection with GNN node classification
### This notenook includes GNN model training and dataset implementation with PyG library. In this example, we used HI-Small_Trans.csv as our dataset for training and testing.  

In [2]:
import datetime
import os
from typing import Callable, Optional
import pandas as pd
from sklearn import preprocessing
import numpy as np
import torch

from torch_geometric.data import (
    Data,
    InMemoryDataset
)

pd.set_option('display.max_columns', None)
path = '/Users/owhy/Documents/Datasets/HI-Small_Trans_3.csv'
df = pd.read_csv(path)

  from .autonotebook import tqdm as notebook_tqdm


# Data visualization and possible feature engineering
Let's look into the dataset

In [3]:
print(df.head())

          Timestamp  From Bank    Account  To Bank  Account.1  \
0  2022/09/01 00:20         10  8000EBD30       10  8000EBD30   
1  2022/09/01 00:20       3208  8000F4580        1  8000F5340   
2  2022/09/01 00:00       3209  8000F4670     3209  8000F4670   
3  2022/09/01 00:02         12  8000F5030       12  8000F5030   
4  2022/09/01 00:06         10  8000F5200       10  8000F5200   

   Amount Received Receiving Currency  Amount Paid Payment Currency  \
0          3697.34          US Dollar      3697.34        US Dollar   
1             0.01          US Dollar         0.01        US Dollar   
2         14675.57          US Dollar     14675.57        US Dollar   
3          2806.97          US Dollar      2806.97        US Dollar   
4         36682.97          US Dollar     36682.97        US Dollar   

  Payment Format  Is Laundering  
0   Reinvestment              0  
1         Cheque              0  
2   Reinvestment              0  
3   Reinvestment              0  
4   Reinvest

After the viewing the dataframe, we suggest that we can extract all accounts from receiver and payer among all transcation for sorting the suspicious accounts. We can transform the whole dataset into node classification problem by considering accounts as nodes while transcation as edges.

The object columns should be encoded into classes with sklearn LabelEncoder.

In [4]:
print(df.dtypes)

Timestamp              object
From Bank               int64
Account                object
To Bank                 int64
Account.1              object
Amount Received       float64
Receiving Currency     object
Amount Paid           float64
Payment Currency       object
Payment Format         object
Is Laundering           int64
dtype: object


Check if there are any null values

In [5]:
print(df.isnull().sum())

Timestamp             0
From Bank             0
Account               0
To Bank               0
Account.1             0
Amount Received       0
Receiving Currency    0
Amount Paid           0
Payment Currency      0
Payment Format        0
Is Laundering         0
dtype: int64


There are two columns representing paid and received amount of each transcation, wondering if it is necessary to split the amount into two columns when they shared the same value, unless there are transcation fee/transcation between different currency. Let's find out 

In [6]:
print('Amount Received equals to Amount Paid:')
print(df['Amount Received'].equals(df['Amount Paid']))
print('Receiving Currency equals to Payment Currency:')
print(df['Receiving Currency'].equals(df['Payment Currency']))

Amount Received equals to Amount Paid:
False
Receiving Currency equals to Payment Currency:
False


It seens involved the transcations between different currency, let's print it out

In [7]:
not_equal1 = df.loc[~(df['Amount Received'] == df['Amount Paid'])]
not_equal2 = df.loc[~(df['Receiving Currency'] == df['Payment Currency'])]
print(not_equal1)
print('---------------------------------------------------------------------------')
print(not_equal2)

             Timestamp  From Bank    Account  To Bank  Account.1  \
1173  2022/09/01 00:22       1362  80030A870     1362  80030A870   

      Amount Received Receiving Currency  Amount Paid Payment Currency  \
1173            52.11               Euro        61.06        US Dollar   

     Payment Format  Is Laundering  
1173            ACH              0  
---------------------------------------------------------------------------
             Timestamp  From Bank    Account  To Bank  Account.1  \
1173  2022/09/01 00:22       1362  80030A870     1362  80030A870   

      Amount Received Receiving Currency  Amount Paid Payment Currency  \
1173            52.11               Euro        61.06        US Dollar   

     Payment Format  Is Laundering  
1173            ACH              0  


The size of two df shows that there are transcation fee and transcation between different currency, we cannot combine/drop the amount columns.

As we are going to encode the columns, we have to make sure that the classes of same attribute are aligned.
Let's check if the list of Receiving Currency and Payment Currency are the same

In [8]:
print(sorted(df['Receiving Currency'].unique()))
print(sorted(df['Payment Currency'].unique()))

['Bitcoin', 'Euro', 'US Dollar']
['Bitcoin', 'Euro', 'US Dollar']


# Data Preprocessing
### We will show the functions used in the PyG dataset first, dataset and model training will be provided in bottom section

In the data preprocessing, we perform below transformation:  
1. Transform the Timestamp with min max normalization.  
2. Create unique ID for each account by adding bank code with account number.  
3. Create receiving_df with the information of receiving accounts, received amount and currency
4. Create paying_df with the information of payer accounts, paid amount and currency
5. Create a list of currency used among all transactions
6. Label the 'Payment Format', 'Payment Currency', 'Receiving Currency' by classes with sklearn LabelEncoder


In [9]:
def df_label_encoder(df, columns):
        le = preprocessing.LabelEncoder()
        for i in columns:
            df[i] = le.fit_transform(df[i].astype(str))
        return df

def preprocess(df):
        df = df_label_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])
        df['Timestamp'] = pd.to_datetime(df['Timestamp'])
        df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
        df['Timestamp'] = (df['Timestamp']-df['Timestamp'].min())/(df['Timestamp'].max()-df['Timestamp'].min())

        df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
        df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']
        df = df.sort_values(by=['Account'])
        receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
        paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
        receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)
        currency_ls = sorted(df['Receiving Currency'].unique())

        return df, receiving_df, paying_df, currency_ls

Let's have a look of processed df

In [10]:
df, receiving_df, paying_df, currency_ls = preprocess(df = df)
print(df.head())

      Timestamp  From Bank         Account  To Bank        Account.1  \
67     0.172414       1047  1047_800416A40    15723  15723_8051B9F40   
3614   0.965517       1047  1047_800679ED0     1047   1047_800679ED0   
3702   0.620690       1047  1047_800683140     1047   1047_800683140   
3689   0.965517       1047  1047_8006920D0     1047   1047_8006920D0   
3812   0.655172       1047  1047_8006B2FA0     1047   1047_8006B2FA0   

      Amount Received  Receiving Currency  Amount Paid  Payment Currency  \
67             130.10                   2       130.10                 2   
3614        261353.23                   2    261353.23                 2   
3702         55345.22                   2     55345.22                 2   
3689        181317.26                   2    181317.26                 2   
3812         47062.42                   2     47062.42                 2   

      Payment Format  Is Laundering  
67                 4              0  
3614               5              

paying df and receiving df:

In [11]:
print(receiving_df.head())
print(paying_df.head())

              Account  Amount Received  Receiving Currency
67    15723_8051B9F40           130.10                   2
3614   1047_800679ED0        261353.23                   2
3702   1047_800683140         55345.22                   2
3689   1047_8006920D0        181317.26                   2
3812   1047_8006B2FA0         47062.42                   2
             Account  Amount Paid  Payment Currency
67    1047_800416A40       130.10                 2
3614  1047_800679ED0    261353.23                 2
3702  1047_800683140     55345.22                 2
3689  1047_8006920D0    181317.26                 2
3812  1047_8006B2FA0     47062.42                 2


currency_ls:

In [12]:
print(currency_ls)

[0, 1, 2]


We would like to extract all unique accounts from payer and receiver as node of our graph. It includes the unique account ID, Bank code and the label of 'Is Laundering'.  
In this section, we consider both payer and receiver involved in a illicit transaction as suspicious accounts, we will label both accounts with 'Is Laundering' == 1.

In [13]:
def get_all_account(df):
        ldf = df[['Account', 'From Bank']]
        rdf = df[['Account.1', 'To Bank']]
        suspicious = df[df['Is Laundering']==1]
        s1 = suspicious[['Account', 'Is Laundering']]
        s2 = suspicious[['Account.1', 'Is Laundering']]
        s2 = s2.rename({'Account.1': 'Account'}, axis=1)
        suspicious = pd.concat([s1, s2], join='outer')
        suspicious = suspicious.drop_duplicates()

        ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
        rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
        df = pd.concat([ldf, rdf], join='outer')
        df = df.drop_duplicates()

        df['Is Laundering'] = 0
        df.set_index('Account', inplace=True)
        df.update(suspicious.set_index('Account'))
        df = df.reset_index()
        return df

Take a look of the account list:

In [14]:
accounts = get_all_account(df)
print(accounts.head())

          Account  Bank  Is Laundering
0  1047_800416A40  1047              0
1  1047_800679ED0  1047              0
2  1047_800683140  1047              0
3  1047_8006920D0  1047              0
4  1047_8006B2FA0  1047              0


# Node features
For node features, we would like to aggregate the mean of paid and received amount with different types of currency as the new features of each node. 

In [15]:
def paid_currency_aggregate(currency_ls, paying_df, accounts):
        for i in currency_ls:
            temp = paying_df[paying_df['Payment Currency'] == i]
            accounts['avg paid '+str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
        return accounts

def received_currency_aggregate(currency_ls, receiving_df, accounts):
    for i in currency_ls:
        temp = receiving_df[receiving_df['Receiving Currency'] == i]
        accounts['avg received '+str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
    accounts = accounts.fillna(0)
    return accounts

Now we can define the node attributes by the bank code and the mean of paid and received amount with different types of currency.

In [16]:
def get_node_attr(currency_ls, paying_df,receiving_df, accounts):
        node_df = paid_currency_aggregate(currency_ls, paying_df, accounts)
        node_df = received_currency_aggregate(currency_ls, receiving_df, node_df)
        node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)
        node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
        node_df = df_label_encoder(node_df,['Bank'])
#         node_df = torch.from_numpy(node_df.values).to(torch.float)  # comment for visualization
        return node_df, node_label

Take a look of node_df:

In [17]:
node_df, node_label = get_node_attr(currency_ls, paying_df,receiving_df, accounts)
print(node_df.head())

   Bank  avg paid 0  avg paid 1  avg paid 2  avg received 0  avg received 1  \
0     3         0.0         0.0     3697.34             0.0             0.0   
1     3         0.0         0.0        0.01             0.0             0.0   
2     3         0.0         0.0    14675.57             0.0             0.0   
3     3         0.0         0.0     2806.97             0.0             0.0   
4     3         0.0         0.0    36682.97             0.0             0.0   

   avg received 2  
0         3697.34  
1            0.01  
2        14675.57  
3         2806.97  
4        36682.97  


# Edge features
In terms of edge features, we would like to conside each transcation as edges.  
For edge index, we replace all account with index and stack into a list with size of [2, num of transcation]  
For edge attributes, we used 'Timestamp', 'Amount Received', 'Receiving Currency', 'Amount Paid', 'Payment Currency' and 'Payment Format'


In [18]:
def get_edge_df(accounts, df):
        accounts = accounts.reset_index(drop=True)
        accounts['ID'] = accounts.index
        mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
        df['From'] = df['Account'].map(mapping_dict)
        df['To'] = df['Account.1'].map(mapping_dict)
        df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

        edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

        df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

#         edge_attr = torch.from_numpy(df.values).to(torch.float)  # comment for visualization

        edge_attr = df  # for visualization
        return edge_attr, edge_index

edge_attr:

In [19]:
edge_attr, edge_index = get_edge_df(accounts, df)
print(edge_attr.head())

      Timestamp  Amount Received  Receiving Currency  Amount Paid  \
67     0.172414           130.10                   2       130.10   
3614   0.965517        261353.23                   2    261353.23   
3702   0.620690         55345.22                   2     55345.22   
3689   0.965517        181317.26                   2    181317.26   
3812   0.655172         47062.42                   2     47062.42   

      Payment Currency  Payment Format  
67                   2               4  
3614                 2               5  
3702                 2               5  
3689                 2               5  
3812                 2               5  


edge_index:

In [20]:
print(edge_index)

tensor([[   0,    1,    2,  ..., 3732, 3732, 3733],
        [3734,    1,    2,  ..., 3732, 3732, 4187]])


# Final code 
### Below we will show the final code for model.py, train.py and dataset.py

# Model Architecture
In this section, we used Graph Attention Networks as our backbone model.  
The model built with two GATConv layers followed by a linear layer with sigmoid outout for classification

In [21]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_geometric.transforms as T
from torch_geometric.nn import GATConv, Linear

class GAT(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, heads):
        super().__init__()
        self.conv1 = GATConv(in_channels, hidden_channels, heads, dropout=0.6)
        self.conv2 = GATConv(hidden_channels * heads, int(hidden_channels/4), heads=1, concat=False, dropout=0.6)
        self.lin = Linear(int(hidden_channels/4), out_channels)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x, edge_index, edge_attr):
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv1(x, edge_index, edge_attr))
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv2(x, edge_index, edge_attr))
        x = self.lin(x)
        x = self.sigmoid(x)
        
        return x

# PyG InMemoryDataset
Finally we can build the dataset with above functions

In [22]:
class AMLtoGraph(InMemoryDataset):

    def __init__(self, root: str, edge_window_size: int = 10,
                 transform: Optional[Callable] = None,
                 pre_transform: Optional[Callable] = None):
        self.edge_window_size = edge_window_size
        super().__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self) -> str:
        return 'HI-Small_Trans_3.csv'

    @property
    def processed_file_names(self) -> str:
        return 'data.pt'

    @property
    def num_nodes(self) -> int:
        return self._data.edge_index.max().item() + 1

    def df_label_encoder(self, df, columns):
        le = preprocessing.LabelEncoder()
        for i in columns:
            df[i] = le.fit_transform(df[i].astype(str))
        return df


    def preprocess(self, df):
        df = self.df_label_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])
        df['Timestamp'] = pd.to_datetime(df['Timestamp'])
        df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
        df['Timestamp'] = (df['Timestamp']-df['Timestamp'].min())/(df['Timestamp'].max()-df['Timestamp'].min())

        df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
        df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']
        df = df.sort_values(by=['Account'])
        receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
        paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
        receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)
        currency_ls = sorted(df['Receiving Currency'].unique())

        return df, receiving_df, paying_df, currency_ls

    def get_all_account(self, df):
        ldf = df[['Account', 'From Bank']]
        rdf = df[['Account.1', 'To Bank']]
        suspicious = df[df['Is Laundering']==1]
        s1 = suspicious[['Account', 'Is Laundering']]
        s2 = suspicious[['Account.1', 'Is Laundering']]
        s2 = s2.rename({'Account.1': 'Account'}, axis=1)
        suspicious = pd.concat([s1, s2], join='outer')
        suspicious = suspicious.drop_duplicates()

        ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
        rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
        df = pd.concat([ldf, rdf], join='outer')
        df = df.drop_duplicates()

        df['Is Laundering'] = 0
        df.set_index('Account', inplace=True)
        df.update(suspicious.set_index('Account'))
        df = df.reset_index()
        return df
    
    def paid_currency_aggregate(self, currency_ls, paying_df, accounts):
        for i in currency_ls:
            temp = paying_df[paying_df['Payment Currency'] == i]
            accounts['avg paid '+str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
        return accounts

    def received_currency_aggregate(self, currency_ls, receiving_df, accounts):
        for i in currency_ls:
            temp = receiving_df[receiving_df['Receiving Currency'] == i]
            accounts['avg received '+str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
        accounts = accounts.fillna(0)
        return accounts

    def get_edge_df(self, accounts, df):
        accounts = accounts.reset_index(drop=True)
        accounts['ID'] = accounts.index
        mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
        df['From'] = df['Account'].map(mapping_dict)
        df['To'] = df['Account.1'].map(mapping_dict)
        df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

        edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

        print(edge_index)

        df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

        edge_attr = torch.from_numpy(df.values).to(torch.float)
        return edge_attr, edge_index

    def get_node_attr(self, currency_ls, paying_df,receiving_df, accounts):
        node_df = self.paid_currency_aggregate(currency_ls, paying_df, accounts)
        node_df = self.received_currency_aggregate(currency_ls, receiving_df, node_df)
        node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)
        node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
        node_df = self.df_label_encoder(node_df,['Bank'])
        node_df = torch.from_numpy(node_df.values).to(torch.float)
        return node_df, node_label

    def process(self):
        df = pd.read_csv(self.raw_paths[0])
        df, receiving_df, paying_df, currency_ls = self.preprocess(df)
        accounts = self.get_all_account(df)
        node_attr, node_label = self.get_node_attr(currency_ls, paying_df,receiving_df, accounts)
        edge_attr, edge_index = self.get_edge_df(accounts, df)

        data = Data(x=node_attr,
                    edge_index=edge_index,
                    y=node_label,
                    edge_attr=edge_attr
                    )
        
        data_list = [data] 
        if self.pre_filter is not None:
            data_list = [d for d in data_list if self.pre_filter(d)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(d) for d in data_list]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

# Model Training 
As we cannot create folder in kaggle, please follow the instructions in https://github.com/issacchan26/AntiMoneyLaunderingDetectionWithGNN before you start training 

In [23]:
dataset = AMLtoGraph('/Users/owhy/Documents/Datasets')
data = dataset[0]

In [24]:
print(data)


Data(x=[4188, 7], edge_index=[2, 4999], edge_attr=[4999, 6], y=[4188])


In [29]:
edge_index

tensor([[   0,    1,    2,  ..., 3732, 3732, 3733],
        [3734,    1,    2,  ..., 3732, 3732, 4187]])

In [28]:
data.edge_index

tensor([[   0,    1,    2,  ..., 3732, 3732, 3733],
        [3734,    1,    2,  ..., 3732, 3732, 4187]])

In [26]:
print(type(data))


<class 'torch_geometric.data.data.Data'>


In [103]:
data.num_features # TODO node features

7

In [112]:
model

GAT(
  (conv1): GATConv(7, 16, heads=8)
  (conv2): GATConv(128, 4, heads=1)
  (lin): Linear(4, 1, bias=True)
  (sigmoid): Sigmoid()
)

In [113]:
data.y 

tensor([0., 0., 0.,  ..., 0., 0., 0.])

In [97]:
import torch
import torch_geometric.transforms as T
from torch_geometric.loader import NeighborLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
dataset = AMLtoGraph('/Users/owhy/Documents/Datasets')
data = dataset[0]
epoch = 20

model = GAT(in_channels=data.num_features, hidden_channels=16, out_channels=1, heads=8) # TODO node features
model = model.to(device)
criterion = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)

split = T.RandomNodeSplit(split='train_rest', num_val=0.1, num_test=0)
data = split(data)

train_loader = loader = NeighborLoader(
    data,
    num_neighbors=[30] * 2,
    batch_size=256,
    input_nodes=data.train_mask,
)

test_loader = loader = NeighborLoader(
    data,
    num_neighbors=[30] * 2,
    batch_size=256,
    input_nodes=data.val_mask,
)

for i in range(epoch):
    total_loss = 0
    model.train()
    for data in train_loader:
        optimizer.zero_grad()
        data.to(device)
        pred = model(data.x, data.edge_index, data.edge_attr) # TODO nodes, adjacency matrix, edge attribute
        ground_truth = data.y # TODO True labels
        loss = criterion(pred, ground_truth.unsqueeze(1))
        loss.backward()
        optimizer.step()
        total_loss += float(loss)
    if epoch%10 == 0: # TODO once training is done --> evaluation
        print(f"Epoch: {i:03d}, Loss: {total_loss:.4f}")
        model.eval()
        acc = 0
        total = 0
        with torch.no_grad():
            for test_data in test_loader:
                test_data.to(device)
                pred = model(test_data.x, test_data.edge_index, test_data.edge_attr)
                ground_truth = test_data.y
                correct = (pred == ground_truth.unsqueeze(1)).sum().item()
                total += len(ground_truth)
                acc += correct
            acc = acc/total
            print('accuracy:', acc)

ImportError: 'NeighborSampler' requires either 'pyg-lib' or 'torch-sparse'

In [102]:
! pip3 show torch

Name: torch
Version: 2.2.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: torchaudio, torchvision


In [95]:
! pip3 install wheel




In [87]:
! pip3 install NeighborSampler

[31mERROR: Could not find a version that satisfies the requirement NeighborSampler (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for NeighborSampler[0m[31m
[0m

In [96]:
! pip3 install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.0+cpu.html


Looking in links: https://data.pyg.org/whl/torch-1.13.0+cpu.html
[31mERROR: Could not find a version that satisfies the requirement pyg_lib (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pyg_lib[0m[31m
[0m

In [91]:
if (torch_geometric.typing.WITH_PYG_LIB and self.subgraph_type != SubgraphType.induced):


SyntaxError: incomplete input (2663714873.py, line 1)

In [92]:
! pip3 install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2


Collecting torchvision==0.17.2
  Downloading torchvision-0.17.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.6 kB)
Collecting torchaudio==2.2.2
  Downloading torchaudio-2.2.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (6.4 kB)
Downloading torchvision-0.17.2-cp312-cp312-macosx_11_0_arm64.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading torchaudio-2.2.2-cp312-cp312-macosx_11_0_arm64.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: torchvision, torchaudio
Successfully installed torchaudio-2.2.2 torchvision-0.17.2


In [93]:
! pip3 install torch_geometric




# Future Work
In this notebook, we performed the node classification with GAT and the result accuracy looks satisfied.  
However, it may due to highly imbalance data of the dataset. It is suggested that balance the class of 1 and 0 in the data preprocessing. It is expected that the accuracy will dropped a little bit after balancing the data.  We will keep exploring to see if there are any other models give better performance, such as other traditional regression/classifier model.

## Reference
Some of the feature engineering of this repo are referenced to below papers, highly recommend to read:
1. [Weber, M., Domeniconi, G., Chen, J., Weidele, D. K. I., Bellei, C., Robinson, T., & Leiserson, C. E. (2019). Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics. arXiv preprint arXiv:1908.02591.](https://arxiv.org/pdf/1908.02591.pdf)
2. [Johannessen, F., & Jullum, M. (2023). Finding Money Launderers Using Heterogeneous Graph Neural Networks. arXiv preprint arXiv:2307.13499.](https://arxiv.org/pdf/2307.13499.pdf)