<a href="https://colab.research.google.com/github/kahram-y/first-repository/blob/master/AML_project/AML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install pandas networkx torch torch-geometric matplotlib pyvis



In [2]:
import pandas as pd
import networkx as nx
import numpy as np

# 1. 데이터 로드
from google.colab import drive
drive.mount('/content/drive')

# 파일 경로
trans_path = '/content/drive/MyDrive/HI-Medium_Trans.csv'
accounts_path = '/content/drive/MyDrive/HI-Medium_accounts.csv'

# pandas 옵션 (컬럼 전체 표시)
pd.set_option('display.max_columns', None)

# CSV 로드
df = pd.read_csv(trans_path)
acc_df = pd.read_csv(accounts_path)

# 실습을 위해 데이터 샘플링 (전체 데이터가 매우 큼)
df = df.head(50000)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
print(df.head())

          Timestamp  From Bank    Account  To Bank  Account.1  \
0  2022/09/01 00:17         20  800104D70       20  800104D70   
1  2022/09/01 00:02       3196  800107150     3196  800107150   
2  2022/09/01 00:17       1208  80010E430     1208  80010E430   
3  2022/09/01 00:03       1208  80010E650       20  80010E6F0   
4  2022/09/01 00:02       1208  80010E650       20  80010EA30   

   Amount Received Receiving Currency  Amount Paid Payment Currency  \
0          6794.63          US Dollar      6794.63        US Dollar   
1          7739.29          US Dollar      7739.29        US Dollar   
2          1880.23          US Dollar      1880.23        US Dollar   
3      73966883.00          US Dollar  73966883.00        US Dollar   
4      45868454.00          US Dollar  45868454.00        US Dollar   

  Payment Format  Is Laundering  
0   Reinvestment              0  
1   Reinvestment              0  
2   Reinvestment              0  
3         Cheque              0  
4         Ch

In [4]:
print(acc_df.head())

                Bank Name  Bank ID Account Number    Entity ID  \
0         China Bank #561    53267      817D00980  2AA1F24F180   
1       Spain Bank #18657   316997      808BB2280  2AA1EEB8540   
2    First Bank of Helena   339367      8505ED380  2AA206D7790   
3       Mexico Bank #3367  3148419      8363D4180  2AA2001B1A0   
4  Switzerland Bank #2372  3174937      842090C80  2AA20224CB0   

                   Entity Name  
0          Corporation #183669  
1          Partnership #193780  
2    Sole Proprietorship #4078  
3          Partnership #133577  
4  Sole Proprietorship #190823  


**그래프 생성 및 특징 추출 (Dual-view)**

NetworkX를 사용하여 계좌 간 연결망을 구축하고, 금융결제원 자료에서 강조한 중심성(Centrality) 지표를 추출하여 기존 피처와 결합합니다.

In [5]:
# 2. 그래프 생성 (NetworkX)
G = nx.from_pandas_edgelist(df, source='Account', target='Account.1',
                             edge_attr=['Amount Paid', 'Payment Currency', 'Is Laundering'],
                             create_using=nx.DiGraph())

# 3. 특징 추출 (Graph Feature) - 중심성 지표 계산
print("계좌별 중심성 지표 계산 중...")
degree_cent = nx.degree_centrality(G)
pagerank = nx.pagerank(G, alpha=0.85)

# Dual-view 구성을 위한 데이터프레임 병합
# 계좌(Node) 리스트를 기준으로 피처 테이블 생성
nodes = list(G.nodes())
feature_df = pd.DataFrame(nodes, columns=['Account'])
feature_df['Degree_Cent'] = feature_df['Account'].map(degree_cent)
feature_df['PageRank'] = feature_df['Account'].map(pagerank)

# 결측값 처리
feature_df.fillna(0, inplace=True)
print("Graph Feature 추출 완료 (Dual-view 준비됨)")

계좌별 중심성 지표 계산 중...
Graph Feature 추출 완료 (Dual-view 준비됨)


**모델 학습 (PyTorch Geometric - GraphSAGE)**

추출된 특징을 바탕으로 GraphSAGE 알고리즘을 설계하여 학습을 진행합니다.

In [6]:
import torch
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.nn import SAGEConv

# PyG 데이터 객체로 변환
node_map = {node: i for i, node in enumerate(nodes)}
edge_index = torch.tensor([[node_map[s], node_map[t]] for s, t in zip(df['Account'], df['Account.1'])], dtype=torch.long).t()

# 노드 피처 (Dual-view: 여기서는 간단히 중심성 지표만 사용)
x = torch.tensor(feature_df[['Degree_Cent', 'PageRank']].values, dtype=torch.float)

# 라벨 생성 (거래 데이터의 Is Laundering을 노드 라벨로 변환 - 실습용 간략화)
# 실제로는 계좌 자체가 사기인지 여부를 라벨링해야 함
y = torch.zeros(len(nodes), dtype=torch.long)

data = Data(x=x, edge_index=edge_index, y=y)

# GraphSAGE 모델 설계
class SAGE(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

model = SAGE(in_channels=2, hidden_channels=16, out_channels=2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# 학습 루프
model.train()
for epoch in range(100):
    optimizer.zero_grad()
    out = model(data.x, data.edge_index)
    # 실제 라벨이 있을 경우 Loss 계산 (예시 코드는 학습 구조만 제시)
    # loss = F.cross_entropy(out, data.y)
    # loss.backward()
    # optimizer.step()
    if epoch % 20 == 0:
        print(f"Epoch {epoch}: 학습 진행 중...")

print("GraphSAGE 모델 학습 완료")

Epoch 0: 학습 진행 중...
Epoch 20: 학습 진행 중...
Epoch 40: 학습 진행 중...
Epoch 60: 학습 진행 중...
Epoch 80: 학습 진행 중...
GraphSAGE 모델 학습 완료


**시각화 (PyVis)**

탐지된 의심스러운 자금 흐름망을 인터랙티브한 그래프로 시각화합니다.

In [7]:
from pyvis.network import Network

# 시각화를 위해 소규모 서브그래프 추출 (의심 거래 위주)
suspicious_edges = df[df['Is Laundering'] == 1].head(100)
subG = nx.from_pandas_edgelist(suspicious_edges, source='Account', target='Account.1',
                                edge_attr=True, create_using=nx.DiGraph())

# PyVis 시각화 설정
net = Network(height="750px", width="100%", bgcolor="#222222", font_color="white", directed=True)

for n in subG.nodes():
    net.add_node(n, label=str(n), color='red' if n in suspicious_edges['Account'].values else 'blue')

for e in subG.edges(data=True):
    net.add_edge(e[0], e[1], title=f"Amount Paid: {e[2]['Amount Paid']}")

# HTML 파일로 저장 및 출력
net.save_graph("aml_detection_result.html")
print("시각화 완료: 'aml_detection_result.html' 파일을 확인하세요.")

시각화 완료: 'aml_detection_result.html' 파일을 확인하세요.


# Anti Money Laundering Detection with GNN node classification
This notenook includes GNN model training and dataset implementation with PyG library. In this example, we used HI-Small_Trans.csv as our dataset for training and testing.

In [8]:
import datetime
import os
from typing import Callable, Optional
import pandas as pd
from sklearn import preprocessing
import numpy as np
import torch

from torch_geometric.data import (
    Data,
    InMemoryDataset
)

# Data visualization and possible feature engineering
Let's look into the dataset

In [9]:
print(df.head())

          Timestamp  From Bank    Account  To Bank  Account.1  \
0  2022/09/01 00:17         20  800104D70       20  800104D70   
1  2022/09/01 00:02       3196  800107150     3196  800107150   
2  2022/09/01 00:17       1208  80010E430     1208  80010E430   
3  2022/09/01 00:03       1208  80010E650       20  80010E6F0   
4  2022/09/01 00:02       1208  80010E650       20  80010EA30   

   Amount Received Receiving Currency  Amount Paid Payment Currency  \
0          6794.63          US Dollar      6794.63        US Dollar   
1          7739.29          US Dollar      7739.29        US Dollar   
2          1880.23          US Dollar      1880.23        US Dollar   
3      73966883.00          US Dollar  73966883.00        US Dollar   
4      45868454.00          US Dollar  45868454.00        US Dollar   

  Payment Format  Is Laundering  
0   Reinvestment              0  
1   Reinvestment              0  
2   Reinvestment              0  
3         Cheque              0  
4         Ch

After the viewing the dataframe, we suggest that we can extract all accounts from receiver and payer among all transcation for sorting the suspicious accounts. We can transform the whole dataset into node classification problem by considering accounts as nodes while transcation as edges.

The object columns should be encoded into classes with sklearn LabelEncoder.

In [10]:
print(df.dtypes)

Timestamp              object
From Bank               int64
Account                object
To Bank                 int64
Account.1              object
Amount Received       float64
Receiving Currency     object
Amount Paid           float64
Payment Currency       object
Payment Format         object
Is Laundering           int64
dtype: object


Check if there are any null values

In [11]:
print(df.isnull().sum())

Timestamp             0
From Bank             0
Account               0
To Bank               0
Account.1             0
Amount Received       0
Receiving Currency    0
Amount Paid           0
Payment Currency      0
Payment Format        0
Is Laundering         0
dtype: int64


There are two columns representing paid and received amount of each transcation, wondering if it is necessary to split the amount into two columns when they shared the same value, unless there are transcation fee/transcation between different currency. Let's find out

In [12]:
print('Amount Received equals to Amount Paid:')
print(df['Amount Received'].equals(df['Amount Paid']))
print('Receiving Currency equals to Payment Currency:')
print(df['Receiving Currency'].equals(df['Payment Currency']))

Amount Received equals to Amount Paid:
False
Receiving Currency equals to Payment Currency:
False


It seens involved the transcations between different currency, let's print it out

In [13]:
not_equal1 = df.loc[~(df['Amount Received'] == df['Amount Paid'])]
not_equal2 = df.loc[~(df['Receiving Currency'] == df['Payment Currency'])]
print(not_equal1)
print('---------------------------------------------------------------------------')
print(not_equal2)

              Timestamp  From Bank    Account  To Bank  Account.1  \
268    2022/09/01 00:19       4011  8032D1A00     4011  8032D1A00   
282    2022/09/01 00:13       5991  80341B8B0     5991  80341B8B0   
1577   2022/09/01 00:18          0  800417500        0  800417500   
1926   2022/09/01 00:11       3504  800492D00     3504  800492D00   
5721   2022/09/01 00:29       1488  800C948C0     1488  800C948C0   
6071   2022/09/01 00:17       2597  800DD7610     2597  800DD7610   
15773  2022/09/01 00:29       1818  8021F4AE0     1818  8021F4AE0   
16371  2022/09/01 00:05       1601  802321380     1601  802321380   
21135  2022/09/01 00:05      34384  802B553D0    34384  802B553D0   
23296  2022/09/01 00:20        867  802EAE130      867  802EAE130   
28907  2022/09/01 00:06       5763  8037A4C20     5763  8037A4C20   
30171  2022/09/01 00:11       1922  8039627F0     1922  8039627F0   
32009  2022/09/01 00:24       3902  803C0AB30     3902  803C0AB30   
32587  2022/09/01 00:27       9043

The size of two df shows that there are transcation fee and transcation between different currency, we cannot combine/drop the amount columns.

As we are going to encode the columns, we have to make sure that the classes of same attribute are aligned. Let's check if the list of Receiving Currency and Payment Currency are the same

In [14]:
print(sorted(df['Receiving Currency'].unique()))
print(sorted(df['Payment Currency'].unique()))

['Australian Dollar', 'Bitcoin', 'Canadian Dollar', 'Euro', 'Ruble', 'Rupee', 'Shekel', 'UK Pound', 'US Dollar', 'Yen', 'Yuan']
['Australian Dollar', 'Bitcoin', 'Canadian Dollar', 'Euro', 'Ruble', 'Rupee', 'Shekel', 'UK Pound', 'US Dollar', 'Yen', 'Yuan']


# Data Preprocessing
We will show the functions used in the PyG dataset first, dataset and model training will be provided in bottom section
In the data preprocessing, we perform below transformation:

1. Transform the Timestamp with min max normalization.
2. Create unique ID for each account by adding bank code with account number.
3. Create receiving_df with the information of receiving accounts, received amount and currency
4. Create paying_df with the information of payer accounts, paid amount and currency
5. Create a list of currency used among all transactions
6. Label the 'Payment Format', 'Payment Currency', 'Receiving Currency' by classes with sklearn LabelEncoder

In [15]:
def df_label_encoder(df, columns):
        le = preprocessing.LabelEncoder()
        for i in columns:
            df[i] = le.fit_transform(df[i].astype(str))
        return df

def preprocess(df):
        df = df_label_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])
        df['Timestamp'] = pd.to_datetime(df['Timestamp'])
        df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
        df['Timestamp'] = (df['Timestamp']-df['Timestamp'].min())/(df['Timestamp'].max()-df['Timestamp'].min())

        df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
        df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']
        df = df.sort_values(by=['Account'])
        receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
        paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
        receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)
        currency_ls = sorted(df['Receiving Currency'].unique())

        return df, receiving_df, paying_df, currency_ls

Let's have a look of processed df

In [16]:
df, receiving_df, paying_df, currency_ls = preprocess(df = df)
print(df.head())

     Timestamp  From Bank      Account  To Bank        Account.1  \
470   0.103448          0  0_8000474C0        0      0_8000474C0   
471   0.551724          0  0_800047930        0      0_800047930   
295   0.482759          0  0_80006C140    27453  27453_80345B620   
474   0.413793          0  0_80006C140        0      0_80006C140   
473   0.000000          0  0_80006DFE0        0      0_80006DFE0   

     Amount Received  Receiving Currency  Amount Paid  Payment Currency  \
470            11.21                   8        11.21                 8   
471        169390.13                   8    169390.13                 8   
295         50000.00                   8     50000.00                 8   
474            23.48                   8        23.48                 8   
473            19.18                   8        19.18                 8   

     Payment Format  Is Laundering  
470               5              0  
471               5              0  
295               0          

paying df and receiving df:

In [17]:
print(receiving_df.head())
print(paying_df.head())

             Account  Amount Received  Receiving Currency
470      0_8000474C0            11.21                   8
471      0_800047930        169390.13                   8
295  27453_80345B620         50000.00                   8
474      0_80006C140            23.48                   8
473      0_80006DFE0            19.18                   8
         Account  Amount Paid  Payment Currency
470  0_8000474C0        11.21                 8
471  0_800047930    169390.13                 8
295  0_80006C140     50000.00                 8
474  0_80006C140        23.48                 8
473  0_80006DFE0        19.18                 8


currency_ls:

In [18]:
print(currency_ls)

[np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10)]


We would like to extract all unique accounts from payer and receiver as node of our graph. It includes the unique account ID, Bank code and the label of 'Is Laundering'.
In this section, we consider both payer and receiver involved in a illicit transaction as suspicious accounts, we will label both accounts with 'Is Laundering' == 1.

In [19]:
def get_all_account(df):
        ldf = df[['Account', 'From Bank']]
        rdf = df[['Account.1', 'To Bank']]
        suspicious = df[df['Is Laundering']==1]
        s1 = suspicious[['Account', 'Is Laundering']]
        s2 = suspicious[['Account.1', 'Is Laundering']]
        s2 = s2.rename({'Account.1': 'Account'}, axis=1)
        suspicious = pd.concat([s1, s2], join='outer')
        suspicious = suspicious.drop_duplicates()

        ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
        rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
        df = pd.concat([ldf, rdf], join='outer')
        df = df.drop_duplicates()

        df['Is Laundering'] = 0
        df.set_index('Account', inplace=True)
        df.update(suspicious.set_index('Account'))
        df = df.reset_index()
        return df

Take a look of the account list:

In [20]:
accounts = get_all_account(df)
print(accounts.head())

       Account  Bank  Is Laundering
0  0_8000474C0     0              0
1  0_800047930     0              0
2  0_80006C140     0              0
3  0_80006DFE0     0              0
4  0_80006F150     0              0


# Node features
For node features, we would like to aggregate the mean of paid and received amount with different types of currency as the new features of each node.

In [21]:
def paid_currency_aggregate(currency_ls, paying_df, accounts):
        for i in currency_ls:
            temp = paying_df[paying_df['Payment Currency'] == i]
            accounts['avg paid '+str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
        return accounts

def received_currency_aggregate(currency_ls, receiving_df, accounts):
    for i in currency_ls:
        temp = receiving_df[receiving_df['Receiving Currency'] == i]
        accounts['avg received '+str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
    accounts = accounts.fillna(0)
    return accounts

Now we can define the node attributes by the bank code and the mean of paid and received amount with different types of currency.

In [22]:
def get_node_attr(currency_ls, paying_df,receiving_df, accounts):
        node_df = paid_currency_aggregate(currency_ls, paying_df, accounts)
        node_df = received_currency_aggregate(currency_ls, receiving_df, node_df)
        node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)
        node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
        node_df = df_label_encoder(node_df,['Bank'])
#         node_df = torch.from_numpy(node_df.values).to(torch.float)  # comment for visualization
        return node_df, node_label

Take a look of node_df:

In [23]:
node_df, node_label = get_node_attr(currency_ls, paying_df,receiving_df, accounts)
print(node_df.head())

   Bank  avg paid 0  avg paid 1  avg paid 2  avg paid 3  avg paid 4  \
0     0         0.0         0.0         0.0         0.0         0.0   
1     0         0.0         0.0         0.0         0.0         0.0   
2     0         0.0         0.0         0.0         0.0         0.0   
3     0         0.0         0.0         0.0         0.0         0.0   
4     0         0.0         0.0         0.0         0.0         0.0   

   avg paid 5  avg paid 6  avg paid 7    avg paid 8  avg paid 9  avg paid 10  \
0         0.0         0.0         0.0  6.794630e+03         0.0          0.0   
1         0.0         0.0         0.0  7.739290e+03         0.0          0.0   
2         0.0         0.0         0.0  9.439450e+02         0.0          0.0   
3         0.0         0.0         0.0  3.994511e+07         0.0          0.0   
4         0.0         0.0         0.0  3.994511e+07         0.0          0.0   

   avg received 0  avg received 1  avg received 2  avg received 3  \
0             0.0      

# Edge features
In terms of edge features, we would like to conside each transcation as edges.
For edge index, we replace all account with index and stack into a list with size of [2, num of transcation]
For edge attributes, we used 'Timestamp', 'Amount Received', 'Receiving Currency', 'Amount Paid', 'Payment Currency' and 'Payment Format'

In [24]:
def get_edge_df(accounts, df):
        accounts = accounts.reset_index(drop=True)
        accounts['ID'] = accounts.index
        mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
        df['From'] = df['Account'].map(mapping_dict)
        df['To'] = df['Account.1'].map(mapping_dict)
        df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

        edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

        df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

#         edge_attr = torch.from_numpy(df.values).to(torch.float)  # comment for visualization

        edge_attr = df  # for visualization
        return edge_attr, edge_index

edge_attr:

In [25]:
edge_attr, edge_index = get_edge_df(accounts, df)
print(edge_attr.head())

     Timestamp  Amount Received  Receiving Currency  Amount Paid  \
470   0.103448            11.21                   8        11.21   
471   0.551724        169390.13                   8    169390.13   
295   0.482759         50000.00                   8     50000.00   
474   0.413793            23.48                   8        23.48   
473   0.000000            19.18                   8        19.18   

     Payment Currency  Payment Format  
470                 8               5  
471                 8               5  
295                 8               0  
474                 8               5  
473                 8               5  


edge_index:

In [26]:
print(edge_index)

tensor([[    0,     1,     2,  ..., 37365, 37366, 37367],
        [    0,     1, 22292,  ..., 37365, 37366, 37367]])


# Final code
Below we will show the final code for model.py, train.py and dataset.py
# Model Architecture
In this section, we used Graph Attention Networks as our backbone model.
The model built with two GATConv layers followed by a linear layer with sigmoid outout for classification

In [27]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_geometric.transforms as T
from torch_geometric.nn import GATConv, Linear

class GAT(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, heads):
        super().__init__()
        self.conv1 = GATConv(in_channels, hidden_channels, heads, dropout=0.6)
        self.conv2 = GATConv(hidden_channels * heads, int(hidden_channels/4), heads=1, concat=False, dropout=0.6)
        self.lin = Linear(int(hidden_channels/4), out_channels)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x, edge_index, edge_attr):
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv1(x, edge_index, edge_attr))
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv2(x, edge_index, edge_attr))
        x = self.lin(x)
        x = self.sigmoid(x)

        return x

# PyG InMemoryDataset
Finally we can build the dataset with above functions

In [28]:
class AMLtoGraph(InMemoryDataset):

    def __init__(self, root: str, edge_window_size: int = 10,
                 transform: Optional[Callable] = None,
                 pre_transform: Optional[Callable] = None):
        self.edge_window_size = edge_window_size
        super().__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self) -> str:
        return 'HI-Medium_Trans.csv'

    @property
    def processed_file_names(self) -> str:
        return 'data.pt'

    @property
    def num_nodes(self) -> int:
        return self._data.edge_index.max().item() + 1

    def df_label_encoder(self, df, columns):
        le = preprocessing.LabelEncoder()
        for i in columns:
            df[i] = le.fit_transform(df[i].astype(str))
        return df


    def preprocess(self, df):
        df = self.df_label_encoder(df,['Payment Format', 'Payment Currency', 'Receiving Currency'])
        df['Timestamp'] = pd.to_datetime(df['Timestamp'])
        df['Timestamp'] = df['Timestamp'].apply(lambda x: x.value)
        df['Timestamp'] = (df['Timestamp']-df['Timestamp'].min())/(df['Timestamp'].max()-df['Timestamp'].min())

        df['Account'] = df['From Bank'].astype(str) + '_' + df['Account']
        df['Account.1'] = df['To Bank'].astype(str) + '_' + df['Account.1']
        df = df.sort_values(by=['Account'])
        receiving_df = df[['Account.1', 'Amount Received', 'Receiving Currency']]
        paying_df = df[['Account', 'Amount Paid', 'Payment Currency']]
        receiving_df = receiving_df.rename({'Account.1': 'Account'}, axis=1)
        currency_ls = sorted(df['Receiving Currency'].unique())

        return df, receiving_df, paying_df, currency_ls

    def get_all_account(self, df):
        ldf = df[['Account', 'From Bank']]
        rdf = df[['Account.1', 'To Bank']]
        suspicious = df[df['Is Laundering']==1]
        s1 = suspicious[['Account', 'Is Laundering']]
        s2 = suspicious[['Account.1', 'Is Laundering']]
        s2 = s2.rename({'Account.1': 'Account'}, axis=1)
        suspicious = pd.concat([s1, s2], join='outer')
        suspicious = suspicious.drop_duplicates()

        ldf = ldf.rename({'From Bank': 'Bank'}, axis=1)
        rdf = rdf.rename({'Account.1': 'Account', 'To Bank': 'Bank'}, axis=1)
        df = pd.concat([ldf, rdf], join='outer')
        df = df.drop_duplicates()

        df['Is Laundering'] = 0
        df.set_index('Account', inplace=True)
        df.update(suspicious.set_index('Account'))
        df = df.reset_index()
        return df

    def paid_currency_aggregate(self, currency_ls, paying_df, accounts):
        for i in currency_ls:
            temp = paying_df[paying_df['Payment Currency'] == i]
            accounts['avg paid '+str(i)] = temp['Amount Paid'].groupby(temp['Account']).transform('mean')
        return accounts

    def received_currency_aggregate(self, currency_ls, receiving_df, accounts):
        for i in currency_ls:
            temp = receiving_df[receiving_df['Receiving Currency'] == i]
            accounts['avg received '+str(i)] = temp['Amount Received'].groupby(temp['Account']).transform('mean')
        accounts = accounts.fillna(0)
        return accounts

    def get_edge_df(self, accounts, df):
        accounts = accounts.reset_index(drop=True)
        accounts['ID'] = accounts.index
        mapping_dict = dict(zip(accounts['Account'], accounts['ID']))
        df['From'] = df['Account'].map(mapping_dict)
        df['To'] = df['Account.1'].map(mapping_dict)
        df = df.drop(['Account', 'Account.1', 'From Bank', 'To Bank'], axis=1)

        edge_index = torch.stack([torch.from_numpy(df['From'].values), torch.from_numpy(df['To'].values)], dim=0)

        df = df.drop(['Is Laundering', 'From', 'To'], axis=1)

        edge_attr = torch.from_numpy(df.values).to(torch.float)
        return edge_attr, edge_index

    def get_node_attr(self, currency_ls, paying_df,receiving_df, accounts):
        node_df = self.paid_currency_aggregate(currency_ls, paying_df, accounts)
        node_df = self.received_currency_aggregate(currency_ls, receiving_df, node_df)
        node_label = torch.from_numpy(node_df['Is Laundering'].values).to(torch.float)
        node_df = node_df.drop(['Account', 'Is Laundering'], axis=1)
        node_df = self.df_label_encoder(node_df,['Bank'])
        node_df = torch.from_numpy(node_df.values).to(torch.float)
        return node_df, node_label

    def process(self):
        df = pd.read_csv(self.raw_paths[0])
        df, receiving_df, paying_df, currency_ls = self.preprocess(df)
        accounts = self.get_all_account(df)
        node_attr, node_label = self.get_node_attr(currency_ls, paying_df,receiving_df, accounts)
        edge_attr, edge_index = self.get_edge_df(accounts, df)

        data = Data(x=node_attr,
                    edge_index=edge_index,
                    y=node_label,
                    edge_attr=edge_attr
                    )

        data_list = [data]
        if self.pre_filter is not None:
            data_list = [d for d in data_list if self.pre_filter(d)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(d) for d in data_list]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

# Model Training
As we cannot create folder in kaggle, please follow the instructions in https://github.com/issacchan26/AntiMoneyLaunderingDetectionWithGNN before you start training

In [None]:
import torch
import torch_geometric.transforms as T
from torch_geometric.loader import NeighborLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
import os
root_path = os.path.dirname(trans_path)
dataset = AMLtoGraph(root_path)
data = dataset[0]
epoch = 20

model = GAT(in_channels=data.num_features, hidden_channels=16, out_channels=1, heads=8)
model = model.to(device)
criterion = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)

split = T.RandomNodeSplit(split='train_rest', num_val=0.1, num_test=0)
data = split(data)

train_loader = loader = NeighborLoader(
    data,
    num_neighbors=[30] * 2,
    batch_size=256,
    input_nodes=data.train_mask,
)

test_loader = loader = NeighborLoader(
    data,
    num_neighbors=[30] * 2,
    batch_size=256,
    input_nodes=data.val_mask,
)

for i in range(epoch):
    total_loss = 0
    model.train()
    for data in train_loader:
        optimizer.zero_grad()
        data.to(device)
        pred = model(data.x, data.edge_index, data.edge_attr)
        ground_truth = data.y
        loss = criterion(pred, ground_truth.unsqueeze(1))
        loss.backward()
        optimizer.step()
        total_loss += float(loss)
    if epoch%10 == 0:
        print(f"Epoch: {i:03d}, Loss: {total_loss:.4f}")
        model.eval()
        acc = 0
        total = 0
        with torch.no_grad():
            for test_data in test_loader:
                test_data.to(device)
                pred = model(test_data.x, test_data.edge_index, test_data.edge_attr)
                ground_truth = test_data.y
                correct = (pred == ground_truth.unsqueeze(1)).sum().item()
                total += len(ground_truth)
                acc += correct
            acc = acc/total
            print('accuracy:', acc)

Processing...


# Future Work
In this notebook, we performed the node classification with GAT and the result accuracy looks satisfied.
However, it may due to highly imbalance data of the dataset. It is suggested that balance the class of 1 and 0 in the data preprocessing. It is expected that the accuracy will dropped a little bit after balancing the data. We will keep exploring to see if there are any other models give better performance, such as other traditional regression/classifier model.

# Reference
Some of the feature engineering of this repo are referenced to below papers, highly recommend to read:

1. Weber, M., Domeniconi, G., Chen, J., Weidele, D. K. I., Bellei, C., Robinson, T., & Leiserson, C. E. (2019). Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics. arXiv preprint arXiv:1908.02591.
2. Johannessen, F., & Jullum, M. (2023). Finding Money Launderers Using Heterogeneous Graph Neural Networks. arXiv preprint arXiv:2307.13499.