# Final project for Projects in ML and AI

For this final project, I am going to be using the elliptic dataset on Bitcoin trasaction data. The dataset is downloaded from this link: https://www.kaggle.com/datasets/ellipticco/elliptic-data-set


The citation for the dataset:

```
@article{weber2019anti,
  title={Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics},
  author={Weber, Mark and Domeniconi, Giacomo and Chen, Jie and Weidele, Daniel Karl I and Bellei, Claudio and Robinson, Tom and Leiserson, Charles E},
  journal={arXiv preprint arXiv:1908.02591},
  year={2019}
}

```

## Installing dependencies

In [1]:
# same deal for gdrive and kaggle
from google.colab import drive
drive.mount('/content/drive')

!rm -r ~/.kaggle
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/.kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!pip install -q kaggle

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# install dataset and unzip

!rm -r dataset
!kaggle datasets download -d ellipticco/elliptic-data-set
!mkdir dataset
!unzip elliptic-data-set.zip -d dataset

elliptic-data-set.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  elliptic-data-set.zip
  inflating: dataset/elliptic_bitcoin_dataset/elliptic_txs_classes.csv  
  inflating: dataset/elliptic_bitcoin_dataset/elliptic_txs_edgelist.csv  
  inflating: dataset/elliptic_bitcoin_dataset/elliptic_txs_features.csv  


In [3]:
import torch
TORCH_VERSION = torch.__version__
!pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-{TORCH_VERSION}.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://data.pyg.org/whl/torch-1.12.1+cu113.html


## Let's first examine the data

In [4]:
import math
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import matplotlib.pyplot as plt
import seaborn as sns

import torch
from torch import nn
from torch_geometric.nn import GATv2Conv,GATConv,GCNConv,Sequential
import torch.nn.functional as F

In [5]:
classes_df = pd.read_csv('dataset/elliptic_bitcoin_dataset/elliptic_txs_classes.csv')
edges_df = pd.read_csv('dataset/elliptic_bitcoin_dataset/elliptic_txs_edgelist.csv',)
# It looks like features are missing column headers, so we will reshape this
features_df = pd.read_csv('dataset/elliptic_bitcoin_dataset/elliptic_txs_features.csv',header=None,)

3 types of classes.. I will encode these into ints. 

Unknown = 0, 1 = 1, 2 = 2

1 is 'illicit' 2 is 'licit'

In [6]:
classes_df.head()

Unnamed: 0,txId,class
0,230425980,unknown
1,5530458,unknown
2,232022460,unknown
3,232438397,2
4,230460314,unknown


In [7]:
classes_df['class'] = classes_df['class'].replace('unknown',0).apply(int)

In [8]:
classes_df['class'].value_counts()

0    157205
2     42019
1      4545
Name: class, dtype: int64

The dataset description says they cannot provide the column names. But they mentioned that the data is captured over 49 sequences, so it looks like second column is the time sereis. We will treat the rest as vector features of the node, i.e. the transaction

In [9]:
features_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,157,158,159,160,161,162,163,164,165,166
0,230425980,1,-0.171469,-0.184668,-1.201369,-0.121970,-0.043875,-0.113002,-0.061584,-0.162097,...,-0.562153,-0.600999,1.461330,1.461369,0.018279,-0.087490,-0.131155,-0.097524,-0.120613,-0.119792
1,5530458,1,-0.171484,-0.184668,-1.201369,-0.121970,-0.043875,-0.113002,-0.061584,-0.162112,...,0.947382,0.673103,-0.979074,-0.978556,0.018279,-0.087490,-0.131155,-0.097524,-0.120613,-0.119792
2,232022460,1,-0.172107,-0.184668,-1.201369,-0.121970,-0.043875,-0.113002,-0.061584,-0.162749,...,0.670883,0.439728,-0.979074,-0.978556,-0.098889,-0.106715,-0.131155,-0.183671,-0.120613,-0.119792
3,232438397,1,0.163054,1.963790,-0.646376,12.409294,-0.063725,9.782742,12.414558,-0.163645,...,-0.577099,-0.613614,0.241128,0.241406,1.072793,0.085530,-0.131155,0.677799,-0.120613,-0.119792
4,230460314,1,1.011523,-0.081127,-1.201369,1.153668,0.333276,1.312656,-0.061584,-0.163523,...,-0.511871,-0.400422,0.517257,0.579382,0.018279,0.277775,0.326394,1.293750,0.178136,0.179117
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203764,173077460,49,-0.145771,-0.163752,0.463609,-0.121970,-0.043875,-0.113002,-0.061584,-0.135803,...,-0.577099,-0.613614,0.241128,0.241406,0.018279,-0.087490,-0.131155,-0.097524,-0.120613,-0.119792
203765,158577750,49,-0.165920,-0.123607,1.018602,-0.121970,-0.043875,-0.113002,-0.061584,-0.156418,...,0.162722,0.010822,1.461330,1.461369,-0.098889,-0.087490,-0.084674,-0.140597,-1.760926,-1.760984
203766,158375402,49,-0.172014,-0.078182,1.018602,0.028105,-0.043875,0.054722,-0.061584,-0.163626,...,1.261246,1.985050,1.461330,1.461369,0.018279,-0.087490,-0.131155,-0.097524,-0.120613,-0.119792
203767,158654197,49,-0.172842,-0.176622,1.018602,-0.121970,-0.043875,-0.113002,-0.061584,-0.163501,...,-0.397749,-0.411776,1.461330,1.461369,-0.098889,-0.087490,-0.084674,-0.140597,1.519700,1.521399


In [10]:
timestamp = features_df[[0,1]]
timestamp.columns=['txId','timeStamp']

The txIds are nasty and we want clean indices for the GNN edgelist. So we will map the index to the id

In [11]:
INDEX_TO_ID = dict(features_df[0])
index2id = lambda x: INDEX_TO_ID[int(x)]
ID_TO_INDEX = {v:k for k,v in INDEX_TO_ID.items()}
id2index = lambda x: ID_TO_INDEX[int(x)]

Now looking at edge list

lots of nodes!

Since we have 49 timestamps, we will create 49 mini datasets of transactions. 

In [12]:
# idxs = torch.arange(len(features_df))
train_txs,test_idxs,train_y,_ = train_test_split(features_df[0],features_df[1],test_size = 0.1)
train_txs,valid_txs,_,_ = train_test_split(train_txs,train_y,test_size = 1/9)

mini_data = [
    edges_df[edges_df.txId1.isin(timestamp[timestamp.timeStamp == i+1].txId)]
    for i in range(timestamp.timeStamp.max())
]

In [13]:
for i,m in enumerate(mini_data): print(i,'--',len(m))

0 -- 9164
1 -- 5241
2 -- 8316
3 -- 8180
4 -- 8623
5 -- 5242
6 -- 7253
7 -- 5186
8 -- 5939
9 -- 8588
10 -- 4656
11 -- 2213
12 -- 4827
13 -- 2078
14 -- 3823
15 -- 3120
16 -- 3650
17 -- 2115
18 -- 3838
19 -- 4755
20 -- 3959
21 -- 7014
22 -- 4584
23 -- 5124
24 -- 2619
25 -- 2690
26 -- 1168
27 -- 1717
28 -- 4541
29 -- 2561
30 -- 3049
31 -- 4952
32 -- 3366
33 -- 2692
34 -- 6351
35 -- 7813
36 -- 3849
37 -- 3094
38 -- 2914
39 -- 5246
40 -- 6093
41 -- 8493
42 -- 5950
43 -- 5551
44 -- 6673
45 -- 3866
46 -- 5748
47 -- 3284
48 -- 2587


A few thousand edges for each slice, not bad

In [14]:
mini_data[0]

Unnamed: 0,txId1,txId2
0,230425980,5530458
1,232022460,232438397
2,230460314,230459870
3,230333930,230595899
4,232013274,232029206
...,...,...
9159,230437620,230439288
9160,203465969,5986851
9161,232051667,232051672
9162,232364495,34300577


## Define GNN classes

And also split up the data!

In [15]:
from torch import tensor
# type(mini_data[0].txId1.tolist()[0])

edges = [
    tensor(
        np.stack([
            md.txId1.apply(id2index).values,
            md.txId2.apply(id2index).values
        ])
    ).long().cuda()
    for md in mini_data
]

features = tensor(features_df.values[:,1:]).float().cuda()
labels = tensor(features_df[0].values).long().cuda()

In [16]:
idxs = torch.arange(len(labels)dtype=torch.unit8)
train_idxs,test_idxs,train_y,_ = train_test_split(idxs,labels,test_size = 0.1)
train_idxs,val_idxs,_,_ = train_test_split(train_idxs,train_y,test_size = 1/9)

# train_mask,val_mask,test_mask = torch.zeros(len(labels),dtype=bool).cuda(),torch.zeros(len(labels),dtype=bool).cuda(),torch.zeros(len(labels),dtype=bool).cuda()
# train_mask[train_idxs] = True
# val_mask[val_idxs] = True
# test_mask[test_idxs] = True

In [18]:
len(train_idxs),len(val_idxs),len(test_idxs)

(163015, 20377, 20377)

In [19]:

class GNN(nn.Module):
    def __init__(
        self,
        graph_layer_type = 'GAT',
        output_layer_type = 'lstm',
        n_layers=5, 
        hidden_channels=100, 
        heads=10, 
        dropout = 0.1,
        n_features=features.size(1),
        n_classes = 3,
    ):
        super().__init__()

        self.input = nn.Linear(
            n_features,
            hidden_channels
        )

        if graph_layer_type == 'GATv2':
            self.graph_layer = Sequential('x, edge_index',[
                # GAT comes with dropout and relu!
                (GATv2Conv(hidden_channels,hidden_channels//heads,heads,dropout=dropout), 'x, edge_index -> x')
                for _ in range(n_layers)
            ])
        if graph_layer_type == 'GAT':
            self.graph_layer = Sequential('x, edge_index',[
                # GAT comes with dropout and relu!
                (GATConv(hidden_channels,hidden_channels//heads,heads,dropout=dropout), 'x, edge_index -> x')
                for _ in range(n_layers)
            ])
        elif graph_layer_type == 'GCN':
            self.graph_layer = Sequential('x, edge_index',[
                Sequential(
                    (GCNConv(hidden_channels,hidden_channels), 'x, edge_index -> x'),
                    nn.Dropout(dropout),
                    nn.ReLU(True)
                )
                for _ in range(n_layers)
            ])

        self.output = nn.Linear(
                hidden_channels,n_classes
            )
        
    def forward(self,x,edge_index):
        return self.output(
                self.graph_layer(
                    self.input(x),
                    edge_index
                    )
                )

In [20]:
from torch.optim import Adam


def train(model,criterion,optimizer):
    losses = []
    for e in edges:
        model.train()
        optimizer.zero_grad()
        out = model(features,e)
        loss = criterion(out[train_idxs],labels[train_idxs])
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    return sum(losses)


In [21]:
from tqdm.auto import tqdm
model = GNN().cuda()
criterion = nn.CrossEntropyLoss().cuda()  # Define loss criterion.
optimizer = Adam(model.parameters(), lr=0.01, weight_decay=5e-4) 

for _ in tqdm(range(100)):
    print(train(model,criterion,optimizer))


  0%|          | 0/100 [00:00<?, ?it/s]

RuntimeError: ignored

In [22]:
ll =  criterion(out[train_mask],labels[train_mask])

In [23]:
ll.backward()

RuntimeError: ignored

## Time to run

In [None]:
%load_ext tensorboard