### Notebook Overview: Model Step 3 – Subnetwork Development

This notebook constructs illicit-only subnetworks from the classified Bitcoin transaction dataset. It represents the third stage of the AML detection pipeline and operationalises the subgraph extraction methodology described in the thesis. The goal is to isolate self-contained clusters of illicit activity that can be further analysed and ranked.  

**Purpose**  
The notebook transforms the classified transaction data into directed graph structures that reflect the movement of Bitcoin between transactions. Starting from transactions predicted as illicit, a forward breadth-first search (BFS) expansion identifies all reachable transactions, creating a self-contained subnetwork that captures the flow of funds within that illicit cluster.  

**Key Steps**  
- Import transaction predictions and edgelist data from BigQuery.  
- Filter to include only transactions classified as illicit.  
- Use a directed graph representation (txn–txn) to model Bitcoin flows between transactions.  
- Perform a forward BFS from each illicit transaction to trace all connected downstream nodes.  
- Assign unique Subnetwork IDs and compute basic graph properties (e.g., size, node/edge count, total BTC transferred).  
- Export subnetwork tables to BigQuery for ranking and visualisation in the next stage.  

This process produces a collection of reproducible subnetworks that represent individual clusters of illicit Bitcoin activity. By isolating and structuring these networks, subsequent ranking and visualisation can focus on the most influential or financially significant regions of the transaction graph.  

**Context and Attribution**  
This notebook forms part of the technical work developed in support of the research thesis titled:  
_“Detection, Ranking and Visualisation of Money Laundering Networks on the Bitcoin Blockchain”_  
by Jennifer Payne (RMIT University).  

GitHub Repository: [https://github.com/majorpayne-2021/rmit_master_thesis](https://github.com/majorpayne-2021/rmit_master_thesis)  
Elliptic++ Dataset Source: [https://github.com/git-disl/EllipticPlusPlus](https://github.com/git-disl/EllipticPlusPlus)


In [1]:
# Data cleaning and manipulation
import pandas as pd
import numpy as np
import math
import time

# GCP libraries
from pandas_gbq import to_gbq # write pandas df to a GCP BigQuery table
import gcsfs
import importlib.util
import os
import inspect

# Set up display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.4f' % x)

# Suppress FutureWarnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


--------------
##### Read in Txn Subnetwork Classes & Modules
--------------


In [2]:
# Define bucket and file path
bucket_name = "thesis_classes"
file_name = "txn_subnetworks.py"
gcs_path = f"gs://{bucket_name}/{file_name}"

# Initialize GCS filesystem
fs = gcsfs.GCSFileSystem()

# Local filename to save the script temporarily
local_file = f"/tmp/{file_name}"

# Download the file from GCS to local storage
fs.get(gcs_path, local_file)

# Dynamically import the module
module_name = "txn_subnetworks"
spec = importlib.util.spec_from_file_location(module_name, local_file)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

In [3]:
# Use inspect to get classes and functions
classes = [name for name, obj in inspect.getmembers(module, inspect.isclass)]

# Print results
print("Classes in module:")
for cls in classes:
    print(f"  - {cls}")


Classes in module:
  - Line2D
  - build_txn_subnetwork
  - combinations
  - defaultdict
  - reporting
  - visualise_subnetwork


In [4]:
# Instantiate the classes
build_network = module.build_txn_subnetwork()
build_report = module.reporting()

--------------
##### Read in Datasets
--------------


In [5]:
# Get txn edgelist
%%bigquery df_txn_edgelist
select * from `extreme-torch-467913-m6.txn.txn_edgelist`;

Query is running:   0%|          |

Downloading:   0%|          |

In [6]:
df_txn_edgelist.head(1)

Unnamed: 0,txId1,txId2
0,36186840,1076


In [7]:
# get txn prediction
%%bigquery df_txn_pred
select * from `extreme-torch-467913-m6.txn.txn_pred_final`;

Query is running:   0%|          |

Downloading:   0%|          |

In [8]:
df_txn_pred.head(1)

Unnamed: 0,txId,Time step,class,class_label,pred_model,pred_model_threshold,pred_proba,pred_class,pred_class_label,final_class,final_class_label
0,230550390,1,2,Licit,Random Forest,0.4,0.0932,0,Licit,0,Licit


In [9]:
list_illicit_seeds = df_txn_pred[(df_txn_pred['final_class_label'] == 'Illicit')]['txId'].tolist()

In [10]:
len(list_illicit_seeds)

28542

--------------
##### Build Subnetworks for all Illicit Nodes
--------------


Build subnetwork

In [11]:
start_time = time.time()

# 1) Build naive (possibly overlapping) subnetworks per seed
nodes_all, edges_all = build_network.build_subnetworks_naive(
    edges_df=df_txn_edgelist,
    labels_df=df_txn_pred,
    seed_txns=list_illicit_seeds,
    progress=True,
    progress_every=1000,   # summary every 100 seeds
)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60
print(f"Function took {elapsed_minutes:.2f} minutes")

Commencing subnetwork development from seed nodes. seeds=28542  update_per_batch=1000
[  1000/28542] single node network: 844  |  multi-node network: 156  (cumulative single: 844, multi: 156)  largest_nodes=6 (seed=232359167)
[  2000/28542] single node network: 782  |  multi-node network: 218  (cumulative single: 1626, multi: 374)  largest_nodes=10 (seed=191191261)
[  3000/28542] single node network: 757  |  multi-node network: 243  (cumulative single: 2383, multi: 617)  largest_nodes=10 (seed=191191261)
[  4000/28542] single node network: 791  |  multi-node network: 209  (cumulative single: 3174, multi: 826)  largest_nodes=10 (seed=191191261)
[  5000/28542] single node network: 748  |  multi-node network: 252  (cumulative single: 3922, multi: 1078)  largest_nodes=13 (seed=235427528)
[  6000/28542] single node network: 764  |  multi-node network: 236  (cumulative single: 4686, multi: 1314)  largest_nodes=13 (seed=235427528)
[  7000/28542] single node network: 701  |  multi-node network

Deduplicate subnetworks which are a subset of a larger subnetwork

In [12]:
# 2) Deduplicate: drop any subnetwork whose node set is a subset of another
nodes_dedup, edges_dedup, dedup_report, id_map = build_network.deduplicate_subnetworks_by_node_subset(
    nodes_all, edges_all, relabel=False, progress=True
)


Dedup by node-subset: 28542 → 21020 subnetworks kept (7522 removed).
  - drop 44 (seed=91787015) ⊆ kept 131 (seed=230586841)
  - drop 45 (seed=232375025) ⊆ kept 148 (seed=230451709)
  - drop 48 (seed=231994234) ⊆ kept 92 (seed=231994227)
  - drop 51 (seed=233997270) ⊆ kept 121 (seed=232359167)
  - drop 54 (seed=2718109) ⊆ kept 69 (seed=230586765)
  - drop 59 (seed=232345695) ⊆ kept 89 (seed=232364856)
  - drop 62 (seed=230330969) ⊆ kept 87 (seed=91224338)
  - drop 63 (seed=231994241) ⊆ kept 92 (seed=231994227)
  - drop 67 (seed=233997266) ⊆ kept 121 (seed=232359167)
  - drop 72 (seed=230582115) ⊆ kept 60 (seed=231994199)
  - drop 75 (seed=37232608) ⊆ kept 87 (seed=91224338)
  - drop 76 (seed=60492142) ⊆ kept 124 (seed=231994189)
  - drop 77 (seed=230586846) ⊆ kept 131 (seed=230586841)
  - drop 85 (seed=231994103) ⊆ kept 79 (seed=231994093)
  - drop 88 (seed=230540205) ⊆ kept 121 (seed=232359167)
  - drop 90 (seed=62111535) ⊆ kept 89 (seed=232364856)
  - drop 91 (seed=230402461) ⊆ kept

Merge networks that have overlapping nodes

In [13]:
txn_expanded, edges_expanded, txn_final, edges_final = build_network.merge_subnetworks_by_node_overlap(
    nodes_dedup, edges_dedup,
    min_shared_nodes=1,
    progress=True,
    print_unmerged=False,  # <- only merged groups printed
    collapse=True
)

Merge by node-overlap (≥1): 21020 → 17864 merged subnetworks.
  group 0: subnetwork_ids=[0, 1, 2, 32, 71, 73, 142] seeds=[3084073, 9907558, 39915271, 230658165, 230658519, 230659124, 230659444]
  group 310: subnetwork_ids=[310, 325] seeds=[310414790, 310415405]
  group 327: subnetwork_ids=[327, 386] seeds=[309930098, 310071274]
  group 623: subnetwork_ids=[623, 665] seeds=[86595203, 86720131]
  group 641: subnetwork_ids=[641, 651] seeds=[86714561, 86966335]
  group 724: subnetwork_ids=[724, 728, 749] seeds=[224655093, 224656953, 225054971]
  group 729: subnetwork_ids=[729, 765] seeds=[225054523, 225055156]
  group 787: subnetwork_ids=[787, 793] seeds=[91306418, 225715271]
  group 1014: subnetwork_ids=[1014, 1029, 1103, 1164, 1253, 1394, 1408, 1409, 1413, 1574, 1580] seeds=[94003168, 94123979, 94299895, 94370759, 94370914, 94370921, 94370929, 94370938, 94371207, 94371214, 94371216]
  group 1019: subnetwork_ids=[1019, 1292, 1347, 1403, 1500, 1502] seeds=[94051990, 94157300, 94184658, 941

In [14]:
print(txn_expanded.shape)
txn_expanded.head(1)

(32910, 5)


Unnamed: 0,txn_id,hop,subnetwork_id,seed_txn,merged_subnetwork_id
0,3084073,0,0,3084073,0


In [15]:
print(txn_final.shape)
print(txn_final['merged_subnetwork_id'].nunique())
txn_final.head(1)

(28542, 4)
17864


Unnamed: 0,merged_subnetwork_id,txn_id,min_hop,seeds_in_group
0,0,3084073,0,"[3084073, 9907558, 39915271, 230658165, 230658..."


In [16]:
print(edges_expanded.shape)
edges_expanded.head(1)

(12063, 7)


Unnamed: 0,src_txn_id,dst_txn_id,src_txn_hop,dst_txn_hop,subnetwork_id,seed_txn,merged_subnetwork_id
0,3084073,230658142,0,1,0,3084073,0


In [17]:
print(edges_final.shape)
print(edges_final['merged_subnetwork_id'].nunique()) # <-- Only 3254 networks have more than 2 nodes in the subnetwork (i.e. 1 edge).
edges_final.head(1)

(10877, 5)
3254


Unnamed: 0,merged_subnetwork_id,src_txn_id,dst_txn_id,min_src_hop,min_dst_hop
0,0,3084073,230658142,0,1


Build network summary table

In [18]:
summary_final = build_report.summarise_subnetworks(
    txn_expanded, edges_expanded,
    id_col="merged_subnetwork_id",   # <- merged group id from the overlap step
    sort_by="size"
)

In [19]:
print(len(summary_final))
summary_final.head()

17864


Unnamed: 0,merged_subnetwork_id,txn_ids,node_count,edge_count,depth,seeds,seed_count,linked_txn_count
0,9719,"[138779422, 274995025, 288474188, 306817114, 3...",186,201,9,"[306817114, 307096649, 308031465, 308871228, 3...",109,77
1,15775,"[10000476, 15254893, 17763829, 17942794, 21627...",183,182,182,[46085970],1,182
2,27202,"[3093780, 4377108, 7314214, 9025948, 15410565,...",157,206,104,"[218055898, 218056365]",2,155
3,11723,"[76314770, 153388422, 154828129, 157769334, 16...",126,128,5,"[76314770, 153388422, 154828129, 157769334, 16...",89,37
4,13075,"[37925917, 84313895, 84316588, 84321107, 84321...",121,154,6,"[84313895, 84316588, 84321107, 84321251, 84321...",67,54


Check the distribution of subnetworks by node count in the dataset

In [20]:
# 1) Pull the node counts as integers
s = pd.to_numeric(summary_final["node_count"], errors="coerce").dropna().astype(int)
n = int(s.size)

# 2) Distribution table (ascending by node_count)
dist = (
    s.value_counts()
     .sort_index()
     .rename_axis("node_count")
     .reset_index(name="count")
)
dist["pct_population"] = (dist["count"] / n * 100).round(2)
dist["cum_count"] = dist["count"].cumsum()
dist["cum_pct_population"] = (dist["cum_count"] / n * 100).round(2)

dist

# 82% of the subnetworks are composed of only 1 illicit node.

Unnamed: 0,node_count,count,pct_population,cum_count,cum_pct_population
0,1,14610,81.78,14610,81.78
1,2,2053,11.49,16663,93.28
2,3,488,2.73,17151,96.01
3,4,214,1.2,17365,97.21
4,5,120,0.67,17485,97.88
5,6,68,0.38,17553,98.26
6,7,44,0.25,17597,98.51
7,8,43,0.24,17640,98.75
8,9,26,0.15,17666,98.89
9,10,23,0.13,17689,99.02


--------------
##### Export Subnetwork Tables to BigQuery
--------------


In [21]:
# Define your project ID
project_id = 'extreme-torch-467913-m6'

In [22]:
# Save DataFrame to BigQuery
to_gbq(dataframe = txn_expanded, destination_table = 'networks.network_txn_expanded', project_id=project_id, if_exists='replace')
to_gbq(dataframe = txn_final, destination_table = 'networks.network_txn_final', project_id=project_id, if_exists='replace')
to_gbq(dataframe = edges_expanded, destination_table = 'networks.network_edges_expanded', project_id=project_id, if_exists='replace')
to_gbq(dataframe = edges_final, destination_table = 'networks.network_edges_final', project_id=project_id, if_exists='replace')
to_gbq(dataframe = summary_final, destination_table = 'networks.network_summary', project_id=project_id, if_exists='replace')

100%|██████████| 1/1 [00:00<00:00, 7973.96it/s]
100%|██████████| 1/1 [00:00<00:00, 10894.30it/s]
100%|██████████| 1/1 [00:00<00:00, 10645.44it/s]
100%|██████████| 1/1 [00:00<00:00, 9362.29it/s]
100%|██████████| 1/1 [00:00<00:00, 8943.08it/s]


END