### Notebook Overview: Model Step 3 – Subnetwork Development

This notebook constructs illicit-only subnetworks from the classified Bitcoin transaction dataset. It represents the third stage of the AML detection pipeline and operationalises the subgraph extraction methodology described in the thesis. The goal is to isolate self-contained clusters of illicit activity that can be further analysed and ranked.  

**Purpose**  
The notebook transforms the classified transaction data into directed graph structures that reflect the movement of Bitcoin between transactions. Starting from transactions predicted as illicit, a forward breadth-first search (BFS) expansion identifies all reachable transactions, creating a self-contained subnetwork that captures the flow of funds within that illicit cluster.  

**Key Steps**  
- Import transaction predictions and edgelist data from BigQuery.  
- Filter to include only transactions classified as illicit.  
- Use a directed graph representation (txn–txn) to model Bitcoin flows between transactions.  
- Perform a forward BFS from each illicit transaction to trace all connected downstream nodes.  
- Assign unique Subnetwork IDs and compute basic graph properties (e.g., size, node/edge count, total BTC transferred).  
- Export subnetwork tables to BigQuery for ranking and visualisation in the next stage.  

This process produces a collection of reproducible subnetworks that represent individual clusters of illicit Bitcoin activity. By isolating and structuring these networks, subsequent ranking and visualisation can focus on the most influential or financially significant regions of the transaction graph.  

**Context and Attribution**  
This notebook forms part of the technical work developed in support of the research thesis titled:  
_“Detection, Ranking and Visualisation of Money Laundering Networks on the Bitcoin Blockchain”_  
by Jennifer Payne (RMIT University).  

GitHub Repository: [https://github.com/majorpayne-2021/rmit_master_thesis](https://github.com/majorpayne-2021/rmit_master_thesis)  
Elliptic++ Dataset Source: [https://github.com/git-disl/EllipticPlusPlus](https://github.com/git-disl/EllipticPlusPlus)


In [None]:
# Data cleaning and manipulation
import pandas as pd
import numpy as np
import math
import time

# GCP libraries
from pandas_gbq import to_gbq # write pandas df to a GCP BigQuery table
import gcsfs
import importlib.util
import os
import inspect

# Set up display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.4f' % x)

# Suppress FutureWarnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


--------------
##### Read in Txn Subnetwork Classes & Modules
--------------


In [None]:
# Define bucket and file path
bucket_name = "thesis_classes"
file_name = "txn_subnetworks.py"
gcs_path = f"gs://{bucket_name}/{file_name}"

# Initialize GCS filesystem
fs = gcsfs.GCSFileSystem()

# Local filename to save the script temporarily
local_file = f"/tmp/{file_name}"

# Download the file from GCS to local storage
fs.get(gcs_path, local_file)

# Dynamically import the module
module_name = "txn_subnetworks"
spec = importlib.util.spec_from_file_location(module_name, local_file)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

In [None]:
# Use inspect to get classes and functions
classes = [name for name, obj in inspect.getmembers(module, inspect.isclass)]

# Print results
print("Classes in module:")
for cls in classes:
    print(f"  - {cls}")


Classes in module:
  - Line2D
  - build_txn_subnetwork
  - combinations
  - defaultdict
  - reporting
  - visualise_subnetwork


In [None]:
# Instantiate the classes
build_network = module.build_txn_subnetwork()
build_report = module.reporting()

--------------
##### Read in Datasets
--------------


In [None]:
# Get txn edgelist
%%bigquery df_txn_edgelist
select * from `extreme-torch-467913-m6.txn.txn_edgelist`;

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
df_txn_edgelist.head(1)

Unnamed: 0,txId1,txId2
0,36186840,1076


In [None]:
# get txn prediction
%%bigquery df_txn_pred
select * from `extreme-torch-467913-m6.txn.txn_pred_final`;

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
df_txn_pred.head(1)

Unnamed: 0,txId,Time step,class,class_label,pred_model,pred_model_threshold,pred_proba,pred_class,pred_class_label,final_class,final_class_label
0,230550390,1,2,Licit,Random Forest,0.3,0.0932,0,Licit,0,Licit


In [None]:
list_illicit_seeds = df_txn_pred[(df_txn_pred['final_class_label'] == 'Illicit')]['txId'].tolist()

In [None]:
len(list_illicit_seeds)

49707

--------------
##### Build Subnetworks for all Illicit Nodes
--------------


Build subnetwork

In [None]:
start_time = time.time()

# 1) Build naive (possibly overlapping) subnetworks per seed
nodes_all, edges_all = build_network.build_subnetworks_naive(
    edges_df=df_txn_edgelist,
    labels_df=df_txn_pred,
    seed_txns=list_illicit_seeds,
    progress=True,
    progress_every=1000,   # summary every 100 seeds
)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60
print(f"Function took {elapsed_minutes:.2f} minutes")

Commencing subnetwork development from seed nodes. seeds=49707  update_per_batch=1000
[  1000/49707] single node network: 699  |  multi-node network: 301  (cumulative single: 699, multi: 301)  largest_nodes=9 (seed=15012776)
[  2000/49707] single node network: 756  |  multi-node network: 244  (cumulative single: 1455, multi: 545)  largest_nodes=9 (seed=15012776)
[  3000/49707] single node network: 716  |  multi-node network: 284  (cumulative single: 2171, multi: 829)  largest_nodes=23 (seed=226358329)
[  4000/49707] single node network: 682  |  multi-node network: 318  (cumulative single: 2853, multi: 1147)  largest_nodes=23 (seed=226358329)
[  5000/49707] single node network: 636  |  multi-node network: 364  (cumulative single: 3489, multi: 1511)  largest_nodes=23 (seed=226358329)
[  6000/49707] single node network: 581  |  multi-node network: 419  (cumulative single: 4070, multi: 1930)  largest_nodes=24 (seed=175050552)
[  7000/49707] single node network: 781  |  multi-node network: 

Deduplicate subnetworks which are a subset of a larger subnetwork

In [None]:
# 2) Deduplicate: drop any subnetwork whose node set is a subset of another
nodes_dedup, edges_dedup, dedup_report, id_map = build_network.deduplicate_subnetworks_by_node_subset(
    nodes_all, edges_all, relabel=False, progress=True
)


Dedup by node-subset: 49707 → 32507 subnetworks kept (17200 removed).
  - drop 5 (seed=230472386) ⊆ kept 258 (seed=230472385)
  - drop 11 (seed=230543020) ⊆ kept 340 (seed=230543016)
  - drop 15 (seed=16843895) ⊆ kept 500 (seed=230456346)
  - drop 20 (seed=232038046) ⊆ kept 271 (seed=230464746)
  - drop 49 (seed=91787015) ⊆ kept 270 (seed=230586841)
  - drop 52 (seed=232375025) ⊆ kept 309 (seed=230451709)
  - drop 56 (seed=144657595) ⊆ kept 290 (seed=231994074)
  - drop 59 (seed=231994234) ⊆ kept 165 (seed=231994227)
  - drop 60 (seed=76867000) ⊆ kept 230 (seed=230645551)
  - drop 61 (seed=230456370) ⊆ kept 157 (seed=230521240)
  - drop 62 (seed=87889638) ⊆ kept 230 (seed=230645551)
  - drop 67 (seed=226715158) ⊆ kept 48 (seed=232008935)
  - drop 70 (seed=233997270) ⊆ kept 240 (seed=232359167)
  - drop 79 (seed=231994244) ⊆ kept 165 (seed=231994227)
  - drop 80 (seed=2718109) ⊆ kept 156 (seed=230330936)
  - drop 81 (seed=81270783) ⊆ kept 133 (seed=81273866)
  - drop 85 (seed=230789891

Merge networks that have overlapping nodes

In [None]:
txn_expanded, edges_expanded, txn_final, edges_final = build_network.merge_subnetworks_by_node_overlap(
    nodes_dedup, edges_dedup,
    min_shared_nodes=1,
    progress=True,
    print_unmerged=False,  # <- only merged groups printed
    collapse=True
)

Merge by node-overlap (≥1): 32507 → 26012 merged subnetworks.
  group 0: subnetwork_ids=[0, 1, 2, 32, 68, 73, 74, 88, 96, 105, 111, 119, 122, 135, 153, 164, 173, 186, 223, 251, 256, 257, 262, 274, 294, 295] seeds=[3084073, 9907558, 36385394, 39915271, 230331777, 230335060, 230336139, 230337202, 230340242, 230344662, 230598498, 230658163, 230658165, 230658285, 230658519, 230658520, 230658521, 230658527, 230659124, 230659441, 230659444, 230659446, 230659451, 230659453, 230659454, 230659458]
  group 45: subnetwork_ids=[45, 191] seeds=[232027324, 232629400]
  group 241: subnetwork_ids=[241, 469] seeds=[232658952, 232673081]
  group 917: subnetwork_ids=[917, 957] seeds=[309938432, 309970292]
  group 947: subnetwork_ids=[947, 974] seeds=[310414790, 310415405]
  group 976: subnetwork_ids=[976, 1181, 1186] seeds=[309930098, 309932224, 310142686]
  group 982: subnetwork_ids=[982, 1018, 1025] seeds=[309927449, 309970289, 312115193]
  group 1023: subnetwork_ids=[1023, 1190] seeds=[309935574, 3099

In [None]:
print(txn_expanded.shape)
txn_expanded.head(1)

(63800, 5)


Unnamed: 0,txn_id,hop,subnetwork_id,seed_txn,merged_subnetwork_id
0,3084073,0,0,3084073,0


In [None]:
print(txn_final.shape)
print(txn_final['merged_subnetwork_id'].nunique())
txn_final.head(1)

(49707, 4)
26012


Unnamed: 0,merged_subnetwork_id,txn_id,min_hop,seeds_in_group
0,0,3084073,0,"[3084073, 9907558, 36385394, 39915271, 2303317..."


In [None]:
print(edges_expanded.shape)
edges_expanded.head(1)

(31804, 7)


Unnamed: 0,src_txn_id,dst_txn_id,src_txn_hop,dst_txn_hop,subnetwork_id,seed_txn,merged_subnetwork_id
0,3084073,230658142,0,1,0,3084073,0


In [None]:
print(edges_final.shape)
print(edges_final['merged_subnetwork_id'].nunique()) # <-- Only 6123 networks have more than 2 nodes in the subnetwork (i.e. 1 edge).
edges_final.head(1)

(24293, 5)
6123


Unnamed: 0,merged_subnetwork_id,src_txn_id,dst_txn_id,min_src_hop,min_dst_hop
0,0,3084073,230658142,0,1


Build network summary table

In [None]:
summary_final = build_report.summarise_subnetworks(
    txn_expanded, edges_expanded,
    id_col="merged_subnetwork_id",   # <- merged group id from the overlap step
    sort_by="size"
)

In [None]:
print(len(summary_final))
summary_final.head()

26012


Unnamed: 0,merged_subnetwork_id,txn_ids,node_count,edge_count,depth,seeds,seed_count,linked_txn_count
0,17742,"[101586718, 130897027, 138779422, 162026612, 1...",394,438,15,"[130897027, 176149522, 298005149, 306112320, 3...",223,171
1,17750,"[19042162, 21210721, 22539188, 25321055, 42227...",370,4263,68,"[21210721, 42227626, 45587478, 58502892, 73240...",130,240
2,22362,"[25749848, 31657337, 42996770, 64534421, 80201...",314,357,6,"[25749848, 31657337, 42996770, 64534421, 80201...",228,86
3,31404,"[14140455, 17006093, 21398242, 28202941, 12142...",254,291,5,"[14140455, 17006093, 21398242, 28202941, 13925...",158,96
4,16356,"[10856574, 98902423, 98902477, 98961227, 98961...",225,244,6,"[10856574, 98902423, 98902477, 98961227, 98961...",146,79


Check the distribution of subnetworks by node count in the dataset

In [None]:
# 1) Pull the node counts as integers
s = pd.to_numeric(summary_final["node_count"], errors="coerce").dropna().astype(int)
n = int(s.size)

# 2) Distribution table (ascending by node_count)
dist = (
    s.value_counts()
     .sort_index()
     .rename_axis("node_count")
     .reset_index(name="count")
)
dist["pct_population"] = (dist["count"] / n * 100).round(2)
dist["cum_count"] = dist["count"].cumsum()
dist["cum_pct_population"] = (dist["cum_count"] / n * 100).round(2)

dist

# 76% of the subnetworks are composed of only 1 illicit node.

Unnamed: 0,node_count,count,pct_population,cum_count,cum_pct_population
0,1,19889,76.46,19889,76.46
1,2,3552,13.66,23441,90.12
2,3,993,3.82,24434,93.93
3,4,467,1.8,24901,95.73
4,5,276,1.06,25177,96.79
5,6,158,0.61,25335,97.4
6,7,108,0.42,25443,97.81
7,8,78,0.3,25521,98.11
8,9,64,0.25,25585,98.36
9,10,46,0.18,25631,98.54


--------------
##### Export Subnetwork Tables to BigQuery
--------------


In [None]:
# Define your project ID
project_id = 'extreme-torch-467913-m6'

In [None]:
# Save DataFrame to BigQuery
to_gbq(dataframe = txn_expanded, destination_table = 'networks.network_txn_expanded', project_id=project_id, if_exists='replace')
to_gbq(dataframe = txn_final, destination_table = 'networks.network_txn_final', project_id=project_id, if_exists='replace')
to_gbq(dataframe = edges_expanded, destination_table = 'networks.network_edges_expanded', project_id=project_id, if_exists='replace')
to_gbq(dataframe = edges_final, destination_table = 'networks.network_edges_final', project_id=project_id, if_exists='replace')
to_gbq(dataframe = summary_final, destination_table = 'networks.network_summary', project_id=project_id, if_exists='replace')

100%|██████████| 1/1 [00:00<00:00, 8004.40it/s]
100%|██████████| 1/1 [00:00<00:00, 9709.04it/s]
100%|██████████| 1/1 [00:00<00:00, 9279.43it/s]
100%|██████████| 1/1 [00:00<00:00, 9845.78it/s]
100%|██████████| 1/1 [00:00<00:00, 10180.35it/s]


END