### Notebook Overview: Illicit Subnetwork Analysis

This supplementary notebook analyses the structural characteristics of the illicit-only subnetworks generated in Model Step 3. It examines network-level metrics such as size, depth, connectivity, and transaction flow to understand the composition and variability of illicit Bitcoin activity across subnetworks.  

**Purpose**  
The purpose of this notebook is to quantify and summarise the key structural properties of the extracted subnetworks. By analysing attributes such as node count, edge count, and maximum path length (depth), it provides insights into how illicit activity propagates within the Bitcoin transaction graph. The analysis helps identify subnetworks that are more complex, interconnected, or influential—key indicators of organised laundering activity.  

**Key Steps**  
- Import illicit-only subnetworks and associated node rankings from BigQuery.  
- Compute network-level statistics for each subnetwork, including node count, edge count, and density.  
- Calculate graph-theoretic measures such as average degree, clustering coefficient, and connected component size.  
- Determine subnetwork depth and the longest transaction path using traversal-based analysis.  
- Summarise distributions of subnetwork sizes and depths to identify outliers or unusually large networks.  
- Export subnetwork metrics and summary tables to BigQuery for further reference or visualisation.  

**Outcome**  
The analysis shows that most illicit subnetworks are small and shallow, representing isolated or short transaction chains. A small subset of subnetworks exhibit greater depth and connectivity, indicating potential aggregation or layering behaviour consistent with money laundering typologies. The resulting subnetwork metrics enable prioritisation of complex or high-value subnetworks for ranking and visualisation in later investigative stages.  

**Context and Attribution**  
This notebook forms part of the technical work developed in support of the research thesis titled:  
_“Detection, Ranking and Visualisation of Money Laundering Networks on the Bitcoin Blockchain”_  
by Jennifer Payne (RMIT University).  

GitHub Repository: [https://github.com/majorpayne-2021/rmit_master_thesis](https://github.com/majorpayne-2021/rmit_master_thesis)  
Elliptic++ Dataset Source: [https://github.com/git-disl/EllipticPlusPlus](https://github.com/git-disl/EllipticPlusPlus)


In [None]:
# Data cleaning and manipulation
import pandas as pd
import numpy as np
import math
import time

# GCP libraries
from pandas_gbq import to_gbq # write pandas df to a GCP BigQuery table
import gcsfs
import importlib.util
import os
import inspect

# Set up display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.4f' % x)

# Suppress FutureWarnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


--------------
##### Read in Datasets
--------------


In [None]:
# Get new original (prior to duplication)
%%bigquery df_nw_txn_expanded
select * from `extreme-torch-467913-m6.networks.network_txn_expanded`;

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
df_nw_txn_expanded.head(1)

Unnamed: 0,txn_id,hop,subnetwork_id,seed_txn,merged_subnetwork_id
0,3084073,0,0,3084073,0


In [None]:
# Get nw summary
%%bigquery df_nw_summary
select * from `extreme-torch-467913-m6.networks.network_summary`;

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
df_nw_summary.head(1)

Unnamed: 0,merged_subnetwork_id,txn_ids,node_count,edge_count,depth,seeds,seed_count,linked_txn_count
0,17742,"[101586718, 130897027, 138779422, 162026612, 1...",394,438,15,"[130897027, 176149522, 298005149, 306112320, 3...",223,171


In [None]:
# Get txn nw list
%%bigquery df_nw_txn
select * from `extreme-torch-467913-m6.networks.network_txn_final`;

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
df_nw_txn.head(1)

Unnamed: 0,merged_subnetwork_id,txn_id,min_hop,seeds_in_group
0,0,3084073,0,"[3084073, 9907558, 36385394, 39915271, 2303317..."


--------------
##### Txn Analysis
--------------


--------------
##### Network Analysis
--------------


Total networks (original v deduplicated)

In [None]:
df_nw_txn_expanded['subnetwork_id'].nunique()

32507

In [None]:
df_nw_txn_expanded['merged_subnetwork_id'].nunique()

26012

In [None]:
df_nw_summary['merged_subnetwork_id'].nunique()

26012

Number of Nodes by Subnetwork

In [None]:
# 1) Pull the node counts as integers
s = pd.to_numeric(df_nw_summary["node_count"], errors="coerce").dropna().astype(int)
n = int(s.size)

# 2) Define bins: 1, 2, 3, 4, 5, and 5+
bins = [0, 1, 2, 3, 4, 5, float("inf")]   # 7 edges → 6 intervals
labels = ["1", "2", "3", "4", "5", "5+"]

# Group counts into bins
s_binned = pd.cut(s, bins=bins, labels=labels, right=True, include_lowest=True)

# 3) Distribution table
dist = (
    s_binned.value_counts()
    .sort_index()
    .rename_axis("Nodes in NW")
    .reset_index(name="Count of NW")
)
dist["% Population"] = (dist["Count of NW"] / n * 100).round(2)
dist["Cum Count"] = dist["Count of NW"].cumsum()
dist["Cum % Pop"] = (dist["Cum Count"] / n * 100).round(2)

# 4) Add totals row
totals = pd.DataFrame({
    "Nodes in NW": ["Total"],
    "Count of NW": [dist["Count of NW"].sum()],
    "% Population": [dist["% Population"].sum()],
    "Cum Count": [None],
    "Cum % Pop": [None]
})

dist = pd.concat([dist, totals], ignore_index=True)

dist


Unnamed: 0,Nodes in NW,Count of NW,% Population,Cum Count,Cum % Pop
0,1,19889,76.46,19889.0,76.46
1,2,3552,13.66,23441.0,90.12
2,3,993,3.82,24434.0,93.93
3,4,467,1.8,24901.0,95.73
4,5,276,1.06,25177.0,96.79
5,5+,835,3.21,26012.0,100.0
6,Total,26012,100.01,,


Check if txn only belong to 1 subnetwork (mutually exclusive subnetworks).

In [None]:
# one row per (subnetwork, txn)
pairs = df_nw_txn[['merged_subnetwork_id','txn_id']].drop_duplicates()

# txn that appear in 2+ different subnetworks
conflicts = (pairs
             .groupby('txn_id')['merged_subnetwork_id']
             .nunique()
             .reset_index(name='num_subnets')
             .query('num_subnets > 1'))

has_conflicts = not conflicts.empty
print(f"Any conflicts? {has_conflicts}. # of conflicting txn: {len(conflicts)}")


Any conflicts? False. # of conflicting txn: 0


In [None]:
conflict_map = (pairs[pairs['txn_id'].isin(conflicts['txn_id'])]
                .groupby('txn_id')['merged_subnetwork_id']
                .agg(lambda x: sorted(set(x)))
                .rename('subnets')
                .reset_index())

print(conflict_map.head(20))


Empty DataFrame
Columns: [txn_id, subnets]
Index: []


Distance from seed node to destination node

In [None]:
import pandas as pd

# Assume df_nw_summary is already loaded

# 1) Pull depths as integers
s = pd.to_numeric(df_nw_summary["depth"], errors="coerce").dropna().astype(int)
n = int(s.size)

# 2) Define bins: 1, 2, 3, 4, and 5+
bins = [0, 1, 2, 3, 4, float("inf")]
labels = ["1", "2", "3", "4", "5+"]

# 3) Group into bins
s_binned = pd.cut(s, bins=bins, labels=labels, right=True, include_lowest=True)

# 4) Distribution table
depth_summary = (
    s_binned.value_counts()
    .sort_index()
    .rename_axis("depth_group")
    .reset_index(name="network_count")
)

# 5) Add percentages and cumulative stats
depth_summary["pct_population"] = (depth_summary["network_count"] / n * 100).round(2)
depth_summary["cum_count"] = depth_summary["network_count"].cumsum()
depth_summary["cum_pct_population"] = depth_summary["pct_population"].cumsum().round(2)

# 6) Add totals row
totals = pd.DataFrame({
    "depth_group": ["Total"],
    "network_count": [depth_summary["network_count"].sum()],
    "pct_population": [depth_summary["pct_population"].sum()],
    "cum_count": [depth_summary["cum_count"].iloc[-1]],
    "cum_pct_population": [depth_summary["cum_pct_population"].iloc[-1]]
})

depth_summary = pd.concat([depth_summary, totals], ignore_index=True)

depth_summary


Unnamed: 0,depth_group,network_count,pct_population,cum_count,cum_pct_population
0,1,23865,91.75,23865,91.75
1,2,1057,4.06,24922,95.81
2,3,459,1.76,25381,97.57
3,4,223,0.86,25604,98.43
4,5+,408,1.57,26012,100.0
5,Total,26012,100.0,26012,100.0


Number of seeds per node

In [None]:
import pandas as pd

# Assume df_nw_summary is already loaded

# 1) Pull seed counts as integers
s = pd.to_numeric(df_nw_summary["seed_count"], errors="coerce").dropna().astype(int)
n = int(s.size)

# 2) Define bins: 1, 2, 3, 4, and 5+
bins = [0, 1, 2, 3, 4, float("inf")]
labels = ["1", "2", "3", "4", "5+"]

# 3) Group into bins
s_binned = pd.cut(s, bins=bins, labels=labels, right=True, include_lowest=True)

# 4) Distribution table
seed_summary = (
    s_binned.value_counts()
    .sort_index()
    .rename_axis("seed_group")
    .reset_index(name="network_count")
)

# 5) Add percentages and cumulative stats
seed_summary["pct_population"] = (seed_summary["network_count"] / n * 100).round(2)
seed_summary["cum_count"] = seed_summary["network_count"].cumsum()
seed_summary["cum_pct_population"] = seed_summary["pct_population"].cumsum().round(2)

# 6) Add totals row
totals = pd.DataFrame({
    "seed_group": ["Total"],
    "network_count": [seed_summary["network_count"].sum()],
    "pct_population": [seed_summary["pct_population"].sum()],
    "cum_count": [seed_summary["cum_count"].iloc[-1]],
    "cum_pct_population": [seed_summary["cum_pct_population"].iloc[-1]]
})

seed_summary = pd.concat([seed_summary, totals], ignore_index=True)

seed_summary


Unnamed: 0,seed_group,network_count,pct_population,cum_count,cum_pct_population
0,1,25062,96.35,25062,96.35
1,2,428,1.65,25490,98.0
2,3,161,0.62,25651,98.62
3,4,89,0.34,25740,98.96
4,5+,272,1.05,26012,100.01
5,Total,26012,100.01,26012,100.01


Matrix of count of nodes v depth of nodes for subnetworks with >= 2 nodes.

In [None]:
import pandas as pd

# --- 1) Filter to networks with >= 2 txns
df2 = df_nw_summary.loc[df_nw_summary["node_count"] >= 2,
                        ["merged_subnetwork_id", "node_count", "depth"]].copy()

# --- 2) Bin helpers
def bin_txn_count(x):
    x = int(x)
    if 1 <= x <= 5:
        return "1–5"
    elif 6 <= x <= 10:
        return "6–10"
    elif 11 <= x <= 15:
        return "11–15"
    elif 16 <= x <= 20:
        return "16–20"
    else:
        return "21+"

def bin_depth(x):
    x = int(x)
    if 1 <= x <= 5:
        return str(x)
    else:
        return "5+"

df2["txn_count_bin"] = df2["node_count"].map(bin_txn_count)
df2["depth_bin"] = df2["depth"].map(lambda d: bin_depth(max(1, int(d))))

# --- 3) Matrix: COUNT OF NETWORKS in each (txn_count_bin, depth_bin)
txn_categories = ["1–5", "6–10", "11–15", "16–20", "21+"]
depth_categories = ["1", "2", "3", "4", "5", "5+"]

matrix = pd.crosstab(
    index=df2["txn_count_bin"],
    columns=df2["depth_bin"]
).reindex(index=txn_categories, columns=depth_categories, fill_value=0)

matrix.index.name = "Txn Count (binned)"
matrix.columns.name = "Txn Depth (binned)"

# --- 4) Totals that will match the # of networks after filtering
total_networks = df2["merged_subnetwork_id"].nunique()
matrix_sum = int(matrix.values.sum())  # will equal total_networks since each network contributes exactly one cell

print(f"Unique subnetworks with ≥2 txns: {total_networks}")
print(f"Matrix cell sum (should match): {matrix_sum}")
matrix


Unique subnetworks with ≥2 txns: 6123
Matrix cell sum (should match): 6123


Txn Depth (binned),1,2,3,4,5,5+
Txn Count (binned),Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1–5,3949,937,302,100,0,0
6–10,17,86,105,69,78,99
11–15,5,16,24,22,22,57
16–20,2,5,8,6,9,41
21+,3,13,20,26,18,84
