--------------
**Build Sub-Networks of Illicit Networks**

Purpose of this notebook:
- Build sub-networks of illicit nodes based on a set of rules within the same timespan, then apply a ranking algorithm to give an indication of importance within the same subnetwork.
- The subnetwork is built using a 'k-hop ego expansion' within a time window (in this case, the network timestamp). this is a 'seed expansion' network that respects direction and only collects illicit nodes
- Each subnetwork starts with an illicit node (from the classification output) and builds outwards following the directional transaction. The goal is to find a community of connected individuals starting from a suspected illicit transaction.
- The algorithm traverses forward at a transaction level starting from an illicit transaction, using a txn-txn edge list to find all the illicit transactions directly connected. Illicit transactions are ranked in order of importance, based on txn value. From here, suspected illicit addresses (based on a connection to an illicit transaction) are provided to investigators.
- Note: for each background graph in a timestep, all nodes are connected. The goal is to partition the background graph into illicit sub-networks.

Process to build the sub-network:
1. Get a list of all illicit nodes.
2. Start with a txn node which has a txn prediction as illicit
3. Traverse through the txn-txn edgelist in a breadth-first manner and add each txn to the sub-network which is labelled as illicit. Once an illicit node is added to a subnetwork, remove from the list of illicit nodes.
4. Stop traversing if the next node is licit.
5. deduplicate subnetwork if subnetwork is subset of another subnetwork.
6. merge subnetworks if there is a linking node to reduce the subnetworks.

Benefits of 1-hop forward seed expansion:
- fast and clear: can be used at scale for for large data
- low noise: avoids network explosion by limiting to scope of the sub-network.
- a breadth-first approach captures all layers of the network rather than following a depth-first approach which causes a long but limited network.

Disadvantages of a 1-hop forward seed expansion:
- sub-network is 'cut short' if a txn passes through an licit node.
--------------

In [None]:
# Data cleaning and manipulation
import pandas as pd
import numpy as np
import math
import time

# GCP libraries
from pandas_gbq import to_gbq # write pandas df to a GCP BigQuery table
import gcsfs
import importlib.util
import os
import inspect

# Set up display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.4f' % x)

# Suppress FutureWarnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


--------------
##### Read in Txn Subnetwork Classes & Modules
--------------


In [None]:
# Define bucket and file path
bucket_name = "thesis_classes"
file_name = "txn_subnetworks.py"
gcs_path = f"gs://{bucket_name}/{file_name}"

# Initialize GCS filesystem
fs = gcsfs.GCSFileSystem()

# Local filename to save the script temporarily
local_file = f"/tmp/{file_name}"

# Download the file from GCS to local storage
fs.get(gcs_path, local_file)

# Dynamically import the module
module_name = "txn_subnetworks"
spec = importlib.util.spec_from_file_location(module_name, local_file)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)

In [None]:
# Use inspect to get classes and functions
classes = [name for name, obj in inspect.getmembers(module, inspect.isclass)]

# Print results
print("Classes in module:")
for cls in classes:
    print(f"  - {cls}")


Classes in module:
  - build_txn_subnetwork
  - combinations
  - defaultdict
  - reporting


In [None]:
# Instantiate the classes
build_network = module.build_txn_subnetwork()
build_report = module.reporting()

--------------
##### Read in Datasets
--------------


In [None]:
# Get txn edgelist
%%bigquery df_txn_edgelist
select * from `extreme-torch-467913-m6.txn.txn_edgelist`;

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
df_txn_edgelist.head(1)

Unnamed: 0,txId1,txId2
0,36186840,1076


In [None]:
# get txn prediction
%%bigquery df_txn_pred
select * from `extreme-torch-467913-m6.txn.txn_pred_final`;

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
df_txn_pred.head(1)

Unnamed: 0,txId,Time step,class,class_label,pred_model,pred_model_threshold,pred_proba,pred_class,pred_class_label,final_class,final_class_label
0,230393099,1,3,Unknown,Random Forest,0.3,0.0,0,Licit,0,Licit


In [None]:
list_illicit_seeds = df_txn_pred[(df_txn_pred['final_class_label'] == 'Illicit')]['txId'].tolist()

In [None]:
len(list_illicit_seeds)

12147

--------------
##### Build Subnetworks for all Illicit Nodes
--------------


Build subnetwork

In [None]:
start_time = time.time()

# 1) Build naive (possibly overlapping) subnetworks per seed
nodes_all, edges_all = build_network.build_subnetworks_naive(
    edges_df=df_txn_edgelist,
    labels_df=df_txn_pred,
    seed_txns=list_illicit_seeds,
    progress=True,
    progress_every=1000,   # summary every 100 seeds
)

end_time = time.time()
elapsed_minutes = (end_time - start_time) / 60
print(f"Function took {elapsed_minutes:.2f} minutes")

Commencing subnetwork development from seed nodes. seeds=12147  update_per_batch=1000
[  1000/12147] single node network: 866  |  multi-node network: 134  (cumulative single: 866, multi: 134)  largest_nodes=7 (seed=94051990)
[  2000/12147] single node network: 882  |  multi-node network: 118  (cumulative single: 1748, multi: 252)  largest_nodes=7 (seed=94051990)
[  3000/12147] single node network: 840  |  multi-node network: 160  (cumulative single: 2588, multi: 412)  largest_nodes=7 (seed=94051990)
[  4000/12147] single node network: 786  |  multi-node network: 214  (cumulative single: 3374, multi: 626)  largest_nodes=7 (seed=94051990)
[  5000/12147] single node network: 719  |  multi-node network: 281  (cumulative single: 4093, multi: 907)  largest_nodes=8 (seed=208930302)
[  6000/12147] single node network: 647  |  multi-node network: 353  (cumulative single: 4740, multi: 1260)  largest_nodes=27 (seed=289228146)
[  7000/12147] single node network: 598  |  multi-node network: 402  (c

Deduplicate subnetworks which are a subset of a larger subnetwork

In [None]:
# 2) Deduplicate: drop any subnetwork whose node set is a subset of another
nodes_dedup, edges_dedup, dedup_report, id_map = build_network.deduplicate_subnetworks_by_node_subset(
    nodes_all, edges_all, relabel=False, progress=True
)


Dedup by node-subset: 12147 → 10388 subnetworks kept (1759 removed).
  - drop 163 (seed=279424082) ⊆ kept 164 (seed=44995911)
  - drop 167 (seed=86761325) ⊆ kept 166 (seed=279004203)
  - drop 175 (seed=258961960) ⊆ kept 162 (seed=96258779)
  - drop 202 (seed=86706815) ⊆ kept 165 (seed=86706819)
  - drop 228 (seed=225689154) ⊆ kept 229 (seed=225689150)
  - drop 232 (seed=225689158) ⊆ kept 229 (seed=225689150)
  - drop 273 (seed=94372894) ⊆ kept 454 (seed=94123384)
  - drop 282 (seed=94520801) ⊆ kept 274 (seed=94299416)
  - drop 296 (seed=94687585) ⊆ kept 516 (seed=98629833)
  - drop 338 (seed=94121618) ⊆ kept 310 (seed=94121623)
  - drop 360 (seed=48243178) ⊆ kept 468 (seed=94372683)
  - drop 469 (seed=115106096) ⊆ kept 300 (seed=94123979)
  - drop 483 (seed=94654272) ⊆ kept 468 (seed=94372683)
  - drop 485 (seed=84370891) ⊆ kept 506 (seed=94051990)
  - drop 486 (seed=94372607) ⊆ kept 507 (seed=94003168)
  - drop 487 (seed=94373780) ⊆ kept 455 (seed=94299580)
  - drop 511 (seed=9465427

Merge networks that have overlapping nodes

In [None]:
txn_expanded, edges_expanded, txn_final, edges_final = build_network.merge_subnetworks_by_node_overlap(
    nodes_dedup, edges_dedup,
    min_shared_nodes=1,
    progress=True,
    print_unmerged=False,  # <- only merged groups printed
    collapse=True
)

Merge by node-overlap (≥1): 10388 → 9588 merged subnetworks.
  group 300: subnetwork_ids=[300, 421, 498, 507, 527, 566, 569, 570, 571, 572, 587] seeds=[94003168, 94123979, 94299895, 94370759, 94370914, 94370920, 94370929, 94370938, 94371207, 94371211, 94371216]
  group 468: subnetwork_ids=[468, 524, 562] seeds=[94372683, 94653853, 94654057]
  group 503: subnetwork_ids=[503, 506, 536, 619] seeds=[94051990, 94184658, 94189129, 94189346]
  group 734: subnetwork_ids=[734, 776, 836, 838] seeds=[54504625, 189799399, 189800868, 190005875]
  group 739: subnetwork_ids=[739, 756, 760] seeds=[189984660, 190009101, 190009102]
  group 741: subnetwork_ids=[741, 814, 815] seeds=[189801215, 190005963, 190006176]
  group 750: subnetwork_ids=[750, 762, 824, 832, 835, 839, 845, 846, 848] seeds=[191107595, 191107780, 191108346, 191108947, 191108949, 191108954, 191108962, 191108972, 191111249]
  group 869: subnetwork_ids=[869, 930] seeds=[175047482, 175186988]
  group 872: subnetwork_ids=[872, 887, 940, 95

In [None]:
print(txn_expanded.shape)
txn_expanded.head(1)

(13069, 5)


Unnamed: 0,txn_id,hop,subnetwork_id,seed_txn,merged_subnetwork_id
0,62195631,0,0,62195631,0


In [None]:
print(txn_final.shape)
print(txn_final['merged_subnetwork_id'].nunique())
txn_final.head(1)

(12147, 4)
9588


Unnamed: 0,merged_subnetwork_id,txn_id,min_hop,seeds_in_group
0,0,62195631,0,[62195631]


In [None]:
print(edges_expanded.shape)
edges_expanded.head(1)

(2685, 7)


Unnamed: 0,src_txn_id,dst_txn_id,src_txn_hop,dst_txn_hop,subnetwork_id,seed_txn,merged_subnetwork_id
0,96258779,258961960,0,1,162,96258779,162


In [None]:
print(edges_final.shape)
print(edges_final['merged_subnetwork_id'].nunique()) # <-- Only 877 networks have more than 2 nodes in the subnetwork (i.e. 1 edge).
edges_final.head(1)

(2563, 5)
877


Unnamed: 0,merged_subnetwork_id,src_txn_id,dst_txn_id,min_src_hop,min_dst_hop
0,162,96258779,258961960,0,1


Build network summary table

In [None]:
summary_final = build_report.summarise_subnetworks(
    txn_expanded, edges_expanded,
    id_col="merged_subnetwork_id",   # <- merged group id from the overlap step
    sort_by="size"
)

In [None]:
print(len(summary_final))
summary_final.head()

9588


Unnamed: 0,merged_subnetwork_id,txn_ids,node_count,edge_count,depth,seeds,seed_count,linked_txn_count
0,6528,"[10000476, 15254893, 17763829, 17942794, 21627...",183,182,182,[46085970],1,182
1,7703,"[17280710, 68891403, 69064953, 72780249, 86300...",100,99,99,[355102256],1,99
2,6678,"[163653273, 163653275, 163654533, 163654535, 1...",68,67,2,"[163653275, 163654535, 163654536, 163654538, 1...",64,4
3,7777,"[355003589, 355003644, 355003977, 355004152, 3...",60,59,1,"[355003589, 355003644, 355003977, 355004152, 3...",58,2
4,5465,"[27001252, 57167035, 60868547, 69997426, 70014...",51,52,26,"[60868547, 292163396]",2,49


Check the distribution of subnetworks by node count in the dataset

In [None]:
# 1) Pull the node counts as integers
s = pd.to_numeric(summary_final["node_count"], errors="coerce").dropna().astype(int)
n = int(s.size)

# 2) Distribution table (ascending by node_count)
dist = (
    s.value_counts()
     .sort_index()
     .rename_axis("node_count")
     .reset_index(name="count")
)
dist["pct_population"] = (dist["count"] / n * 100).round(2)
dist["cum_count"] = dist["count"].cumsum()
dist["cum_pct_population"] = (dist["cum_count"] / n * 100).round(2)

dist

# 91% of the subnetworks have only 1 illicit node in it. 6% have 2 nodes in the subnetwork. 3% have 3 or more nodes.

Unnamed: 0,node_count,count,pct_population,cum_count,cum_pct_population
0,1,8711,90.85,8711,90.85
1,2,603,6.29,9314,97.14
2,3,111,1.16,9425,98.3
3,4,47,0.49,9472,98.79
4,5,19,0.2,9491,98.99
5,6,10,0.1,9501,99.09
6,7,7,0.07,9508,99.17
7,8,10,0.1,9518,99.27
8,9,7,0.07,9525,99.34
9,10,18,0.19,9543,99.53


--------------
##### Export Subnetwork Tables to BigQuery
--------------


In [None]:
# Define your project ID
project_id = 'extreme-torch-467913-m6'

In [None]:
# Save DataFrame to BigQuery
to_gbq(dataframe = txn_expanded, destination_table = 'networks.network_txn_expanded', project_id=project_id, if_exists='replace')
to_gbq(dataframe = txn_final, destination_table = 'networks.network_txn_final', project_id=project_id, if_exists='replace')
to_gbq(dataframe = edges_expanded, destination_table = 'networks.network_edges_expanded', project_id=project_id, if_exists='replace')
to_gbq(dataframe = edges_final, destination_table = 'networks.network_edges_final', project_id=project_id, if_exists='replace')
to_gbq(dataframe = summary_final, destination_table = 'networks.network_summary', project_id=project_id, if_exists='replace')

100%|██████████| 1/1 [00:00<00:00, 7182.03it/s]
100%|██████████| 1/1 [00:00<00:00, 8322.03it/s]
100%|██████████| 1/1 [00:00<00:00, 6512.89it/s]
100%|██████████| 1/1 [00:00<00:00, 7724.32it/s]
100%|██████████| 1/1 [00:00<00:00, 9776.93it/s]


END