# 3.1 Whale Detection (Value-Based + Centrality-Based)

In this part, we detect **whales** using two major categories of metrics:

1. **Value-based metrics**  
   - Total outgoing ETH value  
   - Total incoming ETH value  
   - Net flow  
   - Transaction counts (in / out / total)

2. **Centrality-based metrics**  
   - In-degree and out-degree  
   - PageRank  
   - HITS (hubs and authorities)

Combining these metrics allows us to identify:
- Addresses transferring large amounts of ETH (value whales)
- Addresses that play structurally important roles in the network (centrality whales)

The combined output is a unified whale label that will be used in the rest of Chapter 3.


## 1. Imports + Load Data

We load two datasets:

1. Clean ETH transaction data (`load_clean_transactions()`)
2. The heterogeneous graph `G` built in **2.3**

These datasets will be used to compute both value-based and graph-based whale metrics.


In [7]:
import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import pickle
from pathlib import Path

plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["axes.grid"] = True

PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), "..", ".."))
sys.path.append(PROJECT_ROOT)

# print("PROJECT_ROOT:", PROJECT_ROOT)

from src.data.load_data import (
    load_clean_transactions,
)


In [8]:
# Load ETH transactions
tx = load_clean_transactions()
print("Transactions loaded:", len(tx))
tx.head()


Transactions loaded: 13268


Unnamed: 0,hash,from_address,to_address,block_number,value,block_timestamp
0,0xd8ec648861cf4de73f18f9a034623eeded1b26ec7246...,0xa9264494a92ced04747ac84fc9ca5a0b9549b491,0x835033bd90b943fa0d0f8e5382d9dc568d3fbd96,23772289,4.699994e+19,2025-11-11 00:00:11+00:00
1,0x5843a9e865f9b7222ddb376ea2869c50b389c3a0d858...,0xc0ffeebabe5d496b2dde509f9fa189c25cf29671,0xc0ffeebabe5d496b2dde509f9fa189c25cf29671,23772292,5.817089e+19,2025-11-11 00:00:47+00:00
2,0x131571aec26cd23b0134a97341acf9fb0b559b085b68...,0xe50008c1d110da8e56982f46a9188a292ee90a7b,0x1ab4973a48dc892cd9971ece8e01dcc7688f8f23,23772292,3.390013e+18,2025-11-11 00:00:47+00:00
3,0xa1b7caf05dd498111a40ffe269fefb2ae574dde53da0...,0xe40d548eb4fa4d9188fd21723f2fd377456c0876,0x28c6c06298d514db089934071355e5743bf21d60,23772292,7.999922e+18,2025-11-11 00:00:47+00:00
4,0xc1d8e4ffa9e7864d5a38f84aa4532308d411ba35f82e...,0x0eb1665de6473c624dcd087fdeee27418d65ed59,0xa03400e098f4421b34a3a44a1b4e571419517687,23772292,6.318854e+18,2025-11-11 00:00:47+00:00


In [9]:
# Load heterogeneous graph G from 2.3
HETERO_GRAPH_PATH = os.path.join(PROJECT_ROOT, "data", "processed", "heterogeneous_graph.gpickle")

if not os.path.exists(HETERO_GRAPH_PATH):
    raise FileNotFoundError("Please run 2.3 and save G using nx.write_gpickle().")

with Path(HETERO_GRAPH_PATH).open("rb") as f:
    G = pickle.load(f)

print("Loaded G")
print("Nodes:", G.number_of_nodes())
print("Edges:", G.number_of_edges())


Loaded G
Nodes: 26447
Edges: 30638


## 2. Value-Based Whale Metrics

We compute the following address-level metrics:

- Total outgoing value
- Total incoming value
- Net flow
- Number of outgoing and incoming transactions
- Total transaction count

These metrics define *value whales* â€” addresses handling large amounts of ETH.


In [11]:
# Outgoing stats
out_stats = (
    tx.groupby("from_address")["value"]
      .agg(["sum", "count"])
      .rename(columns={"sum": "total_out_value", "count": "n_out_tx"})
)

# Incoming stats
in_stats = (
    tx.groupby("to_address")["value"]
      .agg(["sum", "count"])
      .rename(columns={"sum": "total_in_value", "count": "n_in_tx"})
)

# Merge
addr_value = out_stats.join(in_stats, how="outer").fillna(0.0)

# Additional metrics
addr_value["n_total_tx"] = addr_value["n_out_tx"] + addr_value["n_in_tx"]
addr_value["net_flow"] = addr_value["total_in_value"] - addr_value["total_out_value"]

print("Addresses:", len(addr_value))
addr_value.head()


Addresses: 7796


Unnamed: 0,total_out_value,n_out_tx,total_in_value,n_in_tx,n_total_tx,net_flow
0x0000000000000068f116a894984e2db1123eb395,0.0,0.0,9.49143e+19,15.0,15.0,9.49143e+19
0x0000000000001ff3684f28c67538d4d072c22734,0.0,0.0,6.716182e+20,47.0,47.0,6.716182e+20
0x0000000000a39bb272e79075ade125fd351887ac,0.0,0.0,1.401e+20,23.0,23.0,1.401e+20
0x00000000219ab540356cbb839cbe05303d7705fa,0.0,0.0,1.853817e+22,347.0,347.0,1.853817e+22
0x00000047bb99ea4d791bb749d970de71ee0b1a34,0.0,0.0,1.306812e+20,15.0,15.0,1.306812e+20


## 3. Defining Value-Based Whales

We classify whales using percentile thresholds for:

- Total outgoing value
- Total incoming value

This approach captures the top heavy-hitters in terms of ETH transfer volume.


In [12]:
# Percentile thresholds (tunable)
out_pct = 99.9
in_pct = 99.9

out_th = np.percentile(addr_value["total_out_value"], out_pct)
in_th  = np.percentile(addr_value["total_in_value"],  in_pct)

print(f"{out_pct}th percentile (out):", out_th)
print(f"{in_pct}th percentile (in):",  in_th)


99.9th percentile (out): 2.502564455775112e+22
99.9th percentile (in): 3.258024096342926e+22


In [13]:
addr_value["is_out_whale"] = addr_value["total_out_value"] >= out_th
addr_value["is_in_whale"]  = addr_value["total_in_value"]  >= in_th

addr_value["is_whale_value"] = addr_value["is_out_whale"] | addr_value["is_in_whale"]

addr_value["is_whale_value"].value_counts()


is_whale_value
False    7785
True       11
Name: count, dtype: int64

# 4. Centrality-Based Whale Metrics

We now detect whales based on their structural importance in the heterogeneous graph.

We compute:

- In-degree and out-degree
- Total degree
- PageRank
- HITS (hubs and authorities)

These quantify how "influential" an address is, independent of transaction volume.


In [14]:
# Degree metrics
deg_in = dict(G.in_degree())
deg_out = dict(G.out_degree())
deg_total = dict(G.degree())

centrality_df = pd.DataFrame({
    "in_degree": pd.Series(deg_in),
    "out_degree": pd.Series(deg_out),
    "degree": pd.Series(deg_total)
}).fillna(0).astype(int)

centrality_df.head()


Unnamed: 0,in_degree,out_degree,degree
0xd298c80f6e9e64a54b5a85b1733d76ee58837259,1,1,2
0xae9b92019f3e83d4451d48124f5abd8fc3124de6,1,2,3
0xd8193304176033a5f48976d1881bdd46d36c8523,1,1,2
0x0d8af920bb569f8a7d45485581dd989a7b14d390,1,0,1
0xf78abd170cff445fefe7248336b55e2c18906f00,1,0,1


## 5. PageRank

PageRank identifies nodes that receive flows from other important nodes.

High PageRank addresses often include:
- Exchanges
- Large liquidity nodes
- Contract hubs


In [15]:
print("Computing PageRank...")
pr = nx.pagerank(G, alpha=0.85, max_iter=100)
centrality_df["pagerank"] = pd.Series(pr)
centrality_df["pagerank"].fillna(0, inplace=True)

centrality_df.sort_values("pagerank", ascending=False).head(10)


Computing PageRank...


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  centrality_df["pagerank"].fillna(0, inplace=True)


Unnamed: 0,in_degree,out_degree,degree,pagerank
0xdac17f958d2ee523a2206206994597c13d831ec7,11228,0,11228,0.172156
0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48,9311,0,9311,0.14128
0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2,987,0,987,0.016076
0x28c6c06298d514db089934071355e5743bf21d60,326,93,419,0.009756
0x00000000219ab540356cbb839cbe05303d7705fa,311,0,311,0.0062
0xa9ac43f5b5e38155a288d1a01d2cbc4478e14573,121,80,201,0.004697
0xf30ba13e4b04ce5dc4d254ae5fa95477800f0eb0,114,121,235,0.003048
0xa9d1e08c7793af67e9d92fe308d5697fb81d3e43,105,2,107,0.002595
0xa1abfa21f80ecf401bd41365adbb6fef6fefdf09,90,93,183,0.002467
0x2cff890f0378a11913b6129b2e97417a2c302680,63,19,82,0.002045


## 6. HITS (Hubs and Authorities)

HITS distinguishes:
- **Hubs**: nodes pointing to authoritative nodes  
- **Authorities**: nodes pointed to by many hubs


In [16]:
print("Computing HITS...")
hubs, auth = nx.hits(G, max_iter=100, normalized=True)

centrality_df["hub_score"] = pd.Series(hubs)
centrality_df["authority_score"] = pd.Series(auth)

centrality_df.head()


Computing HITS...


Unnamed: 0,in_degree,out_degree,degree,pagerank,hub_score,authority_score
0xd298c80f6e9e64a54b5a85b1733d76ee58837259,1,1,2,3.4e-05,3.938434e-07,0.0
0xae9b92019f3e83d4451d48124f5abd8fc3124de6,1,2,3,2.6e-05,6.076628e-05,4.592615e-05
0xd8193304176033a5f48976d1881bdd46d36c8523,1,1,2,4.1e-05,4.895364e-24,1.097104e-18
0x0d8af920bb569f8a7d45485581dd989a7b14d390,1,0,1,4.8e-05,0.0,-1.4232719999999998e-19
0xf78abd170cff445fefe7248336b55e2c18906f00,1,0,1,4.7e-05,0.0,-5.1244249999999995e-20


In [17]:
print("Computing HITS...")
hubs, auth = nx.hits(G, max_iter=100, normalized=True)

centrality_df["hub_score"] = pd.Series(hubs)
centrality_df["authority_score"] = pd.Series(auth)

centrality_df.head()


Computing HITS...


Unnamed: 0,in_degree,out_degree,degree,pagerank,hub_score,authority_score
0xd298c80f6e9e64a54b5a85b1733d76ee58837259,1,1,2,3.4e-05,3.938434e-07,-0.0
0xae9b92019f3e83d4451d48124f5abd8fc3124de6,1,2,3,2.6e-05,6.076628e-05,4.592615e-05
0xd8193304176033a5f48976d1881bdd46d36c8523,1,1,2,4.1e-05,5.163022e-23,4.3330889999999995e-19
0x0d8af920bb569f8a7d45485581dd989a7b14d390,1,0,1,4.8e-05,-0.0,-6.563551999999999e-19
0xf78abd170cff445fefe7248336b55e2c18906f00,1,0,1,4.7e-05,-0.0,-9.065606e-19


# 7. Combine Value-Based and Centrality-Based Whales

We merge:
- Value metrics (`addr_value`)
- Centrality metrics (`centrality_df`)

Then define **centrality whales** using percentile thresholds.

Finally, create a **unified whale label**:
is_whale = is_whale_value OR is_whale_centrality


In [19]:
# Merge on address (index)
addr_all = addr_value.join(centrality_df, how="left").fillna(0)

print("Merged shape:", addr_all.shape)
addr_all.head()


Merged shape: (7796, 15)


Unnamed: 0,total_out_value,n_out_tx,total_in_value,n_in_tx,n_total_tx,net_flow,is_out_whale,is_in_whale,is_whale_value,in_degree,out_degree,degree,pagerank,hub_score,authority_score
0x0000000000000068f116a894984e2db1123eb395,0.0,0.0,9.49143e+19,15.0,15.0,9.49143e+19,False,False,False,14,0,14,0.000286,-0.0,2.2747760000000003e-17
0x0000000000001ff3684f28c67538d4d072c22734,0.0,0.0,6.716182e+20,47.0,47.0,6.716182e+20,False,False,False,40,0,40,0.000513,-0.0,0.000671533
0x0000000000a39bb272e79075ade125fd351887ac,0.0,0.0,1.401e+20,23.0,23.0,1.401e+20,False,False,False,20,0,20,0.000374,-0.0,8.615999e-06
0x00000000219ab540356cbb839cbe05303d7705fa,0.0,0.0,1.853817e+22,347.0,347.0,1.853817e+22,False,False,False,311,0,311,0.0062,-0.0,3.90484e-09
0x00000047bb99ea4d791bb749d970de71ee0b1a34,0.0,0.0,1.306812e+20,15.0,15.0,1.306812e+20,False,False,False,10,2,12,0.000195,3.6e-05,0.0002554334


## 8. Define Centrality-Based Whales

We classify whales based on the top percentile of:

- Degree
- PageRank
- Authority score


In [20]:
# Percentile cutoffs
deg_pct = 99.9
pr_pct = 99.9
auth_pct = 99.9

degree_th = np.percentile(addr_all["degree"], deg_pct)
pagerank_th = np.percentile(addr_all["pagerank"], pr_pct)
auth_th = np.percentile(addr_all["authority_score"], auth_pct)

addr_all["is_whale_degree"] = addr_all["degree"] >= degree_th
addr_all["is_whale_pagerank"] = addr_all["pagerank"] >= pagerank_th
addr_all["is_whale_authority"] = addr_all["authority_score"] >= auth_th

# Combine centrality whales
addr_all["is_whale_centrality"] = (
    addr_all["is_whale_degree"] |
    addr_all["is_whale_pagerank"] |
    addr_all["is_whale_authority"]
)

addr_all["is_whale_centrality"].value_counts()


is_whale_centrality
False    7782
True       14
Name: count, dtype: int64

## 9. Final Whale Label

We define the unified whale label:

is_whale = is_whale_value OR is_whale_centrality

In [21]:
addr_all["is_whale"] = addr_all["is_whale_value"] | addr_all["is_whale_centrality"]
addr_all["is_whale"].value_counts()


is_whale
False    7775
True       21
Name: count, dtype: int64

## 10. Save Whale Results for Downstream Analysis

The saved table will be used in:

- 4.2 Whale Ego Graphs
- 4.3 Whale Flow + Time Series Analysis
- 4.4 Whale Risk Insights

In [23]:
OUTPUT_DIR = os.path.join(PROJECT_ROOT, "data", "processed")
os.makedirs(OUTPUT_DIR, exist_ok=True)

OUTPUT_PATH = os.path.join(OUTPUT_DIR, "whale_detection_value_and_centrality.parquet")

addr_all.to_parquet(OUTPUT_PATH)
print("Saved whale metrics to:", OUTPUT_PATH)


Saved whale metrics to: /Users/dada/Developer/italy_proj/DataMining/EhereumNetworkAnalysis/data/processed/whale_detection_value_and_centrality.parquet


# 11. Summary

In this notebook, we:

- Computed value-based whale metrics  
- Computed several graph-based centrality metrics  
- Defined both value whales and centrality whales  
- Produced a unified whale label (`is_whale`)  
- Saved all results for use in later notebooks

This completes the whale detection stage.  
