Piridi et al. "SONAR: A Large-Scale Social Network Benchmark for Graph-Based Anomaly Detection." Submitted to SIGIR 2026.
SONAR (SOcial Network Anomaly Resource) is the largest publicly available heterogeneous graph benchmark for anomaly detection in social networks. Built from real X (formerly Twitter) data spanning 11 months of activity during the Indian Farmers' Protest, SONAR captures 3.8 million users, 3.6 million posts, and 7 relation types — enabling the first systematic evaluation of graph anomaly detectors at realistic social network scale.
Graph anomaly detection research is held back by benchmarks that are too small, too simple, and too homogeneous. Existing datasets top out at 1M users with a single relation type, while real social platforms have billions of users interacting through diverse mechanisms. No prior benchmark provides both large-scale authentic social network data and controlled anomaly ground truth at multiple granularities.
Comparison with existing benchmarks
| Dataset | Users | Relations | Heterogeneous | Anomaly Labels |
|---|---|---|---|---|
| Cresci-15 | 5,301 | 1 | User only | |
| TwiBot-20 | 229,580 | 1 | User only | |
| MGTAB | 410,199 | 4 | ✓ | User only |
| TwiBot-22 | 1,000,000 | 1 | User only | |
| SONAR-Large | 3,797,980 | 7 | ✓ | User + Post |
SONAR addresses four critical gaps:
- 3.8x larger scale than TwiBot-22 (3.8M vs 1M users), enabling evaluation at realistic social network sizes
- Rich multi-relational structure with 3 node types and 7 edge types capturing the full spectrum of X/Twitter interactions (posting, replying, quoting, mentioning, hashtag usage)
- Dual-granularity anomaly labels at both user and post level — the first social network benchmark to offer this — enabling fine-grained, multi-task evaluation
- Controlled anomaly injection using established PyGOD methods: structural anomalies (coordinated cliques simulating bot networks) and contextual anomalies (attribute perturbations) at a 5% rate
SONAR is available at three scales to support both rapid prototyping and scalability research:
| Variant | Users | Posts | Hashtags | Total Nodes | Edges | Anomalies |
|---|---|---|---|---|---|---|
| Small | 18,430 | 18,429 | 1 | 36,860 | 49,865 | 1,818 |
| Medium | 424,446 | 422,032 | 18 | 846,496 | 1,112,995 | 41,830 |
| Large | 3,797,980 | 3,611,869 | 152 | 7,410,001 | 10,204,721 | 365,861 |
The heterogeneous graph models the full X/Twitter interaction spectrum:
| Edge Type | Source | Target | Semantics |
|---|---|---|---|
post_original |
User | Post | User authors a post |
post_quote |
User | Post | User quotes a post |
post_reply |
User | Post | User replies to a post |
quotes |
Post | Post | Post quotes another post |
replies |
Post | Post | Post replies to another post |
mentions |
Post | User | Post mentions a user |
contains |
Post | Hashtag | Post contains a hashtag |
The figure below shows an example subgraph from SONAR illustrating the multi-relational structure with users (blue), tweets (green), and hashtags (purple):
| Node Type | Dim | Features |
|---|---|---|
| User | 4 | followers_count, following_count, listed_count, post_count |
| Post | 772 | repost_count, quote_count, like_count, post_type + 768-d Universal Sentence Encoder embedding |
| Hashtag | 1 | category label |
The homogeneous representation projects all nodes into a shared 16-dimensional feature space suitable for standard PyGOD detectors.
SONAR injects two complementary anomaly types at a 5% rate:
- Structural anomalies: Coordinated cliques where selected users are fully connected to selected posts, simulating bot networks that artificially amplify content
- Contextual anomalies: Attribute perturbations using Euclidean distance maximization, simulating accounts with suspicious engagement metrics that deviate from their structural neighborhood
# Install PyTorch first (see https://pytorch.org/get-started)
pip install torch
# Then install sonar-graph
pip install sonar-graphFor development (includes torch-sparse, torch-scatter, pytest, ruff, jupyter):
git clone https://github.com/hpiridi/sonar.git
cd sonar
pip install -e ".[dev]"from sonar import SONAR, dataset_summary, evaluate_detector
# Load small dataset (auto-downloaded, ~60MB)
dataset = SONAR(root="./data", name="small", anomalies=True)
data = dataset[0]
print(dataset_summary(data))
# {'type': 'homogeneous', 'num_nodes': 36860, 'num_edges': 49865,
# 'num_features': 16, 'num_anomalies': 1818, 'anomaly_ratio': 0.0493}
# Run a detector
from pygod.detector import DOMINANT
detector = DOMINANT(epoch=5, gpu=0)
detector.fit(data)
_, score = detector.predict(data, return_pred=True, return_score=True)
# Evaluate
print(evaluate_detector(data.y_outlier, score))
# {'roc_auc': 0.7384, 'average_precision': 0.0825, 'recall_at_k': 0.0286}Load the heterogeneous variant to access the full multi-relational structure:
dataset = SONAR(root="./data", name="small", anomalies=False,
representation="heterogeneous")
data = dataset[0]
# HeteroData(user={x=[18430, 4]}, tweet={x=[18429, 772]}, hashtag={x=[1, 1]}, ...)We benchmark 16 detectors spanning deep graph, classical graph, and non-graph approaches on SONAR-Small:
| Type | Detector | ROC-AUC | Avg Precision | Recall@k | Time (s) | Device |
|---|---|---|---|---|---|---|
| Deep Graph | AdONE | 0.8459 | 0.1672 | 0.0875 | 16.12 | GPU |
| DONE | 0.8407 | 0.1599 | 0.0721 | 15.92 | GPU | |
| GCNAE (GAE) | 0.8025 | 0.1806 | 0.1518 | 0.80 | GPU | |
| DOMINANT | 0.7384 | 0.0825 | 0.0286 | 15.85 | GPU | |
| CONAD | 0.7375 | 0.0824 | 0.0292 | 24.84 | GPU | |
| AnomalyDAE | 0.6858 | 0.2569 | 0.3388 | 16.15 | GPU | |
| DMGD | 0.6366 | 0.0646 | 0.0237 | 140.81 | CPU | |
| ONE | 0.5705 | 0.1257 | 0.1430 | 17.79 | GPU | |
| CoLA | 0.3528 | 0.0544 | 0.1194 | 0.79 | GPU | |
| OCGNN | 0.2294 | 0.0315 | 0.0270 | 0.92 | GPU | |
| Classical Graph | ANOMALOUS | 0.7997 | 0.4305 | 0.4455 | 11.76 | GPU |
| Radar | 0.7997 | 0.4305 | 0.4455 | 207.45 | CPU | |
| SCAN | 0.7526 | 0.5223 | 0.5198 | 44.97 | GPU | |
| Non-graph | IF | 0.6518 | 0.1381 | 0.1865 | 0.62 | CPU |
| MLPAE | 0.5680 | 0.0875 | 0.1078 | 35.27 | CPU | |
| LOF | 0.4284 | 0.0589 | 0.0567 | 1.38 | CPU |
Note: PyGOD's
GAEimplements a GCN-based autoencoder (GCNAE), not the variational GAE from Kipf & Welling (2016). DMGD and Radar ran on CPU due to GPU OOM. Three detectors (GAAN, GADNR, GUIDE) are excluded due to OOM or version incompatibility.
- Deep graph methods lead on ranking but not precision: AdONE and DONE achieve the best ROC-AUC (84.59%, 84.07%), indicating strong overall separation between anomalies and normals. However, their AP (16.72%, 15.99%) and Recall@k (8.75%, 7.21%) are significantly lower, revealing that deep autoencoders produce smooth, continuous anomaly scores that rank well in aggregate but fail to concentrate true anomalies at the top of the prediction list.
- Classical graph methods excel at precision: SCAN achieves the highest AP (52.23%) and Recall@k (51.98%) despite a lower ROC-AUC (75.26%). Its discrete structural clustering produces fewer but more precise predictions (933 outliers detected vs. AdONE's 3,686), making it more suitable for practical settings where analysts investigate top-k alerts. ANOMALOUS and Radar both reach ROC-AUC of 0.80 with AP of 43.05%, showing that classical graph-aware methods effectively capture both structural and contextual anomalies.
- ROC-AUC alone is misleading for anomaly detection: The divergence between ROC-AUC and AP/Recall@k across detectors highlights the importance of evaluating with multiple metrics. A detector with high ROC-AUC may still produce many false positives at any practical operating threshold, while a lower-ROC-AUC detector like SCAN can be far more actionable.
- Non-graph baselines provide context: Isolation Forest (ROC-AUC 0.65) and MLPAE (0.57) show that feature-only methods capture some signal, but graph-aware methods substantially outperform them, validating the importance of relational structure.
- Efficiency varies 228x: IF completes in 0.62s while Radar requires 207.45s (on CPU), highlighting significant runtime-accuracy trade-offs across method types.
See results/ for full JSON results.
Run a single detector:
uv run python run_detector.py --dataset-name small --algorithm DOMINANT --epoch 5Run all detectors:
bash run_all.shUse a custom dataset:
uv run python run_detector.py --dataset path/to/graph.pickle --algorithm DOMINANTBenchmark configurations (epoch, contamination, detector list) are documented in benchmarks/configs/small.yaml.
sonar/ # Python package (pip install sonar-graph)
dataset.py # PyG InMemoryDataset loader with auto-download
utils.py # evaluate_detector(), dataset_summary()
tests/ # pytest suite (17 fast + 8 slow tests)
notebooks/
quickstart.ipynb # Load, explore, detect, evaluate
benchmark_analysis.ipynb # Reproduce paper tables and figures
results/ # Pre-computed benchmark results (JSON)
benchmarks/configs/ # Hyperparameter configurations
scripts/ # Data conversion utilities
run_detector.py # CLI benchmark runner
run_all.sh # Run all detectors
| Variant | Access | Size |
|---|---|---|
| Small | Auto-downloaded via SONAR loader |
~60 MB |
| Medium | Contact authors (see below) | ~1.5 GB |
| Large | Contact authors (see below) | ~12 GB |
The medium and large datasets exceed GitHub's LFS file size limits, so they cannot be hosted on GitHub. To access them, please contact the authors:
- Hari Prasad Piridi — p20210102@hyderabad.bits-pilani.ac.in
- Dipanjan Chakraborty — dipanjan@hyderabad.bits-pilani.ac.in
Please include your affiliation and intended use.
@misc{piridi2026sonar,
title = {{SONAR}: A Large-Scale Social Network Benchmark for Graph-Based Anomaly Detection},
author = {Piridi, Hari Prasad and Agarwal, Sheyril and Singh, Anirudh and
Duddupudi, Sailesh and Yarramsetty, Sanjeeva Sai Preetham and
Shyamendra, Pavan and Enaganti, Shreya and Ratra, Vastav and
Upadhyay, Prajna Devi and Chandra, Priyank and Chakraborty, Dipanjan},
note = {Submitted to SIGIR 2026},
year = {2026}
}


