AD-GEN: Evidence-Preserving Generation of Validated ATT&CK-Aligned Narratives from Large-Scale Endpoint Telemetry
LLM-ready endpoint security dataset for SOC automation, ATT&CK reasoning, and instruction tuning
Modern endpoint telemetry datasets contain rich behavioral evidence but are often difficult to use directly for large language model (LLM) reasoning. Raw Sysmon logs are fragmented across individual events, contain sensitive identifiers, suffer from process identifier reuse, and provide limited behavioral context for security analysis.
AD-GEN addresses this gap by transforming large-scale Windows Sysmon telemetry from the COMISET corpus into process-centric, privacy-preserving, compressed, and validated ATT&CK-aligned narrative records. The framework reconstructs process lifecycles, preserves behavioral evidence, normalizes temporal information, and generates analyst-style annotations suitable for LLM-based cybersecurity applications.
AD-GEN is designed for:
- LLM-based SOC automation
- ATT&CK-aware instruction tuning
- Threat hunting assistant development
- Endpoint behavior reasoning
- Security narrative generation
The resulting dataset enables security-focused language models to reason over structured behavioral narratives rather than isolated raw events, while maintaining alignment with the MITRE ATT&CK framework through automated validation.
| Metric | LAB | REAL | Total |
|---|---|---|---|
| Raw Sysmon Events | 49,914,325 | 202,304,790 | 252,219,115 |
| Post-Squash Events | 21,360,985 | 31,571,618 | 52,932,603 |
| Step 2 Narratives | 49,745 | 185,978 | 235,723 |
| Step 3 LLM Outputs | 50,671 | 190,109 | 240,780 |
| Step 4 Validated Outputs | 50,622 | 190,085 | 240,707 |
| Environment | Raw Events | Post-Squash Events | Compression Ratio | Event Reduction |
|---|---|---|---|---|
| LAB | 49,914,325 | 21,360,985 | 2.34× | 57.20% |
| REAL | 202,304,790 | 31,571,618 | 6.41× | 84.39% |
| Overall | 252,219,115 | 52,932,603 | 4.76× | 79.01% |
| Risk Level | Count | Percentage |
|---|---|---|
| Low | 234,046 | 97.26% |
| Medium | 3,131 | 1.30% |
| High | 2,607 | 1.08% |
| Critical | 847 | 0.35% |
| Tactic ID | Tactic Name | Frequency |
|---|---|---|
| TA0004 | Privilege Escalation | 3,157 |
| TA0005 | Defense Evasion | 2,511 |
| TA0003 | Persistence | 2,451 |
| TA0002 | Execution | 1,763 |
| TA0007 | Discovery | 727 |
| TA0006 | Credential Access | 603 |
Each record is stored as JSONL.
{
"sample_id": "ADGEN_0000001",
"environment": "LAB",
"source_dataset": "COMISET",
"narrative": "...",
"sysmon_hints": ["T1055"],
"label": {
"risk_level": "Medium",
"mitre_tactics": ["TA0005"],
"mitre_techniques": ["T1055"],
"recommended_actions": [
{
"tool_name": "get_file_metadata",
"parameters": {}
}
],
"summary": "Process access behavior consistent with process injection.",
"analyst_rationale": "Evidence-based analyst reasoning.",
"verdict": "suspicious",
"label_source": "llm_validated"
}
}check_threat_intel
get_file_metadata
query_registry
get_network_flow
terminate_process
isolate_host
no_action
| Metric | LAB | REAL |
|---|---|---|
| Parse Success | 100.00% | 100.00% |
| Schema Validity | 99.93% | 99.98% |
| Verdict Consistency | 95.98% | 98.64% |
| Unknown Tactics After Validation | 0.032% | 0.007% |
| Unknown Techniques After Validation | 0.041% | 0.013% |
| Invalid Actions | 0 | 0 |
| Metric | Value |
|---|---|
| Audit Samples | 300 |
| Auditor Models | 3 |
| GPT-5.5 Composite Score | 0.748 |
| Claude Opus 4.8 Composite Score | 0.748 |
| Minimum Evidence Support Score | >0.72 |
| Minimum ATT&CK Alignment Score | >0.72 |
Three independent frontier language models (GPT-5.5, Claude Opus 4.8, and Gemini 3.5 Flash) were used to evaluate stratified samples from both LAB and REAL environments. Independent audits converged to nearly identical quality assessments, supporting the reliability and consistency of the validated ATT&CK-aligned labels.
AD-GEN
├── README.md
├── LAB
│ └── NEW_LAB.jsonl
├── REAL
│ └── NEW_REAL.jsonl
├── docs
│ └── pipeline.png
└── Conversion
├── Conversion.py
└── react_soc_prompt.txt
git clone https://github.com/namhop88/AD-GEN.git
cd AD-GEN| File | Description |
|---|---|
LAB/NEW_LAB.jsonl |
AD-GEN processed records derived from the COMISET laboratory environment |
REAL/NEW_REAL.jsonl |
AD-GEN processed records derived from the COMISET real university network environment |
docs/pipeline.png |
Overview of the AD-GEN transformation pipeline |
Conversion/Conversion.py |
Utility for converting records to the release format |
Conversion/react_soc_prompt.txt |
ReAct-style SOC labeling prompt used during generation |
AD-GEN is derived from the COMISET Windows endpoint telemetry corpus; it does not claim to be the original raw telemetry source.
AD-GEN labels are validated synthetic analyst labels, not human-adjudicated forensic ground truth.
The dataset is intended for research in instruction tuning, weak supervision, SOC assistant development, ATT&CK-aware reasoning, and endpoint narrative modeling. Additional expert review is recommended before operational use.
@article{nam2026adgen,
title = {AD-GEN: Evidence-Preserving Generation of Validated ATT\&CK-Aligned Narratives from Large-Scale Endpoint Telemetry},
author = {Dinh Phuong Nam and Nguyen Tan Cam},
year = {2026},
journal = {Preprint}
}| Component | License |
|---|---|
| Source Code | MIT License |
| Dataset | CC BY-NC 4.0 |
AD-GEN is released for academic and defensive cybersecurity research only. It should not be used as the sole basis for operational security decisions without expert validation.
Dinh Phuong Nam
University of Information Technology (UIT), VNU-HCM
HUTECH University
GitHub: @namhop88
