Graph-Attentive Multi-Omics Fusion for Blood-Based PTSD Classification Across Military and Civilian Cohorts
OmicsGraphNet integrates six molecular feature blocks from whole blood across eight publicly available GEO cohorts (n = 1,173 samples) into a unified multi-omics GNN classifier for PTSD. The model chains:
- Per-modality β-VAE encoders (β = 4.0, z = 64) for robust latent compression of each feature block
- Multi-head cross-modal attention (4 heads) for learned modality fusion with soft per-modality gates
- GATv2 graph attention over the STRING v12 PPI knowledge graph (14,690 nodes, 191,324 edges)
- Multi-task decoder for simultaneous PTSD classification and PCL-5 severity regression
- Monte Carlo Dropout (T = 50 passes) for predictive uncertainty quantification
Evaluation uses six-fold Leave-One-Cohort-Out cross-validation (LOCO-CV), the most stringent generalisation benchmark for blood-based PTSD biomarker discovery.
| Model | AUC (mean ± SD) | F1 | Sensitivity | Specificity |
|---|---|---|---|---|
| OmicsGraphNet | 0.603 ± 0.176 | 0.132 | 0.105 | 0.907 |
| ElasticNet | 0.541 ± 0.040 | 0.324 | 0.351 | 0.706 |
| Random Forest | 0.548 ± 0.051 | 0.041 | 0.031 | 0.970 |
| XGBoost | 0.528 ± 0.041 | 0.295 | 0.360 | 0.678 |
| ConcatMLP | 0.542 ± 0.043 | 0.403 | 0.546 | 0.477 |
Best fold: GSE164877 (WTC First Responders 2) — AUC = 0.897
All data are publicly available from NCBI GEO. No patient-level data are included in this repository.
| GEO Accession | Cohort | n | Assay | LOCO Role |
|---|---|---|---|---|
| GSE97356 | WTC First Responders | 324 | RNA-seq | Fold 1 |
| GSE164877 | WTC First Responders 2 | 226 | RNA-seq | Fold 2 (best) |
| GSE63878 | Marines MRS | 96 | Microarray | Fold 3 |
| GSE81761 | Military Service Members | 66 | RNA-seq | Fold 4 |
| GSE109409 | Canadian Infantry | 85 | RNA-seq | Fold 5 |
| GSE64813 | US Service Members | 188 | RNA-seq | Fold 6 |
| GSE67663 | PTSD+MDD Comorbid | 184 | Microarray | Fixed holdout |
| GSE860 | ER Trauma Survivors | 33 | Microarray | Fixed civilian holdout |
Input (537 features) β-VAE Encoders Cross-Modal Attn GATv2 PPI Graph Decoder
────────────────── ────────────── ──────────────── ─────────────── ───────
Block A: Transcriptomics (500) VAE (z=64) ──┐
Block B: EWAS CpG proxies (12) VAE (z=64) ──┤
Block C: Cell-type fracs (11) VAE (z=64) ──┤ MHA (4 heads) ──► GATv2 (2 layers) ──► PTSD prob
Block D: PRS (1) VAE (z=64) ──┤ Soft gates 14,690 nodes PCL-5 score
Block E: TWAS brain genes (10) VAE (z=64) ──┤ Query vector 191,324 edges
Block F: Clinical (3) VAE (z=64) ──┘
Total parameters: 896,413
Loss: L_CE + β · KL + λ_r · MSE_recon + λ_s · MSE_PCL5
MC Dropout (p = 0.25, T = 50) for uncertainty quantification
# 1. Create environment
conda create -n deepomics python=3.11 -y
conda activate deepomics
# 2. Install PyTorch (CPU; for GPU see https://pytorch.org/get-started/locally/)
pip install torch==2.2.2 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# 3. Install dependencies
pip install -r requirements.txt
# 4. Install PyTorch Geometric
pip install torch-scatter torch-sparse torch-cluster torch-geometric --no-build-isolationNo GPU required. Training one fold takes ~2 hours on an M2 MacBook Pro (CPU).
The checkpoints/ directory contains 11 trained models (~3.5 MB each). Run inference immediately after completing the data pipeline (Steps 1–6):
python predict.py \
--checkpoint checkpoints/omicsgraphnet_fold2_GSE164877.pt \
--features data/processed/feature_matrix.csv \
--labels data/processed/sample_labels.csv \
--ppi_edges data/processed/ppi_graph_edges.csv \
--ppi_nodes data/processed/ppi_node_features.csv \
--output predictions.csvOutput: sample_id · ptsd_prob · ptsd_pred · uncertainty_std · uncertainty_entropy · [true_label]
| Checkpoint | Test cohort | AUC | Use for |
|---|---|---|---|
omicsgraphnet_fold2_GSE164877.pt |
WTC FR 2 | 0.897 | Best overall |
omicsgraphnet_fold1_GSE97356.pt |
WTC FR 1 | ~0.60 | WTC-independent |
omicsgraphnet_fold5_GSE109409.pt |
Canadian infantry | ~0.55 | Military population |
# 1. Download all 8 GEO cohorts (~2 GB)
python data/download_geo.py
# 2. Extract and standardise PTSD labels
python data/fix_labels.py
# 3. Normalise, ComBat batch-correct, run SVA
python src/preprocessing/normalize_expression.py
# 4. NNLS cell-type deconvolution (Hwang 2025 atlas)
python src/preprocessing/run_deconvolution.py
# 5. Build 537-feature matrix (6 blocks)
python src/preprocessing/build_feature_matrix.py
# 6. Build STRING v12 PPI graph (auto-downloads ~500 MB)
python src/preprocessing/build_ppi_graph.py --download_string
# 7. LOCO-CV training — 6 folds, 200 epochs each
python src/training/train_omicsgraphnet.py --epochs 200 --cv loco
# 8. Integrated Gradients biomarker attribution
python src/evaluation/interpret.py \
--checkpoint checkpoints/omicsgraphnet_fold2_GSE164877.pt
# 9. Statistical tests (DeLong AUC CI, McNemar, bootstrap)
python src/evaluation/statistical_tests.py
# 10. Generate figures
python make_figures.pyOmicsGraphNet can be applied to any binary phenotype with whole-blood transcriptomics.
Feature matrix format — CSV with samples as rows, 537 columns in block order:
Block A (cols 0–499): 500 transcriptomics features (log2-TPM, ComBat-corrected)
Block B (cols 500–511): 12 EWAS CpG proxy genes
Block C (cols 512–522): 11 NNLS cell-type fractions
Block D (col 523 ): 1 PRS value (0 if unavailable)
Block E (cols 524–533): 10 TWAS PrediXcan z-scores
Block F (cols 534–536): 3 clinical (PCL-5, sex, platform)
See data/processed/feature_metadata.csv for the full column list. Unavailable blocks can be zero-filled — the cross-modal attention gate will down-weight them automatically.
Fine-tuning a checkpoint:
import torch
from src.models.omics_graph_net import OmicsGraphNet
ckpt = torch.load("checkpoints/omicsgraphnet_fold2_GSE164877.pt", map_location="cpu", weights_only=False)
model = OmicsGraphNet(block_dims=ckpt["block_dims"])
model.load_state_dict(ckpt["model_state"])
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
# standard PyTorch training loopOmicsGraphNet/
├── data/
│ ├── download_geo.py # Download 8 GEO cohorts via GEOparse
│ ├── fix_labels.py # Cohort-specific PTSD label extraction
│ ├── processed/ # Generated outputs (git-ignored; run pipeline to create)
│ └── external/
│ ├── ewas/ # PTSD EWAS CpG target genes
│ ├── prs/ # PGC Freeze 3 GWAS loci + PRSice2/LDPred2 templates
│ ├── ppi/ # STRING v12 protein aliases
│ └── singlecell/ # Hwang 2025 cell-type marker signatures
├── src/
│ ├── models/
│ │ ├── omics_graph_net.py # OmicsGraphNet: β-VAE → Attention → GATv2
│ │ └── baselines.py # ElasticNet, SVM-RFE, RF, XGBoost, ConcatMLP, MOLI, MOGONET
│ ├── preprocessing/
│ │ ├── normalize_expression.py # VST + ComBat + SVA
│ │ ├── run_deconvolution.py # NNLS cell-type deconvolution (11 types)
│ │ ├── build_feature_matrix.py # Assembles 537-feature 6-block matrix
│ │ └── build_ppi_graph.py # STRING v12 → PyG graph
│ ├── training/
│ │ └── train_omicsgraphnet.py # LOCO-CV, AdamW, CosineAnnealingLR, early stopping
│ ├── evaluation/
│ │ ├── interpret.py # Integrated Gradients + cross-modal attention
│ │ └── statistical_tests.py # DeLong, McNemar, bootstrap, BH-FDR
│ └── visualization/
│ └── make_figures.py # Figure utilities (Okabe-Ito palette, 600 DPI)
├── checkpoints/ # 11 pre-trained models (~3.5 MB each)
├── predict.py # Inference with MC Dropout uncertainty
├── make_figures.py # Publication figures (Fig 1–3)
├── run_loco_cv_folds.sh # Run all 6 LOCO folds sequentially
├── requirements.txt
├── CITATION.cff
├── LICENSE
└── README.md
Top features by Integrated Gradients (GSE164877 fold, AUC = 0.897):
| Rank | Feature | Block | Mean |IG| | Biological context |
|---|---|---|---|---|
| 1 | Neutrophil fraction | C — Cell-type | 6.3 × 10⁻⁵ | Elevated NLR in PTSD (innate immunity) |
| 2 | PV interneuron score | C — Cell-type | 4.8 × 10⁻⁵ | GABAergic interneuron proxy |
| 3 | NK cell fraction | C — Cell-type | 4.8 × 10⁻⁵ | NK functional deficit (glucocorticoid) |
| 4 | Astrocyte score | C — Cell-type | 3.8 × 10⁻⁵ | Neuroinflammatory proxy (cross-tissue) |
| 5 | IFI44 | A — Transcriptomics | 1.4 × 10⁻⁶ | ISG15 / type-I interferon pathway |
| 6 | RSAD2 | A — Transcriptomics | 1.2 × 10⁻⁶ | Viperin / type-I interferon pathway |
Full ranked list: results/interpretation/saliency_top_genes.csv
| Item | Status | Source |
|---|---|---|
| Polygenic risk scores (Block D) | Zero-imputed placeholder | dbGaP: phs000864, phs000560, phs002046 |
| EWAS methylation (Block B) | Expression-proxy (12 surrogates) | PGC PTSD EWAS working group |
| Hwang 2025 atlas (.h5ad) | Required for deconvolution | Zenodo: 10.5281/zenodo.15186498 |
| STRING v12 | Auto-downloaded | build_ppi_graph.py --download_string |
All analyses run on a standard macOS workstation (Apple M-series, 16 GB RAM). No HPC or GPU required. Seeds: torch.manual_seed(42), np.random.seed(42).
Versions: Python 3.11 · PyTorch 2.2.2 · PyG 2.7.0 · scikit-learn 1.3 · neuroCombat 0.9.1 · GEOparse 2.0.4
@article{roy2026omicsgraphnet,
title = {{OmicsGraphNet}: Graph-Attentive Multi-Omics Fusion for Blood-Based {PTSD}
Classification Across Military and Civilian Cohorts},
author = {Roy, Kushal Raj},
year = {2026}
}MIT — see LICENSE