Skip to content

kroy3/OmicsGraphNet

Repository files navigation

OmicsGraphNet

Graph-Attentive Multi-Omics Fusion for Blood-Based PTSD Classification Across Military and Civilian Cohorts

License: MIT Python 3.11 PyTorch 2.2 PyG 2.7


Overview

OmicsGraphNet integrates six molecular feature blocks from whole blood across eight publicly available GEO cohorts (n = 1,173 samples) into a unified multi-omics GNN classifier for PTSD. The model chains:

  1. Per-modality β-VAE encoders (β = 4.0, z = 64) for robust latent compression of each feature block
  2. Multi-head cross-modal attention (4 heads) for learned modality fusion with soft per-modality gates
  3. GATv2 graph attention over the STRING v12 PPI knowledge graph (14,690 nodes, 191,324 edges)
  4. Multi-task decoder for simultaneous PTSD classification and PCL-5 severity regression
  5. Monte Carlo Dropout (T = 50 passes) for predictive uncertainty quantification

Evaluation uses six-fold Leave-One-Cohort-Out cross-validation (LOCO-CV), the most stringent generalisation benchmark for blood-based PTSD biomarker discovery.


Key Results

Model AUC (mean ± SD) F1 Sensitivity Specificity
OmicsGraphNet 0.603 ± 0.176 0.132 0.105 0.907
ElasticNet 0.541 ± 0.040 0.324 0.351 0.706
Random Forest 0.548 ± 0.051 0.041 0.031 0.970
XGBoost 0.528 ± 0.041 0.295 0.360 0.678
ConcatMLP 0.542 ± 0.043 0.403 0.546 0.477

Best fold: GSE164877 (WTC First Responders 2) — AUC = 0.897


Datasets

All data are publicly available from NCBI GEO. No patient-level data are included in this repository.

GEO Accession Cohort n Assay LOCO Role
GSE97356 WTC First Responders 324 RNA-seq Fold 1
GSE164877 WTC First Responders 2 226 RNA-seq Fold 2 (best)
GSE63878 Marines MRS 96 Microarray Fold 3
GSE81761 Military Service Members 66 RNA-seq Fold 4
GSE109409 Canadian Infantry 85 RNA-seq Fold 5
GSE64813 US Service Members 188 RNA-seq Fold 6
GSE67663 PTSD+MDD Comorbid 184 Microarray Fixed holdout
GSE860 ER Trauma Survivors 33 Microarray Fixed civilian holdout

Architecture

Input (537 features)           β-VAE Encoders    Cross-Modal Attn    GATv2 PPI Graph    Decoder
──────────────────             ──────────────    ────────────────    ───────────────    ───────
Block A: Transcriptomics (500)  VAE (z=64) ──┐
Block B: EWAS CpG proxies  (12) VAE (z=64) ──┤
Block C: Cell-type fracs   (11) VAE (z=64) ──┤  MHA (4 heads) ──► GATv2 (2 layers) ──► PTSD prob
Block D: PRS                (1) VAE (z=64) ──┤  Soft gates         14,690 nodes          PCL-5 score
Block E: TWAS brain genes  (10) VAE (z=64) ──┤  Query vector       191,324 edges
Block F: Clinical           (3) VAE (z=64) ──┘

Total parameters: 896,413
Loss: L_CE  +  β · KL  +  λ_r · MSE_recon  +  λ_s · MSE_PCL5
MC Dropout  (p = 0.25,  T = 50)  for uncertainty quantification

Installation

# 1. Create environment
conda create -n deepomics python=3.11 -y
conda activate deepomics

# 2. Install PyTorch (CPU; for GPU see https://pytorch.org/get-started/locally/)
pip install torch==2.2.2 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# 3. Install dependencies
pip install -r requirements.txt

# 4. Install PyTorch Geometric
pip install torch-scatter torch-sparse torch-cluster torch-geometric --no-build-isolation

No GPU required. Training one fold takes ~2 hours on an M2 MacBook Pro (CPU).


Quick Start — Inference with Pre-trained Checkpoints

The checkpoints/ directory contains 11 trained models (~3.5 MB each). Run inference immediately after completing the data pipeline (Steps 1–6):

python predict.py \
  --checkpoint checkpoints/omicsgraphnet_fold2_GSE164877.pt \
  --features   data/processed/feature_matrix.csv \
  --labels     data/processed/sample_labels.csv \
  --ppi_edges  data/processed/ppi_graph_edges.csv \
  --ppi_nodes  data/processed/ppi_node_features.csv \
  --output     predictions.csv

Output: sample_id · ptsd_prob · ptsd_pred · uncertainty_std · uncertainty_entropy · [true_label]

Checkpoint Test cohort AUC Use for
omicsgraphnet_fold2_GSE164877.pt WTC FR 2 0.897 Best overall
omicsgraphnet_fold1_GSE97356.pt WTC FR 1 ~0.60 WTC-independent
omicsgraphnet_fold5_GSE109409.pt Canadian infantry ~0.55 Military population

Full Pipeline

# 1. Download all 8 GEO cohorts (~2 GB)
python data/download_geo.py

# 2. Extract and standardise PTSD labels
python data/fix_labels.py

# 3. Normalise, ComBat batch-correct, run SVA
python src/preprocessing/normalize_expression.py

# 4. NNLS cell-type deconvolution (Hwang 2025 atlas)
python src/preprocessing/run_deconvolution.py

# 5. Build 537-feature matrix (6 blocks)
python src/preprocessing/build_feature_matrix.py

# 6. Build STRING v12 PPI graph (auto-downloads ~500 MB)
python src/preprocessing/build_ppi_graph.py --download_string

# 7. LOCO-CV training — 6 folds, 200 epochs each
python src/training/train_omicsgraphnet.py --epochs 200 --cv loco

# 8. Integrated Gradients biomarker attribution
python src/evaluation/interpret.py \
       --checkpoint checkpoints/omicsgraphnet_fold2_GSE164877.pt

# 9. Statistical tests (DeLong AUC CI, McNemar, bootstrap)
python src/evaluation/statistical_tests.py

# 10. Generate figures
python make_figures.py

Adapting to Your Own Data

OmicsGraphNet can be applied to any binary phenotype with whole-blood transcriptomics.

Feature matrix format — CSV with samples as rows, 537 columns in block order:

Block A (cols   0–499):  500 transcriptomics features (log2-TPM, ComBat-corrected)
Block B (cols 500–511):  12  EWAS CpG proxy genes
Block C (cols 512–522):  11  NNLS cell-type fractions
Block D (col  523    ):   1  PRS value (0 if unavailable)
Block E (cols 524–533):  10  TWAS PrediXcan z-scores
Block F (cols 534–536):   3  clinical (PCL-5, sex, platform)

See data/processed/feature_metadata.csv for the full column list. Unavailable blocks can be zero-filled — the cross-modal attention gate will down-weight them automatically.

Fine-tuning a checkpoint:

import torch
from src.models.omics_graph_net import OmicsGraphNet

ckpt  = torch.load("checkpoints/omicsgraphnet_fold2_GSE164877.pt", map_location="cpu", weights_only=False)
model = OmicsGraphNet(block_dims=ckpt["block_dims"])
model.load_state_dict(ckpt["model_state"])

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
# standard PyTorch training loop

Project Structure

OmicsGraphNet/
├── data/
│   ├── download_geo.py          # Download 8 GEO cohorts via GEOparse
│   ├── fix_labels.py            # Cohort-specific PTSD label extraction
│   ├── processed/               # Generated outputs (git-ignored; run pipeline to create)
│   └── external/
│       ├── ewas/                # PTSD EWAS CpG target genes
│       ├── prs/                 # PGC Freeze 3 GWAS loci + PRSice2/LDPred2 templates
│       ├── ppi/                 # STRING v12 protein aliases
│       └── singlecell/          # Hwang 2025 cell-type marker signatures
├── src/
│   ├── models/
│   │   ├── omics_graph_net.py   # OmicsGraphNet: β-VAE → Attention → GATv2
│   │   └── baselines.py         # ElasticNet, SVM-RFE, RF, XGBoost, ConcatMLP, MOLI, MOGONET
│   ├── preprocessing/
│   │   ├── normalize_expression.py  # VST + ComBat + SVA
│   │   ├── run_deconvolution.py     # NNLS cell-type deconvolution (11 types)
│   │   ├── build_feature_matrix.py  # Assembles 537-feature 6-block matrix
│   │   └── build_ppi_graph.py       # STRING v12 → PyG graph
│   ├── training/
│   │   └── train_omicsgraphnet.py   # LOCO-CV, AdamW, CosineAnnealingLR, early stopping
│   ├── evaluation/
│   │   ├── interpret.py             # Integrated Gradients + cross-modal attention
│   │   └── statistical_tests.py     # DeLong, McNemar, bootstrap, BH-FDR
│   └── visualization/
│       └── make_figures.py          # Figure utilities (Okabe-Ito palette, 600 DPI)
├── checkpoints/                 # 11 pre-trained models (~3.5 MB each)
├── predict.py                   # Inference with MC Dropout uncertainty
├── make_figures.py              # Publication figures (Fig 1–3)
├── run_loco_cv_folds.sh         # Run all 6 LOCO folds sequentially
├── requirements.txt
├── CITATION.cff
├── LICENSE
└── README.md

Key Biomarkers

Top features by Integrated Gradients (GSE164877 fold, AUC = 0.897):

Rank Feature Block Mean |IG| Biological context
1 Neutrophil fraction C — Cell-type 6.3 × 10⁻⁵ Elevated NLR in PTSD (innate immunity)
2 PV interneuron score C — Cell-type 4.8 × 10⁻⁵ GABAergic interneuron proxy
3 NK cell fraction C — Cell-type 4.8 × 10⁻⁵ NK functional deficit (glucocorticoid)
4 Astrocyte score C — Cell-type 3.8 × 10⁻⁵ Neuroinflammatory proxy (cross-tissue)
5 IFI44 A — Transcriptomics 1.4 × 10⁻⁶ ISG15 / type-I interferon pathway
6 RSAD2 A — Transcriptomics 1.2 × 10⁻⁶ Viperin / type-I interferon pathway

Full ranked list: results/interpretation/saliency_top_genes.csv


Pending Data Access

Item Status Source
Polygenic risk scores (Block D) Zero-imputed placeholder dbGaP: phs000864, phs000560, phs002046
EWAS methylation (Block B) Expression-proxy (12 surrogates) PGC PTSD EWAS working group
Hwang 2025 atlas (.h5ad) Required for deconvolution Zenodo: 10.5281/zenodo.15186498
STRING v12 Auto-downloaded build_ppi_graph.py --download_string

Reproducibility

All analyses run on a standard macOS workstation (Apple M-series, 16 GB RAM). No HPC or GPU required. Seeds: torch.manual_seed(42), np.random.seed(42).

Versions: Python 3.11 · PyTorch 2.2.2 · PyG 2.7.0 · scikit-learn 1.3 · neuroCombat 0.9.1 · GEOparse 2.0.4


Citation

@article{roy2026omicsgraphnet,
  title   = {{OmicsGraphNet}: Graph-Attentive Multi-Omics Fusion for Blood-Based {PTSD}
             Classification Across Military and Civilian Cohorts},
  author  = {Roy, Kushal Raj},
  year    = {2026}
}

License

MIT — see LICENSE

About

Graph-attentive multi-omics GNN for blood-based PTSD classification | β-VAE + Cross-Modal Attention + GATv2 | 8 GEO cohorts, n=1173, LOCO-CV

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors