Skip to content

glbala87/scNetGWAS

Repository files navigation

scNetGWAS

A Python toolkit for integrating GWAS, single-cell RNA-seq, and STRING protein-protein interaction (PPI) networks to identify cell-type-specific disease modules.


Table of Contents


Overview

scNetGWAS builds weighted gene networks by combining three complementary data sources:

  1. GWAS gene-level p-values (e.g., from MAGMA) — identifies disease-associated genes
  2. Single-cell RNA-seq data (AnnData .h5ad) — provides cell-type resolution and co-expression structure
  3. STRING PPI networks — defines physical/functional protein interactions

The pipeline constructs cell-type-specific networks where:

  • Nodes are genes weighted by GWAS significance (-log10(p))
  • Edges are PPI interactions weighted by co-expression correlation and STRING confidence scores
  • Communities (disease modules) are detected via Louvain or other algorithms
  • Gene prioritization extends beyond GWAS hits using Random Walk with Restart

Key Capabilities

Category Features
Network Construction Cell-type stratification, p-value filtering, co-expression edge weighting, STRING confidence scoring
Analysis Community detection, centrality metrics (5 types), Random Walk with Restart, pathway enrichment, differential co-expression
Multi-Trait Fisher's method to combine multiple GWAS datasets
Validation Permutation-based robustness testing, leave-one-cell-type-out cross-validation
QC & Reproducibility Input QC report, reproducibility manifest with file hashes and package versions
Output 5 network formats, matplotlib plots, self-contained HTML report
Scalability Parallel co-expression, sparse matrix support, Snakemake batch workflow

Architecture

scnetgwas/
├── pipeline.py        # Main orchestration — runs the full pipeline
├── validation.py      # Input validation and QC reporting
├── coexpression.py    # Co-expression computation (standard, differential, parallel)
├── network.py         # Network construction, centrality, RWR, multi-GWAS merging
├── community.py       # Community detection, robustness testing, cross-validation
├── enrichment.py      # GO/pathway enrichment via gseapy
├── gene_mapping.py    # Gene ID harmonization (Ensembl ↔ Symbol ↔ Entrez)
├── export.py          # Multi-format export, visualization, HTML reports
└── cli.py             # Command-line interface

Installation

Basic Install

git clone https://github.com/glbala87/scNetGWAS.git
cd scNetGWAS
pip install -e .

With All Optional Dependencies

pip install -e ".[all]"

Optional Dependency Groups

pip install -e ".[enrichment]"   # gseapy for pathway enrichment
pip install -e ".[progress]"     # tqdm for progress bars
pip install -e ".[config]"       # pyyaml for YAML config files
pip install -e ".[dev]"          # pytest, pytest-cov, build, twine

Requirements

  • Python >= 3.9 (tested on 3.9, 3.10, 3.11, 3.12)
  • Core: numpy, pandas, networkx, scanpy, anndata, scipy, matplotlib

Input File Formats

1. GWAS File (CSV or TSV)

Gene-level association results (e.g., from MAGMA).

Column Type Description
gene string Gene symbol (or Ensembl/Entrez ID)
pval float Gene-level p-value
gene,pval
TP53,1.2e-8
BRCA1,0.002
EGFR,0.05
MYC,0.5

2. PPI File (CSV or TSV)

Protein-protein interaction network (e.g., from STRING).

Column Type Description
protein1 string First protein/gene
protein2 string Second protein/gene
combined_score int (optional) STRING confidence score (0-1000)
protein1,protein2,combined_score
TP53,BRCA1,900
TP53,EGFR,700
BRCA1,EGFR,500

3. AnnData File (.h5ad)

Single-cell RNA-seq data in AnnData format.

  • var_names: Gene identifiers (must overlap with GWAS and PPI gene IDs)
  • obs["cell_type"] (optional): Cell type annotations for cell-type-specific analysis
  • obs["condition"] (optional): Case/control labels for differential co-expression

4. Gene Mapping File (CSV, optional)

For harmonizing gene IDs across datasets.

Column Type Description
symbol string HGNC gene symbol
ensembl string Ensembl gene ID
entrez string Entrez gene ID
symbol,ensembl,entrez
TP53,ENSG00000141510,7157
BRCA1,ENSG00000012048,672

Quick Start

# Install
pip install -e .

# Run with minimum required inputs
scnetgwas \
  --gwas data/gwas_results.csv \
  --ppi data/string_ppi.csv \
  --adata data/scrna.h5ad \
  --outdir results/ \
  -v

# Open the HTML report
open results/report.html

Usage Examples

Basic Run

Builds one global network using all cells, detects communities, computes centrality, and generates outputs.

scnetgwas \
  --gwas data/gwas.csv \
  --ppi data/ppi.csv \
  --adata data/scrna.h5ad \
  --outdir results/ \
  --pval-threshold 0.05 \
  -v

Cell-Type-Specific Networks

Build separate networks for each cell type with cross-validation to assess stability.

scnetgwas \
  --gwas data/gwas.csv \
  --ppi data/ppi.csv \
  --adata data/scrna.h5ad \
  --outdir results/ \
  --cell-types Excitatory Inhibitory Microglia Astrocyte \
  --cross-validation \
  -v

Full Analysis Pipeline

Run every available analysis.

scnetgwas \
  --gwas data/gwas.csv \
  --ppi data/ppi.csv \
  --adata data/scrna.h5ad \
  --outdir results/ \
  --pval-threshold 0.05 \
  --min-edge-weight 0.1 \
  --cell-types Excitatory Inhibitory Microglia \
  --use-string-scores \
  --community-method louvain \
  --community-resolution 1.0 \
  --rwr \
  --enrichment \
  --gene-sets GO_Biological_Process_2023 \
  --differential \
  --condition-col condition \
  --case-label case \
  --control-label control \
  --robustness \
  --n-permutations 100 \
  --cross-validation \
  --n-jobs 4 \
  -v

Multi-Trait GWAS Integration

Combine multiple GWAS traits using Fisher's method for p-value combination.

scnetgwas \
  --gwas data/gwas_schizophrenia.csv \
  --multi-gwas data/gwas_bipolar.csv data/gwas_depression.csv \
  --ppi data/ppi.csv \
  --adata data/scrna.h5ad \
  --outdir results_multi/ \
  --pval-threshold 0.05 \
  -v

Gene ID Harmonization

Auto-detect and convert between Ensembl, HGNC symbols, and Entrez IDs.

scnetgwas \
  --gwas data/gwas_ensembl.csv \
  --ppi data/string_ensembl.csv \
  --adata data/scrna.h5ad \
  --outdir results/ \
  --gene-mapping data/gene_id_map.csv \
  --target-id-type symbol \
  -v

Using a Config File

Store parameters in a YAML or TOML file for reproducibility.

scnetgwas \
  --gwas data/gwas.csv \
  --ppi data/ppi.csv \
  --adata data/scrna.h5ad \
  --outdir results/ \
  --config example_config.yaml

Example example_config.yaml:

pval_threshold: 0.05
min_edge_weight: 0.1
community_method: louvain
community_resolution: 1.0
coexpression: true
centrality: true
rwr: true
enrichment: true
robustness: true
n_permutations: 100
plot: true
html_report: true
qc_report: true
# cell_types:
#   - Excitatory
#   - Inhibitory
#   - Microglia

Batch Processing with Snakemake

Process multiple GWAS traits in parallel using the included Snakefile.

1. Organize your GWAS files:

data/gwas/
├── schizophrenia.csv
├── bipolar.csv
└── depression.csv

2. Create a Snakemake config:

# snakemake_config.yaml
gwas_dir: data/gwas/
ppi: data/ppi.csv
adata: data/scrna.h5ad
outdir: results/
pval_threshold: 0.05
cell_types: ["Excitatory", "Inhibitory", "Microglia"]
enrichment: true
rwr: true

3. Run:

snakemake --configfile snakemake_config.yaml -j 4

This creates separate output directories per trait under results/.

Python API

Use individual modules programmatically for custom workflows.

import scanpy as sc
import pandas as pd
from scnetgwas import (
    run_pipeline,
    dry_run,
    build_network,
    compute_centrality,
    random_walk_restart,
    detect_communities,
    run_enrichment,
    generate_qc_report,
    validate_pipeline_params,
    differential_coexpression,
    harmonize_gene_ids,
)

# ── Option 0: Validate inputs before running ─────────────────
qc = dry_run(
    gwas_path="data/gwas.csv",
    ppi_path="data/ppi.csv",
    adata_path="data/scrna.h5ad",
    outdir="results/",
    pval_threshold=0.05,
    cell_types=["Excitatory", "Inhibitory"],
)
print(qc["warnings"])  # Check for issues before committing to full run

# ── Option 1: Run the full pipeline ──────────────────────────
stats = run_pipeline(
    gwas_path="data/gwas.csv",
    ppi_path="data/ppi.csv",
    adata_path="data/scrna.h5ad",
    outdir="results/",
    pval_threshold=0.05,
    cell_types=["Excitatory", "Inhibitory"],
    rwr=True,
    enrichment=True,
    centrality=True,
    robustness=True,
)
print(stats)

# ── Option 2: Use individual modules ─────────────────────────
gwas = pd.read_csv("data/gwas.csv")
ppi = pd.read_csv("data/ppi.csv")
adata = sc.read_h5ad("data/scrna.h5ad")

# Build network for a specific cell type
G = build_network(gwas, ppi, adata, cell_type="Microglia", pval_threshold=0.05)

# Detect communities
communities = detect_communities(G, method="louvain", resolution=1.0)

# Compute centrality metrics
centrality_df = compute_centrality(G)
print(centrality_df.head(10))

# Gene prioritization via Random Walk with Restart
rwr_scores = random_walk_restart(G, seed_genes=["TP53", "BRCA1"])

# Pathway enrichment per community
enrichment_results = run_enrichment(G, communities)

# Differential co-expression
diff = differential_coexpression(
    adata, list(G.nodes()),
    condition_col="condition", case_label="case", control_label="control"
)

CLI Reference

scnetgwas [OPTIONS]

Required Arguments

Flag Description
--gwas PATH GWAS gene-level results (CSV/TSV with gene and pval columns)
--ppi PATH PPI network (CSV/TSV with protein1 and protein2 columns)
--adata PATH AnnData .h5ad file with scRNA-seq data
--outdir PATH Output directory

Filtering

Flag Default Description
--pval-threshold FLOAT None (keep all) Keep GWAS genes with p-value <= threshold
--min-edge-weight FLOAT 0.0 Minimum co-expression weight to retain an edge

Cell-Type Analysis

Flag Default Description
--cell-types TYPE [TYPE ...] None (all cells) Build separate networks per cell type

Community Detection

Flag Default Description
--community-method {louvain,greedy_modularity,label_propagation} louvain Community detection algorithm
--community-resolution FLOAT 1.0 Resolution parameter for Louvain
--robustness off Permutation-based community significance testing
--n-permutations INT 100 Number of permutations for robustness
--cross-validation off Leave-one-cell-type-out cross-validation

Co-Expression

Flag Default Description
--no-coexpression off Skip co-expression computation
--differential off Compute case vs control differential co-expression
--condition-col STR condition Column in adata.obs for case/control
--case-label STR case Label for case group
--control-label STR control Label for control group

Network Analysis

Flag Default Description
--no-centrality off Skip centrality metric computation
--rwr off Run Random Walk with Restart for gene prioritization
--rwr-restart-prob FLOAT 0.15 RWR restart probability
--rwr-seed-genes GENE [GENE ...] None (uses GWAS weights) Seed genes for RWR
--use-string-scores off Use STRING confidence scores as edge weights

Multi-Trait Analysis

Flag Default Description
--multi-gwas PATH [PATH ...] None Additional GWAS files (combined via Fisher's method)

Pathway Enrichment

Flag Default Description
--enrichment off Run pathway enrichment on communities
--gene-sets STR GO_Biological_Process_2023 Gene set library for Enrichr

Gene ID Mapping

Flag Default Description
--gene-mapping PATH None Gene ID mapping table (CSV with ensembl/symbol/entrez)
--target-id-type {symbol,ensembl,entrez} symbol Target ID type for harmonization

Output Options

Flag Default Description
--no-plot off Skip network visualization
--no-html off Skip HTML report generation
--no-qc off Skip input QC report

Performance

Flag Default Description
--n-jobs INT 1 Parallel workers for co-expression computation

Validation

Flag Description
--dry-run Validate inputs and parameters, generate QC report, skip analysis

General

Flag Description
--config PATH YAML or TOML config file (keys validated against known parameters)
-v / --verbose Increase verbosity (-v INFO, -vv DEBUG)
-q / --quiet Suppress all output except errors

Features

1. Gene Ontology / Pathway Enrichment

Runs Enrichr enrichment analysis on each detected community module using gseapy. Falls back to a gene-list summary if gseapy is not installed.

2. Random Walk with Restart (RWR)

Propagates GWAS association signal through the PPI network to prioritize genes beyond genome-wide significance. Seed genes are weighted by -log10(p) by default, or specified manually.

3. Differential Co-Expression

Compares Spearman co-expression between case and control groups. Reports per-gene-pair case correlation, control correlation, and the difference — highlighting edges that rewire between conditions.

4. STRING Confidence Edge Scoring

When --use-string-scores is enabled and the PPI file includes a combined_score column, edge weights become a composite of co-expression and STRING confidence (normalized to 0-1).

5. Multi-Trait GWAS Integration

Accepts multiple GWAS files via --multi-gwas. P-values are combined across traits using Fisher's method (scipy.stats.combine_pvalues), enabling identification of pleiotropic disease modules.

6. Network Centrality Metrics

Computes five centrality measures per gene:

  • Degree — number of direct connections
  • Betweenness — frequency on shortest paths
  • Closeness — inverse average distance to all nodes
  • Eigenvector — connection to other well-connected nodes
  • PageRank — importance based on link structure

7. Parallel Co-Expression

For large gene sets, co-expression computation can be parallelized across --n-jobs workers using chunked thread pools.

8. Sparse Matrix Support

Handles large sparse AnnData matrices efficiently with chunked column-wise dense conversion, avoiding full memory materialization.

9. PPI Gene Pre-Filtering

Before computing co-expression, genes are pre-filtered to only those present in the PPI network, avoiding unnecessary computation on unconnected genes.

10. HTML Report

Generates a self-contained HTML report (report.html) with:

  • QC summary and warnings
  • Network statistics with metric cards
  • Embedded network visualizations
  • Centrality rankings
  • RWR gene prioritization scores
  • Pathway enrichment tables
  • Community robustness results
  • Cross-validation results

11. Reproducibility Manifest

Outputs run_metadata.json containing:

  • Timestamp
  • All pipeline parameters
  • SHA-256 hashes and sizes of input files
  • Package versions (numpy, pandas, networkx, scanpy, etc.)
  • Python version

12. Snakemake Workflow

Included Snakefile enables batch processing of multiple GWAS traits in parallel, with automatic result aggregation.

13. Progress Bars

Optional tqdm progress bars for long-running steps. Install with pip install tqdm or pip install -e ".[progress]".

14. Gene ID Mapping

Auto-detects gene ID types (Ensembl, HGNC symbols, Entrez) and harmonizes across datasets using a provided mapping table. Handles duplicates gracefully.

15. Input QC Report & Validation

Before analysis, validates all inputs and generates qc_report.json with:

  • Parameter bounds validation: p-value threshold (0–1), community resolution (> 0), RWR restart probability (0–1), edge weight (>= 0)
  • Cell type existence check: Verifies requested --cell-types exist in AnnData before starting
  • GWAS validation: p-value range (0–1), duplicate gene detection, missing value handling
  • PPI validation: STRING score range check (0–1000), duplicate edge warnings
  • Config key validation: Unknown keys in YAML/TOML configs are flagged and ignored
  • GWAS p-value distribution statistics
  • PPI edge counts and self-loop detection
  • scRNA cell counts per cell type
  • Gene overlap across all three datasets
  • Warnings for low overlap, missing annotations, or data issues

Use --dry-run to run validation and QC only, without the full analysis.

16. Community Robustness Testing

Permutation-based significance testing via degree-preserving edge swaps. Reports real modularity, null distribution statistics, z-score, and empirical p-value.

17. Leave-One-Cell-Type-Out Cross-Validation

Assesses community stability by holding out each cell type, rebuilding the network, and comparing community assignments via Jaccard similarity.


Output Files

After running, the output directory contains:

File Description
network_*.edgelist Edge list format
network_*.graphml Cytoscape-compatible GraphML
network_*.gexf Gephi-compatible GEXF
network_*.json JSON node-link format
network_*_adjacency.csv Adjacency matrix as CSV
network_*.png Network visualization (nodes colored by community, sized by GWAS weight)
centrality_*.csv Centrality metrics per gene (degree, betweenness, closeness, eigenvector, PageRank)
rwr_scores_*.csv Random Walk with Restart gene prioritization scores
diff_coexpr_*.csv Differential co-expression (case vs control per gene pair)
*_enrichment_comm*.csv Pathway enrichment results per community
*_enrichment_summary.json Top enrichment terms per community
qc_report.json Input QC report with overlap stats and warnings
summary_stats.json Network statistics (nodes, edges, density, hubs, communities)
run_metadata.json Reproducibility manifest (parameters, file hashes, versions)
cross_validation.json Leave-one-cell-type-out Jaccard similarities
report.html Self-contained HTML report with all results

Testing

# Install dev dependencies
pip install -e ".[dev]"

# Run all tests (65 tests)
pytest tests/ -v

# Run only the original pipeline tests
pytest tests/test_pipeline.py -v

# Run only the new feature tests
pytest tests/test_new_features.py -v

# Run with coverage report
pytest tests/ -v --cov=scnetgwas --cov-report=term-missing

CI Pipeline

GitHub Actions runs automatically on push/PR with:

  • Linting: flake8 for syntax errors and style checks
  • Multi-version testing: Python 3.9, 3.10, 3.11, 3.12
  • Coverage reporting: pytest-cov with term-missing output
  • Optional dependency testing: Separate job with all optional deps installed
  • Pip caching: Faster CI runs with dependency caching

Project Structure

scNetGWAS/
├── .github/
│   └── workflows/
│       └── ci.yml                 # GitHub Actions CI (lint, test matrix, coverage)
├── src/
│   └── scnetgwas/
│       ├── __init__.py            # Public API exports
│       ├── cli.py                 # Command-line interface
│       ├── pipeline.py            # Main pipeline orchestration
│       ├── validation.py          # Input validation & QC reporting
│       ├── coexpression.py        # Co-expression (standard, differential, parallel)
│       ├── network.py             # Network construction, centrality, RWR
│       ├── community.py           # Community detection, robustness, cross-validation
│       ├── enrichment.py          # Pathway enrichment via gseapy
│       ├── gene_mapping.py        # Gene ID harmonization
│       └── export.py              # Multi-format export, plots, HTML report
├── tests/
│   ├── test_smoke.py              # Import smoke test
│   ├── test_pipeline.py           # Core pipeline tests (28 tests)
│   └── test_new_features.py       # Enhancement tests (37 tests)
├── Snakefile                      # Snakemake batch workflow
├── example_config.yaml            # Example YAML configuration
├── pyproject.toml                 # Package configuration
├── README.md
├── LICENSE
└── .gitignore

Scalability & Resource Requirements

Memory Guidelines

Dataset Size Genes Cells Approx. RAM Notes
Small < 5,000 < 10,000 2–4 GB Runs on laptop
Medium 5,000–15,000 10,000–100,000 8–16 GB Use --n-jobs 4 for co-expression
Large 15,000–25,000 100,000–500,000 32–64 GB Consider --no-coexpression or cell-type subsets
Very large > 25,000 > 500,000 64+ GB Subsample cells, use sparse AnnData, limit cell types

Performance Tips

  • Co-expression is the bottleneck: O(n² genes) pairwise correlations. Use --n-jobs for parallelism or --no-coexpression if not needed.
  • Sparse matrices: The tool handles sparse .X in AnnData efficiently with chunked conversion for matrices > 10⁸ elements.
  • PPI pre-filtering: Only PPI-connected genes are included in co-expression, reducing computation significantly.
  • Cell-type subsetting: Processing one cell type at a time uses far less memory than all cells together.
  • Permutation testing: --robustness with --n-permutations 1000 multiplies community detection runtime proportionally — start with 100.

Dry-Run Validation

Use --dry-run to validate all inputs and parameters without running the full analysis:

scnetgwas --gwas gwas.csv --ppi ppi.csv --adata data.h5ad --outdir out/ --dry-run

This loads files, checks formats, validates parameter bounds, verifies cell types, and writes a QC report — useful before committing to a long run.


Troubleshooting

Common Errors

Error Cause Fix
GWAS file missing required columns: {'gene'} CSV column names don't match expected gene and pval Rename columns or check delimiter (CSV vs TSV auto-detected)
GWAS p-values must be between 0 and 1 P-values contain -log10 transformed values or z-scores Convert back to raw p-values before input
No cells found for cell_type='...' Requested cell type doesn't exist in adata.obs["cell_type"] Use --dry-run to list available cell types, check spelling
Cell types not found in AnnData --cell-types includes names not in the data Run --dry-run to see available types
No overlapping genes — returning empty network Gene IDs don't match across GWAS/PPI/scRNA-seq Use --gene-mapping with a mapping table to harmonize IDs
CRITICAL: No genes overlap across all three datasets Gene naming conventions differ (e.g., Ensembl vs Symbol) Check QC report, use --gene-mapping and --target-id-type
PyYAML is required to use YAML config files Optional dependency not installed pip install scnetgwas[config]
gseapy is required for enrichment Optional dependency not installed pip install scnetgwas[enrichment]
Unknown config keys (will be ignored): [...] Typo or unsupported key in config YAML/TOML Check key spelling against CLI reference
Eigenvector centrality did not converge Disconnected graph or numerical instability Non-fatal — zeros used as fallback; consider filtering edges

Debugging Tips

  1. Start with --dry-run: Validates inputs cheaply before long runs
  2. Check qc_report.json: Gene overlap stats reveal ID mismatches early
  3. Use -vv for DEBUG logging: Shows detailed step-by-step progress
  4. Small test first: Run on one cell type with --no-coexpression to verify the pipeline works, then scale up
  5. Memory issues: If you hit OOM, use --cell-types to process one type at a time, or increase --min-edge-weight to produce a sparser network

Dependencies

Core (installed automatically)

Package Version Purpose
numpy >= 1.23, < 3.0 Numerical computing
pandas >= 2.0, < 3.0 Data manipulation
networkx >= 3.0, < 4.0 Graph/network analysis
scanpy >= 1.9.0, < 2.0 Single-cell RNA-seq processing
anndata >= 0.9.0, < 1.0 Single-cell data structure
scipy >= 1.10, < 2.0 Scientific computing, statistics
matplotlib >= 3.5, < 4.0 Network visualization

Optional

Package Install Group Purpose
gseapy [enrichment] GO/pathway enrichment analysis
tqdm [progress] Progress bars
pyyaml [config] YAML config file support
snakemake (separate) Batch workflow execution

License

MIT License. See LICENSE for details.


Author

Bala SubramaniGitHub

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages