A Python toolkit for integrating GWAS, single-cell RNA-seq, and STRING protein-protein interaction (PPI) networks to identify cell-type-specific disease modules.
- Overview
- Architecture
- Installation
- Input File Formats
- Quick Start
- Usage Examples
- CLI Reference
- Features
- Output Files
- Testing
- Project Structure
- Scalability & Resource Requirements
- Troubleshooting
- Dependencies
- License
- Author
scNetGWAS builds weighted gene networks by combining three complementary data sources:
- GWAS gene-level p-values (e.g., from MAGMA) — identifies disease-associated genes
- Single-cell RNA-seq data (AnnData
.h5ad) — provides cell-type resolution and co-expression structure - STRING PPI networks — defines physical/functional protein interactions
The pipeline constructs cell-type-specific networks where:
- Nodes are genes weighted by GWAS significance (
-log10(p)) - Edges are PPI interactions weighted by co-expression correlation and STRING confidence scores
- Communities (disease modules) are detected via Louvain or other algorithms
- Gene prioritization extends beyond GWAS hits using Random Walk with Restart
| Category | Features |
|---|---|
| Network Construction | Cell-type stratification, p-value filtering, co-expression edge weighting, STRING confidence scoring |
| Analysis | Community detection, centrality metrics (5 types), Random Walk with Restart, pathway enrichment, differential co-expression |
| Multi-Trait | Fisher's method to combine multiple GWAS datasets |
| Validation | Permutation-based robustness testing, leave-one-cell-type-out cross-validation |
| QC & Reproducibility | Input QC report, reproducibility manifest with file hashes and package versions |
| Output | 5 network formats, matplotlib plots, self-contained HTML report |
| Scalability | Parallel co-expression, sparse matrix support, Snakemake batch workflow |
scnetgwas/
├── pipeline.py # Main orchestration — runs the full pipeline
├── validation.py # Input validation and QC reporting
├── coexpression.py # Co-expression computation (standard, differential, parallel)
├── network.py # Network construction, centrality, RWR, multi-GWAS merging
├── community.py # Community detection, robustness testing, cross-validation
├── enrichment.py # GO/pathway enrichment via gseapy
├── gene_mapping.py # Gene ID harmonization (Ensembl ↔ Symbol ↔ Entrez)
├── export.py # Multi-format export, visualization, HTML reports
└── cli.py # Command-line interface
git clone https://github.com/glbala87/scNetGWAS.git
cd scNetGWAS
pip install -e .pip install -e ".[all]"pip install -e ".[enrichment]" # gseapy for pathway enrichment
pip install -e ".[progress]" # tqdm for progress bars
pip install -e ".[config]" # pyyaml for YAML config files
pip install -e ".[dev]" # pytest, pytest-cov, build, twine- Python >= 3.9 (tested on 3.9, 3.10, 3.11, 3.12)
- Core: numpy, pandas, networkx, scanpy, anndata, scipy, matplotlib
Gene-level association results (e.g., from MAGMA).
| Column | Type | Description |
|---|---|---|
gene |
string | Gene symbol (or Ensembl/Entrez ID) |
pval |
float | Gene-level p-value |
gene,pval
TP53,1.2e-8
BRCA1,0.002
EGFR,0.05
MYC,0.5Protein-protein interaction network (e.g., from STRING).
| Column | Type | Description |
|---|---|---|
protein1 |
string | First protein/gene |
protein2 |
string | Second protein/gene |
combined_score |
int (optional) | STRING confidence score (0-1000) |
protein1,protein2,combined_score
TP53,BRCA1,900
TP53,EGFR,700
BRCA1,EGFR,500Single-cell RNA-seq data in AnnData format.
var_names: Gene identifiers (must overlap with GWAS and PPI gene IDs)obs["cell_type"](optional): Cell type annotations for cell-type-specific analysisobs["condition"](optional): Case/control labels for differential co-expression
For harmonizing gene IDs across datasets.
| Column | Type | Description |
|---|---|---|
symbol |
string | HGNC gene symbol |
ensembl |
string | Ensembl gene ID |
entrez |
string | Entrez gene ID |
symbol,ensembl,entrez
TP53,ENSG00000141510,7157
BRCA1,ENSG00000012048,672# Install
pip install -e .
# Run with minimum required inputs
scnetgwas \
--gwas data/gwas_results.csv \
--ppi data/string_ppi.csv \
--adata data/scrna.h5ad \
--outdir results/ \
-v
# Open the HTML report
open results/report.htmlBuilds one global network using all cells, detects communities, computes centrality, and generates outputs.
scnetgwas \
--gwas data/gwas.csv \
--ppi data/ppi.csv \
--adata data/scrna.h5ad \
--outdir results/ \
--pval-threshold 0.05 \
-vBuild separate networks for each cell type with cross-validation to assess stability.
scnetgwas \
--gwas data/gwas.csv \
--ppi data/ppi.csv \
--adata data/scrna.h5ad \
--outdir results/ \
--cell-types Excitatory Inhibitory Microglia Astrocyte \
--cross-validation \
-vRun every available analysis.
scnetgwas \
--gwas data/gwas.csv \
--ppi data/ppi.csv \
--adata data/scrna.h5ad \
--outdir results/ \
--pval-threshold 0.05 \
--min-edge-weight 0.1 \
--cell-types Excitatory Inhibitory Microglia \
--use-string-scores \
--community-method louvain \
--community-resolution 1.0 \
--rwr \
--enrichment \
--gene-sets GO_Biological_Process_2023 \
--differential \
--condition-col condition \
--case-label case \
--control-label control \
--robustness \
--n-permutations 100 \
--cross-validation \
--n-jobs 4 \
-vCombine multiple GWAS traits using Fisher's method for p-value combination.
scnetgwas \
--gwas data/gwas_schizophrenia.csv \
--multi-gwas data/gwas_bipolar.csv data/gwas_depression.csv \
--ppi data/ppi.csv \
--adata data/scrna.h5ad \
--outdir results_multi/ \
--pval-threshold 0.05 \
-vAuto-detect and convert between Ensembl, HGNC symbols, and Entrez IDs.
scnetgwas \
--gwas data/gwas_ensembl.csv \
--ppi data/string_ensembl.csv \
--adata data/scrna.h5ad \
--outdir results/ \
--gene-mapping data/gene_id_map.csv \
--target-id-type symbol \
-vStore parameters in a YAML or TOML file for reproducibility.
scnetgwas \
--gwas data/gwas.csv \
--ppi data/ppi.csv \
--adata data/scrna.h5ad \
--outdir results/ \
--config example_config.yamlExample example_config.yaml:
pval_threshold: 0.05
min_edge_weight: 0.1
community_method: louvain
community_resolution: 1.0
coexpression: true
centrality: true
rwr: true
enrichment: true
robustness: true
n_permutations: 100
plot: true
html_report: true
qc_report: true
# cell_types:
# - Excitatory
# - Inhibitory
# - MicrogliaProcess multiple GWAS traits in parallel using the included Snakefile.
1. Organize your GWAS files:
data/gwas/
├── schizophrenia.csv
├── bipolar.csv
└── depression.csv
2. Create a Snakemake config:
# snakemake_config.yaml
gwas_dir: data/gwas/
ppi: data/ppi.csv
adata: data/scrna.h5ad
outdir: results/
pval_threshold: 0.05
cell_types: ["Excitatory", "Inhibitory", "Microglia"]
enrichment: true
rwr: true3. Run:
snakemake --configfile snakemake_config.yaml -j 4This creates separate output directories per trait under results/.
Use individual modules programmatically for custom workflows.
import scanpy as sc
import pandas as pd
from scnetgwas import (
run_pipeline,
dry_run,
build_network,
compute_centrality,
random_walk_restart,
detect_communities,
run_enrichment,
generate_qc_report,
validate_pipeline_params,
differential_coexpression,
harmonize_gene_ids,
)
# ── Option 0: Validate inputs before running ─────────────────
qc = dry_run(
gwas_path="data/gwas.csv",
ppi_path="data/ppi.csv",
adata_path="data/scrna.h5ad",
outdir="results/",
pval_threshold=0.05,
cell_types=["Excitatory", "Inhibitory"],
)
print(qc["warnings"]) # Check for issues before committing to full run
# ── Option 1: Run the full pipeline ──────────────────────────
stats = run_pipeline(
gwas_path="data/gwas.csv",
ppi_path="data/ppi.csv",
adata_path="data/scrna.h5ad",
outdir="results/",
pval_threshold=0.05,
cell_types=["Excitatory", "Inhibitory"],
rwr=True,
enrichment=True,
centrality=True,
robustness=True,
)
print(stats)
# ── Option 2: Use individual modules ─────────────────────────
gwas = pd.read_csv("data/gwas.csv")
ppi = pd.read_csv("data/ppi.csv")
adata = sc.read_h5ad("data/scrna.h5ad")
# Build network for a specific cell type
G = build_network(gwas, ppi, adata, cell_type="Microglia", pval_threshold=0.05)
# Detect communities
communities = detect_communities(G, method="louvain", resolution=1.0)
# Compute centrality metrics
centrality_df = compute_centrality(G)
print(centrality_df.head(10))
# Gene prioritization via Random Walk with Restart
rwr_scores = random_walk_restart(G, seed_genes=["TP53", "BRCA1"])
# Pathway enrichment per community
enrichment_results = run_enrichment(G, communities)
# Differential co-expression
diff = differential_coexpression(
adata, list(G.nodes()),
condition_col="condition", case_label="case", control_label="control"
)scnetgwas [OPTIONS]
| Flag | Description |
|---|---|
--gwas PATH |
GWAS gene-level results (CSV/TSV with gene and pval columns) |
--ppi PATH |
PPI network (CSV/TSV with protein1 and protein2 columns) |
--adata PATH |
AnnData .h5ad file with scRNA-seq data |
--outdir PATH |
Output directory |
| Flag | Default | Description |
|---|---|---|
--pval-threshold FLOAT |
None (keep all) | Keep GWAS genes with p-value <= threshold |
--min-edge-weight FLOAT |
0.0 | Minimum co-expression weight to retain an edge |
| Flag | Default | Description |
|---|---|---|
--cell-types TYPE [TYPE ...] |
None (all cells) | Build separate networks per cell type |
| Flag | Default | Description |
|---|---|---|
--community-method {louvain,greedy_modularity,label_propagation} |
louvain | Community detection algorithm |
--community-resolution FLOAT |
1.0 | Resolution parameter for Louvain |
--robustness |
off | Permutation-based community significance testing |
--n-permutations INT |
100 | Number of permutations for robustness |
--cross-validation |
off | Leave-one-cell-type-out cross-validation |
| Flag | Default | Description |
|---|---|---|
--no-coexpression |
off | Skip co-expression computation |
--differential |
off | Compute case vs control differential co-expression |
--condition-col STR |
condition | Column in adata.obs for case/control |
--case-label STR |
case | Label for case group |
--control-label STR |
control | Label for control group |
| Flag | Default | Description |
|---|---|---|
--no-centrality |
off | Skip centrality metric computation |
--rwr |
off | Run Random Walk with Restart for gene prioritization |
--rwr-restart-prob FLOAT |
0.15 | RWR restart probability |
--rwr-seed-genes GENE [GENE ...] |
None (uses GWAS weights) | Seed genes for RWR |
--use-string-scores |
off | Use STRING confidence scores as edge weights |
| Flag | Default | Description |
|---|---|---|
--multi-gwas PATH [PATH ...] |
None | Additional GWAS files (combined via Fisher's method) |
| Flag | Default | Description |
|---|---|---|
--enrichment |
off | Run pathway enrichment on communities |
--gene-sets STR |
GO_Biological_Process_2023 | Gene set library for Enrichr |
| Flag | Default | Description |
|---|---|---|
--gene-mapping PATH |
None | Gene ID mapping table (CSV with ensembl/symbol/entrez) |
--target-id-type {symbol,ensembl,entrez} |
symbol | Target ID type for harmonization |
| Flag | Default | Description |
|---|---|---|
--no-plot |
off | Skip network visualization |
--no-html |
off | Skip HTML report generation |
--no-qc |
off | Skip input QC report |
| Flag | Default | Description |
|---|---|---|
--n-jobs INT |
1 | Parallel workers for co-expression computation |
| Flag | Description |
|---|---|
--dry-run |
Validate inputs and parameters, generate QC report, skip analysis |
| Flag | Description |
|---|---|
--config PATH |
YAML or TOML config file (keys validated against known parameters) |
-v / --verbose |
Increase verbosity (-v INFO, -vv DEBUG) |
-q / --quiet |
Suppress all output except errors |
Runs Enrichr enrichment analysis on each detected community module using gseapy. Falls back to a gene-list summary if gseapy is not installed.
Propagates GWAS association signal through the PPI network to prioritize genes beyond genome-wide significance. Seed genes are weighted by -log10(p) by default, or specified manually.
Compares Spearman co-expression between case and control groups. Reports per-gene-pair case correlation, control correlation, and the difference — highlighting edges that rewire between conditions.
When --use-string-scores is enabled and the PPI file includes a combined_score column, edge weights become a composite of co-expression and STRING confidence (normalized to 0-1).
Accepts multiple GWAS files via --multi-gwas. P-values are combined across traits using Fisher's method (scipy.stats.combine_pvalues), enabling identification of pleiotropic disease modules.
Computes five centrality measures per gene:
- Degree — number of direct connections
- Betweenness — frequency on shortest paths
- Closeness — inverse average distance to all nodes
- Eigenvector — connection to other well-connected nodes
- PageRank — importance based on link structure
For large gene sets, co-expression computation can be parallelized across --n-jobs workers using chunked thread pools.
Handles large sparse AnnData matrices efficiently with chunked column-wise dense conversion, avoiding full memory materialization.
Before computing co-expression, genes are pre-filtered to only those present in the PPI network, avoiding unnecessary computation on unconnected genes.
Generates a self-contained HTML report (report.html) with:
- QC summary and warnings
- Network statistics with metric cards
- Embedded network visualizations
- Centrality rankings
- RWR gene prioritization scores
- Pathway enrichment tables
- Community robustness results
- Cross-validation results
Outputs run_metadata.json containing:
- Timestamp
- All pipeline parameters
- SHA-256 hashes and sizes of input files
- Package versions (numpy, pandas, networkx, scanpy, etc.)
- Python version
Included Snakefile enables batch processing of multiple GWAS traits in parallel, with automatic result aggregation.
Optional tqdm progress bars for long-running steps. Install with pip install tqdm or pip install -e ".[progress]".
Auto-detects gene ID types (Ensembl, HGNC symbols, Entrez) and harmonizes across datasets using a provided mapping table. Handles duplicates gracefully.
Before analysis, validates all inputs and generates qc_report.json with:
- Parameter bounds validation: p-value threshold (0–1), community resolution (> 0), RWR restart probability (0–1), edge weight (>= 0)
- Cell type existence check: Verifies requested
--cell-typesexist in AnnData before starting - GWAS validation: p-value range (0–1), duplicate gene detection, missing value handling
- PPI validation: STRING score range check (0–1000), duplicate edge warnings
- Config key validation: Unknown keys in YAML/TOML configs are flagged and ignored
- GWAS p-value distribution statistics
- PPI edge counts and self-loop detection
- scRNA cell counts per cell type
- Gene overlap across all three datasets
- Warnings for low overlap, missing annotations, or data issues
Use --dry-run to run validation and QC only, without the full analysis.
Permutation-based significance testing via degree-preserving edge swaps. Reports real modularity, null distribution statistics, z-score, and empirical p-value.
Assesses community stability by holding out each cell type, rebuilding the network, and comparing community assignments via Jaccard similarity.
After running, the output directory contains:
| File | Description |
|---|---|
network_*.edgelist |
Edge list format |
network_*.graphml |
Cytoscape-compatible GraphML |
network_*.gexf |
Gephi-compatible GEXF |
network_*.json |
JSON node-link format |
network_*_adjacency.csv |
Adjacency matrix as CSV |
network_*.png |
Network visualization (nodes colored by community, sized by GWAS weight) |
centrality_*.csv |
Centrality metrics per gene (degree, betweenness, closeness, eigenvector, PageRank) |
rwr_scores_*.csv |
Random Walk with Restart gene prioritization scores |
diff_coexpr_*.csv |
Differential co-expression (case vs control per gene pair) |
*_enrichment_comm*.csv |
Pathway enrichment results per community |
*_enrichment_summary.json |
Top enrichment terms per community |
qc_report.json |
Input QC report with overlap stats and warnings |
summary_stats.json |
Network statistics (nodes, edges, density, hubs, communities) |
run_metadata.json |
Reproducibility manifest (parameters, file hashes, versions) |
cross_validation.json |
Leave-one-cell-type-out Jaccard similarities |
report.html |
Self-contained HTML report with all results |
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests (65 tests)
pytest tests/ -v
# Run only the original pipeline tests
pytest tests/test_pipeline.py -v
# Run only the new feature tests
pytest tests/test_new_features.py -v
# Run with coverage report
pytest tests/ -v --cov=scnetgwas --cov-report=term-missingGitHub Actions runs automatically on push/PR with:
- Linting: flake8 for syntax errors and style checks
- Multi-version testing: Python 3.9, 3.10, 3.11, 3.12
- Coverage reporting: pytest-cov with term-missing output
- Optional dependency testing: Separate job with all optional deps installed
- Pip caching: Faster CI runs with dependency caching
scNetGWAS/
├── .github/
│ └── workflows/
│ └── ci.yml # GitHub Actions CI (lint, test matrix, coverage)
├── src/
│ └── scnetgwas/
│ ├── __init__.py # Public API exports
│ ├── cli.py # Command-line interface
│ ├── pipeline.py # Main pipeline orchestration
│ ├── validation.py # Input validation & QC reporting
│ ├── coexpression.py # Co-expression (standard, differential, parallel)
│ ├── network.py # Network construction, centrality, RWR
│ ├── community.py # Community detection, robustness, cross-validation
│ ├── enrichment.py # Pathway enrichment via gseapy
│ ├── gene_mapping.py # Gene ID harmonization
│ └── export.py # Multi-format export, plots, HTML report
├── tests/
│ ├── test_smoke.py # Import smoke test
│ ├── test_pipeline.py # Core pipeline tests (28 tests)
│ └── test_new_features.py # Enhancement tests (37 tests)
├── Snakefile # Snakemake batch workflow
├── example_config.yaml # Example YAML configuration
├── pyproject.toml # Package configuration
├── README.md
├── LICENSE
└── .gitignore
| Dataset Size | Genes | Cells | Approx. RAM | Notes |
|---|---|---|---|---|
| Small | < 5,000 | < 10,000 | 2–4 GB | Runs on laptop |
| Medium | 5,000–15,000 | 10,000–100,000 | 8–16 GB | Use --n-jobs 4 for co-expression |
| Large | 15,000–25,000 | 100,000–500,000 | 32–64 GB | Consider --no-coexpression or cell-type subsets |
| Very large | > 25,000 | > 500,000 | 64+ GB | Subsample cells, use sparse AnnData, limit cell types |
- Co-expression is the bottleneck: O(n² genes) pairwise correlations. Use
--n-jobsfor parallelism or--no-coexpressionif not needed. - Sparse matrices: The tool handles sparse
.Xin AnnData efficiently with chunked conversion for matrices > 10⁸ elements. - PPI pre-filtering: Only PPI-connected genes are included in co-expression, reducing computation significantly.
- Cell-type subsetting: Processing one cell type at a time uses far less memory than all cells together.
- Permutation testing:
--robustnesswith--n-permutations 1000multiplies community detection runtime proportionally — start with 100.
Use --dry-run to validate all inputs and parameters without running the full analysis:
scnetgwas --gwas gwas.csv --ppi ppi.csv --adata data.h5ad --outdir out/ --dry-runThis loads files, checks formats, validates parameter bounds, verifies cell types, and writes a QC report — useful before committing to a long run.
| Error | Cause | Fix |
|---|---|---|
GWAS file missing required columns: {'gene'} |
CSV column names don't match expected gene and pval |
Rename columns or check delimiter (CSV vs TSV auto-detected) |
GWAS p-values must be between 0 and 1 |
P-values contain -log10 transformed values or z-scores | Convert back to raw p-values before input |
No cells found for cell_type='...' |
Requested cell type doesn't exist in adata.obs["cell_type"] |
Use --dry-run to list available cell types, check spelling |
Cell types not found in AnnData |
--cell-types includes names not in the data |
Run --dry-run to see available types |
No overlapping genes — returning empty network |
Gene IDs don't match across GWAS/PPI/scRNA-seq | Use --gene-mapping with a mapping table to harmonize IDs |
CRITICAL: No genes overlap across all three datasets |
Gene naming conventions differ (e.g., Ensembl vs Symbol) | Check QC report, use --gene-mapping and --target-id-type |
PyYAML is required to use YAML config files |
Optional dependency not installed | pip install scnetgwas[config] |
gseapy is required for enrichment |
Optional dependency not installed | pip install scnetgwas[enrichment] |
Unknown config keys (will be ignored): [...] |
Typo or unsupported key in config YAML/TOML | Check key spelling against CLI reference |
Eigenvector centrality did not converge |
Disconnected graph or numerical instability | Non-fatal — zeros used as fallback; consider filtering edges |
- Start with
--dry-run: Validates inputs cheaply before long runs - Check
qc_report.json: Gene overlap stats reveal ID mismatches early - Use
-vvfor DEBUG logging: Shows detailed step-by-step progress - Small test first: Run on one cell type with
--no-coexpressionto verify the pipeline works, then scale up - Memory issues: If you hit OOM, use
--cell-typesto process one type at a time, or increase--min-edge-weightto produce a sparser network
| Package | Version | Purpose |
|---|---|---|
| numpy | >= 1.23, < 3.0 | Numerical computing |
| pandas | >= 2.0, < 3.0 | Data manipulation |
| networkx | >= 3.0, < 4.0 | Graph/network analysis |
| scanpy | >= 1.9.0, < 2.0 | Single-cell RNA-seq processing |
| anndata | >= 0.9.0, < 1.0 | Single-cell data structure |
| scipy | >= 1.10, < 2.0 | Scientific computing, statistics |
| matplotlib | >= 3.5, < 4.0 | Network visualization |
| Package | Install Group | Purpose |
|---|---|---|
| gseapy | [enrichment] |
GO/pathway enrichment analysis |
| tqdm | [progress] |
Progress bars |
| pyyaml | [config] |
YAML config file support |
| snakemake | (separate) | Batch workflow execution |
MIT License. See LICENSE for details.
Bala Subramani — GitHub