Skip to content

junior1p/GWASEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GWASEngine

Pure Python Genome-Wide Association Study Analysis Engine

Python 3.9+ License: MIT

GWASEngine is a complete GWAS analysis pipeline built entirely in Python using NumPy, SciPy, and scikit-learn — no PLINK, no R, no BOLT-LMM, no REGENIE. All algorithms are implemented from first principles.

Features

Module Description
Quality Control Sample QC (call rate, heterozygosity, relatedness) + Variant QC (MAF, HWE, missingness)
Association Testing Vectorized linear regression (quantitative) + Firth logistic regression (binary)
LD Structure Pairwise r² computation, LD clumping, Gabriel block detection
Polygenic Risk Scores C+T (Clumping + Thresholding) + LDpred2-inspired Bayesian shrinkage
Fine-Mapping Wakefield Approximate Bayes Factors, 95% credible sets
LD Score Regression SNP heritability (h²_SNP) estimation, confounding inflation correction

Installation

pip install numpy scipy pandas scikit-learn plotly matplotlib
git clone https://github.com/junior1p/GWASEngine.git
cd GWASEngine

Quick Start

from gwasengine import run_gwas_engine

summary = run_gwas_engine(
    trait="quantitative",
    out_dir="gwas_output",
    n_samples=2000,
    n_snps=10000,
    h2_snp=0.3,
    n_causal=20,
    n_genetic_pcs=4,
    run_prs=True,
    run_finemapping=True,
    run_ldsc=True,
)

Or from the command line:

python -m gwasengine

Demo Output

The demo (python -m gwasengine) generates:

gwas_output/
├── gwas_analysis.html   # Interactive 6-panel dashboard
├── gwas_results.csv     # All SNP association statistics
├── lead_snps.csv        # Independent genome-wide significant loci
├── prs.csv              # Per-sample polygenic risk scores
├── finemapping.csv      # Fine-mapping PIPs at top locus
└── summary.json         # Pipeline summary metrics

Expected metrics (2000 samples, 10000 SNPs, h²_SNP=0.30, 20 causal):

  • ~5–15 genome-wide significant SNPs (p<5×10⁻⁸)
  • λ_GC ≈ 1.0–1.1 (well-controlled with PC correction)
  • h²_SNP estimate ≈ 0.25–0.35
  • PRS R² ≈ 0.05–0.15 at p<10⁻⁴ threshold

Interactive Dashboard

Opens automatically as gwas_analysis.html. Six panels:

  1. Manhattan Plot — -log₁₀(p) vs genomic position, GWS threshold line
  2. QQ Plot — Observed vs expected p-values, λ_GC annotation
  3. PRS Distribution — Histogram of polygenic risk scores
  4. Fine-Mapping PIPs — Posterior inclusion probabilities at top locus
  5. LDSC Results — Heritability estimates and intercept
  6. Top Hits — Bar chart of most significant associations

Scientific Background

Why no PLINK?

Most GWAS tools require compiled binaries (PLINK 1.9, PLINK 2.0, REGENIE) or R/Bioconductor packages. GWASEngine is pure Python — installable anywhere pip works, no sudo/root required, runs on CPU.

Statistical Methods

  • Linear regression: Univariate per-SNP with covariate residualization. Handles n < m (more SNPs than samples) via degrees-of-freedom correction.
  • Firth logistic regression: Newton-Raphson with penalty to handle separation in binary traits.
  • LD clumping: Gabriel block method with r² < 0.1 threshold, 500kb window.
  • Fine-mapping: Wakefield (2009) Approximate Bayes Factors, 95% credible sets.
  • LDSC: Bulik-Sullivan et al. (2015) — regress χ² on LD scores to separate polygenicity from confounding.

Performance (2000 × 10,000, CPU)

Step Runtime
QC ~2s
Linear GWAS ~3s
LD clumping ~5s
LDSC ~15s
Full pipeline ~30s

API Reference

gwasengine.run_gwas_engine()

Full pipeline. Returns summary dict with all metrics.

gwasengine.generate_synthetic_gwas()

Generate realistic synthetic GWAS data with:

  • 3 ancestry clusters (population stratification)
  • LD blocks (Cholesky correlation structure)
  • Known causal SNPs with effect sizes drawn from N(0, h²/n_causal)

gwasengine.quality_control()

Sample + variant QC. Returns (data_qc, snp_mask, sample_mask, qc_report).

Dependencies

numpy >= 1.24
scipy >= 1.10
pandas >= 1.5
scikit-learn >= 1.3
plotly >= 5.15
matplotlib >= 3.7

License

MIT License. See LICENSE.

References

  1. Bulik-Sullivan, B.K. et al. (2015). LD Score regression distinguishes confounding from polygenicity. Nature Genetics.
  2. Wakefield, J. (2009). Bayes factors for genome-wide association studies. Genetic Epidemiology.
  3. Price, A.L. et al. (2006). Principal components analysis corrects for stratification. Nature Genetics.
  4. Maller, J.B. et al. (2012). Bayesian refinement of association signals. Nature Genetics.
  5. Vilhjálmsson, B.J. et al. (2015). Modeling LD increases accuracy of PRS. AJHG.
  6. Anderson, C.A. et al. (2010). Data quality control in GWAS. Nature Protocols.

About

Pure Python GWAS Analysis Engine — NumPy/SciPy/sklearn only, no PLINK

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages