Pure Python Genome-Wide Association Study Analysis Engine
GWASEngine is a complete GWAS analysis pipeline built entirely in Python using NumPy, SciPy, and scikit-learn — no PLINK, no R, no BOLT-LMM, no REGENIE. All algorithms are implemented from first principles.
| Module | Description |
|---|---|
| Quality Control | Sample QC (call rate, heterozygosity, relatedness) + Variant QC (MAF, HWE, missingness) |
| Association Testing | Vectorized linear regression (quantitative) + Firth logistic regression (binary) |
| LD Structure | Pairwise r² computation, LD clumping, Gabriel block detection |
| Polygenic Risk Scores | C+T (Clumping + Thresholding) + LDpred2-inspired Bayesian shrinkage |
| Fine-Mapping | Wakefield Approximate Bayes Factors, 95% credible sets |
| LD Score Regression | SNP heritability (h²_SNP) estimation, confounding inflation correction |
pip install numpy scipy pandas scikit-learn plotly matplotlib
git clone https://github.com/junior1p/GWASEngine.git
cd GWASEnginefrom gwasengine import run_gwas_engine
summary = run_gwas_engine(
trait="quantitative",
out_dir="gwas_output",
n_samples=2000,
n_snps=10000,
h2_snp=0.3,
n_causal=20,
n_genetic_pcs=4,
run_prs=True,
run_finemapping=True,
run_ldsc=True,
)Or from the command line:
python -m gwasengineThe demo (python -m gwasengine) generates:
gwas_output/
├── gwas_analysis.html # Interactive 6-panel dashboard
├── gwas_results.csv # All SNP association statistics
├── lead_snps.csv # Independent genome-wide significant loci
├── prs.csv # Per-sample polygenic risk scores
├── finemapping.csv # Fine-mapping PIPs at top locus
└── summary.json # Pipeline summary metrics
Expected metrics (2000 samples, 10000 SNPs, h²_SNP=0.30, 20 causal):
- ~5–15 genome-wide significant SNPs (p<5×10⁻⁸)
- λ_GC ≈ 1.0–1.1 (well-controlled with PC correction)
- h²_SNP estimate ≈ 0.25–0.35
- PRS R² ≈ 0.05–0.15 at p<10⁻⁴ threshold
Opens automatically as gwas_analysis.html. Six panels:
- Manhattan Plot — -log₁₀(p) vs genomic position, GWS threshold line
- QQ Plot — Observed vs expected p-values, λ_GC annotation
- PRS Distribution — Histogram of polygenic risk scores
- Fine-Mapping PIPs — Posterior inclusion probabilities at top locus
- LDSC Results — Heritability estimates and intercept
- Top Hits — Bar chart of most significant associations
Most GWAS tools require compiled binaries (PLINK 1.9, PLINK 2.0, REGENIE) or R/Bioconductor packages. GWASEngine is pure Python — installable anywhere pip works, no sudo/root required, runs on CPU.
- Linear regression: Univariate per-SNP with covariate residualization. Handles n < m (more SNPs than samples) via degrees-of-freedom correction.
- Firth logistic regression: Newton-Raphson with penalty to handle separation in binary traits.
- LD clumping: Gabriel block method with r² < 0.1 threshold, 500kb window.
- Fine-mapping: Wakefield (2009) Approximate Bayes Factors, 95% credible sets.
- LDSC: Bulik-Sullivan et al. (2015) — regress χ² on LD scores to separate polygenicity from confounding.
| Step | Runtime |
|---|---|
| QC | ~2s |
| Linear GWAS | ~3s |
| LD clumping | ~5s |
| LDSC | ~15s |
| Full pipeline | ~30s |
Full pipeline. Returns summary dict with all metrics.
Generate realistic synthetic GWAS data with:
- 3 ancestry clusters (population stratification)
- LD blocks (Cholesky correlation structure)
- Known causal SNPs with effect sizes drawn from N(0, h²/n_causal)
Sample + variant QC. Returns (data_qc, snp_mask, sample_mask, qc_report).
numpy >= 1.24
scipy >= 1.10
pandas >= 1.5
scikit-learn >= 1.3
plotly >= 5.15
matplotlib >= 3.7
MIT License. See LICENSE.
- Bulik-Sullivan, B.K. et al. (2015). LD Score regression distinguishes confounding from polygenicity. Nature Genetics.
- Wakefield, J. (2009). Bayes factors for genome-wide association studies. Genetic Epidemiology.
- Price, A.L. et al. (2006). Principal components analysis corrects for stratification. Nature Genetics.
- Maller, J.B. et al. (2012). Bayesian refinement of association signals. Nature Genetics.
- Vilhjálmsson, B.J. et al. (2015). Modeling LD increases accuracy of PRS. AJHG.
- Anderson, C.A. et al. (2010). Data quality control in GWAS. Nature Protocols.