GWASEngine

Pure Python Genome-Wide Association Study Analysis Engine

GWASEngine is a complete GWAS analysis pipeline built entirely in Python using NumPy, SciPy, and scikit-learn — no PLINK, no R, no BOLT-LMM, no REGENIE. All algorithms are implemented from first principles.

Features

Module	Description
Quality Control	Sample QC (call rate, heterozygosity, relatedness) + Variant QC (MAF, HWE, missingness)
Association Testing	Vectorized linear regression (quantitative) + Firth logistic regression (binary)
LD Structure	Pairwise r² computation, LD clumping, Gabriel block detection
Polygenic Risk Scores	C+T (Clumping + Thresholding) + LDpred2-inspired Bayesian shrinkage
Fine-Mapping	Wakefield Approximate Bayes Factors, 95% credible sets
LD Score Regression	SNP heritability (h²_SNP) estimation, confounding inflation correction

Installation

pip install numpy scipy pandas scikit-learn plotly matplotlib
git clone https://github.com/junior1p/GWASEngine.git
cd GWASEngine

Quick Start

from gwasengine import run_gwas_engine

summary = run_gwas_engine(
    trait="quantitative",
    out_dir="gwas_output",
    n_samples=2000,
    n_snps=10000,
    h2_snp=0.3,
    n_causal=20,
    n_genetic_pcs=4,
    run_prs=True,
    run_finemapping=True,
    run_ldsc=True,
)

Or from the command line:

python -m gwasengine

Demo Output

The demo (python -m gwasengine) generates:

gwas_output/
├── gwas_analysis.html   # Interactive 6-panel dashboard
├── gwas_results.csv     # All SNP association statistics
├── lead_snps.csv        # Independent genome-wide significant loci
├── prs.csv              # Per-sample polygenic risk scores
├── finemapping.csv      # Fine-mapping PIPs at top locus
└── summary.json         # Pipeline summary metrics

Expected metrics (2000 samples, 10000 SNPs, h²_SNP=0.30, 20 causal):

~5–15 genome-wide significant SNPs (p<5×10⁻⁸)
λ_GC ≈ 1.0–1.1 (well-controlled with PC correction)
h²_SNP estimate ≈ 0.25–0.35
PRS R² ≈ 0.05–0.15 at p<10⁻⁴ threshold

Interactive Dashboard

Opens automatically as gwas_analysis.html. Six panels:

Manhattan Plot — -log₁₀(p) vs genomic position, GWS threshold line
QQ Plot — Observed vs expected p-values, λ_GC annotation
PRS Distribution — Histogram of polygenic risk scores
Fine-Mapping PIPs — Posterior inclusion probabilities at top locus
LDSC Results — Heritability estimates and intercept
Top Hits — Bar chart of most significant associations

Scientific Background

Why no PLINK?

Most GWAS tools require compiled binaries (PLINK 1.9, PLINK 2.0, REGENIE) or R/Bioconductor packages. GWASEngine is pure Python — installable anywhere pip works, no sudo/root required, runs on CPU.

Statistical Methods

Linear regression: Univariate per-SNP with covariate residualization. Handles n < m (more SNPs than samples) via degrees-of-freedom correction.
Firth logistic regression: Newton-Raphson with penalty to handle separation in binary traits.
LD clumping: Gabriel block method with r² < 0.1 threshold, 500kb window.
Fine-mapping: Wakefield (2009) Approximate Bayes Factors, 95% credible sets.
LDSC: Bulik-Sullivan et al. (2015) — regress χ² on LD scores to separate polygenicity from confounding.

Performance (2000 × 10,000, CPU)

Step	Runtime
QC	~2s
Linear GWAS	~3s
LD clumping	~5s
LDSC	~15s
Full pipeline	~30s

API Reference

`gwasengine.run_gwas_engine()`

Full pipeline. Returns summary dict with all metrics.

`gwasengine.generate_synthetic_gwas()`

Generate realistic synthetic GWAS data with:

3 ancestry clusters (population stratification)
LD blocks (Cholesky correlation structure)
Known causal SNPs with effect sizes drawn from N(0, h²/n_causal)

`gwasengine.quality_control()`

Sample + variant QC. Returns (data_qc, snp_mask, sample_mask, qc_report).

Dependencies

numpy >= 1.24
scipy >= 1.10
pandas >= 1.5
scikit-learn >= 1.3
plotly >= 5.15
matplotlib >= 3.7

License

MIT License. See LICENSE.

References

Bulik-Sullivan, B.K. et al. (2015). LD Score regression distinguishes confounding from polygenicity. Nature Genetics.
Wakefield, J. (2009). Bayes factors for genome-wide association studies. Genetic Epidemiology.
Price, A.L. et al. (2006). Principal components analysis corrects for stratification. Nature Genetics.
Maller, J.B. et al. (2012). Bayesian refinement of association signals. Nature Genetics.
Vilhjálmsson, B.J. et al. (2015). Modeling LD increases accuracy of PRS. AJHG.
Anderson, C.A. et al. (2010). Data quality control in GWAS. Nature Protocols.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
gwasengine		gwasengine
.gitignore		.gitignore
HOW-TO-USE.md		HOW-TO-USE.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GWASEngine

Features

Installation

Quick Start

Demo Output

Interactive Dashboard

Scientific Background

Why no PLINK?

Statistical Methods

Performance (2000 × 10,000, CPU)

API Reference

`gwasengine.run_gwas_engine()`

`gwasengine.generate_synthetic_gwas()`

`gwasengine.quality_control()`

Dependencies

License

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GWASEngine

Features

Installation

Quick Start

Demo Output

Interactive Dashboard

Scientific Background

Why no PLINK?

Statistical Methods

Performance (2000 × 10,000, CPU)

API Reference

gwasengine.run_gwas_engine()

gwasengine.generate_synthetic_gwas()

gwasengine.quality_control()

Dependencies

License

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`gwasengine.run_gwas_engine()`

`gwasengine.generate_synthetic_gwas()`

`gwasengine.quality_control()`

Packages