simplify missing data and span normalization API by andrewkern · Pull Request #32 · kr-colab/pg_gpu

andrewkern · 2026-04-07T17:09:29Z

Summary

Reduce missing_data from 4 modes to 2: 'include' (default) and 'exclude'
Replace span_normalize + span_denominator with a single span_normalize parameter that auto-detects the best denominator (accessible mask > n_total_sites > genomic span)
Move LD projection estimator to estimator='sigma_d2' on zns()/omega()
Unify dxy/da/pi_within_population to use the same span_normalize as pi/theta_w
Achaz FrequencySpectrum.theta() auto-detects span from source matrix

Bug fixes

WC FST: fix numerator/denominator mismatch inflating FST by 1/(1-miss_rate) under missing data
Achaz zeng_e: fix broken variance formula (canceling a2 terms gave NaN for all n)
dxy/da: fix crash in exclude mode (shape mismatch after site filtering)

Validation

53 simulation-based bias tests confirming include mode is unbiased under MCAR at 10-60% missingness for 17 statistics
495 unit tests pass
11/12 scikit-allel comparisons match on real Ag1000G 3L (71% missing)
Net: -81 lines across 26 files

48-test bias suite verifying include mode is unbiased under MCAR at 10-60% missingness for pi, theta_w, theta_h, theta_l, tajd, dxy, fst_hudson, da, and all Achaz estimators. Marked slow. Fix achaz zeng_e: variance formula had canceling a2 terms giving negative variance for all n. Replaced with neutrality_test() which uses the general Achaz (2009) / Fu (1995) covariance framework. Documents known biases: WC FST (inflates), seg_sites (deflates).

Simplify missing_data to two options: 'include' (default) and 'exclude'. Simulation testing confirms per-site averaging (include mode) is unbiased under MCAR at 10-60% missingness for all major statistics. - Remove pairwise branches from pi, theta_w, tajimas_d, dxy, fst_hudson, fst_weir_cockerham, fst_nei, da, pi_within_population - Remove dead pairwise->include conversions from 11 functions - Move LD projection to estimator='sigma_d2' param on zns/omega - Rename _pairwise_pi_components to pi_components (public API) - Delete test_pairwise_mode.py (543 lines) - Clean all docstrings and comments - Net: -730 lines

Two modes (include/exclude), LD estimator parameter, span normalization, accessible masks, Achaz framework, component access. Remove all pairwise/project mode documentation and examples.

Allele frequencies were computed from all individually-valid haplotypes but divided by the count of complete diploid pairs, inflating frequencies by 1/(1-miss_rate). Now both allele counts and het are computed from complete diploid pairs only. Before: +22% bias at 10% missing, +3674% at 60% After: +0.2% at 10%, -1.3% at 60% (unbiased)

get_span('auto') selects the best denominator: accessible mask > n_total_sites > genomic span > callable span. Explicit 'accessible' now errors if no mask set instead of silently falling back.

Replace span_normalize + span_denominator with a single span_normalize parameter accepting True (auto-detect: accessible > n_total_sites > genomic span), False (raw sum), or explicit string mode. - pi, theta_w, theta_h, theta_l: replace span_denominator with unified span_normalize, use _apply_span_normalize helper - dxy, da: replace span_denominator: bool with span_normalize - pi_within_population: delegate to diversity.pi() - windowed_analysis: pass span_normalize through to all stats - dxy default changes from per-valid-site to per-base (matching pi)

FrequencySpectrum.theta() now accepts span_normalize=True to auto-detect span from the source matrix, instead of requiring the caller to pass span explicitly.

andrewkern added 9 commits April 6, 2026 22:03

rewrite missing data docs for simplified interface

3d3d797

Two modes (include/exclude), LD estimator parameter, span normalization, accessible masks, Achaz framework, component access. Remove all pairwise/project mode documentation and examples.

add auto mode to get_span for smart span normalization

a010be4

get_span('auto') selects the best denominator: accessible mask > n_total_sites > genomic span > callable span. Explicit 'accessible' now errors if no mask set instead of silently falling back.

update Achaz theta() to use unified span_normalize

4670279

FrequencySpectrum.theta() now accepts span_normalize=True to auto-detect span from the source matrix, instead of requiring the caller to pass span explicitly.

update docs for unified span_normalize

a361fda

fix debug scripts for simplified API

23c5d73

andrewkern force-pushed the refactor/simplify-missing-data branch from e8fafed to 23c5d73 Compare April 7, 2026 17:11

andrewkern merged commit 0da5b8f into main Apr 7, 2026
1 check passed

andrewkern mentioned this pull request Apr 8, 2026

add Gram matrix fast path for ZnS computation #25

Open

2 tasks

andrewkern deleted the refactor/simplify-missing-data branch April 16, 2026 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simplify missing data and span normalization API#32

simplify missing data and span normalization API#32
andrewkern merged 9 commits intomainfrom
refactor/simplify-missing-data

andrewkern commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrewkern commented Apr 7, 2026

Summary

Bug fixes

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant