simplify missing data and span normalization API#32
Merged
andrewkern merged 9 commits intomainfrom Apr 7, 2026
Merged
Conversation
48-test bias suite verifying include mode is unbiased under MCAR at 10-60% missingness for pi, theta_w, theta_h, theta_l, tajd, dxy, fst_hudson, da, and all Achaz estimators. Marked slow. Fix achaz zeng_e: variance formula had canceling a2 terms giving negative variance for all n. Replaced with neutrality_test() which uses the general Achaz (2009) / Fu (1995) covariance framework. Documents known biases: WC FST (inflates), seg_sites (deflates).
Simplify missing_data to two options: 'include' (default) and 'exclude'. Simulation testing confirms per-site averaging (include mode) is unbiased under MCAR at 10-60% missingness for all major statistics. - Remove pairwise branches from pi, theta_w, tajimas_d, dxy, fst_hudson, fst_weir_cockerham, fst_nei, da, pi_within_population - Remove dead pairwise->include conversions from 11 functions - Move LD projection to estimator='sigma_d2' param on zns/omega - Rename _pairwise_pi_components to pi_components (public API) - Delete test_pairwise_mode.py (543 lines) - Clean all docstrings and comments - Net: -730 lines
Two modes (include/exclude), LD estimator parameter, span normalization, accessible masks, Achaz framework, component access. Remove all pairwise/project mode documentation and examples.
Allele frequencies were computed from all individually-valid haplotypes but divided by the count of complete diploid pairs, inflating frequencies by 1/(1-miss_rate). Now both allele counts and het are computed from complete diploid pairs only. Before: +22% bias at 10% missing, +3674% at 60% After: +0.2% at 10%, -1.3% at 60% (unbiased)
get_span('auto') selects the best denominator: accessible mask >
n_total_sites > genomic span > callable span. Explicit 'accessible'
now errors if no mask set instead of silently falling back.
Replace span_normalize + span_denominator with a single span_normalize parameter accepting True (auto-detect: accessible > n_total_sites > genomic span), False (raw sum), or explicit string mode. - pi, theta_w, theta_h, theta_l: replace span_denominator with unified span_normalize, use _apply_span_normalize helper - dxy, da: replace span_denominator: bool with span_normalize - pi_within_population: delegate to diversity.pi() - windowed_analysis: pass span_normalize through to all stats - dxy default changes from per-valid-site to per-base (matching pi)
FrequencySpectrum.theta() now accepts span_normalize=True to auto-detect span from the source matrix, instead of requiring the caller to pass span explicitly.
e8fafed to
23c5d73
Compare
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
missing_datafrom 4 modes to 2:'include'(default) and'exclude'span_normalize+span_denominatorwith a singlespan_normalizeparameter that auto-detects the best denominator (accessible mask > n_total_sites > genomic span)estimator='sigma_d2'onzns()/omega()span_normalizeas pi/theta_wFrequencySpectrum.theta()auto-detects span from source matrixBug fixes
zeng_e: fix broken variance formula (canceling a2 terms gave NaN for all n)Validation
includemode is unbiased under MCAR at 10-60% missingness for 17 statistics