Skip to content

simplify missing data and span normalization API#32

Merged
andrewkern merged 9 commits intomainfrom
refactor/simplify-missing-data
Apr 7, 2026
Merged

simplify missing data and span normalization API#32
andrewkern merged 9 commits intomainfrom
refactor/simplify-missing-data

Conversation

@andrewkern
Copy link
Copy Markdown
Member

Summary

  • Reduce missing_data from 4 modes to 2: 'include' (default) and 'exclude'
  • Replace span_normalize + span_denominator with a single span_normalize parameter that auto-detects the best denominator (accessible mask > n_total_sites > genomic span)
  • Move LD projection estimator to estimator='sigma_d2' on zns()/omega()
  • Unify dxy/da/pi_within_population to use the same span_normalize as pi/theta_w
  • Achaz FrequencySpectrum.theta() auto-detects span from source matrix

Bug fixes

  • WC FST: fix numerator/denominator mismatch inflating FST by 1/(1-miss_rate) under missing data
  • Achaz zeng_e: fix broken variance formula (canceling a2 terms gave NaN for all n)
  • dxy/da: fix crash in exclude mode (shape mismatch after site filtering)

Validation

  • 53 simulation-based bias tests confirming include mode is unbiased under MCAR at 10-60% missingness for 17 statistics
  • 495 unit tests pass
  • 11/12 scikit-allel comparisons match on real Ag1000G 3L (71% missing)
  • Net: -81 lines across 26 files

48-test bias suite verifying include mode is unbiased under MCAR
at 10-60% missingness for pi, theta_w, theta_h, theta_l, tajd,
dxy, fst_hudson, da, and all Achaz estimators. Marked slow.

Fix achaz zeng_e: variance formula had canceling a2 terms giving
negative variance for all n. Replaced with neutrality_test() which
uses the general Achaz (2009) / Fu (1995) covariance framework.

Documents known biases: WC FST (inflates), seg_sites (deflates).
Simplify missing_data to two options: 'include' (default) and 'exclude'.

Simulation testing confirms per-site averaging (include mode) is unbiased
under MCAR at 10-60% missingness for all major statistics.

- Remove pairwise branches from pi, theta_w, tajimas_d, dxy, fst_hudson,
  fst_weir_cockerham, fst_nei, da, pi_within_population
- Remove dead pairwise->include conversions from 11 functions
- Move LD projection to estimator='sigma_d2' param on zns/omega
- Rename _pairwise_pi_components to pi_components (public API)
- Delete test_pairwise_mode.py (543 lines)
- Clean all docstrings and comments
- Net: -730 lines
Two modes (include/exclude), LD estimator parameter, span
normalization, accessible masks, Achaz framework, component access.
Remove all pairwise/project mode documentation and examples.
Allele frequencies were computed from all individually-valid
haplotypes but divided by the count of complete diploid pairs,
inflating frequencies by 1/(1-miss_rate). Now both allele counts
and het are computed from complete diploid pairs only.

Before: +22% bias at 10% missing, +3674% at 60%
After:  +0.2% at 10%, -1.3% at 60% (unbiased)
get_span('auto') selects the best denominator: accessible mask >
n_total_sites > genomic span > callable span. Explicit 'accessible'
now errors if no mask set instead of silently falling back.
Replace span_normalize + span_denominator with a single span_normalize
parameter accepting True (auto-detect: accessible > n_total_sites >
genomic span), False (raw sum), or explicit string mode.

- pi, theta_w, theta_h, theta_l: replace span_denominator with
  unified span_normalize, use _apply_span_normalize helper
- dxy, da: replace span_denominator: bool with span_normalize
- pi_within_population: delegate to diversity.pi()
- windowed_analysis: pass span_normalize through to all stats
- dxy default changes from per-valid-site to per-base (matching pi)
FrequencySpectrum.theta() now accepts span_normalize=True to
auto-detect span from the source matrix, instead of requiring
the caller to pass span explicitly.
@andrewkern andrewkern force-pushed the refactor/simplify-missing-data branch from e8fafed to 23c5d73 Compare April 7, 2026 17:11
@andrewkern andrewkern merged commit 0da5b8f into main Apr 7, 2026
1 check passed
@andrewkern andrewkern deleted the refactor/simplify-missing-data branch April 16, 2026 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant