Structural-zero detection for compositional feature tables, as a single fast CLI.
Equivalent to skbio.stats.composition.struc_zero.
Structural zeros are features systematically absent from a sample group — their observed counts are all (or nearly all) zero. ANCOM-BC and related differential-abundance methods detect them as a preprocessing step: a feature that is a structural zero in a group is automatically differentially abundant there and should be excluded from log-ratio analyses of that group. Run this on the raw table, before any pseudocount replacement.
rsomics-struc-zero --table table.tsv --metadata meta.tsv --grouping <col> [--neg-lb] [-o grid.tsv]
table.tsv— feature table: header row of feature IDs (corner cell ignored), then onesample_id value...line per sample. Empty /NAcells count as zero.meta.tsv— sample metadata: sample ID in the first column, named covariate columns in the header.--grouping— the metadata column whose labels partition samples into groups.--neg-lb— flag a feature when the 95% Wald lower bound of its group prevalence is ≤ 0, not only when it is absent from every sample. Recommended for larger per-group sample sizes.--csv— comma-separated I/O instead of tab.
Output is a boolean grid: a header of the (sorted) group labels, then one
feature<TAB>True/False... line per feature, True where the feature is a
structural zero in that group.
Features are independent, so -t parallelises the per-group scan across them.
This crate is an independent Rust reimplementation of
skbio.stats.composition.struc_zero based on:
- Kaul, A., Mandal, S., Davidov, O., Peddada, S. D., "Analysis of Microbiome Data in the Presence of Excess Zeros", Frontiers in Microbiology 8:2114, 2017. DOI: 10.3389/fmicb.2017.02114
- Lin, H., Peddada, S. D., "Analysis of compositions of microbiomes with bias correction", Nature Communications 11:3514, 2020. DOI: 10.1038/s41467-020-17041-7
- The scikit-bio implementation (Modified BSD License), read and cited: for each
group it counts nonzero samples per feature to get a prevalence
p, optionally replaces it with the Wald lower boundp - 1.96·sqrt(p(1-p)/n), and flags the feature whenp ≤ 0. NaN/empty cells are treated as zero.
The boolean grid is reproduced exactly vs scikit-bio (counting plus a closed-form
threshold, no RNG, no iteration); tests/compat.rs diffs a committed
skbio-captured golden and a live skbio.stats.composition.struc_zero run.
Group labels are sorted lexicographically as strings; scikit-bio sorts via
numpy.unique after a float-cast of the metadata, so a column of numeric-looking
labels orders numerically there. Pass non-numeric labels for identical column
order.
License: MIT OR Apache-2.0. Upstream credit: scikit-bio https://scikit-bio.org/ (Modified BSD License).