BIO-ENV / BEST — find the subset of environmental variables whose standardized
Euclidean distances are maximally rank-correlated with a community distance
matrix (Clarke & Ainsworth 1993). Drop-in compatible with
skbio.stats.distance.bioenv.
rsomics-bioenv dm.tsv --env env.tsv [--columns a,b,c] [-o result.tsv]
dm.tsv is an lsmat-format community distance matrix (a blank top-left corner,
a tab-separated id header, then one id<TAB>values… row per sample). env.tsv
is a samples × variables table: a header <id-label><TAB>var1<TAB>var2…, then
one sampleid<TAB>v1<TAB>v2… row per sample. Env rows are reindexed onto the
distance-matrix ids, so the ids must match but need not be in the same order;
extra env rows are ignored. All variable values must be numeric.
For each subset size from 1 to the number of variables, every variable subset
of that size is evaluated and the one with the highest correlation is reported.
The variables are standardized first (centered, divided by the sample standard
deviation), their Euclidean distances are computed over the matrix's upper
triangle, and Spearman's ρ is taken against the community distances. Output is a
TSV of size, the best correlation, and the comma-joined vars.
This is an exhaustive 2^p search, so runtime grows quickly with the variable count — the same warning scikit-bio gives.
This crate is an independent Rust reimplementation of
skbio.stats.distance.bioenv, informed by its BSD-3-licensed source (the
center-and-scale standardization with sample standard deviation, the
upper-triangle condensed Euclidean distances, the per-subset-size exhaustive
search, and the "first subset on a tie" rule matching vegan::bioenv) and by
the method's primary reference:
- Clarke, K. R. & Ainsworth, M. (1993). "A method of linking multivariate community structure to environmental variables." Marine Ecology Progress Series 92: 205–219. doi:10.3354/meps092205.
Spearman's ρ is computed as Pearson correlation on average-ranked distances,
matching scipy.stats.spearmanr; the community ranks are centered once and
reused across all subsets.
License: MIT OR Apache-2.0. Upstream credit: scikit-bio https://scikit-bio.org (BSD-3-Clause).