Downsample a single-cell count matrix without replacement — a Rust port of
scanpy pp.downsample_counts. Each cell is randomly thinned to at most
--counts-per-cell counts, or the whole matrix to --total-counts, value-exact
with scanpy given the same random_state.
rsomics-sc-downsample matrix.mtx --counts-per-cell 1000 --seed 0 -o out.mtx
rsomics-sc-downsample matrix.mtx --total-counts 1000000 --seed 42 -o out.mtx
Input and output are 10x MatrixMarket integer coordinate matrices (genes × cells);
.gz input is read transparently.
scanpy runs the per-cell draw inside a numba @njit kernel: np.random.seed(s)
then np.random.choice(total, target, replace=False). Inside numba that is the
legacy MT19937 (numba_rnd_init, single-int Knuth seeding — not numpy's
PCG64 default_rng), and choice(replace=False) is permutation(arange(total))[:target]
via a top-down Fisher–Yates over randint(0, i+1). This crate reproduces that
generator and draw bit-for-bit, so the thinned matrix is integer-identical to
scanpy's for the same seed.
2.45× faster (wall and CPU-time) and 5.6× less peak memory than scanpy on a 20000 × 4000 matrix (4.8M nnz), byte-identical output. See the perf record for provenance.
This crate is an independent Rust reimplementation of scanpy.pp.downsample_counts
based on:
- The published method (Wolf, Angerer & Theis, SCANPY: large-scale single-cell gene expression data analysis, Genome Biology 2018, DOI 10.1186/s13059-017-1382-0).
- The scanpy source (
scanpy/preprocessing/_simple.py) and numba'scpython/randomimpl.py+_random.c, all BSD-3-Clause, read and cited to reproduce the exact RNG draw and per-cell thinning algorithm. - Differential testing against the upstream binary.
License: MIT OR Apache-2.0. Upstream credit: scanpy (BSD-3-Clause), numba (BSD-2-Clause).