Skip to content

omics-rust/rsomics-sc-downsample

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rsomics-sc-downsample

Downsample a single-cell count matrix without replacement — a Rust port of scanpy pp.downsample_counts. Each cell is randomly thinned to at most --counts-per-cell counts, or the whole matrix to --total-counts, value-exact with scanpy given the same random_state.

rsomics-sc-downsample matrix.mtx --counts-per-cell 1000 --seed 0 -o out.mtx
rsomics-sc-downsample matrix.mtx --total-counts 1000000 --seed 42 -o out.mtx

Input and output are 10x MatrixMarket integer coordinate matrices (genes × cells); .gz input is read transparently.

Bit-exact RNG

scanpy runs the per-cell draw inside a numba @njit kernel: np.random.seed(s) then np.random.choice(total, target, replace=False). Inside numba that is the legacy MT19937 (numba_rnd_init, single-int Knuth seeding — not numpy's PCG64 default_rng), and choice(replace=False) is permutation(arange(total))[:target] via a top-down Fisher–Yates over randint(0, i+1). This crate reproduces that generator and draw bit-for-bit, so the thinned matrix is integer-identical to scanpy's for the same seed.

Performance

2.45× faster (wall and CPU-time) and 5.6× less peak memory than scanpy on a 20000 × 4000 matrix (4.8M nnz), byte-identical output. See the perf record for provenance.

Origin

This crate is an independent Rust reimplementation of scanpy.pp.downsample_counts based on:

  • The published method (Wolf, Angerer & Theis, SCANPY: large-scale single-cell gene expression data analysis, Genome Biology 2018, DOI 10.1186/s13059-017-1382-0).
  • The scanpy source (scanpy/preprocessing/_simple.py) and numba's cpython/randomimpl.py + _random.c, all BSD-3-Clause, read and cited to reproduce the exact RNG draw and per-cell thinning algorithm.
  • Differential testing against the upstream binary.

License: MIT OR Apache-2.0. Upstream credit: scanpy (BSD-3-Clause), numba (BSD-2-Clause).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages