v0.8.0 brings multi-GPU support, a full performance sweep across the solver pipeline, and a lot less dead code.
๐ Highlights
๐ฎ Multi-GPU solving โ kangaroo can now dispatch work across multiple GPUs simultaneously (#69). Been the most requested feature for a while - if you have more than one card, the solver finally uses them all.
โก GPU pipeline got significantly faster. Compute pipelines are cached across repeated solves now (#100) so you don't pay compilation cost every time. Lock scope during pipeline compilation was way too broad - narrowed it down (#101). Dispatch and DP readback run pipelined instead of sequential (#95). Together these cut GPU overhead substantially for repeated or batched solves.
CPU hot paths got attention too. Post-jump affine was recomputing field inversions that don't change between walks - cached (#97). Hot loop for x-coordinate extraction was hitting the allocator on every iteration - gone (#91).
Solver startup for small ranges used to do full initialization even when the search space was tiny. Cut that overhead (#99). StoredDP distances are fixed-size arrays now instead of heap-allocated (#88), and walk + DP readback logic is tighter (#87).
๐ฉน GPU poll waits could hang indefinitely under certain device conditions - bounded now (#85). DP counter wasn't resetting between calibration probes, which gave wrong calibration numbers on repeated runs (#83). Provider bounds rejected exact-fit ranges that should've been valid (#76). Oversized U256 hex input panicked instead of returning error (#75).
๐
Swept dead code across the whole crate - dropped unused SharedResources, dead constructors, is_provider predicate, LE arithmetic helpers, DP mask helpers, stale dead_code allows, and unused dashmap/thiserror/futures deps (#84-#94). Net result: cleaner dependency tree and less surface area.
โ Upgrading
cargo install kangaroo๐ Changelog
๐ Features
- gpu: Allow multiple devices (#69)
โก Performance
- gpu: Cache compute pipelines across repeated solves (#100)
- gpu: Narrow pipeline cache lock scope during compilation (#101)
- solver: Pipeline GPU dispatch and DP readback (#95)
- solver: Cut startup overhead for small-range solves (#99)
- solver: Tighten walk and DP readback (#87)
- cpu: Cache post-jump affine to skip redundant field inversions (#97)
- cpu: Avoid heap alloc in hot loop x-coordinate extraction (#91)
- dp_table: Inline
StoredDPdist as fixed-size array (#88)
๐ฉน Fixes
- solver: Bound GPU poll waits to prevent indefinite hangs (#85)
- solver: Reset DP counter between calibration probes (#83)
- provider: Allow exact-fit search ranges for provider bounds (#76)
- crypto: Avoid panic on oversized U256 hex input (#75)
- cli: Make benchmark dispatch explicit (#77)
- ci: Gate crates publish with release checks (#82)
๐ Refactors
- deps: Drop unused
dashmap,thiserror, andfutures(#84) - solver: Drop unused
SharedResourcesand dead constructors (#89) - crypto: Drop dead utils, remove stale
dead_codeallows (#90) - math: Drop dead LE arithmetic and DP mask helpers (#92)
- provider: Drop unused
is_providerpredicate (#93) - dp_table: Drop dead
lenandis_emptymethods (#94) - dp_table: Use
Negtrait instead ofScalar::ZERO - x(#96)