Skip to content

v0.8.0

Latest

Choose a tag to compare

@oritwoen oritwoen released this 01 Apr 14:52
· 2 commits to main since this release

v0.8.0 brings multi-GPU support, a full performance sweep across the solver pipeline, and a lot less dead code.

๐Ÿ‘€ Highlights

๐ŸŽฎ Multi-GPU solving โ€” kangaroo can now dispatch work across multiple GPUs simultaneously (#69). Been the most requested feature for a while - if you have more than one card, the solver finally uses them all.

โšก GPU pipeline got significantly faster. Compute pipelines are cached across repeated solves now (#100) so you don't pay compilation cost every time. Lock scope during pipeline compilation was way too broad - narrowed it down (#101). Dispatch and DP readback run pipelined instead of sequential (#95). Together these cut GPU overhead substantially for repeated or batched solves.

CPU hot paths got attention too. Post-jump affine was recomputing field inversions that don't change between walks - cached (#97). Hot loop for x-coordinate extraction was hitting the allocator on every iteration - gone (#91).

Solver startup for small ranges used to do full initialization even when the search space was tiny. Cut that overhead (#99). StoredDP distances are fixed-size arrays now instead of heap-allocated (#88), and walk + DP readback logic is tighter (#87).

๐Ÿฉน GPU poll waits could hang indefinitely under certain device conditions - bounded now (#85). DP counter wasn't resetting between calibration probes, which gave wrong calibration numbers on repeated runs (#83). Provider bounds rejected exact-fit ranges that should've been valid (#76). Oversized U256 hex input panicked instead of returning error (#75).

๐Ÿ’… Swept dead code across the whole crate - dropped unused SharedResources, dead constructors, is_provider predicate, LE arithmetic helpers, DP mask helpers, stale dead_code allows, and unused dashmap/thiserror/futures deps (#84-#94). Net result: cleaner dependency tree and less surface area.

โœ… Upgrading

cargo install kangaroo

๐Ÿ‘‰ Changelog

compare changes

๐Ÿš€ Features

  • gpu: Allow multiple devices (#69)

โšก Performance

  • gpu: Cache compute pipelines across repeated solves (#100)
  • gpu: Narrow pipeline cache lock scope during compilation (#101)
  • solver: Pipeline GPU dispatch and DP readback (#95)
  • solver: Cut startup overhead for small-range solves (#99)
  • solver: Tighten walk and DP readback (#87)
  • cpu: Cache post-jump affine to skip redundant field inversions (#97)
  • cpu: Avoid heap alloc in hot loop x-coordinate extraction (#91)
  • dp_table: Inline StoredDP dist as fixed-size array (#88)

๐Ÿฉน Fixes

  • solver: Bound GPU poll waits to prevent indefinite hangs (#85)
  • solver: Reset DP counter between calibration probes (#83)
  • provider: Allow exact-fit search ranges for provider bounds (#76)
  • crypto: Avoid panic on oversized U256 hex input (#75)
  • cli: Make benchmark dispatch explicit (#77)
  • ci: Gate crates publish with release checks (#82)

๐Ÿ’… Refactors

  • deps: Drop unused dashmap, thiserror, and futures (#84)
  • solver: Drop unused SharedResources and dead constructors (#89)
  • crypto: Drop dead utils, remove stale dead_code allows (#90)
  • math: Drop dead LE arithmetic and DP mask helpers (#92)
  • provider: Drop unused is_provider predicate (#93)
  • dp_table: Drop dead len and is_empty methods (#94)
  • dp_table: Use Neg trait instead of Scalar::ZERO - x (#96)

โค๏ธ Contributors