Release v0.6.0 · oritwoen/kangaroo

v0.6.0 is the next minor release.

👀 Highlights

⚡ 2× throughput. Optimized fe_inv with an addition chain cut multiplications from 248 to 15, and dedicated fe_square exploits symmetry to halve mul32 ops. 48-bit rate went from 8.84 M/s to 17.1 M/s on RX 6800S.

⚡ Field Inversion Addition Chain

Replaced naive square-and-multiply fe_inv with an addition chain based on Peter Dettman's work in libsecp256k1. Since fe_inv runs once per workgroup while 63 threads wait at a barrier, cutting 233 multiplications from this critical path hits wall-clock time directly.

94% fewer multiplications — 248 → 15, total ops 503 → 270 (#32)

Metric	Before	After
Multiplications	248	15
Total ops	503	270
48-bit rate	9.0 M/s	17.1 M/s

⚡ Schoolbook Squaring

Dedicated fe_square exploits a[i]*a[j] == a[j]*a[i] symmetry — 36 mul32 instead of 64. Uses individual variables to sidestep RADV array indexing issues.

44% fewer mul32 ops (#30)

🏗️ Double-Buffer Architecture

Single GPU round-trip instead of a conditional two. Dual DP slots alternate between dispatches, preparing for future async CPU/GPU overlap. Batch inversion switched from Blelloch scan to a simpler tree-based approach (shared memory down from 6KB to 4KB).

Merged encoder — DP count + buffer copy in one submission (#33)
Tree-based batch inversion — simpler structure, fewer barriers (#31)

📊 Auto-Save Benchmarks

--benchmark now saves results to BENCHMARKS.md automatically, with version tracking and improvement percentages.

Auto-save — no more manual copy-paste (#34)

✅ Upgrading

cargo install kangaroo

👉 Changelog

compare changes

⚡ Performance

gpu: Optimize fe_inv with addition chain — 2× speedup (#32)
gpu: Optimize fe_square with schoolbook squaring (#30)

🚀 Enhancements

benchmark: Auto-save results to BENCHMARKS.md (#34)

🩹 Fixes

gpu: Full DP mask evaluation in WGSL (#28)

💅 Refactors

gpu: Double-buffer DP slots and merge copy encoder (#33)
gpu: Replace Blelloch scan with tree-based batch inversion (#31)

📖 Documentation

benchmarks: RTX 5060 results (#26)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.6.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

👀 Highlights

⚡ Field Inversion Addition Chain

⚡ Schoolbook Squaring

🏗️ Double-Buffer Architecture

📊 Auto-Save Benchmarks

✅ Upgrading

👉 Changelog

⚡ Performance

🚀 Enhancements

🩹 Fixes

💅 Refactors

📖 Documentation

❤️ Contributors

Contributors

Uh oh!