v0.6.0
v0.6.0 is the next minor release.
π Highlights
β‘ 2Γ throughput. Optimized fe_inv with an addition chain cut multiplications from 248 to 15, and dedicated fe_square exploits symmetry to halve mul32 ops. 48-bit rate went from 8.84 M/s to 17.1 M/s on RX 6800S.
β‘ Field Inversion Addition Chain
Replaced naive square-and-multiply fe_inv with an addition chain based on Peter Dettman's work in libsecp256k1. Since fe_inv runs once per workgroup while 63 threads wait at a barrier, cutting 233 multiplications from this critical path hits wall-clock time directly.
- 94% fewer multiplications β 248 β 15, total ops 503 β 270 (#32)
| Metric | Before | After |
|---|---|---|
| Multiplications | 248 | 15 |
| Total ops | 503 | 270 |
| 48-bit rate | 9.0 M/s | 17.1 M/s |
β‘ Schoolbook Squaring
Dedicated fe_square exploits a[i]*a[j] == a[j]*a[i] symmetry β 36 mul32 instead of 64. Uses individual variables to sidestep RADV array indexing issues.
- 44% fewer mul32 ops (#30)
ποΈ Double-Buffer Architecture
Single GPU round-trip instead of a conditional two. Dual DP slots alternate between dispatches, preparing for future async CPU/GPU overlap. Batch inversion switched from Blelloch scan to a simpler tree-based approach (shared memory down from 6KB to 4KB).
- Merged encoder β DP count + buffer copy in one submission (#33)
- Tree-based batch inversion β simpler structure, fewer barriers (#31)
π Auto-Save Benchmarks
--benchmark now saves results to BENCHMARKS.md automatically, with version tracking and improvement percentages.
- Auto-save β no more manual copy-paste (#34)
β Upgrading
cargo install kangarooπ Changelog
β‘ Performance
- gpu: Optimize fe_inv with addition chain β 2Γ speedup (#32)
- gpu: Optimize fe_square with schoolbook squaring (#30)
π Enhancements
- benchmark: Auto-save results to BENCHMARKS.md (#34)
π©Ή Fixes
- gpu: Full DP mask evaluation in WGSL (#28)
π Refactors
- gpu: Double-buffer DP slots and merge copy encoder (#33)
- gpu: Replace Blelloch scan with tree-based batch inversion (#31)
π Documentation
- benchmarks: RTX 5060 results (#26)