Skip to content

v0.6.0

Choose a tag to compare

@oritwoen oritwoen released this 15 Feb 15:11
· 47 commits to main since this release

v0.6.0 is the next minor release.

πŸ‘€ Highlights

⚑ 2Γ— throughput. Optimized fe_inv with an addition chain cut multiplications from 248 to 15, and dedicated fe_square exploits symmetry to halve mul32 ops. 48-bit rate went from 8.84 M/s to 17.1 M/s on RX 6800S.

⚑ Field Inversion Addition Chain

Replaced naive square-and-multiply fe_inv with an addition chain based on Peter Dettman's work in libsecp256k1. Since fe_inv runs once per workgroup while 63 threads wait at a barrier, cutting 233 multiplications from this critical path hits wall-clock time directly.

  • 94% fewer multiplications β€” 248 β†’ 15, total ops 503 β†’ 270 (#32)
Metric Before After
Multiplications 248 15
Total ops 503 270
48-bit rate 9.0 M/s 17.1 M/s

⚑ Schoolbook Squaring

Dedicated fe_square exploits a[i]*a[j] == a[j]*a[i] symmetry β€” 36 mul32 instead of 64. Uses individual variables to sidestep RADV array indexing issues.

  • 44% fewer mul32 ops (#30)

πŸ—οΈ Double-Buffer Architecture

Single GPU round-trip instead of a conditional two. Dual DP slots alternate between dispatches, preparing for future async CPU/GPU overlap. Batch inversion switched from Blelloch scan to a simpler tree-based approach (shared memory down from 6KB to 4KB).

  • Merged encoder β€” DP count + buffer copy in one submission (#33)
  • Tree-based batch inversion β€” simpler structure, fewer barriers (#31)

πŸ“Š Auto-Save Benchmarks

--benchmark now saves results to BENCHMARKS.md automatically, with version tracking and improvement percentages.

  • Auto-save β€” no more manual copy-paste (#34)

βœ… Upgrading

cargo install kangaroo

πŸ‘‰ Changelog

compare changes

⚑ Performance

  • gpu: Optimize fe_inv with addition chain β€” 2Γ— speedup (#32)
  • gpu: Optimize fe_square with schoolbook squaring (#30)

πŸš€ Enhancements

  • benchmark: Auto-save results to BENCHMARKS.md (#34)

🩹 Fixes

  • gpu: Full DP mask evaluation in WGSL (#28)

πŸ’… Refactors

  • gpu: Double-buffer DP slots and merge copy encoder (#33)
  • gpu: Replace Blelloch scan with tree-based batch inversion (#31)

πŸ“– Documentation

  • benchmarks: RTX 5060 results (#26)

❀️ Contributors