Skip to content

feat(v2.1-rv64): switch to u16 limbs in keccak/sha2 airs#2802

Open
shuklaayush wants to merge 1 commit into
feat/memory-bus-u16-pr1from
feat/memory-bus-u16-pr3
Open

feat(v2.1-rv64): switch to u16 limbs in keccak/sha2 airs#2802
shuklaayush wants to merge 1 commit into
feat/memory-bus-u16-pr1from
feat/memory-bus-u16-pr3

Conversation

@shuklaayush
Copy link
Copy Markdown
Collaborator

@shuklaayush shuklaayush commented May 21, 2026

PR 3 of the memory-bus-u16 split, stacked on #2794 (disjoint from PR 2). Shrinks SHA-2 and Keccak chip trace columns that are only consumed as packed u16 cells through the memory bus.

Why

#2794 still stores SHA-2 main-chip state / message bytes and Keccak state / pointer columns byte-shaped, packing them at memory-bus call sites via pack_u8_block. Because these columns are never inspected byte-by-byte by the AIR (the constraints operate on packed u16 values, or on byte-XOR buffers that intentionally stay byte-shaped), they can be stored u16-shaped directly — halving the affected trace columns and removing one pack_u8_block per row.

This is the second of the four cell-narrowing PRs (after PR 2 for RV64 control-flow) and is disjoint from it: the only files in this PR live under extensions/sha2, extensions/keccak256, and crates/circuits/sha2-air, plus one helper addition in extensions/riscv/circuit/src/adapters/mod.rs.

What changes

SHA-2 section

Files: extensions/sha2/circuit/src/sha2_chips/{main_chip,block_hasher_chip}/, crates/circuits/sha2-air/src/{air,columns,trace}.rs, extensions/sha2/circuit/cuda/.

Main chip:

  • prev_state, new_state are u16 cells.
  • message_bytes reshapes to message_u16s (byte pairs packed at the call site).
  • Pointer limbs dst_ptr_limbs, state_ptr_limbs, input_ptr_limbs reshape to u16 where the AIR only consumes them as packed pointer cells; the high u16 cell gets the scaled 16-bit range check.

Block hasher:

  • prev_hash, final_hash are u16 cells.

sha2_bus:

  • MESSAGE_1 / MESSAGE_2 payloads are packed to u16 on both sender and receiver.

Stays byte-shaped (8-bit bounded by construction):

  • carry_a, carry_e, carry_or_buffer in sha2-air.

Keccak section

Files: extensions/keccak256/circuit/src/{keccakf_op,xorin}/, extensions/keccak256/circuit/cuda/.

keccakf_op:

  • preimage, postimage reshape to u16 cells.
  • buffer_ptr_limbs reshape to u16; pointer composition uses the scaled 16-bit bound.
  • keccakf_state_bus payload between keccakf_op and keccakf_perm is packed to u16.

xorin:

  • Pointer limbs u16-shaped (same treatment as keccakf_op).
  • Byte-XOR buffers (preimage_buffer_bytes, input_bytes, postimage_buffer_bytes) stay byte-shaped — those columns still carry byte-level XOR data and the AIR inspects them directly.

CUDA tracegen mirrors the new shapes; keccakf_op drops the bitwise_lookup parameter from its kernel and ABI signatures now that pointer bytes no longer need byte-level checks. The call chip still wires bitwise_lu.

Adapter helper (cross-cut)

File: extensions/riscv/circuit/src/adapters/mod.rs.

  • Add expand_to_rv64_block: zero-pads N u16 limbs to one RV64 bus block (BLOCK_FE_WIDTH cells). Used by keccakf_op and xorin for pointer reads against u16-celled RV64_REGISTER_AS.
  • BLOCK_FE_WIDTH imported into the adapters module to support the new helper.

The existing expand_to_rv64_register (8 u8 limbs, byte-shaped) is kept for byte-shaped chips and the legacy paths.

Migration notes

  • Chips passing SHA-2 main-chip state through the bus should hand the bus u16 cells directly (no pack_u8_block).
  • Pointer reads against u16-celled RV64_REGISTER_AS from chips storing only the low 32 bits of the register should use expand_to_rv64_block (BLOCK_FE_WIDTH cells, zero-padded) instead of expand_to_rv64_register (8 u8s).
  • Keccak pointer composition should use the scaled 16-bit bound on the high u16 cell; the previous byte-pair range check on bitwise_lookup_bus is removed for keccakf_op.

resolves int-7832, int-7833

@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr3 branch from d5e1353 to db27775 Compare May 21, 2026 15:34
@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@shuklaayush shuklaayush requested review from GunaDD and jonathanpwang and removed request for jonathanpwang May 21, 2026 15:46
@shuklaayush shuklaayush marked this pull request as ready for review May 21, 2026 15:47
@github-actions

This comment was marked as outdated.

@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr3 branch from db27775 to 8b0a2d8 Compare May 21, 2026 15:51
@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr3 branch from 8b0a2d8 to 7acbea7 Compare May 21, 2026 16:03
@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

…lumns

PR 3 of the memory-bus-u16 split (stacked on PR 1, disjoint from PR 2).
Shrinks SHA-2 and Keccak chip trace columns that are only consumed as
packed u16 cells through the memory bus.

SHA-2:
- Main chip: prev_state / new_state are u16 cells; message_bytes becomes
  message_u16s; dst/state/input pointer limbs are u16-shaped.
- Block hasher: prev_hash / final_hash are u16 cells.
- sha2_bus MESSAGE_1 / MESSAGE_2 payloads packed to u16 on sender and
  receiver.
- CUDA tracegen + tests updated.
- Carry columns (carry_a, carry_e, carry_or_buffer in sha2-air) stay
  8-bit bounded — byte carries are still constrained as bytes.

Keccak:
- keccakf_op: preimage / postimage / buffer_ptr_limbs migrated to u16
  cells; pointer composition uses the scaled 16-bit bound.
- keccakf_state_bus payload between keccakf_op and keccakf_perm packed
  to u16.
- xorin: pointer limbs u16-shaped; the byte-XOR buffers
  (preimage_buffer_bytes, input_bytes, postimage_buffer_bytes) stay
  byte-shaped — those columns still carry byte-level XOR data.
- CUDA tracegen + tests updated; bitwise-lookup wiring removed from
  keccakf_op now that pointer bytes no longer need byte-level checks.

Adapters helper:
- Add expand_to_rv64_block in riscv/circuit::adapters: zero-pads `N`
  u16 limbs to one RV64 register bus block (BLOCK_FE_WIDTH cells).
  Used by keccakf_op and xorin for pointer reads against u16-celled
  RV64_REGISTER_AS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shuklaayush shuklaayush force-pushed the feat/memory-bus-u16-pr3 branch from 7acbea7 to 337518b Compare May 21, 2026 16:27
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Claude finished @shuklaayush's task in 2m 7s —— View job


Code Review: PR #2802 — u16 limbs in keccak/sha2 airs

  • Gather context and read PR diff
  • Review SHA-2 changes (main chip, block hasher, sha2-air)
  • Review Keccak changes (keccakf_op, xorin)
  • Review adapter helper (expand_to_rv64_block)
  • Review CUDA tracegen changes
  • Post final review

This is a clean, well-scoped PR. The mechanical transformation from byte-shaped to u16-shaped columns is consistent across all affected chips and their CUDA counterparts. The constraint logic is simplified throughout by removing the pack_u8_block indirection and composing u16 cells directly. Overall this looks good — the changes are correct and match the stated goals.

Positive observations

  1. Write offset fix (latent bug): The old eval_writes in sha2/main_chip/air.rs used SHA2_READ_SIZE for the write memory stride and STATE_READS for the loop bound. This PR correctly switches to SHA2_WRITE_SIZE and STATE_WRITES. Since both are currently 8, this was harmless, but it's a good fix.

  2. Simplified AIR constraints: Removing pack_u8_block calls and operating directly on u16 cells produces cleaner, easier-to-audit constraint code. The composecompose_rv64_u16_limbs helper avoids scattered bit-shift arithmetic.

  3. Consistent pointer handling: The scale_rv64_ptr_high_u16_expr / scale_rv64_ptr_high_u16_from_ptr abstraction is well-designed and used consistently across SHA-2 main chip, keccakf_op, and xorin. The switch from batched bitwise_lookup_bus.send_range (which had the awkward "must have even length" padding) to individual range_bus.range_check calls is cleaner.

  4. Good test coverage update: Tests are updated to wire the range_checker_chip alongside the existing bitwise_lookup_chip, including both CPU and CUDA test harnesses.

  5. CUDA / Rust parity: The CUDA kernels mirror the Rust changes precisely. The u32_to_le_u16_limbs and bytes_to_le_u16_limbs CUDA utilities are clean and match their Rust counterparts.

Minor issues / suggestions

  1. debug_assert upgraded to assert in online.rs (crates/vm/src/system/memory/online.rs:212): This changes a debug-only check to a release check. This is fine for safety but worth noting — if this was intentional for catching misuse of get_slice vs get_u8_slice with the new u16-celled memory, it's a reasonable change. Just confirming it's intended since it's a behavioral change in release builds.

  2. Block hasher message_u16 byte ordering (extensions/sha2/circuit/src/sha2_chips/block_hasher_chip/air.rs:60-68): The message_byte helper reverses byte order (C::WORD_U8S - i % C::WORD_U8S - 1) and then message_u16 packs byte[2k] + byte[2k+1] << 8 in LE order. This matches the main chip's set_arrayview_from_u16_le_bytes which uses u16::from_le_bytes. Correct, but the big-endian SHA word decomposition combined with the LE packing makes this subtle — a unit test or inline comment clarifying the expected byte↔u16 mapping for SHA message schedule columns could help prevent future regressions.

  3. set_arrayview_from_u16_le_bytes accepts &[u8] by reference but set_arrayview_from_u16_slice accepts impl IntoIterator<Item = u16> by value (crates/circuits/sha2-air/src/utils.rs:319-351): Minor API inconsistency. set_arrayview_from_u16_slice could take &[u16] or an impl IntoIterator<Item = u16> for consistency; currently it differs from set_arrayview_from_u8_slice in that one also takes a reference while the other takes owned u16. Not blocking, just a style observation.

  4. expand_to_rv64_block_from_slice (extensions/riscv/circuit/src/adapters/mod.rs:448-459): This is a pub function that panic-asserts limbs.len() <= BLOCK_FE_WIDTH at runtime. Since expand_to_rv64_block already has a compile-time assertion (const { assert!(N <= BLOCK_FE_WIDTH) }), the slice version is the one that could silently receive a wrong-length slice. This is fine for internal use but worth noting.

No blocking issues found

The constraint changes are algebraically sound — the u16 cell columns pass directly through the memory bus (which now expects BLOCK_FE_WIDTH = 4 u16 cells), and the SHA-2 / keccakf state buses receive u16 values without repacking. The pointer range checks use the correct scaling for u16 high cells.

@github-actions
Copy link
Copy Markdown

group app.proof_time_ms app.cycles leaf.proof_time_ms
fibonacci 1,570 4,000,051 435
keccak 13,828 14,365,133 2,190
sha2_bench 9,158 11,167,961 1,413
regex 1,473 4,090,656 355
ecrecover 474 112,210 263
pairing 605 592,827 256
kitchen_sink 2,223 1,979,971 410

Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights.

Commit: 337518b

Benchmark Workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant