feat(v2.1-rv64): switch to u16 limbs in keccak/sha2 airs by shuklaayush · Pull Request #2802 · openvm-org/openvm

shuklaayush · 2026-05-21T15:13:55Z

PR 3 of the memory-bus-u16 split, stacked on #2794 (disjoint from PR 2). Shrinks SHA-2 and Keccak chip trace columns that are only consumed as packed u16 cells through the memory bus.

Why

#2794 still stores SHA-2 main-chip state / message bytes and Keccak state / pointer columns byte-shaped, packing them at memory-bus call sites via pack_u8_block. Because these columns are never inspected byte-by-byte by the AIR (the constraints operate on packed u16 values, or on byte-XOR buffers that intentionally stay byte-shaped), they can be stored u16-shaped directly — halving the affected trace columns and removing one pack_u8_block per row.

This is the second of the four cell-narrowing PRs (after PR 2 for RV64 control-flow) and is disjoint from it: the only files in this PR live under extensions/sha2, extensions/keccak256, and crates/circuits/sha2-air, plus one helper addition in extensions/riscv/circuit/src/adapters/mod.rs.

What changes

SHA-2 section

Files: extensions/sha2/circuit/src/sha2_chips/{main_chip,block_hasher_chip}/, crates/circuits/sha2-air/src/{air,columns,trace}.rs, extensions/sha2/circuit/cuda/.

Main chip:

prev_state, new_state are u16 cells.
message_bytes reshapes to message_u16s (byte pairs packed at the call site).
Pointer limbs dst_ptr_limbs, state_ptr_limbs, input_ptr_limbs reshape to u16 where the AIR only consumes them as packed pointer cells; the high u16 cell gets the scaled 16-bit range check.

Block hasher:

prev_hash, final_hash are u16 cells.

sha2_bus:

MESSAGE_1 / MESSAGE_2 payloads are packed to u16 on both sender and receiver.

Stays byte-shaped (8-bit bounded by construction):

carry_a, carry_e, carry_or_buffer in sha2-air.

Keccak section

Files: extensions/keccak256/circuit/src/{keccakf_op,xorin}/, extensions/keccak256/circuit/cuda/.

keccakf_op:

preimage, postimage reshape to u16 cells.
buffer_ptr_limbs reshape to u16; pointer composition uses the scaled 16-bit bound.
keccakf_state_bus payload between keccakf_op and keccakf_perm is packed to u16.

xorin:

Pointer limbs u16-shaped (same treatment as keccakf_op).
Byte-XOR buffers (preimage_buffer_bytes, input_bytes, postimage_buffer_bytes) stay byte-shaped — those columns still carry byte-level XOR data and the AIR inspects them directly.

CUDA tracegen mirrors the new shapes; keccakf_op drops the bitwise_lookup parameter from its kernel and ABI signatures now that pointer bytes no longer need byte-level checks. The call chip still wires bitwise_lu.

Adapter helper (cross-cut)

File: extensions/riscv/circuit/src/adapters/mod.rs.

Add expand_to_rv64_block: zero-pads N u16 limbs to one RV64 bus block (BLOCK_FE_WIDTH cells). Used by keccakf_op and xorin for pointer reads against u16-celled RV64_REGISTER_AS.
BLOCK_FE_WIDTH imported into the adapters module to support the new helper.

The existing expand_to_rv64_register (8 u8 limbs, byte-shaped) is kept for byte-shaped chips and the legacy paths.

Migration notes

Chips passing SHA-2 main-chip state through the bus should hand the bus u16 cells directly (no pack_u8_block).
Pointer reads against u16-celled RV64_REGISTER_AS from chips storing only the low 32 bits of the register should use expand_to_rv64_block (BLOCK_FE_WIDTH cells, zero-padded) instead of expand_to_rv64_register (8 u8s).
Keccak pointer composition should use the scaled 16-bit bound on the high u16 cell; the previous byte-pair range check on bitwise_lookup_bus is removed for keccakf_op.

resolves int-7832, int-7833

…lumns PR 3 of the memory-bus-u16 split (stacked on PR 1, disjoint from PR 2). Shrinks SHA-2 and Keccak chip trace columns that are only consumed as packed u16 cells through the memory bus. SHA-2: - Main chip: prev_state / new_state are u16 cells; message_bytes becomes message_u16s; dst/state/input pointer limbs are u16-shaped. - Block hasher: prev_hash / final_hash are u16 cells. - sha2_bus MESSAGE_1 / MESSAGE_2 payloads packed to u16 on sender and receiver. - CUDA tracegen + tests updated. - Carry columns (carry_a, carry_e, carry_or_buffer in sha2-air) stay 8-bit bounded — byte carries are still constrained as bytes. Keccak: - keccakf_op: preimage / postimage / buffer_ptr_limbs migrated to u16 cells; pointer composition uses the scaled 16-bit bound. - keccakf_state_bus payload between keccakf_op and keccakf_perm packed to u16. - xorin: pointer limbs u16-shaped; the byte-XOR buffers (preimage_buffer_bytes, input_bytes, postimage_buffer_bytes) stay byte-shaped — those columns still carry byte-level XOR data. - CUDA tracegen + tests updated; bitwise-lookup wiring removed from keccakf_op now that pointer bytes no longer need byte-level checks. Adapters helper: - Add expand_to_rv64_block in riscv/circuit::adapters: zero-pads `N` u16 limbs to one RV64 register bus block (BLOCK_FE_WIDTH cells). Used by keccakf_op and xorin for pointer reads against u16-celled RV64_REGISTER_AS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-21T16:28:14Z

Claude finished @shuklaayush's task in 2m 7s —— View job

Code Review: PR #2802 — u16 limbs in keccak/sha2 airs

Gather context and read PR diff
Review SHA-2 changes (main chip, block hasher, sha2-air)
Review Keccak changes (keccakf_op, xorin)
Review adapter helper (expand_to_rv64_block)
Review CUDA tracegen changes
Post final review

This is a clean, well-scoped PR. The mechanical transformation from byte-shaped to u16-shaped columns is consistent across all affected chips and their CUDA counterparts. The constraint logic is simplified throughout by removing the pack_u8_block indirection and composing u16 cells directly. Overall this looks good — the changes are correct and match the stated goals.

Positive observations

Write offset fix (latent bug): The old eval_writes in sha2/main_chip/air.rs used SHA2_READ_SIZE for the write memory stride and STATE_READS for the loop bound. This PR correctly switches to SHA2_WRITE_SIZE and STATE_WRITES. Since both are currently 8, this was harmless, but it's a good fix.
Simplified AIR constraints: Removing pack_u8_block calls and operating directly on u16 cells produces cleaner, easier-to-audit constraint code. The compose → compose_rv64_u16_limbs helper avoids scattered bit-shift arithmetic.
Consistent pointer handling: The scale_rv64_ptr_high_u16_expr / scale_rv64_ptr_high_u16_from_ptr abstraction is well-designed and used consistently across SHA-2 main chip, keccakf_op, and xorin. The switch from batched bitwise_lookup_bus.send_range (which had the awkward "must have even length" padding) to individual range_bus.range_check calls is cleaner.
Good test coverage update: Tests are updated to wire the range_checker_chip alongside the existing bitwise_lookup_chip, including both CPU and CUDA test harnesses.
CUDA / Rust parity: The CUDA kernels mirror the Rust changes precisely. The u32_to_le_u16_limbs and bytes_to_le_u16_limbs CUDA utilities are clean and match their Rust counterparts.

Minor issues / suggestions

debug_assert upgraded to assert in online.rs (crates/vm/src/system/memory/online.rs:212): This changes a debug-only check to a release check. This is fine for safety but worth noting — if this was intentional for catching misuse of get_slice vs get_u8_slice with the new u16-celled memory, it's a reasonable change. Just confirming it's intended since it's a behavioral change in release builds.
Block hasher message_u16 byte ordering (extensions/sha2/circuit/src/sha2_chips/block_hasher_chip/air.rs:60-68): The message_byte helper reverses byte order (C::WORD_U8S - i % C::WORD_U8S - 1) and then message_u16 packs byte[2k] + byte[2k+1] << 8 in LE order. This matches the main chip's set_arrayview_from_u16_le_bytes which uses u16::from_le_bytes. Correct, but the big-endian SHA word decomposition combined with the LE packing makes this subtle — a unit test or inline comment clarifying the expected byte↔u16 mapping for SHA message schedule columns could help prevent future regressions.
set_arrayview_from_u16_le_bytes accepts &[u8] by reference but set_arrayview_from_u16_slice accepts impl IntoIterator<Item = u16> by value (crates/circuits/sha2-air/src/utils.rs:319-351): Minor API inconsistency. set_arrayview_from_u16_slice could take &[u16] or an impl IntoIterator<Item = u16> for consistency; currently it differs from set_arrayview_from_u8_slice in that one also takes a reference while the other takes owned u16. Not blocking, just a style observation.
expand_to_rv64_block_from_slice (extensions/riscv/circuit/src/adapters/mod.rs:448-459): This is a pub function that panic-asserts limbs.len() <= BLOCK_FE_WIDTH at runtime. Since expand_to_rv64_block already has a compile-time assertion (const { assert!(N <= BLOCK_FE_WIDTH) }), the slice version is the one that could silently receive a wrong-length slice. This is fine for internal use but worth noting.

No blocking issues found

The constraint changes are algebraically sound — the u16 cell columns pass directly through the memory bus (which now expects BLOCK_FE_WIDTH = 4 u16 cells), and the SHA-2 / keccakf state buses receive u16 values without repacking. The pointer range checks use the correct scaling for u16 high cells.

github-actions · 2026-05-21T16:41:42Z

group	app.proof_time_ms	app.cycles	leaf.proof_time_ms
fibonacci	1,570	4,000,051	435
keccak	13,828	14,365,133	2,190
sha2_bench	9,158	11,167,961	1,413
regex	1,473	4,090,656	355
ecrecover	474	112,210	263
pairing	605	592,827	256
kitchen_sink	2,223	1,979,971	410

Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights.

Commit: 337518b

Benchmark Workflow