-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machines communicating through shared memory #1055
Comments
This was referenced Feb 19, 2024
This was referenced Jul 1, 2024
github-merge-queue bot
pushed a commit
that referenced
this issue
Jul 8, 2024
Implements #1055 for the Poseidon machines. Pulled out of #1508. Specifically, this PR adds a new `PoseidonGLMemory` machine which receives 2 memory points and then: - Reads 24 32-Bit words and packs them into 12 field elements - Computes the Poseidon permutation (just like `PoseidonGL`) - For each of the 4 output field elements, it: - Invokes the `SplitGL` machine to get the canonical `u64` representation - Writes the 8 32-Bit words to memory at the provided memory pointer The read and write memory regions can even overlap! 🎉 This should simplify our RISC-V machine, as the syscall already expects two memory pointers. We can simply pass it to the machine directly. I started doing that in #1533, but I think it makes sense to wait until #1443 is merged. To test: ``` cargo run -r pil test_data/std/poseidon_gl_test.asm -o output -f --export-csv --prove-with estark-starky ``` I recommend reviewing the diff between `std/machines/hash/poseidon_gl.asm` and `std/machines/hash/poseidon_gl_memory.asm` ### Discussion The overhead of the memory read / write is quite high (18 extra witness columns, see [this comment](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L13-L23), mostly because we now need to have the input available in all rows (which previously was only the case for the outputs). If we had offsets other than 0 and 1, this could be avoided. Doing 24 parallel memory reads in the first row would *not* help, because we'd have to add 24 witness columns (instead of 2 now) to store the result of the memory operation. A few more notes: - With Vadcop, 18 extra witness columns in a secondary machine is *a lot* better than introducing more registers (either "regular" registers or assignment registers) in the main machine - As mentioned [here](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L111-L113), we could get rid of two permutations if either: - We were able to express explicitly that we want to call at most one operation in the current row, or - We had an optimizer that would be smart enough to batch the memory reads and writes. - We could also have just 1 read or write at a time (instead of 2), but we'd have to increase the block size from 31 to 32 and the implementation would be more complicated. - We could also store the full final state of the Poseidon permutation, instead of just the first 4 elements. This would need 8 more witness columns to make the entire output available in all rows. Then, one could use the machine to implement a Poseidon sponge, instead of. - Looking at the bootloader, maybe it makes sense to pass 3 input pointers instead of 1: One for the first 4 elements, one for the next 4, and one for the capacity (often just a constant). For example, when computing a Merkle root, you'd pass pointers for the two children hashes and a pointer to the capacity constant.
georgwiese
added a commit
that referenced
this issue
Jul 10, 2024
Implements #1055 for the Poseidon machines. Pulled out of #1508. Specifically, this PR adds a new `PoseidonGLMemory` machine which receives 2 memory points and then: - Reads 24 32-Bit words and packs them into 12 field elements - Computes the Poseidon permutation (just like `PoseidonGL`) - For each of the 4 output field elements, it: - Invokes the `SplitGL` machine to get the canonical `u64` representation - Writes the 8 32-Bit words to memory at the provided memory pointer The read and write memory regions can even overlap! 🎉 This should simplify our RISC-V machine, as the syscall already expects two memory pointers. We can simply pass it to the machine directly. I started doing that in #1533, but I think it makes sense to wait until #1443 is merged. To test: ``` cargo run -r pil test_data/std/poseidon_gl_test.asm -o output -f --export-csv --prove-with estark-starky ``` I recommend reviewing the diff between `std/machines/hash/poseidon_gl.asm` and `std/machines/hash/poseidon_gl_memory.asm` ### Discussion The overhead of the memory read / write is quite high (18 extra witness columns, see [this comment](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L13-L23), mostly because we now need to have the input available in all rows (which previously was only the case for the outputs). If we had offsets other than 0 and 1, this could be avoided. Doing 24 parallel memory reads in the first row would *not* help, because we'd have to add 24 witness columns (instead of 2 now) to store the result of the memory operation. A few more notes: - With Vadcop, 18 extra witness columns in a secondary machine is *a lot* better than introducing more registers (either "regular" registers or assignment registers) in the main machine - As mentioned [here](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L111-L113), we could get rid of two permutations if either: - We were able to express explicitly that we want to call at most one operation in the current row, or - We had an optimizer that would be smart enough to batch the memory reads and writes. - We could also have just 1 read or write at a time (instead of 2), but we'd have to increase the block size from 31 to 32 and the implementation would be more complicated. - We could also store the full final state of the Poseidon permutation, instead of just the first 4 elements. This would need 8 more witness columns to make the entire output available in all rows. Then, one could use the machine to implement a Poseidon sponge, instead of. - Looking at the bootloader, maybe it makes sense to pass 3 input pointers instead of 1: One for the first 4 elements, one for the next 4, and one for the capacity (often just a constant). For example, when computing a Merkle root, you'd pass pointers for the two children hashes and a pointer to the capacity constant.
This is done for Poseidon, but not for Arith. See #1508 for a prototype. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When delegating operation like Elliptic Curve (EC) addition to a secondary machine, one problem is that the number of input + output words can be fairly large, which leads to a large number of needed assignment registers. In the case of EC add, there are 8 32-bit words per 256-bit coordinate, 2 coordinates per point, 2 input points and one output point ->$8 \times 2 \times (2 + 1) = 48$ assignment registers needed.
If secondary machines were able to read from and write to a shared read/write memory, we could instead pass pointers to memory. For example, for EC add we could pass 3 pointers, which works as long as each point is stored in memory in 16 consecutive words.
This is a sketch how this would look like:
Notes:
Main machine:
Secondary machine (e.g. Arithmetic Machine):
LATCH
column and columns for each input (the step and the three pointers) which are constrained to be constant within a block.cl
to hold the current cycle number.CLK[0] * 0 + CLK[1] * 1 + CLK[2] * 2 + ... + CLK[31] * 31
m
which serves a role similar to an assignment register. It holds the memory value retrieved in the current cycle.is_block_used {step, P1_addr + cl, m} is mload_arith {m_step, m_addr, m_value}
instr_ec_add { ... } is (arith.is_block_used * arith.LATCH) { ... }
.is_block_used
should then be constrained to be constant within a block.m
column, the input is distributed to where it's actually needed. In this case, there might be columnsP1.x0, ... P1.x16
which should hold the 16-bit limbs of the x-coordinate of P1, so there would be a constraint looking something like this:CLK[0] * (P1.x0 + 2**16 * P1.x1 - m) + CLK[1] * (P1.x2 + 2**16 * P1.x3 - m) + ... = 0
P1.x0, ..., P1.x15
columns are constrained to be constant within the block (this is already the case today). SoP1.x0
can be written in one cycle and used in another.m
could be an intermediate polynomial.m
.Memory machine:
mstore
andmstore_bootloader
).Possible extension: Allowing memory accesses at different time steps:
P1 := ec_add(P1, P2)
now.STEP
in the main machine by more than one in each row. Then, if a secondary machine is called, it could be using time steps "in between" two rows of the main machine.STEP
by 2 in each row, then the writing of P3 could happen at step + 1.The text was updated successfully, but these errors were encountered: