Machines communicating through shared memory #1055

georgwiese · 2024-02-14T09:52:10Z

When delegating operation like Elliptic Curve (EC) addition to a secondary machine, one problem is that the number of input + output words can be fairly large, which leads to a large number of needed assignment registers. In the case of EC add, there are 8 32-bit words per 256-bit coordinate, 2 coordinates per point, 2 input points and one output point -> $8 \times 2 \times (2 + 1) = 48$ assignment registers needed.

If secondary machines were able to read from and write to a shared read/write memory, we could instead pass pointers to memory. For example, for EC add we could pass 3 pointers, which works as long as each point is stored in memory in 16 consecutive words.

This is a sketch how this would look like:

Notes:

Main machine:

The main machine works much like it does today. It is responsible for storing the inputs to memory consecutively (which takes a 16 cycles per EC point, unless the input is already the output of a previous invocation of the arithmetic machine, as would be the case for an EC scalar multiplication algorithm).
The main machine passes pointers to the 2 input points and 1 output point and the current time step to the arithmetic machine.

Secondary machine (e.g. Arithmetic Machine):

As usual, it has a LATCH column and columns for each input (the step and the three pointers) which are constrained to be constant within a block.
There is also a column cl to hold the current cycle number.
- This can actually be an intermediate polynomial computed as CLK[0] * 0 + CLK[1] * 1 + CLK[2] * 2 + ... + CLK[31] * 31
There is a column m which serves a role similar to an assignment register. It holds the memory value retrieved in the current cycle.
The memory lookup is something like this:
is_block_used {step, P1_addr + cl, m} is mload_arith {m_step, m_addr, m_value}
- So from the point of view of the memory all memory accesses happen at the same time step. This is fine if all memory accesses are accessing different memory cells.
- For the permutation to work, we need to keep track of whether the block is used or not. I think this could be done by changing the call to the arithmetic machine to be a permutation as well and do something like
  instr_ec_add { ... } is (arith.is_block_used * arith.LATCH) { ... }.
- is_block_used should then be constrained to be constant within a block.
From the m column, the input is distributed to where it's actually needed. In this case, there might be columns P1.x0, ... P1.x16 which should hold the 16-bit limbs of the x-coordinate of P1, so there would be a constraint looking something like this:
CLK[0] * (P1.x0 + 2**16 * P1.x1 - m) + CLK[1] * (P1.x2 + 2**16 * P1.x3 - m) + ... = 0
- This works because the P1.x0, ..., P1.x15 columns are constrained to be constant within the block (this is already the case today). So P1.x0 can be written in one cycle and used in another.
- m could be an intermediate polynomial.
This can be generalized to reading the other input and writing the output. In this case:
- In cycles 0-15, we would read P1.
- In cycles 16-31, we would read P2.
- To write P3, we need another permutation argument that writes to memory. We're also out of cycles, so we'd need another "virtual" column (i.e., intermediate polynomial) similar to m.
  - Note that because all memory accesses happen on the same time step, you wouldn't be able to overwrite an input. See below on how we could allow for this.

Memory machine:

Because memory is connected to machines via a permutation, we need a permutation per machine and read / write operation.
The memory machine has to keep track of which permutation the entry belongs to (which needs one extra column per permutation). We already built a memory machine with two write permutations (normal mstore and mstore_bootloader).

Possible extension: Allowing memory accesses at different time steps:

As mentioned above, having all memory accesses at the same time step is a problem if they access the same memory cells. For example, we wouldn't be able to do something like P1 := ec_add(P1, P2) now.
This could be fixed by increasing the STEP in the main machine by more than one in each row. Then, if a secondary machine is called, it could be using time steps "in between" two rows of the main machine.
For example, in this case, if we increased STEP by 2 in each row, then the writing of P3 could happen at step + 1.
- I'm sure this would be a challenge for witness generation 🙃

The text was updated successfully, but these errors were encountered:

Implements #1055 for the Poseidon machines. Pulled out of #1508. Specifically, this PR adds a new `PoseidonGLMemory` machine which receives 2 memory points and then: - Reads 24 32-Bit words and packs them into 12 field elements - Computes the Poseidon permutation (just like `PoseidonGL`) - For each of the 4 output field elements, it: - Invokes the `SplitGL` machine to get the canonical `u64` representation - Writes the 8 32-Bit words to memory at the provided memory pointer The read and write memory regions can even overlap! 🎉 This should simplify our RISC-V machine, as the syscall already expects two memory pointers. We can simply pass it to the machine directly. I started doing that in #1533, but I think it makes sense to wait until #1443 is merged. To test: ``` cargo run -r pil test_data/std/poseidon_gl_test.asm -o output -f --export-csv --prove-with estark-starky ``` I recommend reviewing the diff between `std/machines/hash/poseidon_gl.asm` and `std/machines/hash/poseidon_gl_memory.asm` ### Discussion The overhead of the memory read / write is quite high (18 extra witness columns, see [this comment](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L13-L23), mostly because we now need to have the input available in all rows (which previously was only the case for the outputs). If we had offsets other than 0 and 1, this could be avoided. Doing 24 parallel memory reads in the first row would *not* help, because we'd have to add 24 witness columns (instead of 2 now) to store the result of the memory operation. A few more notes: - With Vadcop, 18 extra witness columns in a secondary machine is *a lot* better than introducing more registers (either "regular" registers or assignment registers) in the main machine - As mentioned [here](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L111-L113), we could get rid of two permutations if either: - We were able to express explicitly that we want to call at most one operation in the current row, or - We had an optimizer that would be smart enough to batch the memory reads and writes. - We could also have just 1 read or write at a time (instead of 2), but we'd have to increase the block size from 31 to 32 and the implementation would be more complicated. - We could also store the full final state of the Poseidon permutation, instead of just the first 4 elements. This would need 8 more witness columns to make the entire output available in all rows. Then, one could use the machine to implement a Poseidon sponge, instead of. - Looking at the bootloader, maybe it makes sense to pass 3 input pointers instead of 1: One for the first 4 elements, one for the next 4, and one for the capacity (often just a constant). For example, when computing a Merkle root, you'd pass pointers for the two children hashes and a pointer to the capacity constant.

georgwiese · 2024-09-03T08:34:05Z

This is done for Poseidon, but not for Arith. See #1508 for a prototype.

This was referenced Feb 19, 2024

[Prototype] Machines communicating through shared memory #1071

Closed

Powdr ASM: Allow machines to be linked via permutaions #1073

Closed

leonardoalt added the memory label Mar 14, 2024

georgwiese mentioned this issue Mar 26, 2024

ASM: Allow multiple machines to access the same machine instance #1200

Open

leonardoalt added the priority label Apr 4, 2024

georgwiese added this to the 2. Optimizations milestone Jul 1, 2024

This was referenced Jul 1, 2024

[WIP] Machines communicating through shared memory #1380

Closed

Allow passing machines as arguments #1493

Closed

[WIP] Machines communicating through shared memory #1508

Closed

georgwiese self-assigned this Jul 1, 2024

georgwiese mentioned this issue Jul 3, 2024

Add PoseidonGLMemory machine #1525

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machines communicating through shared memory #1055

Machines communicating through shared memory #1055

georgwiese commented Feb 14, 2024 •

edited

Loading

georgwiese commented Sep 3, 2024

Machines communicating through shared memory #1055

Machines communicating through shared memory #1055

Comments

georgwiese commented Feb 14, 2024 • edited Loading

Notes:

georgwiese commented Sep 3, 2024

georgwiese commented Feb 14, 2024 •

edited

Loading