Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machines communicating through shared memory #1055

Open
georgwiese opened this issue Feb 14, 2024 · 1 comment
Open

Machines communicating through shared memory #1055

georgwiese opened this issue Feb 14, 2024 · 1 comment

Comments

@georgwiese
Copy link
Collaborator

georgwiese commented Feb 14, 2024

When delegating operation like Elliptic Curve (EC) addition to a secondary machine, one problem is that the number of input + output words can be fairly large, which leads to a large number of needed assignment registers. In the case of EC add, there are 8 32-bit words per 256-bit coordinate, 2 coordinates per point, 2 input points and one output point -> $8 \times 2 \times (2 + 1) = 48$ assignment registers needed.

If secondary machines were able to read from and write to a shared read/write memory, we could instead pass pointers to memory. For example, for EC add we could pass 3 pointers, which works as long as each point is stored in memory in 16 consecutive words.

This is a sketch how this would look like:
Screenshot 2024-02-14 at 09 51 28

Notes:

Main machine:

  • The main machine works much like it does today. It is responsible for storing the inputs to memory consecutively (which takes a 16 cycles per EC point, unless the input is already the output of a previous invocation of the arithmetic machine, as would be the case for an EC scalar multiplication algorithm).
  • The main machine passes pointers to the 2 input points and 1 output point and the current time step to the arithmetic machine.

Secondary machine (e.g. Arithmetic Machine):

  • As usual, it has a LATCH column and columns for each input (the step and the three pointers) which are constrained to be constant within a block.
  • There is also a column cl to hold the current cycle number.
    • This can actually be an intermediate polynomial computed as CLK[0] * 0 + CLK[1] * 1 + CLK[2] * 2 + ... + CLK[31] * 31
  • There is a column m which serves a role similar to an assignment register. It holds the memory value retrieved in the current cycle.
  • The memory lookup is something like this:
    is_block_used {step, P1_addr + cl, m} is mload_arith {m_step, m_addr, m_value}
    • So from the point of view of the memory all memory accesses happen at the same time step. This is fine if all memory accesses are accessing different memory cells.
    • For the permutation to work, we need to keep track of whether the block is used or not. I think this could be done by changing the call to the arithmetic machine to be a permutation as well and do something like
      instr_ec_add { ... } is (arith.is_block_used * arith.LATCH) { ... }.
    • is_block_used should then be constrained to be constant within a block.
  • From the m column, the input is distributed to where it's actually needed. In this case, there might be columns P1.x0, ... P1.x16 which should hold the 16-bit limbs of the x-coordinate of P1, so there would be a constraint looking something like this:
    CLK[0] * (P1.x0 + 2**16 * P1.x1 - m) + CLK[1] * (P1.x2 + 2**16 * P1.x3 - m) + ... = 0
    • This works because the P1.x0, ..., P1.x15 columns are constrained to be constant within the block (this is already the case today). So P1.x0 can be written in one cycle and used in another.
    • m could be an intermediate polynomial.
  • This can be generalized to reading the other input and writing the output. In this case:
    • In cycles 0-15, we would read P1.
    • In cycles 16-31, we would read P2.
    • To write P3, we need another permutation argument that writes to memory. We're also out of cycles, so we'd need another "virtual" column (i.e., intermediate polynomial) similar to m.
      • Note that because all memory accesses happen on the same time step, you wouldn't be able to overwrite an input. See below on how we could allow for this.

Memory machine:

  • Because memory is connected to machines via a permutation, we need a permutation per machine and read / write operation.
  • The memory machine has to keep track of which permutation the entry belongs to (which needs one extra column per permutation). We already built a memory machine with two write permutations (normal mstore and mstore_bootloader).

Possible extension: Allowing memory accesses at different time steps:

  • As mentioned above, having all memory accesses at the same time step is a problem if they access the same memory cells. For example, we wouldn't be able to do something like P1 := ec_add(P1, P2) now.
  • This could be fixed by increasing the STEP in the main machine by more than one in each row. Then, if a secondary machine is called, it could be using time steps "in between" two rows of the main machine.
  • For example, in this case, if we increased STEP by 2 in each row, then the writing of P3 could happen at step + 1.
    • I'm sure this would be a challenge for witness generation 🙃
@georgwiese georgwiese added this to the 2. Optimizations milestone Jul 1, 2024
@georgwiese georgwiese self-assigned this Jul 1, 2024
github-merge-queue bot pushed a commit that referenced this issue Jul 8, 2024
Implements #1055 for the Poseidon machines. Pulled out of #1508.

Specifically, this PR adds a new `PoseidonGLMemory` machine which
receives 2 memory points and then:
- Reads 24 32-Bit words and packs them into 12 field elements
- Computes the Poseidon permutation (just like `PoseidonGL`)
- For each of the 4 output field elements, it:
- Invokes the `SplitGL` machine to get the canonical `u64`
representation
  - Writes the 8 32-Bit words to memory at the provided memory pointer

The read and write memory regions can even overlap! 🎉 

This should simplify our RISC-V machine, as the syscall already expects
two memory pointers. We can simply pass it to the machine directly.

I started doing that in #1533, but I think it makes sense to wait until
#1443 is merged.

To test:
```
cargo run -r pil test_data/std/poseidon_gl_test.asm -o output -f --export-csv --prove-with estark-starky
```

I recommend reviewing the diff between
`std/machines/hash/poseidon_gl.asm` and
`std/machines/hash/poseidon_gl_memory.asm`

### Discussion

The overhead of the memory read / write is quite high (18 extra witness
columns, see [this
comment](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L13-L23),
mostly because we now need to have the input available in all rows
(which previously was only the case for the outputs). If we had offsets
other than 0 and 1, this could be avoided. Doing 24 parallel memory
reads in the first row would *not* help, because we'd have to add 24
witness columns (instead of 2 now) to store the result of the memory
operation.

A few more notes:
- With Vadcop, 18 extra witness columns in a secondary machine is *a
lot* better than introducing more registers (either "regular" registers
or assignment registers) in the main machine
- As mentioned
[here](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L111-L113),
we could get rid of two permutations if either:
- We were able to express explicitly that we want to call at most one
operation in the current row, or
- We had an optimizer that would be smart enough to batch the memory
reads and writes.
- We could also have just 1 read or write at a time (instead of 2), but
we'd have to increase the block size from 31 to 32 and the
implementation would be more complicated.
- We could also store the full final state of the Poseidon permutation,
instead of just the first 4 elements. This would need 8 more witness
columns to make the entire output available in all rows. Then, one could
use the machine to implement a Poseidon sponge, instead of.
- Looking at the bootloader, maybe it makes sense to pass 3 input
pointers instead of 1: One for the first 4 elements, one for the next 4,
and one for the capacity (often just a constant). For example, when
computing a Merkle root, you'd pass pointers for the two children hashes
and a pointer to the capacity constant.
georgwiese added a commit that referenced this issue Jul 10, 2024
Implements #1055 for the Poseidon machines. Pulled out of #1508.

Specifically, this PR adds a new `PoseidonGLMemory` machine which
receives 2 memory points and then:
- Reads 24 32-Bit words and packs them into 12 field elements
- Computes the Poseidon permutation (just like `PoseidonGL`)
- For each of the 4 output field elements, it:
- Invokes the `SplitGL` machine to get the canonical `u64`
representation
  - Writes the 8 32-Bit words to memory at the provided memory pointer

The read and write memory regions can even overlap! 🎉 

This should simplify our RISC-V machine, as the syscall already expects
two memory pointers. We can simply pass it to the machine directly.

I started doing that in #1533, but I think it makes sense to wait until
#1443 is merged.

To test:
```
cargo run -r pil test_data/std/poseidon_gl_test.asm -o output -f --export-csv --prove-with estark-starky
```

I recommend reviewing the diff between
`std/machines/hash/poseidon_gl.asm` and
`std/machines/hash/poseidon_gl_memory.asm`

### Discussion

The overhead of the memory read / write is quite high (18 extra witness
columns, see [this
comment](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L13-L23),
mostly because we now need to have the input available in all rows
(which previously was only the case for the outputs). If we had offsets
other than 0 and 1, this could be avoided. Doing 24 parallel memory
reads in the first row would *not* help, because we'd have to add 24
witness columns (instead of 2 now) to store the result of the memory
operation.

A few more notes:
- With Vadcop, 18 extra witness columns in a secondary machine is *a
lot* better than introducing more registers (either "regular" registers
or assignment registers) in the main machine
- As mentioned
[here](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L111-L113),
we could get rid of two permutations if either:
- We were able to express explicitly that we want to call at most one
operation in the current row, or
- We had an optimizer that would be smart enough to batch the memory
reads and writes.
- We could also have just 1 read or write at a time (instead of 2), but
we'd have to increase the block size from 31 to 32 and the
implementation would be more complicated.
- We could also store the full final state of the Poseidon permutation,
instead of just the first 4 elements. This would need 8 more witness
columns to make the entire output available in all rows. Then, one could
use the machine to implement a Poseidon sponge, instead of.
- Looking at the bootloader, maybe it makes sense to pass 3 input
pointers instead of 1: One for the first 4 elements, one for the next 4,
and one for the capacity (often just a constant). For example, when
computing a Merkle root, you'd pass pointers for the two children hashes
and a pointer to the capacity constant.
@georgwiese
Copy link
Collaborator Author

This is done for Poseidon, but not for Arith. See #1508 for a prototype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment