Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registers in memory #1443

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Registers in memory #1443

wants to merge 3 commits into from

Conversation

leonardoalt
Copy link
Member

@leonardoalt leonardoalt commented Jun 11, 2024

Fixes #1492

@leonardoalt leonardoalt changed the title Registers in memory rebase attempt Registers in memory Jun 13, 2024
@leonardoalt leonardoalt force-pushed the registers-in-memory-dev branch 2 times, most recently from 2284f46 to 2aedf98 Compare June 14, 2024 11:35
@leonardoalt leonardoalt changed the base branch from main to instr_link June 14, 2024 11:36
@leonardoalt leonardoalt force-pushed the registers-in-memory-dev branch 5 times, most recently from 48195ed to 8b67cba Compare June 18, 2024 16:29
Base automatically changed from instr_link to main June 18, 2024 18:00
@leonardoalt leonardoalt force-pushed the registers-in-memory-dev branch 5 times, most recently from 2a1b8f5 to db4fcbb Compare June 26, 2024 18:24
github-merge-queue bot pushed a commit that referenced this pull request Jul 8, 2024
Implements #1055 for the Poseidon machines. Pulled out of #1508.

Specifically, this PR adds a new `PoseidonGLMemory` machine which
receives 2 memory points and then:
- Reads 24 32-Bit words and packs them into 12 field elements
- Computes the Poseidon permutation (just like `PoseidonGL`)
- For each of the 4 output field elements, it:
- Invokes the `SplitGL` machine to get the canonical `u64`
representation
  - Writes the 8 32-Bit words to memory at the provided memory pointer

The read and write memory regions can even overlap! 🎉 

This should simplify our RISC-V machine, as the syscall already expects
two memory pointers. We can simply pass it to the machine directly.

I started doing that in #1533, but I think it makes sense to wait until
#1443 is merged.

To test:
```
cargo run -r pil test_data/std/poseidon_gl_test.asm -o output -f --export-csv --prove-with estark-starky
```

I recommend reviewing the diff between
`std/machines/hash/poseidon_gl.asm` and
`std/machines/hash/poseidon_gl_memory.asm`

### Discussion

The overhead of the memory read / write is quite high (18 extra witness
columns, see [this
comment](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L13-L23),
mostly because we now need to have the input available in all rows
(which previously was only the case for the outputs). If we had offsets
other than 0 and 1, this could be avoided. Doing 24 parallel memory
reads in the first row would *not* help, because we'd have to add 24
witness columns (instead of 2 now) to store the result of the memory
operation.

A few more notes:
- With Vadcop, 18 extra witness columns in a secondary machine is *a
lot* better than introducing more registers (either "regular" registers
or assignment registers) in the main machine
- As mentioned
[here](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L111-L113),
we could get rid of two permutations if either:
- We were able to express explicitly that we want to call at most one
operation in the current row, or
- We had an optimizer that would be smart enough to batch the memory
reads and writes.
- We could also have just 1 read or write at a time (instead of 2), but
we'd have to increase the block size from 31 to 32 and the
implementation would be more complicated.
- We could also store the full final state of the Poseidon permutation,
instead of just the first 4 elements. This would need 8 more witness
columns to make the entire output available in all rows. Then, one could
use the machine to implement a Poseidon sponge, instead of.
- Looking at the bootloader, maybe it makes sense to pass 3 input
pointers instead of 1: One for the first 4 elements, one for the next 4,
and one for the capacity (often just a constant). For example, when
computing a Merkle root, you'd pass pointers for the two children hashes
and a pointer to the capacity constant.
@lvella lvella self-assigned this Jul 8, 2024
@leonardoalt leonardoalt marked this pull request as ready for review July 9, 2024 08:47
Comment on lines +392 to +394
col witness XX, XX_inv, XXIsZero;
std::utils::force_bool(XXIsZero);
XXIsZero * XX = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use a std tool function here.

col witness val3_col;
col witness val4_col;

col witness XX, XX_inv, XXIsZero;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have a comment what these are used for

XXIsZero * XX = 0;

// HACK: This constraint cannot be active globally, because when
// XX is not constrained, witgen will try to set XX, XX_inv and XXIsZero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also use a hint

pc' = (1 - XXIsZero) * l + XXIsZero * (pc + 1)
}

instr branch_if_zero X, Y, Z, l: label
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can Z be a number?


// Skips Y instructions if X is zero
instr skip_if_zero X, Y { pc' = pc + 1 + (XIsZero * Y) }
//instr skip_if_zero X, Y { pc' = pc + 1 + (XIsZero * Y) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//instr skip_if_zero X, Y { pc' = pc + 1 + (XIsZero * Y) }

link ~> val1_col = regs.mload(X, STEP)
link ~> val2_col = regs.mload(Y, STEP + 1)
{
(val1_col - val2_col + Z) + 2**32 - 1 = X_b1 + X_b2 * 0x100 + X_b3 * 0x10000 + X_b4 * 0x1000000 + wrap_bit * 2**32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is too late, but why not use e.g. a for val1_col? I think the name is a bit unwieldy

instr is_positive X, Y, Z, W
link ~> val1_col = regs.mload(X, STEP)
link ~> val2_col = regs.mload(Y, STEP + 1)
link ~> regs.mstore(W, STEP + 2, val3_col)
Copy link
Member

@chriseth chriseth Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
link ~> regs.mstore(W, STEP + 2, val3_col)
link ~> regs.mstore(W, STEP + 2, wrap_bit)

?

link ~> val2_col = regs.mload(Y, STEP + 1)
link ~> regs.mstore(W, STEP + 2, val3_col)
{
(val1_col - val2_col + Z) + 2**32 - 1 = X_b1 + X_b2 * 0x100 + X_b3 * 0x10000 + X_b4 * 0x1000000 + wrap_bit * 2**32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use a function for X_b1 + X_b2 * 0x100 + X_b3 * 0x10000 + X_b4 * 0x1000000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even the whole thing with wrap_bit (or pass wrap_bit in)

Comment on lines +509 to +514
link ~> val1_col = regs.mload(X, STEP)
link ~> regs.mstore(W, STEP + 2, val3_col)
{
XXIsZero = 1 - XX * XX_inv,
XX = val1_col,
val3_col = XXIsZero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
link ~> val1_col = regs.mload(X, STEP)
link ~> regs.mstore(W, STEP + 2, val3_col)
{
XXIsZero = 1 - XX * XX_inv,
XX = val1_col,
val3_col = XXIsZero
link ~> XX = regs.mload(X, STEP)
link ~> regs.mstore(W, STEP + 2, XXIsZero)
{
XXIsZero = 1 - XX * XX_inv,

?

georgwiese added a commit that referenced this pull request Jul 10, 2024
Implements #1055 for the Poseidon machines. Pulled out of #1508.

Specifically, this PR adds a new `PoseidonGLMemory` machine which
receives 2 memory points and then:
- Reads 24 32-Bit words and packs them into 12 field elements
- Computes the Poseidon permutation (just like `PoseidonGL`)
- For each of the 4 output field elements, it:
- Invokes the `SplitGL` machine to get the canonical `u64`
representation
  - Writes the 8 32-Bit words to memory at the provided memory pointer

The read and write memory regions can even overlap! 🎉 

This should simplify our RISC-V machine, as the syscall already expects
two memory pointers. We can simply pass it to the machine directly.

I started doing that in #1533, but I think it makes sense to wait until
#1443 is merged.

To test:
```
cargo run -r pil test_data/std/poseidon_gl_test.asm -o output -f --export-csv --prove-with estark-starky
```

I recommend reviewing the diff between
`std/machines/hash/poseidon_gl.asm` and
`std/machines/hash/poseidon_gl_memory.asm`

### Discussion

The overhead of the memory read / write is quite high (18 extra witness
columns, see [this
comment](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L13-L23),
mostly because we now need to have the input available in all rows
(which previously was only the case for the outputs). If we had offsets
other than 0 and 1, this could be avoided. Doing 24 parallel memory
reads in the first row would *not* help, because we'd have to add 24
witness columns (instead of 2 now) to store the result of the memory
operation.

A few more notes:
- With Vadcop, 18 extra witness columns in a secondary machine is *a
lot* better than introducing more registers (either "regular" registers
or assignment registers) in the main machine
- As mentioned
[here](https://github.com/powdr-labs/powdr/blob/40bdca4368c3accccb753aa35ac1027ccb8def0e/std/machines/hash/poseidon_gl_memory.asm#L111-L113),
we could get rid of two permutations if either:
- We were able to express explicitly that we want to call at most one
operation in the current row, or
- We had an optimizer that would be smart enough to batch the memory
reads and writes.
- We could also have just 1 read or write at a time (instead of 2), but
we'd have to increase the block size from 31 to 32 and the
implementation would be more complicated.
- We could also store the full final state of the Poseidon permutation,
instead of just the first 4 elements. This would need 8 more witness
columns to make the entire output available in all rows. Then, one could
use the machine to implement a Poseidon sponge, instead of.
- Looking at the bootloader, maybe it makes sense to pass 3 input
pointers instead of 1: One for the first 4 elements, one for the next 4,
and one for the capacity (often just a constant). For example, when
computing a Merkle root, you'd pass pointers for the two children hashes
and a pointer to the capacity constant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Registers in Memory
3 participants