Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bench: Add design specs for ZK WASM benchmarking infrastructure #182

Closed
wants to merge 4 commits into from

Conversation

mooori
Copy link

@mooori mooori commented Jan 5, 2024

Outlines the benchmarking infrastructure to be built for ZK WASM.

@aborg-dev aborg-dev mentioned this pull request Jan 9, 2024
@mooori mooori marked this pull request as ready for review January 9, 2024 16:45
@mooori
Copy link
Author

mooori commented Jan 10, 2024

@krlosMata this PR adds the benchmarking infrastructure design doc which was discussed in the last sync meeting. (Github unfortunately only allows requesting reviews from members of the repo’s org, so I couldn’t request a review from you.)

Copy link

@aborg-dev aborg-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how well this document summarizes the requirements that we have. I think it would still be useful to go a bit deeper into how we are going to achieve these goals. Ideally, it should be possible for someone to read this document and implement the benchmarking infrastructure without having to make any large design decisions.

I've left a few comments with examples of big questions I would still have if I tried to implement this design now.


## Different hardware

Some benchmarks of other ZK systems are run on different hardware. ZK WASM might run benchmarks on the hardware used by other systems as well to allow for a more complete comparison. This work is lower priority and rather unlikely to be realized in the first iteration on the benchmarking infrastructure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason why running benchmarks on different hardware doesn't make sense for us right now is that we're not directly measuring the proving time, but instead tracking a proxy for it which is the number of VM cycles, and this number is independent from the hardware in use.


## Instrumentation

Instrumentation is based on logging zkASM instructions by calling Javascript from zkASM. The indirection via Javascript is required as `zkevm-proverjs` executes compiled PIL instead of zkASM. The code for zkASM instruction logging might be inserted to zkASM by enabling a compilation flag or in post processing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to expand this section with a bit more detail on:

  • How would a sample raw instruction log look like (e.g. which fields we need to track)
  • JS helpers that we plan to add to facilitate this tracking (e.g. their signature and what they are supposed to do)
  • Changes to ZKASM that are needed to facilitate this tracking

This all depends on the visualizations that we have in mind. We can start with simple tables, but eventually, I think we'll need to use something more visual like CPU flamegraphs to reason about large programs. So I would have that target in mind when answering the questions above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a proposal for a visualization schema in 8c2bcf0. I would follow up with details on the other points once we have specified the visualization schema (as you mentioned, the other points depend on that).

cranelift/zkasm_data/benchmarks/Design.md Show resolved Hide resolved
- zkASM instructions and the number of cycles required for their execution.
- The number of cycles required for benchmarks across different points in the git commit history.

## Relating zkASM instructions to cycles

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly are the instructions that we are going to count? Consider the following ZKASM line from ZKEVM ROM:

$               :ADD, MSTORE(SP++), JMP(readCode)

This is a valid code that pipelines the execution of 3 different primitive instructions in sequence to use fewer cycles and intermediate registers. I would expect our codegen to use this feature heavily in the future.
How will we track this? Do we canonicalize this into a single compound instruction? Or do we track it as 3 separate instructions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we canonicalize this into a single compound instruction? Or do we track it as 3 separate instructions?

I think both might be interesting, depending on the context of the analysis for which benchmarking is used. Therefore I would suggest to collect data to enable both and provide a UI or CLI switch to choose whether an instruction list is accounted for as a single compound instruction or as separate instructions.


### Format

If feasible, visualizations should be SVG files, again taking inspiration from flame graphs. To avoid cluttering the graph, `<count>` is initially hidden and revealed for a rectangle on hovering.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any thoughts on how we can produce such SVGs for the profile? Are there any existing tools we can reuse for this?

- For `zkasm_op_x` sums are taken separately for each `MInst` that emitted `zkasm_op_x`.
- The order of the tuple elements can be chosen via an UI or CLI flags. For example, `num_cycles` can be shown first, in which case `a = num_cycles`.
- Sorting is done by comparing the first element of `<count>` tuples.
- All `zkasm_op_*` rectangles have the same width.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was your motivation for this two-level breakdown?

The more I think about this, the more it seems that the further breakdown of MInst::Instruction into zkasm_op_* will not bring much value compared to the complexity it will introduce.
The lowering from the concrete MInst to ZKASM opcodes is fairly deterministic and can be seen in the ZKASM backend code. I think we'll all soon get the intuition about how much each instruction costs and would not need to dive into this detail when doing an optimization of the benchmark like SHA256.
The only scenario where we want to see this breakdown is when we optimize a specific instruction but in this case, it's enough for us to see a breakdown only for this instruction and not for all instructions in the program.

It might be useful to think end-to-end about the common optimization journeys and what information we need to be able to do them. Two that come to mind right now are:

  • Make a single benchmark faster
  • Make a particular instruction faster

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to use flame graphs?

I assume flame graphs are typically used to see where a program spends most time and then try to optimize that work. For instance, a web frontend might be slow because it spends a lot of time drawing colors on the screen. Further examinations show mostly blue rectangles are re-colored blue, which is unnecessary. So the frontend is optimized by drawing a color only if the area currently has another color.

The compiler receives wasm as input and generates zkASM. We also want to see where most of the time is spent and then optimize that work. However, our understanding of the work done by the program is not as deep as in the previous example and therefore we cannot change what the program does (draw a color only if it differs from the area’s current color). At least in the general case, i.e. when the program is not a benchmark for which we control the original source code.

The wasm we receive as input likely already has been optimized by the compiler that produced it, since developers are not expected to write larger programs directly in wasm. This makes it even harder to figure out what the program does.

For these reasons I’m wondering if (conventional) flame graphs contain a lot of information that we cannot utilize to optimize the emitted zkASM? In that case something simpler might already be sufficient. Or perhaps I am missing something or having a wrong understanding of something?

Usage of Analysis benchmarking tools

My assumption was that Analysis benchmarking tools will be used to identify:

  1. Hot MInsts for a given program and a breakdown of their opcodes and costs on the zkASM side. This helps to identify what to optimize since even micro optimizations of hot opcodes can have a significant impact overall.
  2. Opcode sequences that lend themselves to peephole optimizations. For this task I think it would be helpful to see which blocks/labels of a program are on the hot path, since optimizations here are likely to have a higher impact.

Motivation for graph schema in 8c2bcf0

The motivation was to help with 1. described above. Looking at WAT files like the sha256 benchmark doesn’t reveal which MInsts or zkASM opcodes are hot. The graph aggregates <count> values over the execution of the program and hence helps to identify hot MInsts and zkASM opcodes.

Having the breakdown of MInst to zkASM opcodes in the graph helps to make information available in one place. Though as you mentioned the zkASM opcodes can be looked up in backend code and including them in the graph might not be worth the extra complexity.

Another visualization: hot labels

The visualization proposed above wouldn’t help with 2. An idea for a visualization that helps with identifying peephole optimization opportunities would be printing hot labels. Below is a rough sketch of how that might look. There could be CLI flags to determine how much data is included, e.g. to include only the X most frequent instructions per block to avoid cluttering the graph.

labels_graph

This shows hot labels for a given program and would allow us to identify the labels where peephole optimizations might have the biggest impact on performance. It could also help with the “make a single benchmark faster” optimization journey.

Flame graphs via wasm profiling

If control flow of wasm and the zkASM it gets compiled to are comparable, we might consider producing flame graphs by profiling wasm execution. It is supported by wasmtime and there are many libraries to generate flame graphs from standard profiling data (e.g. flamegraph for Rust). This procedure could be used for the benchmark optimization journey where we know and control the Rust and wasm of the program that is compiled to zkASM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler receives wasm as input and generates zkASM. We also want to see where most of the time is spent and then optimize that work. However, our understanding of the work done by the program is not as deep as in the previous example and therefore we cannot change what the program does (draw a color only if it differs from the area’s current color). At least in the general case, i.e. when the program is not a benchmark for which we control the original source code.

Thanks for raising this very important point. When we build the tooling for performance optimization, we need to have a clear set of use cases we aim to address. For Stage 2, those use cases are:

  1. Support and optimize benchmarks equivalent to https://risc0.github.io/ghpages/dev/benchmarks/index.html for the purpose of fair comparison
  2. Support and optimize full WASM interpreter

In both cases, we do have control over the code that will be compiled to WASM and later to ZKASM and I would expect us to change that code to yield better performance (like we already do for benchmarks by compiling them with no_std). We also can annotate this code with tracing statements (e.g. using tracing) to make it easier to see the correspondence between the original and generated code.

In other words, we're not aiming to build tooling to optimize arbitrary ZKASM programs in the wild, instead, we focus on a selected subset of "pet" programs that are important to us.
Note, that this will also be the case beyond Stage 2 - in Stage 3 we will do optimization specifically for NEAR Protocol code and NEAR host functions. Similarly, any users that will be using ZKWASM will have a particular Rust program that they are trying to produce ZK proofs for (e.g. a NEAR light client) and they will be willing to optimize this Rust program and will also benefit from such tooling.

In this light, I think we should focus on visualizations, that allow us to work effectively with programs that have medium-to-large sizes and are annotated/structured well enough to allow us to efficiently optimize them.


- No dependencies required as SVG files can be viewed in a web browser.
- They can be embedded in Github comments, markdown files and other documents.

Copy link

@aborg-dev aborg-dev Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with these advantages, but I think it's also worth listing the limitations and how SVGs compare with alternative solutions (e.g. no built-in diff compared to plaintexts, limited interactivity compared to HTML).
Then, based on our requirements for visualization, we can choose between these solutions.


Due to the sort order defined above, there are different graphs for different assignments of `a`. This allows developers to highlight different costs, for instance register writes if `a = num_reg_writes` or virtual machine cycles if `a = num_cycles`.

Once the data described in the previous section is available and has been used while working on the repository, opportunities for visualizations might be identified. They should help to make the information more easily digestible. Some examples of possible visualizations are:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to throw in one more visualization idea from perf world - Trace Profile visualizer like https://profiler.firefox.com/docs/#/

It would allow us to use an interactive profile explorer like https://share.firefox.dev/3OFPnAP that supports flamegraphs as long as we export the profile in a standard JSON format.

@mooori
Copy link
Author

mooori commented Jan 19, 2024

@nagisa brought up the following points in today’s sync meeting:

Wall clock time

  • We might want to include wall clock time in the output of benchmarking tools.
  • Even if zkASM instrumentation does not affect VM cycles, its impact on wall clock time might still become an issue. If benchmarking infrastructure is too slow it might not be used.

Additional things that could be measured

  • Number of columns or rows.

Approach to measuring the impact of register allocation

  • Execute a program on machines with different numbers of registers and then compare:
    • cycles
    • wall clock time
  • Better performance with more registers implies that we can optimize generated zkASM by making register allocation more efficient.
  • Increasing the number of VM registers probably requires support from Polygon.
  • Decreasing the number of used registers could be achieved by modifying cranelift. However, since the number of registers is already small, we probably couldn’t use less than the current number.

@mooori
Copy link
Author

mooori commented Feb 9, 2024

Work on this design doc has shown that there are open questions regarding what exactly we want to build. To answer that question a MVP is built:

Once conclusions from the MVP have been drawn, I suggest to start a new design doc. This might be easier to review compared to an almost complete rewrite of this doc.

@mooori mooori closed this Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants