Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script compilation time #153

Closed
bobbinth opened this issue Mar 17, 2022 · 17 comments
Closed

Script compilation time #153

bobbinth opened this issue Mar 17, 2022 · 17 comments
Assignees
Labels
assembly Related to Miden assembly

Comments

@bobbinth
Copy link
Contributor

It seems like compiling large scripts (e.g., sha256 in the standard library) takes quite a bit of time. This is especially apparent when running tests in debug mode (which is the default). For example, sha256 tests takes over 40 seconds to complete, and this is likely driven by the assembler. When running tests in release mode, the situation is better - i.e., under 1 second - but still.

We should investigate what is the bottleneck in the assembler and figure out how to address this. My guess is that a big part of this is driven by computing program hash, but would be good to confirm this.

@bobbinth bobbinth added the assembly Related to Miden assembly label Mar 17, 2022
@itzmeanjan
Copy link
Contributor

My guess is that a big part of this is driven by computing program hash, but would be good to confirm this.

You mean while constructing Merklised Syntax Tree of assembly program, is it ?

@bobbinth
Copy link
Contributor Author

Right. Overall, even for 3K instructions, building the MAST shouldn't require more than 100 or so invocations of the hash function - so, there must be some inefficiency somewhere.

@itzmeanjan
Copy link
Contributor

And we're using rescue prime as hash function, is it ?

@bobbinth
Copy link
Contributor Author

And we're using rescue prime as hash function, is it ?

correct.

@willyrgf
Copy link
Contributor

I would love to work in this issue if no one is.

@bobbinth
Copy link
Contributor Author

Awesome! I'll assign this to you. Thank you!

@willyrgf
Copy link
Contributor

willyrgf commented May 17, 2022

@bobbinth - Debugging the code trying to understand where is the most expensive part of it, I found somethings and I've some questions:

The most expensive part of the executions seems to be in the first loop of combine_blocks() more specific in the hash of the batch groups inside of batch_ops().

  • The cost of the hasher::hash_elements(flatten_slice_elements(&batch_groups)) seems to be linear for the length of the batch_groups (like, length 575 cost 485ms, length 337 cost 295ms) , is this what is expected?
  • Inside of the combine_blocks() we're calling batch_ops() for each span with last_span.append(span) and some blocks of the combine_blocks() have 200 or 300 of spans in the block and this is case is very expensive because of the linear cost per span mentioned above. Decompose blocks in two types of blocks (span blocks and non-span blocks), run batch_ops() just one time for all "span blocks" and recompose this two types in the correct positions, is that a possible solution for it? Maybe only when the "next block" is a span too?
  • Do you think that the difference of 40s to 1s from debug to release versions is that a problem? Or the release version expending 1s is a problem? I think some part of the difference can be the optimization (like debug assertions disabled) that happens in the release version, that make sense?

@bobbinth
Copy link
Contributor Author

@willyrgf - thank you so much for looking into this! To answer your questions:

The cost of the hasher::hash_elements(flatten_slice_elements(&batch_groups)) seems to be linear for the length of the batch_groups (like, length 575 cost 485ms, length 337 cost 295ms) , is this what is expected?

Linear cost is to be expected: hash_elements() is linear in the number of elements to be hashed (and each batch group contains 8 elements). Regarding the exact numbers you show: are these in debug or release mode? If in release mode, these are quite surprising: hashing 1000 elements (or ~125 groups) shouldn't take more than 1 ms.

and some blocks of the combine_blocks() have 200 or 300 of spans in the block

Do you mean 200 or 300 ops in a span? or spans in a block? If the latter, then it is rather surprising, but if the former, it is OK. Some spans could have hundreds (or even thousands) of ops in them.

Or the release version expending 1s is a problem?

I think we should create a benchmark (e.g., using Criterion) for script compilation time on an example of a moderately complex program (e.g. sha256) and see how it performs. My expectation is that compiling such programs shouldn't take more than a few milliseconds in release mode (benchmarks are usually run in release mode). So, if we see results which are very different from this, we can try to see what can be optimized.

In terms of potential places to optimize, I think we'll get the biggest result by trying to minimize the number of times we need to call hash_elements(). With the current structure, we call it many times while we are building the program tree, and my suspicion is that many of these calls are unnecessary. An ideal way would be to build the tree, and then compute hash of all blocks - but I'm not sure how to do this yet.

@willyrgf
Copy link
Contributor

@bobbinth

are these in debug or release mode? If in release mode, these are quite surprising: hashing 1000 elements (or ~125 groups) shouldn't take more than 1 ms.

These are in debug mode.

Do you mean 200 or 300 ops in a span? or spans in a block? If the latter, then it is rather surprising, but if the former, it is OK. Some spans could have hundreds (or even thousands) of ops in them.

I mean, spans in the blocks received by combine_blocks(), like blocks.len(): 769, spans_in_blocks: 769. In this case we're calling hash_elements() n times for n blocks.

I think we should create a benchmark (e.g., using Criterion)

I didn't know Criterion, this looks interesting. I'll work on it.

In terms of potential places to optimize, I think we'll get the biggest result by trying to minimize the number of times we need to call hash_elements()

Yeah it looked the same to me. Seems to be acceptable to me (and easy to test): group all spans that are in sequence and only call hash_elements() one time per sequence. In some cases like a mentioned above I think that will be fast.

@bobbinth
Copy link
Contributor Author

I mean, spans in the blocks received by combine_blocks(), like blocks.len(): 769, spans_in_blocks: 769. In this case we're calling hash_elements() n times for n blocks.

Oh interesting! I haven't anticipated this happening - would be curious what's generating such a large number of blocks to be merged.

I didn't know Criterion, this looks interesting. I'll work on it.

Thank you! If you are interested, you can check out a couple of benchmark examples here.

@willyrgf
Copy link
Contributor

willyrgf commented May 17, 2022

would be curious what's generating such a large number of blocks to be merged.

If I understood correctly, it's the export.hash.16.

@willyrgf
Copy link
Contributor

willyrgf commented Jun 6, 2022

Hi @bobbinth - sorry for that, but I will need to focus my time on other new things. Can you unassign me here? If somehow I found time and it's still available I'll send a PR.

@bobbinth
Copy link
Contributor Author

bobbinth commented Jun 6, 2022

No worries! Thank you @willyrgf

@0xkanekiken
Copy link
Contributor

@bobbinth: I'd love to pick it up from here.

@0xkanekiken
Copy link
Contributor

@bobbinth: On my machine with AMD Ryzen 5 3550H running Pop!_OS 22.04 LTS, the sha256 compilation benchmark is taking
time: [2.8630 s 2.8750 s 2.8874 s]

@0xkanekiken
Copy link
Contributor

0xkanekiken commented Jun 16, 2022

@bobbinth - The benchmark of the new PR #250 running using the same configuration is as follows:
time: [78.123 ms 78.921 ms 79.821 ms]

My findings are consistent with @willyrgf. Earlier, we were iteratively applying the hash elements whenever we get a pair of SPAN block & the entire process was taking considerable amount of time to run. In the PR #250 we are keeping a track of all the consecutive SPAN block occurring in a row and only applying the hash_elements whenever there's a discontinuity.
I'm also keeping a check of the size of the sequence of blocks which we are sending as an input to the combine_block. If there's only one element in it then I'm not doing any further processing (if it's a SPAN block) as it's repetitive. This alone resulted in 20% further reduction in the script compilation time.

In debug mode, sha256_2_to_1_hash, which was initially taking ~80 secs in my system, is now completing in less than 3 secs.

@bobbinth
Copy link
Contributor Author

Closed by #250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assembly Related to Miden assembly
Projects
None yet
Development

No branches or pull requests

4 participants