Script compilation time #153

bobbinth · 2022-03-17T16:18:56Z

It seems like compiling large scripts (e.g., sha256 in the standard library) takes quite a bit of time. This is especially apparent when running tests in debug mode (which is the default). For example, sha256 tests takes over 40 seconds to complete, and this is likely driven by the assembler. When running tests in release mode, the situation is better - i.e., under 1 second - but still.

We should investigate what is the bottleneck in the assembler and figure out how to address this. My guess is that a big part of this is driven by computing program hash, but would be good to confirm this.

itzmeanjan · 2022-03-18T06:32:36Z

My guess is that a big part of this is driven by computing program hash, but would be good to confirm this.

You mean while constructing Merklised Syntax Tree of assembly program, is it ?

bobbinth · 2022-03-18T06:35:45Z

Right. Overall, even for 3K instructions, building the MAST shouldn't require more than 100 or so invocations of the hash function - so, there must be some inefficiency somewhere.

itzmeanjan · 2022-03-18T06:42:59Z

And we're using rescue prime as hash function, is it ?

bobbinth · 2022-03-18T06:43:51Z

And we're using rescue prime as hash function, is it ?

correct.

willyrgf · 2022-05-12T02:26:47Z

I would love to work in this issue if no one is.

bobbinth · 2022-05-12T06:51:12Z

Awesome! I'll assign this to you. Thank you!

willyrgf · 2022-05-17T01:21:59Z

@bobbinth - Debugging the code trying to understand where is the most expensive part of it, I found somethings and I've some questions:

The most expensive part of the executions seems to be in the first loop of combine_blocks() more specific in the hash of the batch groups inside of batch_ops().

The cost of the hasher::hash_elements(flatten_slice_elements(&batch_groups)) seems to be linear for the length of the batch_groups (like, length 575 cost 485ms, length 337 cost 295ms) , is this what is expected?
Inside of the combine_blocks() we're calling batch_ops() for each span with last_span.append(span) and some blocks of the combine_blocks() have 200 or 300 of spans in the block and this is case is very expensive because of the linear cost per span mentioned above. Decompose blocks in two types of blocks (span blocks and non-span blocks), run batch_ops() just one time for all "span blocks" and recompose this two types in the correct positions, is that a possible solution for it? Maybe only when the "next block" is a span too?
Do you think that the difference of 40s to 1s from debug to release versions is that a problem? Or the release version expending 1s is a problem? I think some part of the difference can be the optimization (like debug assertions disabled) that happens in the release version, that make sense?

bobbinth · 2022-05-17T06:22:07Z

@willyrgf - thank you so much for looking into this! To answer your questions:

The cost of the hasher::hash_elements(flatten_slice_elements(&batch_groups)) seems to be linear for the length of the batch_groups (like, length 575 cost 485ms, length 337 cost 295ms) , is this what is expected?

Linear cost is to be expected: hash_elements() is linear in the number of elements to be hashed (and each batch group contains 8 elements). Regarding the exact numbers you show: are these in debug or release mode? If in release mode, these are quite surprising: hashing 1000 elements (or ~125 groups) shouldn't take more than 1 ms.

and some blocks of the combine_blocks() have 200 or 300 of spans in the block

Do you mean 200 or 300 ops in a span? or spans in a block? If the latter, then it is rather surprising, but if the former, it is OK. Some spans could have hundreds (or even thousands) of ops in them.

Or the release version expending 1s is a problem?

I think we should create a benchmark (e.g., using Criterion) for script compilation time on an example of a moderately complex program (e.g. sha256) and see how it performs. My expectation is that compiling such programs shouldn't take more than a few milliseconds in release mode (benchmarks are usually run in release mode). So, if we see results which are very different from this, we can try to see what can be optimized.

In terms of potential places to optimize, I think we'll get the biggest result by trying to minimize the number of times we need to call hash_elements(). With the current structure, we call it many times while we are building the program tree, and my suspicion is that many of these calls are unnecessary. An ideal way would be to build the tree, and then compute hash of all blocks - but I'm not sure how to do this yet.

willyrgf · 2022-05-17T14:39:50Z

@bobbinth

are these in debug or release mode? If in release mode, these are quite surprising: hashing 1000 elements (or ~125 groups) shouldn't take more than 1 ms.

These are in debug mode.

Do you mean 200 or 300 ops in a span? or spans in a block? If the latter, then it is rather surprising, but if the former, it is OK. Some spans could have hundreds (or even thousands) of ops in them.

I mean, spans in the blocks received by combine_blocks(), like blocks.len(): 769, spans_in_blocks: 769. In this case we're calling hash_elements() n times for n blocks.

I think we should create a benchmark (e.g., using Criterion)

I didn't know Criterion, this looks interesting. I'll work on it.

In terms of potential places to optimize, I think we'll get the biggest result by trying to minimize the number of times we need to call hash_elements()

Yeah it looked the same to me. Seems to be acceptable to me (and easy to test): group all spans that are in sequence and only call hash_elements() one time per sequence. In some cases like a mentioned above I think that will be fast.

bobbinth · 2022-05-17T14:54:39Z

I mean, spans in the blocks received by combine_blocks(), like blocks.len(): 769, spans_in_blocks: 769. In this case we're calling hash_elements() n times for n blocks.

Oh interesting! I haven't anticipated this happening - would be curious what's generating such a large number of blocks to be merged.

I didn't know Criterion, this looks interesting. I'll work on it.

Thank you! If you are interested, you can check out a couple of benchmark examples here.

willyrgf · 2022-05-17T15:46:05Z

would be curious what's generating such a large number of blocks to be merged.

If I understood correctly, it's the export.hash.16.

willyrgf · 2022-06-06T17:44:47Z

Hi @bobbinth - sorry for that, but I will need to focus my time on other new things. Can you unassign me here? If somehow I found time and it's still available I'll send a PR.

bobbinth · 2022-06-06T17:48:56Z

No worries! Thank you @willyrgf

0xkanekiken · 2022-06-06T18:18:27Z

@bobbinth: I'd love to pick it up from here.

0xkanekiken · 2022-06-12T16:01:26Z

@bobbinth: On my machine with AMD Ryzen 5 3550H running Pop!_OS 22.04 LTS, the sha256 compilation benchmark is taking
time: [2.8630 s 2.8750 s 2.8874 s]

0xkanekiken · 2022-06-16T15:03:43Z

@bobbinth - The benchmark of the new PR #250 running using the same configuration is as follows:
time: [78.123 ms 78.921 ms 79.821 ms]

My findings are consistent with @willyrgf. Earlier, we were iteratively applying the hash elements whenever we get a pair of SPAN block & the entire process was taking considerable amount of time to run. In the PR #250 we are keeping a track of all the consecutive SPAN block occurring in a row and only applying the hash_elements whenever there's a discontinuity.
I'm also keeping a check of the size of the sequence of blocks which we are sending as an input to the combine_block. If there's only one element in it then I'm not doing any further processing (if it's a SPAN block) as it's repetitive. This alone resulted in 20% further reduction in the script compilation time.

In debug mode, sha256_2_to_1_hash, which was initially taking ~80 secs in my system, is now completing in less than 3 secs.

bobbinth · 2022-06-19T01:24:39Z

Closed by #250

bobbinth added the assembly Related to Miden assembly label Mar 17, 2022

bobbinth assigned willyrgf May 12, 2022

bobbinth unassigned willyrgf Jun 6, 2022

bobbinth assigned 0xkanekiken Jun 6, 2022

This was referenced Jun 12, 2022

chore(assembly): add benchmark for sha256 script compilation #242

Merged

refactor(assembly): speed up script compilation #250

Merged

0xkanekiken mentioned this issue Jun 17, 2022

Efficient way of building MAST tree during assembly script compilation #253

Open

bobbinth closed this as completed Jun 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script compilation time #153

Script compilation time #153

bobbinth commented Mar 17, 2022

itzmeanjan commented Mar 18, 2022

bobbinth commented Mar 18, 2022

itzmeanjan commented Mar 18, 2022

bobbinth commented Mar 18, 2022

willyrgf commented May 12, 2022

bobbinth commented May 12, 2022

willyrgf commented May 17, 2022 •

edited

bobbinth commented May 17, 2022

willyrgf commented May 17, 2022

bobbinth commented May 17, 2022

willyrgf commented May 17, 2022 •

edited

willyrgf commented Jun 6, 2022

bobbinth commented Jun 6, 2022

0xkanekiken commented Jun 6, 2022

0xkanekiken commented Jun 12, 2022

0xkanekiken commented Jun 16, 2022 •

edited

bobbinth commented Jun 19, 2022

Script compilation time #153

Script compilation time #153

Comments

bobbinth commented Mar 17, 2022

itzmeanjan commented Mar 18, 2022

bobbinth commented Mar 18, 2022

itzmeanjan commented Mar 18, 2022

bobbinth commented Mar 18, 2022

willyrgf commented May 12, 2022

bobbinth commented May 12, 2022

willyrgf commented May 17, 2022 • edited

bobbinth commented May 17, 2022

willyrgf commented May 17, 2022

bobbinth commented May 17, 2022

willyrgf commented May 17, 2022 • edited

willyrgf commented Jun 6, 2022

bobbinth commented Jun 6, 2022

0xkanekiken commented Jun 6, 2022

0xkanekiken commented Jun 12, 2022

0xkanekiken commented Jun 16, 2022 • edited

bobbinth commented Jun 19, 2022

willyrgf commented May 17, 2022 •

edited

willyrgf commented May 17, 2022 •

edited

0xkanekiken commented Jun 16, 2022 •

edited