Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Document and improve wasmer performance for contracts #10298

Closed
2 tasks
athei opened this issue Nov 17, 2021 · 20 comments
Closed
2 tasks

Document and improve wasmer performance for contracts #10298

athei opened this issue Nov 17, 2021 · 20 comments
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. Z3-substantial Can be fixed by an experienced coder with a working knowledge of the codebase.

Comments

@athei
Copy link
Member

athei commented Nov 17, 2021

Motivation

Currently, we use an embedded (compiled to wasm into the runtime) wasmi in order to execute contracts. We want to switch to the wasmer singlepass backend. However, in order to make an informed decision we need to document the status quo.

We have two benchmark approaches to assess the performance of contract execution:

Synthetic Benchmarks

This is a suite of runtime benchmarks that are used to assign a weight value to every host function available to contracts and each supported wasm instruction. These benchmark are defined in this file. You can have a look at #10266 to find out how to run these benchmarks with the benchbot and what the current diff is is to wasmi.

Insights from those benchmarks:

  • Host functions are roughly double the weight on wasmer (see for example seal_caller benchmark)
  • There is a 50us static overhead (see callbenchmark).
  • Instruction weights are much lower (1/8)

End to end benchmarks

The end to end benchmarks are also runtime benchmarks but flagged as extra so they don't get part of the emitted weight file. They are just for manual execution and inspection.

Instead of using procedural generated fixtures as the other benchmarks they use real world contracts written in ink! or solang. For now there are only two: Benchmarking a transfer on a ink! ERC20 contract and on an solang ERC20 contract.

You can look into #10268 on how to run these benchmarks. You will also find results for running those with wasmi and wasmer.

By looking at those benchmarks results we can learn that wasmer is 50% slower on the ink! example and 78% slower on the solang example.

Conclusion

Instruction performance of wasmer is better than wasmi (including compilation overhead of wasmer). However, in an end-to-end benchmark wasmi is still faster. We need to find out why this is and resolve the issue if possible.

Goal

The goal of this issue is to find out why those end-to-end benchmarks report no results in favor of wasmer. If the reason are some resolvable inefficiencies in either wasmer or the wasmer integration those should be fixed and the new results should be reported here. Those fixes should then be upstreamed to substrate and wasmer.

If the conclusion is that we cannot outperform interpretation for typical smart contract workloads this is also a valid result. This needs to be backed up with additional end-to-end benchmarks, though. Maybe there are some workloads where is outperforms interpretation (loop heavy contracts) but not all. A solution where we offer two execution engines might be something to think about then.

Steps

  • Change the bench bot so that cargo features can be supplied (allows running benchmarks for wasmer without changing any code)
  • Compare time it takes to setup a instance vs. invoking it on wasmer vs wasmi. We want to find out the static overhead (not depending on contract size) of creating an instance on wasmer vs wasmi.
@athei athei added I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. U1-asap No need to stop dead in your tracks, however issue should be addressed as soon as possible. labels Nov 17, 2021
@athei athei added this to To Do (Ordered by priority) in Smart Contracts via automation Nov 17, 2021
@athei athei moved this from To Do (Ordered by priority) to In Progress in Smart Contracts Dec 7, 2021
@athei athei moved this from In Progress to To Do (Ordered by priority) in Smart Contracts Feb 3, 2022
@0x7CFE
Copy link
Contributor

0x7CFE commented Apr 28, 2022

Intermediate results so far.

tldr: wasmer is ~twice as slow at instantiating modules and ~twice as fast at executing calls when compared to wasmi.

The Setup

I used the tracing library to instrument and measure timings of various parts of Substrate during transaction dispatch, such as instance creation, execution, parameter marshalling, etc. See ca2b249

I used the flipper example as a testbed and isolated a single call to Flipper::flip to get an idea of how wasmi and wasmer compare. Of course, proper comparison should aggregate data over several calls, but gathered data was distinct enough to analyze as is.

I disabled embedded executor, so both wasmi and wasmer were executed by sandbox API from Substrate client. That way control flow would be as close as possible between the two backends so we can directly compare the results.

I executed the node using the cli below but for wasmi execution I hacked client/executor/wasmtime/src/host.rs like this:

 	pub fn new(allocator: FreeingBumpHeapAllocator) -> Self {
 		HostState {
 			sandbox_store: SandboxStore(Some(Box::new(sandbox::Store::new(
-				sandbox::SandboxBackend::TryWasmer,
+				sandbox::SandboxBackend::Wasmi,
 			)))),
cargo run --release --features=wasmer-sandbox -- --dev --execution=wasm --no-hardware-benchmarks -lsandbox=trace -lwasmer=debug -lexec=debug --tracing-targets=wasmer=trace,sandbox=trace,inst=trace

I also tried to measure the whole lifetime of sp_sandbox::SandboxInstance but that required using tracing inside of a runtime. I tried using sp-sandbox there but failed to get some output even with with-tracing feature enabled.

The Results

These two files are part of the node log during dispatch of the Flipper::flip transaction. Important lines were surrounded by newlines. Best to be viewed side-by-side.

https://gist.github.com/0x7CFE/b7db97b07620aca2fa860c2296604cd0#file-wasmi2-txt
https://gist.github.com/0x7CFE/b7db97b07620aca2fa860c2296604cd0#file-wasmer2-log

Note that the ERROR log level was chosen just for visibility. It does not affect the test in any other way.
Also note that tracing spans are nested, so if span A contains span B inside, B's log would come prior to A.

Result Interpretation

Here I show notable lines from the files above with my commentary. See the whole log for the full picture. To simplify reading I removed irrelevant info and formatted time data with spaces (1234 → 1 234). Wasmi goes first, wasmer next.

During module initialization phase we clearly see that wasmi is way ahead of wasmer and is almost twice as fast. This is not surprising at all, since wasmer needs to perform a computationally expensive task of compiling wasm to a native machine code, whereas wasmi just does instruction transpiling into a format that is better suited for immediate execution:

16:41:54.106 sc_tracing: sc_executor_common::sandbox::wasmi_backend: instantiate, time: 635 291, id: 13511348637925378, values: len=26132
16:32:39.108 sc_tracing: sc_executor_common::sandbox::wasmer_backend: instantiate, time: 1 889 049, id: 13511348637925378, values: len=26132
16:41:54.106 sc_tracing: sc_executor_common::sandbox: instantiate, time: 661 333, id: 20266748078981121, values: len=26132
16:32:39.108 sc_tracing: sc_executor_common::sandbox: instantiate, time: 1 924 451, id: 20266748078981121, values: len=26132

However, when modules were initialized the situation is very much the reverse. Here's an example of handling a host function:

16:41:54.106  sc_tracing: sc_executor_common::sandbox::wasmi_backend: SandboxContext::invoke, time: 12 338, id: 524669906344476676, values: index=226
16:32:39.108 sc_tracing: sc_executor_common::sandbox::wasmer_backend: SandboxContext::invoke, time: 7 079, id: 524669906344476676, values: index=226
16:41:54.106 sc_tracing: sc_executor_common::sandbox::wasmi_backend: invoke_index, time: 40 192, id: 524669906344476675, values: index=0, args="RuntimeArgs([I32(142220)])"
16:32:39.108 sc_tracing: sc_executor_common::sandbox::wasmer_backend: wasmer::Function closure, time: 18 164, id: 524669906344476675, values: index=226

The topmost invoke (near the end of the log file) is also very much in favor of wasmer:

16:41:54.109 sc_tracing: sc_executor_common::sandbox::wasmi_backend: invoke, time: 2 803 069, id: 15763148451610626, values: state=1020704, export_name="call", args="[]"
16:32:39.109 sc_tracing: sc_executor_common::sandbox::wasmer_backend: invoke, time: 1 455 830, id: 15763148451610626, values: _state=1020704, export_name="call", args="[]"
16:41:54.109 sc_tracing: sc_executor_common::sandbox: invoke, time: 2 815 407, id: 24770347706351617, values: state=1020704, export_name="call", args="[]"
16:32:39.109 sc_tracing: sc_executor_common::sandbox: invoke, time: 1 464 897, id: 24770347706351617, values: state=1020704, export_name="call", args="[]"

Again, I should note that these measurements were based on an execution of a single transaction. So please take it with a grain of salt. Still it contains a few dozen of invocations and, I think, shows pretty convincingly that wasmer is good at executing stuff, but really falls behind during the instantiation phase.

So, what gives? Unfortunately, our current setup is pretty straightforward: we instantiate the module right before making the call:

// Instantiate the instance from the instrumented module code and invoke the contract entrypoint.
let mut runtime = Runtime::new(ext, input_data, memory);
let result = sp_sandbox::default_executor::Instance::new(&code, &imports, &mut runtime)
	.and_then(|mut instance| instance.invoke(function.identifier(), &[], &mut runtime));

So, even that wasmer is able to execute code fast it still makes no practical difference in the current setup. This is especially visible in smart contract scenario where the code is small and often contains tiny amount of instructions.

Thoughts and Ideas

  1. Research why wasmer init takes so long. Is it really because of compilation or maybe something else?
  2. Think about instance caching. If we'd able to cache instantiated wasmer modules we would be able to execute them lightning fast. We already have instrumentation cache, maybe we can cache wasmer instance as well?
  3. Concurrent execution? Theoretically we can do both backends concurrently and use whichever comes first.
  4. Chunked execution/compiling akin to Loadable functions support wasmi-labs/wasmi#262? If wasmer would compile only what it actually needs the module would be instantiated faster.
  5. Maybe wasmer is slow at its initialization and not module instantiation? Could we preload or warm-up its internals somehow?
  6. On demand JIT compiling per function/module? Or even something weirder, like a heterogeneous approach in Hotspot JVM (interpreter + compiler)

Further Work

  1. Proper setup with aggregation of several calls.
  2. End-to-end measuring

CC @dvdplm, @athei

@dvdplm
Copy link
Contributor

dvdplm commented Apr 28, 2022

I tried using sp-sandbox there but failed to get some output even with with-tracing feature enabled.

This makes me a bit sad. Is there an issue in substrate for this? I believe others have struggled with this as well in the past (cc @Imod7 perhaps?).

isolated a single call to Flipper::flip to get an idea of how wasmi and wasmer compare.

I don't know how representative this call is, but imo it'd be good to repeat the benchmark with a few other function calls just to ensure that the overall conclusion is not overly skewed by the specific code in flip.

Research why wasmer init takes so long. Is it really because of compilation or maybe something else?

This seems like the most pressing question. Is it easy to measure the actual compilation time vs compiler startup?

@0x7CFE
Copy link
Contributor

0x7CFE commented Apr 28, 2022

I don't know how representative this call is, but imo it'd be good to repeat the benchmark with a few other function calls just to ensure that the overall conclusion is not overly skewed by the specific code in flip.

Well, it's the simplest contract call I can imagine, so I essentially wanted to draw a baseline here. It could easily happen that for the simplest of calls wasmi is faster because it exits early. However, what we see is that even on such a trivial method, wasmer still executes faster than wasmi. Also, don't get me wrong, even such a tiny method makes cat wasmer2.txt | grep SandboxContext::invoke | wc -l... 77 calls to SandboxContext::invoke (mostly index 226)! So it's definitely not a noop.

On the other hand, we surely need to measure something like ERC20 for the real world scenario, but I'm expecting it to be mostly the same: slower init, faster execution.

This seems like the most pressing question. Is it easy to measure the actual compilation time vs compiler startup?

That's a question on its own.

P.S.: I've added some more entries to the thoughts and ideas section.

@athei
Copy link
Member Author

athei commented Apr 30, 2022

This suggests that we can rule out our theory that it is due to slow execution of host functions by wasmer. It is all about the startup which was the most likely scenario anyways. Transformation into machine code takes time. I still would like to see this repeated with an ink! ERC20 transfer, though. Also, having this somehow integrated into substrate or some other repo so it is easily repeatable would be very nice. Don't want all your work to go to waste and then the next time we need to do it all over again. I think this is nice to have this as some kind of regression test (doesn't need to be completely automated).

Research why wasmer init takes so long. Is it really because of compilation or maybe something else?

I think that this is the next question that you should investigate. It guess this would mean profiling wasmer itself. All other thoughts about cutting down compile times are useless if we don't know if that is actually the bottle neck.

So, what gives? Unfortunately, our current setup is pretty straightforward: we instantiate the module right before making the call:

There is no other point in time we could possibly instantiate the module. We don't know which code to instantiate until someone makes a call (very different from the runtime itself). We can't do some basic LRU cache of modules if you are thinking about that: The cache can be gamed (warm/cold can be different for different validators). Also, good luck putting weights on that.

Think about instance caching. If we'd able to cache instantiated wasmer modules we would be able to execute them lightning fast. We already have instrumentation cache, maybe we can cache wasmer instance as well?

Caching machine code is problematic because it is platform dependent and the cache lives within the runtime. If we could introduce an IR to wasmer singlepass that is trivially translatable to all supported machines this could be a target for caching. Probably some basic register machine byte code. SSA not possible because there is no O(1) transformation from wasm. On the other hand not having an IR is kind of the point of a single pass compiler in order to reduce compilation times by leaving out this step. But having two O(1) transformations could be worth it for the sake of supporting caching.

Chunked execution/compiling akin to wasmi-labs/wasmi#262? If wasmer would compile only what it actually needs the module would be instantiated faster.

Having a dynamic recompiler that only compiles functions on entry (or even smaller chunks) could cut down on compile times if that is really what makes it expensive. The context switch between interpretation and compilation could pessimise performance, though. You would still pass in the whole module. I think this is essentially the same as your Hotspot idea.

@0x7CFE
Copy link
Contributor

0x7CFE commented May 4, 2022

Ok, so I've made some measurements that I'm now confident of.

Please note that I redid the tests using manual measurement since I discovered that for some reason tracing introduced significant overhead especially in the hot section of the code (invoke_index) where I tried to separately measure time it takes to prepare, invoke and finalize the call. Overall tracing output was quite unpredictable and not very reliable (with the exception of the #[instrument] macro), so I decided to ditch it.

Long story short: the performance is comparable, with the exception of module instantiation which is considerably slower on Wasmer. Everything else is more or less on par.

The tests were performed on the following branch: https://github.com/paritytech/substrate/commits/dk-tracing-executor.

Here are links to recorded Wasmi and Wasmer sessions where I did

Here are the data grepped from the logs:

Calls to deploy

wasmi wasmer
16.937696ms 16.041995ms

Calls to instantiate

wasmi wasmer
892.447µs 3.376242ms
376.753µs 1.215131ms
858.383µs 3.442349ms
835.121µs 3.695362ms
813.823µs 1.323876ms
258.299µs 1.348852ms
276.585µs 3.898192ms

Calls to call

wasmi wasmer
1.540576ms 1.436548ms
27.965549ms 21.100177ms
21.856403ms 12.622645ms
20.340723ms 22.064872ms
1.158249ms 1.497089ms
1.22862ms 1.136338ms

All things combined make me think that the bottleneck is not in the execution engine, but in a way how we dispatch wasm calls. JIT shines when it's able to fully utilize modern super-scalar hardware, especially speculative execution and cache prefetch. In our case it's impossible to do so because execution is jumping back and forth between substrate core and wasm. Of course it's just my speculation, but I can't imagine other reasons why interpreter and compiler show similar results.

Interestingly, Wasmer is generally faster when doing invoke_index, but it's still not enough to compensate huge time loss during the module instantiation.

If only we could cache instantiated modules then we'd probably be able to squeeze some performance out of Wasmer. Still, I don't think it's practical given that performance win is marginal.

P.S.: I did another test session where I removed all logs from invoke_index which is pretty hot point in the code, but the results remained mostly the same.

CC @dvdplm @athei

@athei
Copy link
Member Author

athei commented May 5, 2022

All things combined make me think that the bottleneck is not in the execution engine, but in a way how we dispatch wasm calls.

I don't know how you come to this conclusion. Assuming that you mean imported function calls by "wasm calls". The values you measured for invoke_index even contradict this (wasmer isn't slower there). When look at instantiate I see that wasmi pulls ahead which is easily explainable by the fact that instantiate (just does some basic initialization) executed much less of the contract code than the call (which does an actual transfer). The quota between executed code vs. compiled code is very much in favor of wasmi for instantiate.

JIT shines when it's able to fully utilize modern super-scalar hardware, especially speculative execution and cache prefetch. In our case it's impossible to do so because execution is jumping back and forth between substrate core and wasm.

Again, this doesn't necessarily follow from the numbers observed. Compilation performance can be the culprit.

Of course it's just my speculation, but I can't imagine other reasons why interpreter and compiler show similar results.

From that I am seeing here I think the most likely scenario is the obvious one: Faster execution of the compiled code cannot recuperate the big up front costs of code compilation. This is especially true of just a small part is executed (compare instantiate vs. call) because wasmer always compiles the whole module.

If only we could cache instantiated modules then we'd probably be able to squeeze some performance out of Wasmer. Still, I don't think it's practical given that performance win is marginal.

If we assume that compilation is the culprit than caching would make all the difference.

Thoughts and Ideas

You asked me to comment on them. I will do this here. But in general I think this issue should only be about accessing the status quo. All of those ideas are massive undertakings and shouldn't be tackled on a hunch. First, we need to know what is going on. Otherwise we could be optimizing at the completely wrong end. We need to verify our assumption.

  1. Yes absolutely please find that out. This is a very useful information and awesome if we could get our hands on it.
  2. Discussed this in my previous comment.
  3. Having concurrency and the runtime level is currently something that is not supported by substrate. Also, very difficult to assign weights to that if we only "let things run". We can't just do something without knowing what the cost of that is. But in general having some way to use interpreter and compiler for different use cases could be something that makes sense. Maybe just offload that to the contract authors to assess performance and select the one they want. That would make it deterministic. But I consider the other ideas as much more promising.
  4. Discussed this in my previous comment.
  5. Would be awesome if you could find that out. Then we could use that API to warm that up within the client and keep it in memory. Could be related to memory allocation. I know this is a big topic for the wastime executor that we don't re-zero all the memory when we recycle instances but use memory mapping operations to zero it in constant time. However, we can't use the same approach here as we cannot cache instances because those are connected to a fixed module. In contracts we don't know the modules a-priori as we do for the runtime. So in case we go down this route it needs to be done differently. Like caching only the linear memory and not the instances. However, linear memory is very small for contracts so it is unlikely that this is the culprit. This all assumes that it is not compilation which is causing this.
  6. I consider this kind of the same thing as in 4. We can't make it as clever as HotSpot because it all needs to be deterministic. We need a mapping from: Compiled code in bytes -> Weight and a worst case estimation for this mapping.

@0x7CFE
Copy link
Contributor

0x7CFE commented May 5, 2022

From that I am seeing here I think the most likely scenario is the obvious one: Faster execution of the compiled code cannot recuperate the big up front costs of code compilation. This is especially true of just a small part is executed (compare instantiate vs. call) because wasmer always compiles the whole module.

Yes, you are absolutely right. Maybe I failed to express my point, but my conclusions are mostly the same.

I don't know how you come to this conclusion. Assuming that you mean imported function calls by "wasm calls". The values you measured for invoke_index even contradict this (wasmer isn't slower there). When look at instantiate I see that wasmi pulls ahead which is easily explainable by the fact that instantiate (just does some basic initialization) executed much less of the contract code than the call (which does an actual transfer). The quota between executed code vs. compiled code is very much in favor of wasmi for instantiate.

My understanding is that compiled code should be an order of magnitude faster which is definitely not the case as we see in the measurements. Yes, invoke_index is generally faster when executed by wasmer, but it's not that faster. So I thought maybe something is preventing wasmer to execute at full speed. Something like the need to perform host function calls. We even read and write sandbox memory using host functions.

@athei
Copy link
Member Author

athei commented May 5, 2022

My understanding is that compiled code should be an order of magnitude faster which is definitely not the case as we see in the measurements.

Correct me if I am wrong: Your measurement always measure compilation + execution. This means even if the execution is much faster it could still be offset by some very slow compilation. So you can't follow from that that the execution isn't a magnitude faster. For all we know it could be. As a matter of fact it is suggested by the synthetic instruction benchmarks I posted in the top post.

Yes, invoke_index is generally faster when executed by wasmer, but it's not that faster.

Same speed is completely fine there assuming that the rest of the code executes much faster. Assuming that imported function calls are roughly the same that should still yield much better performance. But we can't observe any speedup and I still think that compilation costs are the most likely explanation: It is the most obvious and doesn't contradict any measurement.

So I thought maybe something is preventing wasmer to execute at full speed. Something like the need to perform host function calls. We even read and write sandbox memory using host functions.

I don't get why you think it should be that. I mean it could be but compilation seems to be more likely. You could test this assumption by writing two contract calls where one does much more host function calls and the other much more arithmetic instructions, stack access etc..

@0x7CFE
Copy link
Contributor

0x7CFE commented May 5, 2022

Correct me if I am wrong: Your measurement always measure compilation + execution. This means even if the execution is much faster it could still be offset by some very slow compilation. So you can't follow from that that the execution isn't a magnitude faster.

I can be wrong too, but my current understanding is that wasmer does the full compilation during module creation that accepts a wasm bytecode and a Store object. Then it does sanity checks and the actual compilation in wasmer::sys::module::Module::from_binary_unchecked. Theoretically it could be that something is done during first invocation in a JIT manner but I doubt it's the case in Singlepass.

So, effectively, everything related to native code generation is done during module creation. Everything else deals with already compiled native module.

I have done some quick checks and indeed, the data backs up my theory:

module creation module imports module instantiation total
3.025098ms 34.248µs 85.161µs 3.144507ms
1.724811ms 19.01µs 40.881µs 1.784702ms
3.437267ms 43.125µs 95.834µs 3.576226ms
3.639528ms 38.441µs 87.024µs 3.764993ms
3.662501ms 51.664µs 92.641µs 3.806806ms
2.62689ms 31.589µs 80.124µs 2.738603ms
1.768464ms 19.011µs 46.197µs 1.833672ms

Please see the instantiate function. First, it creates wasmer module by passing wasm bytecode to it, then it populates import tables to tie modules together and expose host functions. Finally, it instantiates the module and then creates our BackendInstance.

Note that the majority of the time is spent during module creation where the compilation takes place. Also note that this procedure is pure in a sense that it does not depend on anything but the wasm blob. If only we could cache it then all subsequent module instantiations would happen in a microsecond range. Even if we'd cache the compiled module for one block we would already be able to shave off some time. Even in my test scenario the module gets compiled 7 times.

@athei
Copy link
Member Author

athei commented May 5, 2022

Okay I get confused because I didn't understand what is shown in the tables. We cleared that over element:

deploy: Calling the deploy exported function of a contract
instantiate: Calling the instantiate function of the wasmer backend (compiles the code and creates the instance). No contract code execution.
call: Calls the call exported function of a contract: Does an ERC20 transfer.

Given these information it is indeed curious why even with all the code compiled wasmer doesn't outperform wasmi by a huge margin. Compilation is factored out already.

Wasmer looks really bad: It takes a long time to create the module and then doesn't perform faster at execution.

But why would it be the host function calls? You measured them as roughly equal in time. I guess the A/B test I mentioned would be the next step: Comparing a call with many vs. few host function calls.

@0x7CFE
Copy link
Contributor

0x7CFE commented May 6, 2022

But why would it be the host function calls? You measured them as roughly equal in time. I guess the A/B test I mentioned would be the next step: Comparing a call with many vs. few host function calls.

I made some more tests.

The deploy call takes 16ms. I added together time of all invoke_index calls that happen during the deploy and they sum to 13420.98µs ≈ 13.4ms. So, the absolute majority of time is spent in those 1600 host function calls.

In such a scenario, even if wasmer will perform 1000 times faster than wasmi it still can affect only those remaining ≈2.5 ms. Even in ideal case the deployment would take no less than 13.4ms + ε.

On the other hand, for other calls it's not that bad.

Overall I think that host function overhead explains why wasmer does not not show x10 speedup in all tests. Yet, that thing alone cannot explain why wasmer is slower sometimes even if we factor out the compilation.

@athei
Copy link
Member Author

athei commented May 6, 2022

1600 host function calls for a single deploy of an ERC20 contract? That seems too much. Or is it a batch of calls?

Overall I think that host function overhead explains why wasmer does not not show x10 speedup in all tests. Yet, that thing alone cannot explain why wasmer is slower sometimes even if we factor out the compilation.

Agreed. It does explain why it isn't able to outperform it by a huge margin but not why it is slower. Except if the host functions would be much slower on wasmer. Maybe the slowness isn't correctly captured in your measurements because the context switch makes the following non host function instructions slower (cold caches due to a lot of code being shuffled around). I think you already mentioned this theory somewhere.

@0x7CFE
Copy link
Contributor

0x7CFE commented May 6, 2022

1600 host function calls for a single deploy of an ERC20 contract? That seems too much. Or is it a batch of calls?

Unfortunately, not. It's for a single deployment. I also think that's a bit too much but I don't know the contract internals. Wasmi log shows the same story so it's definitely not Wasmer's fault.

Also, call that takes 22.064872ms in the table above makes even more, 2244 host function calls that take 8.4ms in total. So its not specific to deploy alone.

Maybe the slowness isn't correctly captured in your measurements

Yes, I think that is the case, not to mention that massive log output introduces delays on its own. I think now we need to aggregate data over many invocations with and without host function calls to say for sure.

@athei
Copy link
Member Author

athei commented May 6, 2022

Unfortunately, not. It's for a single deployment. I also think that's a bit too much but I don't know the contract internals. Wasmi log shows the same story so it's definitely not Wasmer's fault.

I assume it is the gas metering then. It generates one host function call for every basic block. I think what we can learn from that is that adding disk caching to wasmer (with an IR language) + implementing paritytech/wasm-instrument#11 could make wasmer competitive.

@xgreenx
Copy link
Contributor

xgreenx commented Jun 6, 2022

I assume it is the gas metering then. It generates one host function call for every basic block. I think what we can learn from that is that adding disk caching to wasmer (with an IR language) + implementing paritytech/wasm-instrument#11 could make wasmer competitive.

To confirm that assumption @0x7CFE can disable gas metering and do one more measurement.

Also, to avoid the influence of debug logs, you can measure the total time of the execution and the time to init the executor. The difference - is the time of call execution.

@0x7CFE
Copy link
Contributor

0x7CFE commented Jun 7, 2022

To confirm that assumption @0x7CFE can disable gas metering and do one more measurement.

I no longer work on this.

@athei
Copy link
Member Author

athei commented Jun 7, 2022

Yes I also thought of this. Redoing this with disabled gas metering would be interesting to confirm the theory.

@athei athei unassigned 0x7CFE Jul 4, 2022
@athei athei added U2-some_time_soon Issue is worth doing soon. and removed U1-asap No need to stop dead in your tracks, however issue should be addressed as soon as possible. labels Jul 4, 2022
@athei
Copy link
Member Author

athei commented Jul 4, 2022

No one working on this anymore. De-prioritized because we go with in-runtime wasmi for now which will get some performance updates soon.

@athei athei removed the U2-some_time_soon Issue is worth doing soon. label Jul 28, 2022
@athei athei added the Z3-substantial Can be fixed by an experienced coder with a working knowledge of the codebase. label Jul 28, 2022
@bkchr
Copy link
Member

bkchr commented Oct 18, 2022

With this pr here: #12501 we can probably close this?

@athei
Copy link
Member Author

athei commented Oct 18, 2022

Yea. I don't think wasmer is happening any time soon.

@athei athei closed this as completed Oct 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. Z3-substantial Can be fixed by an experienced coder with a working knowledge of the codebase.
Projects
Status: Done
Smart Contracts
  
To Do (Ordered by priority)
Development

No branches or pull requests

6 participants