Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the bulk memory operations WASM feature #36

Open
Tracked by #10707
koute opened this issue Sep 8, 2022 · 5 comments
Open
Tracked by #10707

Enable the bulk memory operations WASM feature #36

koute opened this issue Sep 8, 2022 · 5 comments
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T1-FRAME This PR/Issue is related to core FRAME, the framework.

Comments

@koute
Copy link
Contributor

koute commented Sep 8, 2022

(This is a subissue of paritytech/substrate#10707; I'm creating a new issue to focus on just the bulk memory operations.)

We should seriously consider enabling the bulk memory operations feature in our WASM executor and our runtimes. We have recently discovered that the contracts' benchmarks currently can spend up to 75% of their time within the WASM calling memset; not only that, since wasmtime doesn't cache-align loops depending on how the instructions are laid out in memory the performance of memset/memcpy/etc. can very widely vary, and can become up to ~40% slower when compared to the cache-aligned case.

I've done a quick test to compare the performance of the pallet_contracts/seal_return_per_kb benchmark as it currently stands today; here's its execution time in each case:

  • without bulk memory operations, loop not cache-aligned: 362598
  • without bulk memory operations, loop cache-aligned: 230293
  • with bulk memory operations: 113970

Not only does it cut down the benchmark's execution time by half, but also should prevent wasmtime's codegen roulette from regressing the performance.

Now, we could probably just optimize this on wasmi's side so that it doesn't preallocate and clear 1MB buffer on each invocation (assuming this hasn't been done already; we're still using quite an old version of wasmi, and I haven't checked the newest version). Nevertheless I think this has demonstrated that there's concrete value in enabling this extension - even if wasmi is fixed something else could conceivably allocate large buffers and tank the performance, and then we'll be back to square one; there's also potential for this extension to speed things up in general, considering how widely memset/memcpy are used under the hood. I don't think it's worth holding back on this extension anymore, and we should just pull the trigger and enable it. In the worst case it won't make any difference, in the best case it can significantly speed things up.

What needs to be done? (high level plan)

  1. Add support for the bulk memory ops to wasmi, wasmi-validation and wasm-instrument, if it hasn't been done yet.
  2. Enable the bulk feature on the parity-wasm.
  3. Call config.wasm_bulk_memory(true) when initializing wasmtime.
  4. Do a burn in with the runtime compiled with -C target-feature=+bulk-memory.
  5. Release a new version of polkadot.
  6. Wait for a few releases. (Essentially the same as when introducing a new host function.)
  7. Permanently enable -C target-feature=+bulk-memory flag when building runtimes.
  8. ???
  9. Profit.

Anything else I'm missing?

cc @pepyakin @athei @Robbepop

@koute koute added the I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. label Sep 8, 2022
@Robbepop
Copy link
Contributor

Robbepop commented Sep 8, 2022

Now, we could probably just optimize this on wasmi's side so that it doesn't preallocate and clear 1MB buffer on each invocation (assuming this hasn't been done already; we're still using quite an old version of wasmi, and I haven't checked the newest version).

The newer wasmi versions allow to set the initial and maximum value stack length in the Config similar to how Wasmtime Config works. Therefore this wasmi specific problem should be resolved once we upgrade to wasmi 0.16.0 or above.

Add support for the bulk memory ops to wasmi, wasmi-validation and wasm-instrument, if it hasn't been done yet.

We probably do not need support for Wasm bulk-memory proposal in wasmi since this is only relevant if we want to keep wasmi as a Substrate runtime execution engine but I honestly see no reason why we should keep it. In the very past it was useful when Wasmtime was not as stable as it is nowadays (or maybe some other reason). Although support for Wasm bulk-memory proposal is planned for wasmi since certain smart contracts could potentially benefit from it.

@pepyakin
Copy link
Contributor

pepyakin commented Sep 8, 2022

Citing the parent issue:

  • It would be great to have data on how exactly performance is improved. This would help to evaluate how much priority it has for implementation and for the runtime writers user for upgrading.
  • wasmi does not support pretty much most of the newest features. wasmtime is the primary engine for now, but it would still be good to have the second engine.
  • The polkadot side should be taken into account. How would PVF execution migrated? Or those features won't hit PVF?

It's clear about the pt. 1. Re pt. 2, I think either way is fine. (UPD: Robin beat me up to it, and I agree, I think getting rid of wasmi executor is on the table)

The PVF is a bit more complicated.

We need to consider that if we unconditionally enable wasm_bulk_memory it will be enabled for PVFs. There are two problems with that:

There is a problem with the upgrade. If we just YOLO enable it, an adversary could take advantage of that. Unfortunately, until #917 is landed our hands are tied.

That implies that the executor configuration should allow disabling bulk mem ops. It will be disabled for PVF execution.

Since the Cumulus PDK uses the same binary for the Runtime and PVF, the parachains won't be able to take advantage of bulk mem ops. Parachains are the overwhelming majority of the users of Substrate, so the impact would be limited until we upgrade PVF.

Then, the blocker for enabling bulk mem ops for PVFs is the question about the metering. It is likely coming at least for PVFs. That means we have to squash this concern touched on in the parent issue.

@athei
Copy link
Member

athei commented Sep 8, 2022

(UPD: Robin beat me up to it, and I agree, I think getting rid of wasmi executor is on the table)

For the foreseeable future we will run contracts with an in-runtime wasmi.

@pepyakin
Copy link
Contributor

pepyakin commented Sep 8, 2022

Tangential? The wasmi executor refers to sc-executor-wasmi and not the sandbox backend.

@ggwpez
Copy link
Member

ggwpez commented Sep 8, 2022

Should be sanity checked with something outside of pallet benchmarks, but the seal_return_per_kb sound great.
Probably historic import times, there is benchmark block to measure re-import times of old blocks.
I will do so once you put up an MR 😄

@the-right-joyce the-right-joyce transferred this issue from paritytech/substrate Aug 24, 2023
@the-right-joyce the-right-joyce added T1-FRAME This PR/Issue is related to core FRAME, the framework. and removed T1-runtime labels Aug 25, 2023
lexnv pushed a commit that referenced this issue Apr 3, 2024
Design the `archive` API
liuchengxu added a commit to subcoin-project/polkadot-sdk that referenced this issue Sep 20, 2024
* ci: pin to specific nightly for docs job

* .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T1-FRAME This PR/Issue is related to core FRAME, the framework.
Projects
Status: Backlog
Status: backlog
Development

No branches or pull requests

7 participants