Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chain-spec building failure due to WASM allocation size #5419

Open
gpestana opened this issue Aug 20, 2024 · 17 comments · May be fixed by #6081
Open

chain-spec building failure due to WASM allocation size #5419

gpestana opened this issue Aug 20, 2024 · 17 comments · May be fixed by #6081
Labels
T0-node This PR/Issue is related to the topic “node”.

Comments

@gpestana
Copy link
Contributor

I'm trying to build a chain-spec with a large number of nominators and validators for the multi-block election tests but given the amount of stakers to add, the chain-spec generation fails with:

❯ ~/cargo_target/debug/staking-node build-spec --disable-default-bootnode > chain-specs/staking-max.spec --raw
2024-08-20 14:58:04 Building chain spec
2024-08-20 14:58:07 going to fail due to allocating 54354570
Error: Service(Other("wasm call error Requested allocation size is too large"))

Is there any way to increase the WASM allocation size at the spec building time?

This issue seems to be similar to paritytech/substrate#11132 but at the time of chain-spec building, not block production.

@gpestana gpestana added the T0-node This PR/Issue is related to the topic “node”. label Aug 20, 2024
@michalkucharczyk
Copy link
Contributor

michalkucharczyk commented Aug 26, 2024

Could you try this? Seems that this error is reported when the allocation exceeds this value:

/// The maximum number of bytes that can be allocated at one time.
// The maximum possible allocation size was chosen rather arbitrary, 32 MiB should be enough for
// everybody.
pub const MAX_POSSIBLE_ALLOCATION: u32 = 33554432; // 2^25 bytes, 32 MiB

@michalkucharczyk
Copy link
Contributor

michalkucharczyk commented Aug 26, 2024

from what I know It is not possible to change this value during chain spec building.

@gpestana
Copy link
Contributor Author

@michalkucharczyk thanks for the tip, it also requires a few other changes but in the end I got it working. The solution for this has been posted in the stackexchange for the record https://substrate.stackexchange.com/questions/11863/allocate-extra-wasm-memory-to-generate-large-chainspecs/11864

@michalkucharczyk
Copy link
Contributor

Thank you for writing this down.

@kianenigma
Copy link
Contributor

Should we address this at a more fundamental level? We can't expect everyone to change the code.

All WASM memory limits should ideall be removed in chain spec generation and OCW code path.

@kianenigma kianenigma reopened this Sep 3, 2024
@bkchr
Copy link
Member

bkchr commented Sep 3, 2024

What you want is this: polkadot-fellows/RFCs#4 and then to move the allocator inside of the runtime.

@gpestana
Copy link
Contributor Author

@bkchr we've been discussing about increasing the max number of validators that can be registered in the system and be part of the snapshot. From a few experiments locally, the offchain miner panics with failure to allocating memory due to the current limits when calculating the election. I wonder if we could double the MAX_POSSIBLE_ALLOCATION to 64MiB and increase the max number of mem pages in the current hardcoded limits.

I'm now trying to improve the code so require less memory allocation, in any case, if we need to increase the limits, are there any thing to consider/side effects from increasing those limits?

@bkchr
Copy link
Member

bkchr commented Sep 12, 2024

We can not just bump these limits. This changes the behavior of the allocator. If possible, I would not like to touch this at all before we move this to the runtime.

I think I proposed this already somewhere, but can we for now just not change the way we register them at genesis? Either pass them with two fields into the runtime or use append to add them.

@gpestana do you know where exactly the allocation is failing?

@kianenigma
Copy link
Contributor

I think I proposed this already somewhere, but can we for now just not change the way we register them at genesis? Either pass them with two fields into the runtime or use append to add them.

The current issue is about OCW code path actually, no longer genesis, but @gpestana knows better and can point out exactly where we run OOM.

It is kind of silly to have the OCW/Genesis code path, which are by no means "consensus-critical" be subject to the memory limits that are afaik only arguable in consensus code paths. I hope we can hack around it.

In the old substrate days, I made a PR that never got merged, but it did something like this:

On the client side, we attached a enum Execution { Consensus, Offchain } to every wasm invocation. If Execution::Offchain, then these memory limits where a lot higher, or non existent. Would something like that make senese?

@bkchr
Copy link
Member

bkchr commented Sep 12, 2024

On the client side, we attached a enum Execution { Consensus, Offchain } to every wasm invocation. If Execution::Offchain, then these memory limits where a lot higher, or non existent. Would something like that make senese?

We have this, but even in the old days you did not modify these constants.

@gpestana
Copy link
Contributor Author

As a way to be more lenient wrt to memory constrains in the offchain execution mode, I'm parameterising the const MAX_ALLOCATING_MEMORY. This would unlock a couple of things: 1. chainspec generation won't fail for large states (e.g. init the chain with a staking state similar to currently in Polkadot) and to be more lenient on offchain staking-miners and other offchain workloads that can hit the constant memory limits imposed currently by the client.

My current direction is to expose those params in the CLI and overwrite the constants if the call context is of type CallContext::Offchain. The new parameters could be piggybacked into the struct HeapAllocStrategy, which is already taken in by the WasmExecutor::with_instance.

wdyt about this approach?

@gpestana
Copy link
Contributor Author

PR to address this #5419

@bkchr
Copy link
Member

bkchr commented Oct 16, 2024

It is kind of silly to have the OCW/Genesis code path, which are by no means "consensus-critical" be subject to the memory limits that are afaik only arguable in consensus code paths. I hope we can hack around it.

Just because you don't know how the allocator works, doesn't mean that the current way is "silly". As I said, you can not just change these constants without knowing what they are doing or how the allocator works.

When you change the allocator as done in #6081, it means the moment there is a two gigabyte allocation, this memory is never "freed" again. You will only be able to reuse this space. Maybe you are lucky and what you are doing works, but I clearly don't want to depend on luck for this kind of change. But for example for a Vec it allocates probably in multiple of two steps and for a 2 gigabyte vector. This means that the total number of allocations is reduced quite a lot.

What you want is this: polkadot-fellows/RFCs#4 and then to move the allocator inside of the runtime.

What I said here is the solution. It would be a better use of time to try to push this forward, instead of trying to come up with hacks that we can not accept.

@kianenigma
Copy link
Contributor

When you change the allocator as done in #6081, it means the moment there is a two gigabyte allocation, this memory is never "freed" again.

If that WASM instance is only used to generate the genesis config, or a one-off offchain worker, and then it is thrown away, is this a real issue?

You are right to protest that I don't exactly know how these wasm executor/allocator stuff is working under the hood, and perhaps my use of the word silly was not good, so let me rephrase: The current situation seems "unnecessarily restrictive".

Offchain WASM instances should have higher memory limits, and ideally kept separate from all the ones that are used for onchain execution. The offchain ones might be prone to this flaw that you are talking about, but, so be it, they are not particularly important anyways 🤷

@bkchr
Copy link
Member

bkchr commented Oct 18, 2024

If that WASM instance is only used to generate the genesis config, or a one-off offchain worker, and then it is thrown away, is this a real issue?

For sure that will not lead to any problem with the on chain execution. However, depending on the access in the offchain worker, the offchain worker could die because it runs out of memory. And it will run out of memory faster with the proposed solution.

Maybe could someone explain me what we mean by "allowing more validators"? More validators can apply for getting nominated? Also don't we have the solution miner that runs totally independent of the on chain runtime?

@michalkucharczyk
Copy link
Contributor

dumb thought : we could only increase limit in particular binary (chain-spec-builder), while keeping old limit in node.

@bkchr
Copy link
Member

bkchr commented Oct 20, 2024

Yeah, that would be an acceptable solution for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
4 participants