Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unnecessary host functions allocator usage #4

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented Jul 4, 2023

I have followed the template in #2.


## Future Possibilities

The existence of the host-side allocator has become questionable over time. It is implemented in a very naive way, and for determinism and backwards compatibility reasons it needs to be implemented exactly identically in every client implementation. Furthermore, runtimes make substantial use of heap memory allocations, and each allocation needs to go twice through the runtime <-> host boundary (once for allocating and once for freeing). Moving the allocator to the runtime side, while it would increase the size of the runtime, would be a good idea. But before the host-side allocator can be deprecated, all the host functions that make use of it need to be updated to not use it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you move the allocator to the runtime side?

Do you just give it a big block of memory and let it split it up as desired?

Copy link
Contributor Author

@tomaka tomaka Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the problem? The runtime could just do whatever it wants with its memory.

If the question is how to make sure that the memory usage is bounded, my personal opinion would be that at initialization the runtime does storage_get(":heappages") and puts the value in its memory allocator, and the allocator crashes if it reaches the limit.

An alternative way is to rely on the fact that the wasm memory needs to be grown by the wasm code with a memory.grow opcode, so the host can detect when memory usage is over a certain threshold.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue, I'm just not familiar enough with Wasm and was curious about how this would be done in practice

Copy link
Contributor Author

@tomaka tomaka Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the allocator back into the runtime would make the runtime more similar to a normal program (as normal programs manage their memory themselves, with the exception of "the sbrk mechanism" which is mimicked by memory.grow in wasm).
The fact that the allocator is outside of the runtime is what requires additional code/hacks. So the answer on how to move the allocator back to the runtime is simply "remove the hacks".

Copy link
Contributor

@rphmeier rphmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding of the RFC correct that it proposes to introduce the new functions but not to remove the old ones?

This seems valid, but after being accepted/implemented, the old versions should be deprecated and then removed entirely (after some significant period of time, to allow backwards compatibility)

@tomaka
Copy link
Contributor Author

tomaka commented Jul 6, 2023

Is my understanding of the RFC correct that it proposes to introduce the new functions but not to remove the old ones?

Yes!

This seems valid, but after being accepted/implemented, the old versions should be deprecated and then removed entirely (after some significant period of time, to allow backwards compatibility)

They should be deprecated, yes, but removing them would mean that you could no longer sync older blocks.
I'm not necessarily against breaking the possibility to sync older blocks, but that's a big/controversial change not in scope of this RFC.

@rphmeier rphmeier mentioned this pull request Jul 6, 2023
@pepyakin
Copy link

This is definitely a step into the right direction. I also agree on that the C API approach makes sense under current condition. I am a bit worried though that this proposal doesn't provide any numbers though.

I am thinking that in the future, we might need to consider the API design overhaul as if it's 2023 actually. But this is outside of the scope of this proposal.

Just for linking purposes, here are a relevant discussion on Substrate: paritytech/substrate#11883

@burdges
Copy link

burdges commented Jul 28, 2023

Is there an obstacle to host functions actually doing borrowing aka just pass pointers? We need not even copy memory in say:

fn host_function(foo: &'a mut [u8], &'b [u8; 32]) -> &'a mut [u8];

Also..

A hash interface like pub fn keccak_256(data: &[u8]) -> [u8; 32] is inherently inefficient for variable sized data because you must build some Vec<u8> before doing the hashing. Although RustCrypto/digest is imperfect, at least their traits avoid this.

We could've wasm types designed for usage directly in the host, like demanding u64 or simd alignment, and then impl appropriate digest::{Update, Reset, FixedOutput*, ExtendableOutput*, XofReader} directly upon those.

#[derive(Clone)]
#[repr(align(8))]
pub(crate) struct Sha3State {
    state: [u8; 26 * 8],
    //  [u64; PLEN] with PLEN = 25 plus a usize=u64
    // https://github.com/RustCrypto/hashes/blob/master/sha3/src/state.rs#L9
}

impl Update for Sha3State {
    fn update(&mut self, data: &[u8]) {
        // Host call.  On the host side unsafely casts our Sha3State to sha3::Sha3State
    }
}
...

You'd then cargo patch in sp-sha3 over sha3 and then you've a faster hashing interface. It's more work to do this however..

@tomaka
Copy link
Contributor Author

tomaka commented Jul 30, 2023

This is definitely a step into the right direction. I also agree on that the C API approach makes sense under current condition. I am a bit worried though that this proposal doesn't provide any numbers though.

I have opened this RFC with the desire of making things move. I would be more than happy if someone else took over this effort and provided numbers. Personally it seems obvious to me that "allocating memory, writing a hash to it, and freeing said memory" is slower than "writing a hash to memory". And if it turns out that in practice it wasn't slower, I would search for the problem elsewhere.
Since nobody else is proposing these changes, it will have to do with me and without numbers.

Is there an obstacle to host functions actually doing borrowing aka just pass pointers? We need not even copy memory in say:

A hash interface like pub fn keccak_256(data: &[u8]) -> [u8; 32] is inherently inefficient for variable sized data because you must build some Vec before doing the hashing. Although RustCrypto/digest is imperfect, at least their traits avoid this.

This is very precisely what this RFC fixes.

(func $ext_hashing_twox_128_version_2
(param $data i64) (param $out i32))
(func $ext_hashing_twox_256_version_2
(param $data i64) (param $out i32))
Copy link

@burdges burdges Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the twox hashes still being used anywhere? I'd thought we deprecated them since they're footguns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are still used for hashing the key of a storage value which is a static key.

Copy link
Contributor Author

@tomaka tomaka Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are used in order to know where to store stuff. For example the pallet Foo stores stuff under the key xxhash64("Foo").
While this is out of scope of this RFC, this is also very inefficient (literally every single storage access calculates the xxhashes of the pallet and the item) and should be moved to compile time.

Copy link

@burdges burdges Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We deprecated twox entirely though, no? twox_64 is xxhash64. twox_128 is simply two copies of xxhash64: https://github.com/paritytech/polkadot-sdk/blob/cfa19c37e60f094a2954ffa73b1da261c050f61f/substrate/primitives/core/hashing/src/lib.rs#L77

It's therefore easy to find collisions for both twox flavors, which yields two storage locations which your code believes to be different, but actually wind up being the same. It technically facilitates supply chain attacks against parachains, even if used only at compile time.

Cyan4973/xxHash#165 (comment)
Cyan4973/xxHash#54 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quote me on that, but it is likely that nowadays the only input to the twox hash functions are strings hardcoded in the source code (pallet names and storage key names).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quote me on that, but it is likely that nowadays the only input to the twox hash functions are strings hardcoded in the source code (pallet names and storage key names).

Yes this. I mean it is up to the user, but for maps the default is blake2.

should be moved to compile time.

This is also already done.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll avoid any pallet dependencies with bizarre hard coded storage names then I guess. lol

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also already done.

We do not need host calls for twox hash then I guess.

@burdges
Copy link

burdges commented Jul 31, 2023

This is very precisely what this RFC fixes.

Just for the case where a fixed length buffer is being hashed. sp_io::Hashing inherently demands allocations when hashing a variable length buffer. I just pointed out the variable length buffer interface looks like:

#[runtime_interface]
pub trait Hashing {
        //  Existing fixed length buffer stuff

	fn keccak_256_update(state: &mut Sha3State, data: &[u8])
	fn keccak_256_finalize(state: &Sha3State) -> [u8; 32]

	fn keccak_512_update(state: &mut Sha3State, data: &[u8])
	fn keccak_512_finalize(state: &Sha3State) -> [u8; 64]

	fn sha2_256_update(state: &mut Sha2State, data: &[u8])
	fn sha2_256_finalize(state: &Sha2State) -> [u8; 32]

	fn blake2_256_update(state: &mut Blake2State, data: &[u8])
	fn blake2_256_finalize(state: &Blake2State) -> [u8; 32]
}

Anyways..

If we typically hash fixed length buffers then maybe this variable length buffer allocation is best to just keep, as the above interfaces demands quite a few more calls.

But before the host-side allocator can be deprecated, all the host functions that make use of it need to be updated to not use it.

I see, we need roughly this for backwards compatibility anyways.

(param $key_type_id i32) (param $seed i64) (param $out i32) (return i32))
```

The behaviour of these functions is identical to their version 1 or version 2 equivalent. Instead of allocating a buffer, writing the output to it, and returning a pointer to it, the new version of these functions accepts an `out` parameter containing the memory location where the host writes the output. The output is always of a size known at compilation time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected behavior (trap?) for an out-of-bounds memory access ought to be specified.

Copy link
Contributor Author

@tomaka tomaka Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're actually (I guess semi-unintentionally) raising a good point.

If we take for example ext_crypto_ed25519_sign_version_2, on success the function returns 0 and writes to out, and on failure the function returns 1 and doesn't write anything. Should a trap happen if out is out of range even in the situation where 1 would be returned? The answer is important for determinism.

I've added some text mentioning that yes, the range of out must always be checked. This is a bit arbitrary, but I think that passing a completely invalid input has "higher priority" than a failure in what is being demanded from the function.

@yjhmelody
Copy link

Hi, I want to know any plan to move the allocator inside of wasm runtime.
It definitly will make substrate runtime be more generic.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 17, 2023

Same remark as #8 (comment)

I would like to propose this RFC, but I don't know how to proceed.

@bkchr
Copy link
Contributor

bkchr commented Sep 21, 2023

To move forward with this pull request, I would like to see some basic benchmarks being done. @yjhmelody could you please help with this?

@bkchr
Copy link
Contributor

bkchr commented Sep 21, 2023

I would also like to see that this directly moves the entire allocator to the runtime. Not sure why we only should do this half way.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 21, 2023

I can expand the RFC to cover all the host functions.

When I opened this RFC (two and a half months ago), I actually expected a swift and unanimous approval. But if RFCs inherently need a long time, I might as well make it as big as possible.

@koute
Copy link

koute commented Sep 21, 2023

+1 for going all the way and moving the whole allocator inside of the runtime, so that for new blocks the allocator host functions are never called anymore.

@bkchr
Copy link
Contributor

bkchr commented Sep 21, 2023

When I opened this RFC (two and a half months ago), I actually expected a swift and unanimous approval.

Yeah sorry on that, but it should actually also not be "swift". Let's do it properly.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 22, 2023

I'm currently updating the RFC.

When it comes to ext_offchain_network_state_version_1, I'm going in the direction of deprecating the function due to its unclear semantics. It's not clear from the runtime's perspective whether the PeerId or addresses returned by this function can change over time or not (hint: they can change and this function is very racy).

The only place where this function is used is in the node-authorization pallet, in order to grab the PeerId of the local node and mark it as reserved (I don't really understand why you'd mark yourself as reserved, but whatever). While this is slightly off-topic, I also think that the mechanism of reserved peers should be deprecated due to unclear semantics.

For the sake of not breaking things, I'm going to add a ext_offchain_network_peer_id_version_1 function, but I do think that this function should not exist.

@tomaka
Copy link
Contributor Author

tomaka commented Sep 22, 2023

I've updated the RFC.

As I've mentioned in the "unresolved questions" section, I expect the offchain worker functions to be slightly slower than right now, but this is IMO negligible since they are offchain.

To me, the biggest potential sources of performance degradation comes from the deprecation of ext_storage_get_version_1, as the runtime might have to call ext_storage_read_version_2 multiple times instead of calling ext_storage_get_version_1 once.
In theory it should be possible to progressively read storage items (in a streaming way) whose size isn't known ahead of time, rather than completely load it in one go, but I understand that this would be a pretty large change to the runtime code and I'm afraid that half-assing the change might lead to reading becoming slow.

The changes to ext_storage_next_key might also cause performance degradation if not implemented properly. This is also a moderately sized change to the runtime code (that will degrade performances if half-assed), but I have the impression that it shouldn't be too hard to do it properly.

@tomaka
Copy link
Contributor Author

tomaka commented Oct 23, 2023

To move forward with this pull request, I would like to see some basic benchmarks being done. @yjhmelody could you please help with this?

Is something in progress?

While I agree that RFCs shouldn't necessarily be swift, if nothing is done then the ETA of this RFC is "infinity" and it would be great if it was finite. The more we wait, the more other changes (such as #31) will conflict with this RFC.

(and the reason why I originally submitted a stripped-out version is that I anticipated this happening)

(param $key_type_id i32) (param $seed i64) (param $out i32) (return i32))
```

The behaviour of these functions is identical to their version 1 or version 2 counterparts. Instead of allocating a buffer, writing the output to it, and returning a pointer to it, the new version of these functions accepts an `out` parameter containing the memory location where the host writes the output. The output is always of a size known at compilation time. The runtime execution stops with an error if `out` is outside of the range of the memory of the virtual machine.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one actually implement a host function that borrows like this?

Copy link
Contributor Author

@tomaka tomaka Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably requires some changes to the Substrate procedural macro stuff (which I'm not familiar with).
This is why I mention in the "drawbacks" section that it might be difficult to implement, but I have no idea how much.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably requires some changes to the Substrate procedural macro stuff (which I'm not familiar with). This is why I mention in the "drawbacks" section that it might be difficult to implement, but I have no idea how much.

I don't remember from the top of my head whether the macro machinery supports this right now, but yeah, we can do this; not a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macros support borrowing since the beginning you can pass &mut [u8] quite easily.

@burdges
Copy link

burdges commented Oct 23, 2023

The more we wait, the more other changes .. will conflict with this RFC.

Agreed, our unstable host calls for elliptic curves pass Vec<u8>s too for example. I managed to get them down to only one or two allocations each, and in cases those are unavoidable, but.. They could be fixed with a PR now, given they're not stable or used anywhere. I just do not know how to pass a &[u8] to a host call can without doing one allocation. This matters for halo2 usage.

@yjhmelody
Copy link

yjhmelody commented Oct 24, 2023

To move forward with this pull request, I would like to see some basic benchmarks being done. @yjhmelody could you please help with this?

Is something in progress?

While I agree that RFCs shouldn't necessarily be swift, if nothing is done then the ETA of this RFC is "infinity" and it would be great if it was finite. The more we wait, the more other changes (such as #31) will conflict with this RFC.

(and the reason why I originally submitted a stripped-out version is that I anticipated this happening)

Sorry,I haven't seen this message before.
I personally think there is no need to do benchmarking, it will take a lot of effort. And there is no reason not to support the runtime's own Allocator, which can greatly promote the substrate runtime ecosystem, and obviously it will not degrade performance very much unless you use a very poor memory allocator.

You can see that when various Ethereum ecosystems use wasm or other virtual machines, they all use the runtime Allocator instead of the host. This is very convenient in verification scenarios, because verification is just a pure function, which helps support more For ecological scenarios, it is a pity that substrate has not supported it.

Sorry again, I don't have the energy to promote related improvements recently.

Copy link
Contributor

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay. I left some nitpicks, generally looking good and we could continue with this RFC from my side.

(func $ext_hashing_twox_128_version_2
(param $data i64) (param $out i32))
(func $ext_hashing_twox_256_version_2
(param $data i64) (param $out i32))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are still used for hashing the key of a storage value which is a static key.

(param $key_type_id i32) (param $seed i64) (param $out i32) (return i32))
```

The behaviour of these functions is identical to their version 1 or version 2 counterparts. Instead of allocating a buffer, writing the output to it, and returning a pointer to it, the new version of these functions accepts an `out` parameter containing the memory location where the host writes the output. The output is always of a size known at compilation time. The runtime execution stops with an error if `out` is outside of the range of the memory of the virtual machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macros support borrowing since the beginning you can pass &mut [u8] quite easily.

text/0004-remove-unnecessary-allocator-usage.md Outdated Show resolved Hide resolved
text/0004-remove-unnecessary-allocator-usage.md Outdated Show resolved Hide resolved
text/0004-remove-unnecessary-allocator-usage.md Outdated Show resolved Hide resolved
Comment on lines +274 to +277
(func $ext_input_size_version_1
(return i64))
(func $ext_input_read_version_1
(param $offset i64) (param $out i64))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just have the read function and follow the way the other read functions are implemented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that doing it this way (first getting the size then getting the content) is way more clear and simple, and more importantly avoids the risk of allocating a buffer too small and having to re-allocate it again later and copy or throw away all the data that was already read.

When reading the storage, knowing the size of the storage value would require a database access, so if reading a storage value required two function calls we would need to access the database twice, which seems not great.

When reading the input, I assume that ext_input_size_version_1 just returns something like self.input.len(), so there's no problem. Also, you typically read the input only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you could pass out with a zero length to only get the length of the value.


- `ext_input_size_version_1`/`ext_input_read_version_1` is inherently slower than obtaining a buffer with the entire data due to the two extra function calls and the extra copying. However, given that this only happens once per runtime call, the cost is expected to be negligible.

- The `ext_crypto_*_public_keys`, `ext_offchain_network_state`, and `ext_offchain_http_*` host functions are likely slightly slower than their deprecated counterparts, but given that they are used only in offchain workers this is acceptable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ext_offchain_network_state doesn't exist anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mention it here: #4 (comment)

I think that this function should be deprecated entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me. Maybe should be mentioned a little bit more obvious in the RFC.

Copy link
Contributor Author

@tomaka tomaka Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime execution stops with an error if `value_out` is outside of the range of the memory of the virtual machine, even if the size of the buffer is 0 or if the amount of data to write would be 0 bytes.

```wat
(func $ext_offchain_network_peer_id_version_1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to read the external addresses? (Not really sure we still need them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also mention it here: #4 (comment) and I think it should be deprecated.

It's completely unclear to me how to define "an external address".

@tomaka
Copy link
Contributor Author

tomaka commented Nov 27, 2023

I've run a quick small experiment and have tried to extract the allocator out of smoldot, in other words do to smoldot the opposite of what this RFC wants to do to the runtime, and then measure the size of smoldot compiled to Wasm.
Smoldot uses dlmalloc, which is the allocator used by default by the Rust standard library.

With the allocator outside of smoldot: 2,789.9kB
With the allocator inside of smoldot: 2,794.3kB

So basically the size of the code of the allocator is <5 kiB. And that's uncompressed. According to this page, compression seems to reduce the size by ~75%, so we're closer to 1kiB to 1.5kiB of overhead if we moved the allocator to the runtime.

@radkomih
Copy link

Sorry if this isn't the right place for this, but I would like to share our insights on the topic:

I'm part of a team developing runtimes in Go, and we've had to modify the compiler to ensure compatibility with the external allocator. We've implemented a custom garbage collector that integrates seamlessly with this external allocator, but this is one of the reasons we're not using the standard Go compiler (another is related to the Wasm target).
For us, integrating the host allocator into the runtime is not solely a matter of speed, it's also about facilitating easier adoption and generalization for writing Polkadot runtimes in various languages.

+1 this change to happen.

- The following function signature is now also accepted for runtime entry points: `(func (result i64))`.
- Runtimes no longer need to expose a constant named `__heap_base`.

All the host functions that are being superceded by new host functions are now considered deprecated and should no longer be used.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does deprecated mean here? Do you mean increasing the version number? If not, wouldn't the existing chain still need these host functions if it starts synchronizing from the old block? So it seems like we always need to ensure these functions are available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means what is written: they should no longer be used. Runtimes are what use host functions. Therefore, runtimes should no longer use these host functions.

Do you mean increasing the version number?

This text simply says that if the RFC adds foo_v2, then foo_v1 should no longer be used. You can see it as increasing the version number, yes, but given that there's no semver involved it's better to think as foo_v1 and foo_v2 as two completely unrelated functions.


- It is unclear how replacing `ext_storage_get` with `ext_storage_read` and `ext_default_child_storage_get` with `ext_default_child_storage_read` will impact performances.

- It is unclear how the changes to `ext_storage_next_key` and `ext_default_child_storage_next_key` will impact performances.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For writing variable-length host functions, is it really necessary to revise it?
In fact, the host always needs an Allocator to support the copy of the entrypoint parameter.
So if the new version cannot improve performance, it will only increase the complexity of use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you define "variable-length host function" and "entrypoint parameter"?
If you're talking about the input to the runtime call, it's also tackled in this RFC.

Copy link

@yjhmelody yjhmelody Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just what you are talking about: input to the runtime call.
The current specification of the input to the runtime call need host to call alloc func(defined in host side or runtime side).


```wat
(func $ext_storage_read_version_2
(param $key i64) (param $value_out i64) (param $offset i32) (result i64))
Copy link

@koute koute Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result could be an i32 here; even with an i32 a return value of -1 is not ambiguous as you'll physically never be able to actually allocate a buffer of such size.

So, let me see if I got this right, is the idea here that:

  1. the guest prepares a buffer of n bytes,
  2. it calls this host function and passes a pointer to this buffer and its size,
  3. the host reads the storage and writes m bytes into the buffer, where m <= n and returns m,
  4. if m == n (so either the buffer was exactly the right size or we still have some data left) then the guest enlarges the buffer and calls the host function again with a bigger offset and repeats this until the whole value is read (and if the guest guesses wrong then this could be called many times),

correct?

If so maybe instead we could do this: return an i64 which would be a packed (i32, i32) (or a i64 with its upper bit reserved) where the first value is going to be whether the read succeeded, and the second value is going to be the actual size of the storage value, and have the algorithm go like this:

  1. the guest prepares a buffer of n bytes,
  2. it calls this host function and passes a pointer to this buffer and its size,
  3. let m be the size of the storage value; if n <= m then write it to the buffer and return (1, m)
  4. if n > m then return (0, m)
  5. on the guest size if a 0 was returned then enlarge the buffer to the exact size needed and call the function again (which is guaranteed to succeed now), otherwise we're done

And also do not require that "if the pointer to the buffer is invalid then runtime execution stops even if length is 0".

This way we gain the following benefits:

  1. if the buffer is exactly the right size we will only call this function once (as opposed to twice with the current proposal),
  2. if the buffer is not big enough then we will only call this function twice (as opposed to twice or more with the current proposal),
  3. there's only going to be one host-guest memcpy (because if the buffer is not big enough then nothing will be copied) which is going to be faster/more efficient,
  4. the guest can choose whether it wants to try its luck guessing the initial size of the buffer (risk having to reallocate the buffer in the worst case but read the value in only one call in the best case), or it can call the host function with a size of "0" to ask for the right size and prepare a buffer which is of exactly the right size (always read the value in two calls, but won't ever need to reallocate the buffer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result could be an i32 here; even with an i32 a return value of -1 is not ambiguous as you'll physically never be able to actually allocate a buffer of such size.

I went with i64 because I'm not really a fan of "-1i32 is not ambiguous". If we ever switch to a 64bits runtime, then it's no longer impossible for 4 GiB to be allocated. Sure, a 4 GiB storage item would be unusually big, but it's also not in the real of completely absurd.

correct?

Yes, but I think that your explanation builds upon a wrong assumption.

if m == n (so either the buffer was exactly the right size or we still have some data left) then the guest enlarges the buffer and calls the host function again with a bigger offset and repeats this until the whole value is read (and if the guest guesses wrong then this could be called many times),

The idea when I designed this function is that the runtime is supposed to know what the size of the storage item is. When you're for example reading a u32 from storage, you know it's always 4 bytes. When you're reading a Vec<something>, you can first read the length prefix in front of the Vec (which realistically shouldn't be more than 5 bytes) and then allocate a Vec of exactly the correct size.

Allocating a buffer whose size is equal to the size of the runtime storage item builds upon the assumption that you can't decode SCALE-encoded items in a streaming way. This assumption is true in practice now, but in the absolute it isn't.

To give an example, if you have a storage item whose format is Vec<Vec<u8>>, right now you have to allocate a big buffer that contains the entire storage item, then when decoding you copy the data within each inner Vec. It seems more optimal to me (but I could be wrong) to call storage_read once to get the size of the outer Vec, then twice for each inner Vec, and read the data of each inner Vec<u8>s directly its their destination buffer.

@koute
Copy link

koute commented Feb 23, 2024

This RFC is kind of dead, so unless anyone else steps up I volunteer to push it forward.

I will implement this in Substrate and report back; let's get this over the finish line.

@burdges
Copy link

burdges commented Feb 23, 2024

As discussed above, it appears twox is sufficently depricated to ignore here.

A polkavm-ish question: Is there some minimum overhead for a host function call? If small enough then like I described above ..

One could open up the hash functions guts and provide host functions for digest::Update::update and digest::FixedOutput::finalize_into. We'd avoid doing so many allocations even inside the runtime this way.

@koute
Copy link

koute commented Feb 23, 2024

A polkavm-ish question: Is there some minimum overhead for a host function call? If small enough then like I described above ..

One could open up the hash functions guts and provide host functions for digest::Update::update and digest::FixedOutput::finalize_into. We'd avoid doing so many allocations even inside the runtime this way.

For WASM VMs depending on the implementation hostcalls can be expensive (e.g. if the VM fiddles with the signal handlers when going through the boundary) and for PolkaVM specifically it does involve crossing a thread/process boundary (essentially similar to sending a message through a channel to another thread, but possibly a little cheaper than your average Rust async channel implementation due to optimistic spinlocking I can do), so it's really hard to say and depends on the ratio of work saved vs the number of host calls.

Another alternative would be a vectored call, that is something like fn hashv(slices: &[*const u8], out_hash: &mut [u8]) which is sort of a hybrid where it only takes a single call but you can hash multiple disjoint chunks of memory as if they were next to each other.

Nevertheless, I'm not sure if we need this?

@burdges
Copy link

burdges commented Feb 23, 2024

These calls are purely for performance, and have no reason to transfer control to the host. We could've machine code which gets linked in directly maybe? I guess that'd bring adifferent auditing problem..

Nevertheless, I'm not sure if we need this?

Nah. It helps avoids some performance footguns, but whatever.

IRFT standards where those footguns look likely, like hash-to-curve and HKDF, pay some small cost to make this not a problem, not that we support any of their hash functions anyways.

@yjhmelody
Copy link

This RFC is kind of dead, so unless anyone else steps up I volunteer to push it forward.

I will implement this in Substrate and report back; let's get this over the finish line.

I'm not free recently, but if you implement it, I can help review it :)

github-merge-queue bot pushed a commit to paritytech/polkadot-sdk that referenced this pull request Mar 12, 2024
This PR adds a new PolkaVM-based executor to Substrate.

- The executor can now be used to actually run a PolkaVM-based runtime,
and successfully produces blocks.
- The executor is always compiled-in, but is disabled by default.
- The `SUBSTRATE_ENABLE_POLKAVM` environment variable must be set to `1`
to enable the executor, in which case the node will accept both WASM and
PolkaVM program blobs (otherwise it'll default to WASM-only). This is
deliberately undocumented and not explicitly exposed anywhere (e.g. in
the command line arguments, or in the API) to disincentivize anyone from
enabling it in production. If/when we'll move this into production usage
I'll remove the environment variable and do it "properly".
- I did not use our legacy runtime allocator for the PolkaVM executor,
so currently every allocation inside of the runtime will leak guest
memory until that particular instance is destroyed. The idea here is
that I will work on the polkadot-fellows/RFCs#4
which will remove the need for the legacy allocator under WASM, and that
will also allow us to use a proper non-leaking allocator under PolkaVM.
- I also did some minor cleanups of the WASM executor and deleted some
dead code.

No prdocs included since this is not intended to be an end-user feature,
but an unofficial experiment, and shouldn't affect any current
production user. Once this is production-ready a full Polkadot
Fellowship RFC will be necessary anyway.
dharjeezy pushed a commit to dharjeezy/polkadot-sdk that referenced this pull request Mar 24, 2024
This PR adds a new PolkaVM-based executor to Substrate.

- The executor can now be used to actually run a PolkaVM-based runtime,
and successfully produces blocks.
- The executor is always compiled-in, but is disabled by default.
- The `SUBSTRATE_ENABLE_POLKAVM` environment variable must be set to `1`
to enable the executor, in which case the node will accept both WASM and
PolkaVM program blobs (otherwise it'll default to WASM-only). This is
deliberately undocumented and not explicitly exposed anywhere (e.g. in
the command line arguments, or in the API) to disincentivize anyone from
enabling it in production. If/when we'll move this into production usage
I'll remove the environment variable and do it "properly".
- I did not use our legacy runtime allocator for the PolkaVM executor,
so currently every allocation inside of the runtime will leak guest
memory until that particular instance is destroyed. The idea here is
that I will work on the polkadot-fellows/RFCs#4
which will remove the need for the legacy allocator under WASM, and that
will also allow us to use a proper non-leaking allocator under PolkaVM.
- I also did some minor cleanups of the WASM executor and deleted some
dead code.

No prdocs included since this is not intended to be an end-user feature,
but an unofficial experiment, and shouldn't affect any current
production user. Once this is production-ready a full Polkadot
Fellowship RFC will be necessary anyway.
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this pull request Mar 25, 2024
This PR adds a new PolkaVM-based executor to Substrate.

- The executor can now be used to actually run a PolkaVM-based runtime,
and successfully produces blocks.
- The executor is always compiled-in, but is disabled by default.
- The `SUBSTRATE_ENABLE_POLKAVM` environment variable must be set to `1`
to enable the executor, in which case the node will accept both WASM and
PolkaVM program blobs (otherwise it'll default to WASM-only). This is
deliberately undocumented and not explicitly exposed anywhere (e.g. in
the command line arguments, or in the API) to disincentivize anyone from
enabling it in production. If/when we'll move this into production usage
I'll remove the environment variable and do it "properly".
- I did not use our legacy runtime allocator for the PolkaVM executor,
so currently every allocation inside of the runtime will leak guest
memory until that particular instance is destroyed. The idea here is
that I will work on the polkadot-fellows/RFCs#4
which will remove the need for the legacy allocator under WASM, and that
will also allow us to use a proper non-leaking allocator under PolkaVM.
- I also did some minor cleanups of the WASM executor and deleted some
dead code.

No prdocs included since this is not intended to be an end-user feature,
but an unofficial experiment, and shouldn't affect any current
production user. Once this is production-ready a full Polkadot
Fellowship RFC will be necessary anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants