Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable reuse of mmaps between artifacts #125

Closed
wants to merge 7 commits into from

Conversation

nagisa
Copy link
Collaborator

@nagisa nagisa commented Apr 29, 2022

This is a crude prototype which allows reusing the CodeMemory between
distinct Artifacts and Instances. This can reduce the overhead of
allocating mmaps with the kernel significantly. It turns out that the
kernel would still spend a lot of time if requested to change the
protection of a mapping from RX to RW.

This overhead is entirely mitigated by using HUGETLB, which is sadly
non-portable and is a major pain to enable (requires a kernel parameter).

With these two things in place the sys time goes down to almost zero
when loading contracts into memory, and the overall time of the
deserialize/10000 benchamrk is reduced from ~2ms per iteration down to
1.25ms.

@nagisa nagisa marked this pull request as draft April 29, 2022 13:56
@@ -109,12 +109,12 @@ impl CodeMemory {
executable_section_result.push(s);
}

self.start_of_nonexecutable_pages = bytes;
self.start_of_nonexecutable_pages = Mmap::round_up_to_page_size(bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we zero-out the executable leftover?

Copy link
Collaborator Author

@nagisa nagisa May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is necessary to zero out this mmap at all. Due to wasm using a Harvard-style execution model, the leftover memory isn't going to be accessible to the contract. Unless we mess up our VM implementation, in which we'd have worse problems (e.g. contracts getting access to a source of non-determinism or to private counterparts of keys stored in memory)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this isn't necessary, but might still be good for defense in depth -- it just feels scary to have an executable memory somewhere in process, whose contents is not explicitly set. Though, this is very weak opinion. If we do enable huge tables, we potentially could need to zero-out a significant chunk of stuff...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this makes the situation worse if someone pops us somewhere else. Like, what if there's an RCE in rocks? Seems not zeroing is not worse even in that case -- if the attacker can get a gadget via non-zeronig of old contract, they probably can get one via just explicit contract?

if total_len <= self.mmap.len() {
self.unpublish()
} else {
self.mmap = Mmap::with_at_least(total_len).map_err(|e| e.to_string())?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cool to push this error to the caller completely, such that we don't have to worry about internal mmap failing.

@nagisa nagisa force-pushed the nagisa/mmap-reuse-proto branch 2 times, most recently from c1dce83 to cbf9d31 Compare June 28, 2022 13:58
@matklad matklad requested a review from Ekleog June 28, 2022 14:22
@nagisa
Copy link
Collaborator Author

nagisa commented Jun 28, 2022

For the time being this is still a prototype that uses RWX maps, but it won’t be too onerous to adopt any other mechanism. The primary value is that now the user is responsible for allocating a pool of memory maps ahead of time, and they continue being reused within the same engine, possibly resizing when necessary to fit a larger module (though this is something I’m somewhat on an edge about – it introduces a failure mode related to memory allocation deep inside UniversalEngine and it makes memory usage growth unbounded).

906.06us is the baseline/best performance we can get when running this benchmark with 10k functions. This is the same deserialize/10000 benchmark mentioned in the PR description above.

@nagisa
Copy link
Collaborator Author

nagisa commented Jun 28, 2022

@Ekleog it would be a good time to review and think about the implications of the new API here.

@nagisa
Copy link
Collaborator Author

nagisa commented Jun 30, 2022

I think this is largely functionally complete. Compared to the current master we’re looking at a module load performance somewhere along the lines of:

many_functions/load/1   time:   [1.9852 us 1.9917 us 1.9970 us]
                        change: [-53.637% -49.569% -45.959%] (p = 0.00 < 0.05)
many_functions/load/10  time:   [3.1826 us 3.1921 us 3.2026 us]
                        change: [-33.674% -33.501% -33.328%] (p = 0.00 < 0.05)
many_functions/load/100 time:   [10.638 us 10.649 us 10.659 us]
                        change: [-74.704% -68.424% -60.145%] (p = 0.00 < 0.05)
many_functions/load/1000
                        time:   [72.230 us 72.269 us 72.342 us]
                        change: [-62.413% -58.517% -54.767%] (p = 0.00 < 0.05)
many_functions/load/10000
                        time:   [922.51 us 924.06 us 925.87 us]
                        change: [-48.213% -43.139% -39.995%] (p = 0.00 < 0.05)
many_locals/load/1000   time:   [3.1262 us 3.2095 us 3.3001 us]
                        change: [-58.665% -49.766% -43.309%] (p = 0.00 < 0.05)
many_locals/load/100000 time:   [11.248 us 11.267 us 11.292 us]
                        change: [-50.290% -49.910% -49.608%] (p = 0.00 < 0.05)
many_locals/load/10000000
                        time:   [91.316 us 91.800 us 92.441 us]
                        change: [-67.343% -63.812% -60.858%] (p = 0.00 < 0.05)

I’m thinking of delaying work on improving this further with memfd (which gives another significant boost) until after we get flat executable formats going.

These methods attempt to make it straightforward to load a module into a
store when the engine contained therein is already a dynamic object.
While there are some instances of it, for all cases where it matters for
us, obtaining a specific type of engine is going to be pretty
straightforward (`UniversalEngine` is the only supported engine right
now anyway…).
@nagisa nagisa changed the title proto: enable reuse of mmaps between artifacts Enable reuse of mmaps between artifacts Jun 30, 2022
@nagisa nagisa marked this pull request as ready for review June 30, 2022 14:41
Copy link

@Ekleog Ekleog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM for the idea! I still have a few concerns around unsafe here and there though.

Also, I'm with @matklad that keeping an RX memory map hanging around with user-defined code is probably bad for defense-in-depth. Given we're going to turn that RX map into an RW map as soon as we get it out of the code memory pool, maybe it'd make sense to actually do the RW remap before putting it back into the code memory pool?

This way an attacker could control only an RW zone of memory outside of contract execution, which should be much harder to exploit from eg. a networking library ropchain than an RX zone of memory forever.

It does mean, though, that the idea about CodeMemory keeping an enum of RX/RW should actually happen, so that the RW remap doesn't happen two times for no good reason.

Does that make sense?

@@ -101,4 +97,41 @@ impl dyn Engine {
None
}
}

/// Downcast a dynamic Executable object to a concrete implementation of the trait.
pub fn downcast_arc<T: Engine + 'static>(self: Arc<Self>) -> Result<Arc<T>, Arc<Self>> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand, why would Arc::downcast not work here? AFAICT we could just add : Any to the definition of trait Engine, and we could get rid of this bit of unsafe code for free

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem with Arc::downcast is that it is implemented for Arc<dyn Any + Send + Sync + 'static>. This notably doesn’t work for Arc<dyn Engine*> and there isn’t an easy way to go between the two, so you cannot actually call Arc::downcast on Arc<dyn Engine> AFAICT.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think having something like this should work?

trait Engine: Any + Send + Sync + 'static {
    // ...
    fn to_any(self: Arc<Self>) -> Arc<dyn Any + Send + Sync + 'static> {
        self
    }
}

And then use Arc::downcast(engine.to_any())?

Basically I'm feeling that if there's no way something like this would work, it might mean that this bit of unsafe code, however sound it looks, is actually wrong :/

Also I actually just noticed that Engine::type_id is a function implemented by a non-unsafe non-sealed trait, which means that the code is technically unsound because an engine could spoof a fake TypeId. That said tracing has similar unsoundness in its traits and it's not a big deal in practice, so it's probably not too bad.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I actually just noticed that Engine::type_id is a function implemented by a non-unsafe non-sealed trait, which means that the code is technically unsound because an engine could spoof a fake TypeId.

Implementers wouldn’t be able to access private::Internal necessary to specify the argument type for the function.

And then use Arc::downcast(engine.to_any())?

Implementing to_any would require an unstable feature (trait upcasting) or unsafe code along the lines of what is seen here, and would end up returning dyn Any in its Err variant, which is not great in terms of usability of the API, but it would work.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so for to_any(), here is what I was thinking of: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7cedaa843727797e370656ec18459e0f (found by re-looking at the craziest trait object code I ever wrote, erlust)

I had forgotten that a default impl of the function wouldn't work, but that's probably not a problem for us. The advantage being, that there's zero unsafe code here. And the drawback that, as you mentioned, it returns dyn Any instead of dyn Engine upon failure, but that's probably not too bad as it's all Arc anyway and we can just keep a clone of the Arc<dyn Engine> from before the to_any cast.

WDYT?


impl dyn Engine + Send + Sync {
/// Downcast a dynamic Executable object to a concrete implementation of the trait.
pub fn downcast_ref<T: Engine + 'static>(&self) -> Option<&T> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story with dyn Any vs dyn T.

lib/engine/src/engine.rs Show resolved Hide resolved
lib/api/src/sys/module.rs Outdated Show resolved Hide resolved
tests/compilers/serialize.rs Show resolved Hide resolved
lib/engine-universal/src/code_memory.rs Outdated Show resolved Hide resolved
page_size,
) + data_sections.iter().fold(0, |acc, data| {
round_up(acc + data.bytes.len(), DATA_SECTION_ALIGNMENT.into())
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a first draft of this comment I was fooled by how the code for this computation is constructed, but it turns out that it's just doing additional needless round_up alignments, meaning it's just overestimating and is correct.

Maybe add a comment saying that this is an over-estimation that does additional aligning at the end of each section type, so other code readers don't get surprised in the same way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I wanted to get rid of this entirely somehow, as I found this difficult to verify as well (this is pre-existing code). I think the most plausible approach would be to make “allocate” generic over the writer and call it twice, once with a dummy writer that would just collect the write sizes and spit out the final required size at the end, and the 2nd with a proper writer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

lib/engine-universal/src/engine.rs Outdated Show resolved Hide resolved
@@ -30,7 +30,6 @@ mod code_memory;
mod engine;
mod executable;
mod link;
mod unwind;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change mean we can drop the eh_tables code handling in the previous commit too?

lib/engine-universal/src/engine.rs Outdated Show resolved Hide resolved
lib/engine-universal/src/code_memory.rs Outdated Show resolved Hide resolved
lib/engine-universal/src/code_memory.rs Show resolved Hide resolved
}

/// Remap the offset into an absolute address within a read-execute mapping.
pub fn executable_address(&self, offset: usize) -> *const u8 {
Copy link
Collaborator Author

@nagisa nagisa Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t know. Obtaining an address to contents of the map is not inherently an unsafe operation, it is the publish/writer that end up prodding the dragons. So I feel like it is more natural to have unsafety there, rather than here. Analogous situation as with casting a mutable reference to a pointer vs dereferencing said pointer(s) to make a mutable reference out of it.

page_size,
) + data_sections.iter().fold(0, |acc, data| {
round_up(acc + data.bytes.len(), DATA_SECTION_ALIGNMENT.into())
});
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I wanted to get rid of this entirely somehow, as I found this difficult to verify as well (this is pre-existing code). I think the most plausible approach would be to make “allocate” generic over the writer and call it twice, once with a dummy writer that would just collect the write sizes and spit out the final required size at the end, and the 2nd with a proper writer.

lib/engine-universal/src/code_memory.rs Outdated Show resolved Hide resolved
lib/engine-universal/src/code_memory.rs Outdated Show resolved Hide resolved
Instead, users can obtain a concrete type of engine and use inherent
methods to compile with the specific engine.
For the memory maps themselves, for the time being, we’re using a scheme
that uses `SHARED` maps which gives us a large chunk of the speed
beenfit described in
https://kazlauskas.me/entries/fast-page-maps-for-jit. Ideally we would
use a memfd based approach, however that turns out to require some
significant reworking of how the compiled functions are referenced – for
exapmle we want to apply linking and relocation on the writable view of
the functions, but the function references will contain references to
the executable mapping right after the `allocate` call.

I imagine we won’t get to implementing an improvement here for a little
while, and definitely not before we implement flat executable file
representations.
This appears to be no longer used in any meaningful way since we no
longer have any backends that emit uwtables in the first place.
This makes sure that there isn’t a place with a bunch of user-defined
code in memory while the code is not being used in any way.
@stale
Copy link

stale bot commented Jul 13, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the 🏚 stale label Jul 13, 2023
@nagisa nagisa closed this Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants