New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addition of Mimalloc & Snmalloc as an allocators #358
Conversation
…ure and using global dropin for malloc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs documentation. Maybe also a test that a process running with mimalloc selected is actually allocating with mimalloc rather than the system or Rust allocator. I have no idea how to write such a test, but it seems to me to be critical that we're sure we never allocate with one allocator and free with a different one. In particular, I didn't see anywhere in the diff a place where the Python allocator is set (but maybe it's already set to the global Rust allocator and mimalloc piggybacks off that?)
@@ -17,6 +17,7 @@ links = "pythonXY" | |||
# Update documentation in lib.rs when new dependencies are added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed it if you already did it, but according to this comment, you need to update docs in lib.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I checked out the lib.rs but I wasn't exactly sure what needed to be changed but I'll keep looking over the individual files to try to see what is needed for documentation.
In terms of which one it is allocating too I am actually setting the rust global allocator to mimalloc so the rust allocator should be mimalloc. This was enabled in Rust 1.28 [https://doc.rust-lang.org/edition-guide/rust-2018/platform-and-target-support/global-allocators.html]. Without mimalloc feature enabled it uses the normal system allocation and mimalloc is not compiled. When it is enabled I see it being compiled and should be set as the rust allocator, which I have cloned the pathway to use the raw_allocator="rust" from this point onward, though it would be nice to add test to make sure its enabled properly.
In conclusion, it is the rust raw allocator except with the rust global allocator being set to mimalloc prior. Perceptually I've noticed a significant boost in app startup speed.
pyembed/Cargo.toml
Outdated
@@ -47,6 +48,7 @@ rusty-fork = "0.3" | |||
[features] | |||
default = ["build-mode-default"] | |||
jemalloc = ["jemalloc-sys"] | |||
gmimalloc = ["mimalloc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the g?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust would not allow everything to compile if the feature name was the same as the actual crate, therefore I added a g to name in reference to it being set "globally" or as the global rust allocator.
@wkschwartz Using Mimalloc results in my application starting within 7-16 seconds the first time it is run followed by 1-1.5 seconds for each subsequent startup. Using the normal windows malloc allocator it took 20-26 seconds for the application to start (repeated in triplicates) and 1.5 to 2 seconds to start subsequently. Since some of my applications are quite large and python startup speed is my pet peeve this is quite advantageous to me. I learned about the allocator in the wasm forums where it seems to be growing. So it seems setting the global allocator works in my use case, though maybe we should name it custom and point users to pyembed cargo.toml/pyalloc to change out what allocator to use when "Custom" feature is enabled. |
Now if we are not actually setting the allocator for python as well I think we should definitely add this feature as shown in examples of using jemalloc with python: [https://zapier.com/engineering/celery-python-jemalloc/]. I would consider 30-40% reduction in memory quite significant and mimalloc tends to show a further reduction of memory in benchmarks. I believe setting the memory allocator to suite your needs for rust app and python could be a big benefit for pyoxidizer and end user usage i.e. building an environment to run scripts efficiently no matter where or who is running it. It would be a shame not to have a jemalloc alternative for windows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution.
I definitely support adding this as a feature!
Foremost, I've taken a light break from PyOxidizer at the moment and am unsure when I'll have time to merge this PR. It highly depends on my personal schedule. Hopefully within the next 2-3 weeks.
I skimmed the patch and my biggest feedback is that you may want to consider installing mimalloc directly in Python instead of relying on the Rust global allocator fallback. The way this would work is you would install mimalloc as the Rust global allocator and you would install it separately as Python's raw allocator.
The reason you want this is that pyembed
's Python raw allocator shim to Rust is adds overhead and may not work exactly like the direct binding. Having all allocations go through Rust does theoretically have some advantages. But I don't think anybody relies on them in practice so it is probably safe to not support this. If people really wanted this, we could expose config variants for whether mimalloc should go direct or via Rust. Or we could expose a separate config option to control Rust's global allocator and have the rust
Python allocator use Rust's global allocator. I don't think this complexity is warranted at this time though. But if you want to build it, I'll review it.
…but stll working on cmake for gnu
Sounds good, I modified the commit and updated to use mimalloc directly in Python. I also updated mimalloc to work on windows-msvc and windows-gnu. I know it sounds weird but I’m downloading the Python msvc and building pyoxidizer-gnu around it to reduce c-runtime dependencies and use LTO. |
I’m starting to think we should look into setting the global allocator to jemalloc and system as well when selecting which type to use. I found an interesting article discussing performance improvements after setting the rust allocator to jemalloc as well after they dropped it in 2018. https://www.christianfscott.com/making-rust-as-fast-as-go/ |
… due to removal of jemalloc as rust allocator in 2018
… due to removal of jemalloc as rust allocator in 2018
Based on previous comment about jemalloc not being the rust allocator after 2018 unless the system is already using it I added jemalloc global allocator settings and modified mimalloc to work on linux (i.e. ubuntu) so it is no longer just for windows users |
…ursively deleting long paths, resulting in permission denied and requiring user to curent delete install folder manually, using modified remove_dir_all on windows corrects this while pointing to std::fs::remove_dir_all normally on gnu
…for pyembed windows, still need vcruntime.dll for python though,add configs for windows gnu and what to add when using mimalloc on windows-gnu only
Not passing all checks was bugging me so I hope you don’t mind the little maintenance I did |
…ching rust allocation calls and I wanted at least one that went through rust with performance betwen jemalloc and mimalloc
I added changing pymalloc or pyobjectarenaallocator when changing the allocator from system to any of the optional allocators available. |
@indygreg and @wkschwartz ran some small allocation tests and thought you might want to see results: Here is the basic snippit being tested: less verbose outcome: More Verbose #SYSTEM
RUSTPython 3.9.1 (default, Jan 3 2021, 22:36:18) [MSC v.1928 64 bit (AMD64)] on win32
MimallocPython 3.9.1 (default, Jan 3 2021, 22:36:18) [MSC v.1928 64 bit (AMD64)] on win32
SnmallocPython 3.9.1 (default, Jan 3 2021, 22:36:18) [MSC v.1928 64 bit (AMD64)] on win32
#SYSTEM THREADED #THREADED MIMALLOC #THREADED SNMALLOC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to look at this PR. The performance numbers look very promising and I'd very much like to get this merged!
I may or may not tidy this up locally and push it myself (preserving authorship information, of course). If I do this, I'll likely make a handful of changes:
- Remove
rayon
frompyembed
crate. The parallelism could be nice. However, we want to limit the crate dependencies of this run-time crate and I'm likely against addingrayon
to that list. - Remove
rayon
globally. It looks like a global search and replace was performed to make certain iterable operations parallel. Many of these don't need to be made parallel. And the change could be wrong if rayon doesn't preserve order during parallel iteration. I'm not opposed to making use of parallel iteration where it makes sense (outside crates used at run-time). But such a change should be made in its own commit with a performance justification to support the change. - Revert a lot of fully qualified crate versions in
Cargo.toml
files. This may have been a rebase issue with these patches, I'm not sure. Going forward, we prefer simpleX
orX.Y
version pinning to minimize the churn/effort required to update crates. TheCargo.lock
file can ensure exact versions are pinned. - Possibly undo some of the lifetime annotation additions. It isn't clear to me why the anonymous lifetime was added in a bunch of places. Even more perplexing is why this was done in a commit whose message was
update dependencies
. Is there more to this story?
Anyway, I'm continuing to reconcile this PR locally. If I run into blocking issues requiring your attention, I'll push a branch that you can rebase this off of. Give me a few hours...
I may have gotten a little carried away with trying to figure out why random tests were failing. Some dependencies needed to be updated to pass the nightly for example. In terms of lifetimes, those are from clippy suggestions, apparently they will be needed in 2021 so I was trying to update everything that may be needed all at once. It was also clogging my command line where I couldn’t find the real clippy issues easily. There shouldn’t be any patches anymore, I sent pull requests to any external crates I did change. I do recommend keeping the remove_dir_all, on windows I kept getting errors that my install folder didn’t have permission and the build would fail, apparently std on windows would only delete exe in rebuilds and fail on lib folder. yeah I had just started on rayon, I was hoping to start implementing walkdir with it but I’ll push that one separately. I was curious to what else we could enhance with rust like dependency and file search etc. For mimalloc I would recommend deleting the padding and removing dynamic tls feature, I’ve removed it from my local. I can make any of the changes you listed and reupload with mimalloc change if you would like? just let me know so I’m not conflicting any of your changes. There is also a mimalloc heap allocator that I’m trying to figure out but everytime I try to delete the heap at once Python crashes also the heap also needs to be ctx or Python overallocates so I might send the the heap version later on once I study it more. |
As a side note, I would appreciate learning why you don’t like the lifetimes? I’m still new to them and they seem preferred in Rust manual. Furthermore, I have an idea to reduce some of the compile time switching out the macro heavy lazy_static with once_cell, that is actually slated to be moved into the std library soon: https://crates.io/crates/once_cell and wanted to get your thoughts first on weather you preferred lazy_static for some reason? |
Thanks for the context. And no worries! The clippy warnings shouldn't have existed anyway. But they did, likely because the repo has sat dormant for several weeks while the Rust people keep pumping out new clippy lints.
I didn't know about
Generally speaking, not much in PyOxidizer is performance sensitive. The biggest concern is run-time performance of built applications. There, the most important thing is reading the packed resources file and loading modules. But when I last measured this, I'm not convinced that throwing multiple threads at it will make it any better.
I'll let you know. I'm still trying to grok the patch.
This is good to know! There's likely a missing lifetime annotation / dependency on the heap allocator. I'll pay close attention to this when I look at the code.
Generally speaking, modern versions of Rust do a pretty good job of figuring out lifetimes for you so lifetime annotations just aren't needed in many situations. Major exceptions to this are when you are defining/implementing traits and when materializing Rust objects from pointers. In simple cases where you pass a reference into a function and it returns a simple object (like a bool), you likely never need to define lifetimes. You should generally only need to define lifetimes when the returned value references something in an input argument. And even then, Rust is often smart enough to associate the lifetime of the output object with that if the input object it is referencing, so explicit lifetime annotations just aren't necessary. Note that
I'm only using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've locally committed the code in this patch to add mimalloc as the raw allocator.
As part of looking at the entire state of this allocator code, I realized that the abstractions in pyembed were kinda misleading. From the Python C API perspective, you have a single PyMemAllocatorEx
struct representing an allocator. And that can be installed for the raw
, mem
, and obj
domains. Furthermore, the built-in pymalloc
allocator can be customized to have a specific arena allocator function.
So the code in pyembed emphasizing "raw" is somewhat wrong, as a PyMemAllocatorEx
can be installed in multiple domains.
Coinciding with your code, I'm refactoring a lot of the internal names to be more generic so we can install an allocator (backed by jemalloc, mimalloc, etc) in any domain as well as use one of these allocators for the pymalloc arena allocator.
I still have a ways to go. But the refactor looks promising!
I'm preserving your authorship on commits where you clearly contributed the code. I don't want your contribution to get lost here!
let padding = _ctx as *const _ as usize; | ||
unsafe { mimallocffi::mi_malloc(size + padding) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is padding
used here and in other functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The padding was based on the author victor Steiner explanation of the custom memory allocators. Apparently ctx is used in Python for padding 10 mb or kB for pymemallocatorex and 2 mb or kB for pymalloc. However, my recent tests have shown this can max the memory allocation too much and I’ve decided to not use it on non-heap allocator version. The heap allocator I’m working on it works nicely with the padding.
so as mentioned earlier we can remove the padding here for non-heap mimalloc, which I’ve done locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stinner* here is his explanation: https://www.python.org/dev/peps/pep-0445/
again I don’t recommend until we implement the mimalloc heap version though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I was confused because we don't actually set ctx
to non-NULL in this patch. Thanks for the explanation!
Actually I tried that originally and don’t recommend setting the pymemallocatorex for all domains. This dramatically worsened the runtime, I believe this is based on the amount of allocation being made and why their is pymalloc in the first place for small allocations. |
However, this is also why I decided to make pyarenaobject allocator use the same named raw allocator called. So we don’t use different types of allocators but wanted to add option in future to use mimalloc heap allocator instead of normal mimalloc allocator. However, if you look at pymalloc free they actually change the call so that it would fail if raw tried to free pymalloc or vice-versa. |
Interesting. I thought some of these custom allocators specialized in many small allocations and might give I think the bigger potential is around using a custom allocator as Regardless, I'd like to make all of this configurable so things are super flexible. Who knows: perhaps we enable people to perform random benchmarks with the flexibility and gain useful insights! |
I misspoke, jemalloc and mimalloc perform much better than pymalloc. However, for some strange reason I kept getting better results setting pymalloc directly to mimalloc or jemalloc rather than overriding everything with the pymemallocatorex. I believe calls to pyarenaobjectallocator api may be smaller memory calls than to the raw. Otherwise, I'm not sure why overriding pymalloc itself is better than using pymemallocatorex completely and not needing to set pymalloc because it is no longer used. |
So to clarify my best results were obtained by setting pymemallocatorex to mimalloc for PYMEM_DOMAIN_RAW and also setting pymalloc to mimalloc to take care of PYMEM_DOMAIN_MEM and PYMEM_DOMAIN_OBJ. Which would make you think this should be the same as setting pymemallocatorex to PYMEM_DOMAIN_RAW, PYMEM_DOMAIN_OBJ, PYMEM_DOMAIN_MEM but for some strange reason, probably my fatigue, it was not lol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that python3-sys
defines PyObjectArenaAllocator.free
as pub free: Option<extern "C" fn(ctx: *mut c_void, ptr: *mut c_void, size: size_t) -> ()>
. However, the Python docs have the signature as void free(void *ctx, size_t size, void *ptr)
. (Note size
and ptr
having their order reversed.)
So we can implement code that compiles. But it will likely blow up at run-time unless we add some very ugly hacks to munge the arguments. We should fix this in python3-sys
...
Yes this through me off as well, I ended up having to switch the pointer and size on pyalloc but what you are describing makes sense,the memory peaks around 80-90% and jemalloc/mimalloc do small either smaller frees or the system is freeing on its own to prevent going too high. I just though that it may be indexing the memory for future use and not releasing yet which I've seen these newer allocators tend to do but would make sense now for heap crashing.
Agreed |
Support for both allocators is implemented in the config and interpreter initialization layers. Only mimalloc has its Python raw allocator implemented: snmalloc will always fail if used. Part of #358.
Support for both allocators is implemented in the config and interpreter initialization layers. Only mimalloc has its Python raw allocator implemented: snmalloc will always fail if used. Part of #358.
I just pushed all my local commits to the You'll notice that I also pushed the remnants of your branch to the There are a few things that aren't yet implemented:
|
I'm going to implement Starlark support for configuring the |
I put an issue into rust-cpython to let the author know about the error as well. In the meantime, setting pymemallocatorex for all three (raw, obj, mem) should override pyobjectarenaallocator completely according to victor stinner but not sure why I saw a performance benefit going through the pyobjectarenaallocator instead. |
We now define a custom allocator backend in its own field and have separate fields controlling how it is used. This ensures we only have a single custom allocator defined rather than a mix of multiple allocators. We expose fields to enable installing custom allocators for the mem and obj domains as well as pymalloc's arena allocator (although the arena allocator is busted due to dgrunwald/rust-cpython#257). We change the name of the "system" backend to "default" because that is more appropriate. I also found a subtle issue with the detection of the default backend only checking for x86_64 on Windows: it should check for all Windows MSVC arches. Part of #358.
OK. Refactored allocator control has been pushed. We should now have full control over the allocator settings from Starlark /
Yes, setting Regarding the performance, my guess is Are you going to send a PR to Once that lands upstream, I'm hesitant to use an unreleased |
This addresses some outstanding TODOs and allows us to enable this allocator! I haven't tested this thouroughly yet. But at least the test doesn't crash! Part of #358.
Apparently python3-sys is correct! See dgrunwald/rust-cpython#258. So we can enable this feature. As part of this, I fixed an unset pointer that was uncovered by the newly enabled tests. I guess the tests did their job! Part of #358.
OK. That leaves us with support for defining the global allocator as the only missing feature I can think of. I'm not thrilled about the code for ensuring thread safety of allocation tracking. That likely imposes performance overhead. Although we should benchmark. I'd love ideas for how to make that code better. It would also be nice to avoid the tracking overhead if an allocator doesn't need to be thread safe (I think arena allocators are guaranteed to have the GIL held?). |
This is great news, do you think it would be worth specifying allocator features, mimalloc has a couple different ones like dynamic-tls, secure, etc... that aren’t actually required, the only required one is the extended
which the more I think about should probably be in the build.rs so that it is used from the very beginning. My concern if we don’t set it is that it may limit the allocations being done by any rust code but this may not be much for normal users.
I’ll definitely look it over and actually mimalloc and Snmalloc should already be considered thread safe as they are built to run on multiple threads and track their own allocations. Here is the Microsoft article on mimalloc and snmalloc describing how they share processes between threads safely and constantly map use: https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf In the first paragraph that actually mention Python and how it mimalloc would suite its needs for the allocator backend and do reference counting on its own. snmalloc on the other hand uses message parsing and tracks on separate threads to deallocate on main thread. It excels at small sized objects which is why I thought it may be better as a pymalloc alternative: https://www.microsoft.com/en-us/research/uploads/prod/2020/04/snmalloc.pdf |
Yes, I think it would be worth allowing those to be specified.
I was going to throw the global allocator into the auto-generated Rust crate providing the application. IMO it makes more sense to configure the global allocator on the crate providing an executable and not a library.
I should have been specific about the limitations. I'm not concerned about thread safety of calling the allocator functions: allocators are generally thread safe. The problem is that both the Alternatively, we could potentially track the allocation size internally in N bytes at the beginning or end of the allocated block. (Preferably the end to ensure memory alignment.) However, I believe there are security implications to this, since a buffer underrun or overrun would corrupt our accounting and this could potentially lead to badness if the previous size fed into the allocator APIs were incorrect. I worry that could somehow be exploited (e.g. you use a buffer overrun to lie about the previous allocation size which leads to a UAF and RCE). I'm not an expert in this area so I could be wrong. If we pursued this, I would likely reach out to some former coworkers who know this space much better than me to get their assessment. The good news is that I'm pretty sure you can work around this by storing a checksum or some such to detect corruption. This is similar to how Python's debug hooks work. And I believe many memory allocators do things like this transparently as built-in security features. Again, I just don't know this space well enough. |
Previously, use of jemalloc would make jemalloc the Rust global allocator via the `jemallocator` crate. With the recent introduction of support for mimalloc and snmalloc, we want to support a similar feature. This commit teaches the new project template to configure the Rust global allocator for all 3 supported custom allocators. We do this by introducing a new set of crate features for controlling the global allocator explicitly. The autogenerated main.rs has explicit feature checking and code for registering the global allocator. This is a bit more verbose. But I like being explicit and consistent across the allocators. We've also introduced tests that verify autogenerated projects with custom allocators compile correctly. There is room to introduce Starlark config options to control the global allocator settings. But these didn't exist before. So I think we don't need to implement that functionality as part of this commit. Both mimalloc and snmalloc use cmake as part of their crate build process. So we document that and skip tests on Windows since cmake isn't present in CI. We'll likely need more intelligent follow-ups here (like to probe for the executable in the test). Let's wait for things to break/complain before introducing more complexity. Part of #358.
Got it, I may be mistaken but I believe the size tracking is actually because the snmalloc team wrote the api for export to rust that the snmalloc-RS crate calls directly. This is in contrast to the mimalloc-rs (rust crate) team that actually implemented mi-malloc, mi-free in rust before converting to size tracked/rust global allocator compatible version. For example: this is the rust.cc that snmalloc wrote that snmalloc-sys calls: https://github.com/microsoft/snmalloc/blob/master/src/override/rust.cc I won’t ignore that the more direct route that snmalloc takes may be better for the rust global allocator with less function wraps, however, I was hoping to start bringing over the snmalloc-malloc, and snmalloc-free that doesn’t require size, similar to mimalloc, over to the snmalloc-sys crate once I get a chance. furthermore, I would expect cmake requirements short term, mimalloc-rs has a pull request to use cc crate and I may use cc crate instead when adding functionality to snmalloc-sys. |
Very interesting!
This would be terrific! If the size argument isn't needed, we could remove the allocation tracking code and get performance on par with other allocators (modulo Rust, which will always be special). Ideally there is a C API that doesn't accept the argument. But sending in a specialvalue for the size (e.g. |
FYI: I am happy to take changes and additions to the snmalloc |
Hello @mjp41, thanks for the info and I completely agree that you exposed the optimal surface for the rust global allocator. While I plan to use the global allocator with rust projects embedding Python, I would also like to use it for python’s memory allocator but directly, without rust’s size tracking requirements calling snmalloc versions of malloc, realloc, calloc, and free. I’ll see what I can do from my side first and put in an issue as you suggest if I need more support, Best regards! |
I added Mimalloc as an optional feature on Windows systems since jemalloc is not really an option. Since Mimalloc was designed as a drop-in for malloc I decided the easiest route would be to make it the rust global allocator when the feature was enabled.