Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of Mimalloc & Snmalloc as an allocators #358

Closed
wants to merge 18 commits into from
Closed

Addition of Mimalloc & Snmalloc as an allocators #358

wants to merge 18 commits into from

Conversation

ryancinsight
Copy link
Contributor

I added Mimalloc as an optional feature on Windows systems since jemalloc is not really an option. Since Mimalloc was designed as a drop-in for malloc I decided the easiest route would be to make it the rust global allocator when the feature was enabled.

Copy link
Contributor

@wkschwartz wkschwartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs documentation. Maybe also a test that a process running with mimalloc selected is actually allocating with mimalloc rather than the system or Rust allocator. I have no idea how to write such a test, but it seems to me to be critical that we're sure we never allocate with one allocator and free with a different one. In particular, I didn't see anywhere in the diff a place where the Python allocator is set (but maybe it's already set to the global Rust allocator and mimalloc piggybacks off that?)

@@ -17,6 +17,7 @@ links = "pythonXY"
# Update documentation in lib.rs when new dependencies are added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed it if you already did it, but according to this comment, you need to update docs in lib.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I checked out the lib.rs but I wasn't exactly sure what needed to be changed but I'll keep looking over the individual files to try to see what is needed for documentation.

In terms of which one it is allocating too I am actually setting the rust global allocator to mimalloc so the rust allocator should be mimalloc. This was enabled in Rust 1.28 [https://doc.rust-lang.org/edition-guide/rust-2018/platform-and-target-support/global-allocators.html]. Without mimalloc feature enabled it uses the normal system allocation and mimalloc is not compiled. When it is enabled I see it being compiled and should be set as the rust allocator, which I have cloned the pathway to use the raw_allocator="rust" from this point onward, though it would be nice to add test to make sure its enabled properly.

In conclusion, it is the rust raw allocator except with the rust global allocator being set to mimalloc prior. Perceptually I've noticed a significant boost in app startup speed.

@@ -47,6 +48,7 @@ rusty-fork = "0.3"
[features]
default = ["build-mode-default"]
jemalloc = ["jemalloc-sys"]
gmimalloc = ["mimalloc"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the g?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust would not allow everything to compile if the feature name was the same as the actual crate, therefore I added a g to name in reference to it being set "globally" or as the global rust allocator.

@ryancinsight
Copy link
Contributor Author

@wkschwartz Using Mimalloc results in my application starting within 7-16 seconds the first time it is run followed by 1-1.5 seconds for each subsequent startup. Using the normal windows malloc allocator it took 20-26 seconds for the application to start (repeated in triplicates) and 1.5 to 2 seconds to start subsequently. Since some of my applications are quite large and python startup speed is my pet peeve this is quite advantageous to me. I learned about the allocator in the wasm forums where it seems to be growing. So it seems setting the global allocator works in my use case, though maybe we should name it custom and point users to pyembed cargo.toml/pyalloc to change out what allocator to use when "Custom" feature is enabled.

@ryancinsight
Copy link
Contributor Author

Now if we are not actually setting the allocator for python as well I think we should definitely add this feature as shown in examples of using jemalloc with python: [https://zapier.com/engineering/celery-python-jemalloc/]. I would consider 30-40% reduction in memory quite significant and mimalloc tends to show a further reduction of memory in benchmarks.

I believe setting the memory allocator to suite your needs for rust app and python could be a big benefit for pyoxidizer and end user usage i.e. building an environment to run scripts efficiently no matter where or who is running it. It would be a shame not to have a jemalloc alternative for windows.

Copy link
Owner

@indygreg indygreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution.

I definitely support adding this as a feature!

Foremost, I've taken a light break from PyOxidizer at the moment and am unsure when I'll have time to merge this PR. It highly depends on my personal schedule. Hopefully within the next 2-3 weeks.

I skimmed the patch and my biggest feedback is that you may want to consider installing mimalloc directly in Python instead of relying on the Rust global allocator fallback. The way this would work is you would install mimalloc as the Rust global allocator and you would install it separately as Python's raw allocator.

The reason you want this is that pyembed's Python raw allocator shim to Rust is adds overhead and may not work exactly like the direct binding. Having all allocations go through Rust does theoretically have some advantages. But I don't think anybody relies on them in practice so it is probably safe to not support this. If people really wanted this, we could expose config variants for whether mimalloc should go direct or via Rust. Or we could expose a separate config option to control Rust's global allocator and have the rust Python allocator use Rust's global allocator. I don't think this complexity is warranted at this time though. But if you want to build it, I'll review it.

@ryancinsight
Copy link
Contributor Author

Sounds good, I modified the commit and updated to use mimalloc directly in Python. I also updated mimalloc to work on windows-msvc and windows-gnu. I know it sounds weird but I’m downloading the Python msvc and building pyoxidizer-gnu around it to reduce c-runtime dependencies and use LTO.

@ryancinsight
Copy link
Contributor Author

I’m starting to think we should look into setting the global allocator to jemalloc and system as well when selecting which type to use. I found an interesting article discussing performance improvements after setting the rust allocator to jemalloc as well after they dropped it in 2018. https://www.christianfscott.com/making-rust-as-fast-as-go/

Ryan Clanton added 2 commits February 1, 2021 07:29
… due to removal of jemalloc as rust allocator in 2018
… due to removal of jemalloc as rust allocator in 2018
@ryancinsight ryancinsight changed the title Addition of Mimalloc as an allocator for windows systems Addition of Mimalloc as an allocator Feb 1, 2021
@ryancinsight
Copy link
Contributor Author

Based on previous comment about jemalloc not being the rust allocator after 2018 unless the system is already using it I added jemalloc global allocator settings and modified mimalloc to work on linux (i.e. ubuntu) so it is no longer just for windows users

Ryan Clanton added 8 commits February 1, 2021 14:39
…ursively deleting long paths, resulting in permission denied and requiring user to curent delete install folder manually, using modified remove_dir_all on windows corrects this while pointing to std::fs::remove_dir_all normally on gnu
…for pyembed windows, still need vcruntime.dll for python though,add configs for windows gnu and what to add when using mimalloc on windows-gnu only
@ryancinsight
Copy link
Contributor Author

Not passing all checks was bugging me so I hope you don’t mind the little maintenance I did

@ryancinsight ryancinsight changed the title Addition of Mimalloc as an allocator Addition of Mimalloc as an allocator/ update dependencies/fix build Feb 6, 2021
Ryan Clanton added 2 commits February 8, 2021 19:59
…ching rust allocation calls and I wanted at least one that went through rust with performance betwen jemalloc and mimalloc
@ryancinsight ryancinsight changed the title Addition of Mimalloc as an allocator/ update dependencies/fix build Addition of Mimalloc & Snmalloc as an allocators Feb 11, 2021
@ryancinsight
Copy link
Contributor Author

ryancinsight commented Feb 15, 2021

I added changing pymalloc or pyobjectarenaallocator when changing the allocator from system to any of the optional allocators available.

@ryancinsight
Copy link
Contributor Author

@indygreg and @wkschwartz

ran some small allocation tests and thought you might want to see results:

Here is the basic snippit being tested:
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)

less verbose outcome:
Mimalloc outperformed system, rust, and snmalloc with snmalloc being second on single threaded and multithreaded tests.
The largest impact was multithreading with a ~23-24% reduction in execution time. Single threaded had a ~15% reduction.

More Verbose

#SYSTEM
Python 3.9.1 (default, Jan 3 2021, 22:36:18) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import timeit
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.058086099999997
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.9837284
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
20.188063999999997
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
20.38189779999999
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
20.174934499999992
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
20.620286699999994

RUST

Python 3.9.1 (default, Jan 3 2021, 22:36:18) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import timeit
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.430539600000003
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
20.025297000000002
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.895522800000002
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.582417399999997
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.76773460000001
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.623044499999992

Mimalloc

Python 3.9.1 (default, Jan 3 2021, 22:36:18) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import timeit
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
16.899359500000003
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
17.1238023
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
17.984461100000004
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
18.049280200000013
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
18.143765099999996
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
17.752012699999995
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
17.25725270000001
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
17.359214500000007

Snmalloc

Python 3.9.1 (default, Jan 3 2021, 22:36:18) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import timeit
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
18.465406
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.210176000000004
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.650796600000007
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.21021970000001
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.399142600000005
timeit.timeit('"-".join([str(n) for n in range(1000000)])', number=100)
19.499576700000006

#SYSTEM THREADED
--- Started Worker_8
--- Started Worker_15
--- Started Worker_18
--- Started Worker_12
--- Started Worker_19
--- Started Worker_7
--- Started Worker_14
--- Started Worker_8
--- Started Worker_16
--- Started Worker_16
Worker_7 took 9.869464 secs to complete
Worker_8 took 17.5815548 secs to complete
--- Completed Worker_8
Worker_8 took 17.712012899999998 secs
Worker_8 took 19.2237744 secs to complete
Worker_12 took 24.329753200000003 secs to complet
Worker_16 took 28.6749941 secs to complete
Worker_16 took 28.9369972 secs to complete
Worker_18 took 30.520798499999998 secs to complet
Worker_15 took 31.068258999999998 secs to complet
--- Completed Worker_15
Worker_15 took 31.0817758 secs
--- Completed Worker_18
Worker_18 took 30.5261043 secs
--- Completed Worker_12
Worker_12 took 24.3358389 secs
Worker_14 took 30.6610221 secs to complete
Worker_19 took 31.2026022 secs to complete
--- Completed Worker_19
Worker_19 took 31.2030797 secs
--- Completed Worker_7
Worker_7 took 10.1182106 secs
--- Completed Worker_14
Worker_14 took 30.666786400000003 secs
--- Completed Worker_8
Worker_8 took 19.3505246 secs
--- Completed Worker_16
Worker_16 took 28.6806485 secs
--- Completed Worker_16
Worker_16 took 28.942495400000002 secs
Total execution time: 35.4335241 secs
Total no of threads: 10
Average time: 3.54335241
Decorated Threads total duration: 252.6174771
Decorated Average: 25.26174771
{'Average thread time': 1.0,
'Total No of threads': 10,
'Total sum': 206.22481570000002,

'Worker_12': 24.3358389,
'Worker_14': 30.666786400000003,
'Worker_15': 31.0817758,
'Worker_16': 28.942495400000002,
'Worker_18': 30.5261043,
'Worker_19': 31.2030797,
'Worker_7': 10.1182106,
'Worker_8': 19.3505246}

#THREADED MIMALLOC
--- Started Worker_8
--- Started Worker_15
--- Started Worker_18
--- Started Worker_12
--- Started Worker_19
--- Started Worker_7
--- Started Worker_14
--- Started Worker_8
--- Started Worker_16
--- Started Worker_16
Worker_8 took 10.847116000000002 secs to complete
--- Completed Worker_8
Worker_8 took 11.1443108 secs
Worker_8 took 13.0781086 secs to complete
Worker_7 took 13.5550753 secs to complete
Worker_12 took 18.3521441 secs to complete
Worker_14 took 21.2911739 secs to complete
Worker_16 took 22.24417 secs to complete
Worker_16 took 22.4501922 secs to complete
Worker_19 took 22.9591641 secs to complete
Worker_15 took 23.259871699999998 secs to complete
--- Completed Worker_15
Worker_15 took 23.2652892 secs
Worker_18 took 23.4381073 secs to complete
--- Completed Worker_18
Worker_18 took 23.4385529 secs
--- Completed Worker_12
Worker_12 took 18.364294599999997 secs
--- Completed Worker_19
Worker_19 took 23.0001614 secs
--- Completed Worker_7
Worker_7 took 13.56528 secs
--- Completed Worker_14
Worker_14 took 21.3074085 secs
--- Completed Worker_8
Worker_8 took 13.1012854 secs
--- Completed Worker_16
Worker_16 took 22.4564045 secs
--- Completed Worker_16
Worker_16 took 22.266403899999997 secs
Total execution time: 27.5347123 secs
Total no of threads: 10
Average time: 2.7534712299999997
Decorated Threads total duration: 191.9093912
Decorated Average: 19.19093912
{'Average thread time': 1.0,
'Total No of threads': 10,
'Total sum': 158.3086759,

'Worker_12': 18.364294599999997,
'Worker_14': 21.3074085,
'Worker_15': 23.2652892,
'Worker_16': 22.266403899999997,
'Worker_18': 23.4385529,
'Worker_19': 23.0001614,
'Worker_7': 13.56528,
'Worker_8': 13.1012854}

#THREADED SNMALLOC
--- Started Worker_8
--- Started Worker_15
--- Started Worker_18
--- Started Worker_12
--- Started Worker_19
--- Started Worker_7
--- Started Worker_14
--- Started Worker_8
--- Started Worker_16
--- Started Worker_16
Worker_12 took 15.2025633 secs to complete
Worker_8 took 15.532629799999999 secs to complete
Worker_8 took 16.9360441 secs to complete
--- Completed Worker_8
Worker_8 took 16.9487185 secs
Worker_7 took 16.5794989 secs to complete
Worker_16 took 20.3101881 secs to complete
Worker_15 took 23.9788049 secs to complete
--- Completed Worker_15
Worker_15 took 23.9845701 secs
Worker_14 took 24.1287856 secs to complete
Worker_18 took 25.187898500000003 secs to complete
--- Completed Worker_18
Worker_18 took 25.1937109 secs
--- Completed Worker_12
Worker_12 took 15.2302866 secs
Worker_16 took 24.8208617 secs to complete
Worker_19 took 25.636650200000002 secs to complete
--- Completed Worker_19
Worker_19 took 25.6369919 secs
--- Completed Worker_7
Worker_7 took 16.7232609 secs
--- Completed Worker_14
Worker_14 took 24.1445621 secs
--- Completed Worker_8
Worker_8 took 15.636324700000001 secs
--- Completed Worker_16
Worker_16 took 24.825643600000003 secs
--- Completed Worker_16
Worker_16 took 20.3480972 secs
Total execution time: 29.8453487 secs
Total no of threads: 10
Average time: 2.98453487
Decorated Threads total duration: 208.6721665
Decorated Average: 20.86721665
{'Average thread time': 1.0,
'Total No of threads': 10,
'Total sum': 166.8978044,

'Worker_12': 15.2302866,
'Worker_14': 24.1445621,
'Worker_15': 23.9845701,
'Worker_16': 20.3480972,
'Worker_18': 25.1937109,
'Worker_19': 25.6369919,
'Worker_7': 16.7232609,
'Worker_8': 15.636324700000001}

Copy link
Owner

@indygreg indygreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to look at this PR. The performance numbers look very promising and I'd very much like to get this merged!

I may or may not tidy this up locally and push it myself (preserving authorship information, of course). If I do this, I'll likely make a handful of changes:

  • Remove rayon from pyembed crate. The parallelism could be nice. However, we want to limit the crate dependencies of this run-time crate and I'm likely against adding rayon to that list.
  • Remove rayon globally. It looks like a global search and replace was performed to make certain iterable operations parallel. Many of these don't need to be made parallel. And the change could be wrong if rayon doesn't preserve order during parallel iteration. I'm not opposed to making use of parallel iteration where it makes sense (outside crates used at run-time). But such a change should be made in its own commit with a performance justification to support the change.
  • Revert a lot of fully qualified crate versions in Cargo.toml files. This may have been a rebase issue with these patches, I'm not sure. Going forward, we prefer simple X or X.Y version pinning to minimize the churn/effort required to update crates. The Cargo.lock file can ensure exact versions are pinned.
  • Possibly undo some of the lifetime annotation additions. It isn't clear to me why the anonymous lifetime was added in a bunch of places. Even more perplexing is why this was done in a commit whose message was update dependencies. Is there more to this story?

Anyway, I'm continuing to reconcile this PR locally. If I run into blocking issues requiring your attention, I'll push a branch that you can rebase this off of. Give me a few hours...

@ryancinsight
Copy link
Contributor Author

I may have gotten a little carried away with trying to figure out why random tests were failing. Some dependencies needed to be updated to pass the nightly for example. In terms of lifetimes, those are from clippy suggestions, apparently they will be needed in 2021 so I was trying to update everything that may be needed all at once. It was also clogging my command line where I couldn’t find the real clippy issues easily. There shouldn’t be any patches anymore, I sent pull requests to any external crates I did change.

I do recommend keeping the remove_dir_all, on windows I kept getting errors that my install folder didn’t have permission and the build would fail, apparently std on windows would only delete exe in rebuilds and fail on lib folder.

yeah I had just started on rayon, I was hoping to start implementing walkdir with it but I’ll push that one separately. I was curious to what else we could enhance with rust like dependency and file search etc.

For mimalloc I would recommend deleting the padding and removing dynamic tls feature, I’ve removed it from my local. I can make any of the changes you listed and reupload with mimalloc change if you would like? just let me know so I’m not conflicting any of your changes.

There is also a mimalloc heap allocator that I’m trying to figure out but everytime I try to delete the heap at once Python crashes also the heap also needs to be ctx or Python overallocates so I might send the the heap version later on once I study it more.

@ryancinsight
Copy link
Contributor Author

As a side note, I would appreciate learning why you don’t like the lifetimes? I’m still new to them and they seem preferred in Rust manual.

Furthermore, I have an idea to reduce some of the compile time switching out the macro heavy lazy_static with once_cell, that is actually slated to be moved into the std library soon: https://crates.io/crates/once_cell and wanted to get your thoughts first on weather you preferred lazy_static for some reason?

@indygreg
Copy link
Owner

I may have gotten a little carried away with trying to figure out why random tests were failing. Some dependencies needed to be updated to pass the nightly for example. In terms of lifetimes, those are from clippy suggestions, apparently they will be needed in 2021 so I was trying to update everything that may be needed all at once. It was also clogging my command line where I couldn’t find the real clippy issues easily. There shouldn’t be any patches anymore, I sent pull requests to any external crates I did change.

Thanks for the context. And no worries! The clippy warnings shouldn't have existed anyway. But they did, likely because the repo has sat dormant for several weeks while the Rust people keep pumping out new clippy lints.

I do recommend keeping the remove_dir_all, on windows I kept getting errors that my install folder didn’t have permission and the build would fail, apparently std on windows would only delete exe in rebuilds and fail on lib folder.

I didn't know about remove_dir_all: I agree that it should be used wherever possible. However, tugger-file-manifest is used by pyembed and I'd prefer to not introduce the run-time dependency. I may look into a way to get it as a conditional dependency.

yeah I had just started on rayon, I was hoping to start implementing walkdir with it but I’ll push that one separately. I was curious to what else we could enhance with rust like dependency and file search etc.

Generally speaking, not much in PyOxidizer is performance sensitive. The biggest concern is run-time performance of built applications. There, the most important thing is reading the packed resources file and loading modules. But when I last measured this, I'm not convinced that throwing multiple threads at it will make it any better.

For mimalloc I would recommend deleting the padding and removing dynamic tls feature, I’ve removed it from my local. I can make any of the changes you listed and reupload with mimalloc change if you would like? just let me know so I’m not conflicting any of your changes.

I'll let you know. I'm still trying to grok the patch.

There is also a mimalloc heap allocator that I’m trying to figure out but everytime I try to delete the heap at once Python crashes also the heap also needs to be ctx or Python overallocates so I might send the the heap version later on once I study it more.

This is good to know! There's likely a missing lifetime annotation / dependency on the heap allocator. I'll pay close attention to this when I look at the code.

As a side note, I would appreciate learning why you don’t like the lifetimes? I’m still new to them and they seem preferred in Rust manual.

Generally speaking, modern versions of Rust do a pretty good job of figuring out lifetimes for you so lifetime annotations just aren't needed in many situations. Major exceptions to this are when you are defining/implementing traits and when materializing Rust objects from pointers. In simple cases where you pass a reference into a function and it returns a simple object (like a bool), you likely never need to define lifetimes. You should generally only need to define lifetimes when the returned value references something in an input argument. And even then, Rust is often smart enough to associate the lifetime of the output object with that if the input object it is referencing, so explicit lifetime annotations just aren't necessary.

Note that pyembed is kinda special because of all the FFI code: it contains more lifetime annotations than a lot of common Rust code. This is because the Rust compiler can't reason about what the Python interpreter does with Rust objects cast to pointers.

Furthermore, I have an idea to reduce some of the compile time switching out the macro heavy lazy_static with once_cell, that is actually slated to be moved into the std library soon: https://crates.io/crates/once_cell and wanted to get your thoughts first on weather you preferred lazy_static for some reason?

I'm only using lazy_static because the standard library didn't have anything comparable. If there's a better way to do things in 2021, by all means send PRs! Keep in mind we still support Rust 1.45. Although the minimum version requirement can be increased if it makes hacking on the code base easier.

Copy link
Owner

@indygreg indygreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've locally committed the code in this patch to add mimalloc as the raw allocator.

As part of looking at the entire state of this allocator code, I realized that the abstractions in pyembed were kinda misleading. From the Python C API perspective, you have a single PyMemAllocatorEx struct representing an allocator. And that can be installed for the raw, mem, and obj domains. Furthermore, the built-in pymalloc allocator can be customized to have a specific arena allocator function.

So the code in pyembed emphasizing "raw" is somewhat wrong, as a PyMemAllocatorEx can be installed in multiple domains.

Coinciding with your code, I'm refactoring a lot of the internal names to be more generic so we can install an allocator (backed by jemalloc, mimalloc, etc) in any domain as well as use one of these allocators for the pymalloc arena allocator.

I still have a ways to go. But the refactor looks promising!

I'm preserving your authorship on commits where you clearly contributed the code. I don't want your contribution to get lost here!

Comment on lines +244 to +245
let padding = _ctx as *const _ as usize;
unsafe { mimallocffi::mi_malloc(size + padding) }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is padding used here and in other functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The padding was based on the author victor Steiner explanation of the custom memory allocators. Apparently ctx is used in Python for padding 10 mb or kB for pymemallocatorex and 2 mb or kB for pymalloc. However, my recent tests have shown this can max the memory allocation too much and I’ve decided to not use it on non-heap allocator version. The heap allocator I’m working on it works nicely with the padding.

so as mentioned earlier we can remove the padding here for non-heap mimalloc, which I’ve done locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stinner* here is his explanation: https://www.python.org/dev/peps/pep-0445/

again I don’t recommend until we implement the mimalloc heap version though

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I was confused because we don't actually set ctx to non-NULL in this patch. Thanks for the explanation!

@ryancinsight
Copy link
Contributor Author

Actually I tried that originally and don’t recommend setting the pymemallocatorex for all domains. This dramatically worsened the runtime, I believe this is based on the amount of allocation being made and why their is pymalloc in the first place for small allocations.

@ryancinsight
Copy link
Contributor Author

Actually I tried that originally and don’t recommend setting the pymemallocatorex for all domains. This dramatically worsened the runtime, I believe this is based on the amount of allocation being made and why their is pymalloc in the first place for small allocations.

However, this is also why I decided to make pyarenaobject allocator use the same named raw allocator called. So we don’t use different types of allocators but wanted to add option in future to use mimalloc heap allocator instead of normal mimalloc allocator. However, if you look at pymalloc free they actually change the call so that it would fail if raw tried to free pymalloc or vice-versa.

@indygreg
Copy link
Owner

Actually I tried that originally and don’t recommend setting the pymemallocatorex for all domains. This dramatically worsened the runtime, I believe this is based on the amount of allocation being made and why their is pymalloc in the first place for small allocations.

Interesting. I thought some of these custom allocators specialized in many small allocations and might give pymalloc a run for its money! Then again, pymalloc is optimized for Python's use cases, so I can see how it would outperform a generic allocator!

I think the bigger potential is around using a custom allocator as pymalloc's arena allocator.

Regardless, I'd like to make all of this configurable so things are super flexible. Who knows: perhaps we enable people to perform random benchmarks with the flexibility and gain useful insights!

@ryancinsight
Copy link
Contributor Author

Actually I tried that originally and don’t recommend setting the pymemallocatorex for all domains. This dramatically worsened the runtime, I believe this is based on the amount of allocation being made and why their is pymalloc in the first place for small allocations.

Interesting. I thought some of these custom allocators specialized in many small allocations and might give pymalloc a run for its money! Then again, pymalloc is optimized for Python's use cases, so I can see how it would outperform a generic allocator!

I think the bigger potential is around using a custom allocator as pymalloc's arena allocator.

Regardless, I'd like to make all of this configurable so things are super flexible. Who knows: perhaps we enable people to perform random benchmarks with the flexibility and gain useful insights!

I misspoke, jemalloc and mimalloc perform much better than pymalloc. However, for some strange reason I kept getting better results setting pymalloc directly to mimalloc or jemalloc rather than overriding everything with the pymemallocatorex. I believe calls to pyarenaobjectallocator api may be smaller memory calls than to the raw. Otherwise, I'm not sure why overriding pymalloc itself is better than using pymemallocatorex completely and not needing to set pymalloc because it is no longer used.

@ryancinsight
Copy link
Contributor Author

So to clarify my best results were obtained by setting pymemallocatorex to mimalloc for PYMEM_DOMAIN_RAW and also setting pymalloc to mimalloc to take care of PYMEM_DOMAIN_MEM and PYMEM_DOMAIN_OBJ.

Which would make you think this should be the same as setting pymemallocatorex to PYMEM_DOMAIN_RAW, PYMEM_DOMAIN_OBJ, PYMEM_DOMAIN_MEM but for some strange reason, probably my fatigue, it was not lol

Copy link
Owner

@indygreg indygreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that python3-sys defines PyObjectArenaAllocator.free as pub free: Option<extern "C" fn(ctx: *mut c_void, ptr: *mut c_void, size: size_t) -> ()>. However, the Python docs have the signature as void free(void *ctx, size_t size, void *ptr). (Note size and ptr having their order reversed.)

So we can implement code that compiles. But it will likely blow up at run-time unless we add some very ugly hacks to munge the arguments. We should fix this in python3-sys...

@ryancinsight
Copy link
Contributor Author

It appears that python3-sys defines PyObjectArenaAllocator.free as pub free: Option<extern "C" fn(ctx: *mut c_void, ptr: *mut c_void, size: size_t) -> ()>. However, the Python docs have the signature as void free(void *ctx, size_t size, void *ptr). (Note size and ptr having their order reversed.)

Yes this through me off as well, I ended up having to switch the pointer and size on pyalloc but what you are describing makes sense,the memory peaks around 80-90% and jemalloc/mimalloc do small either smaller frees or the system is freeing on its own to prevent going too high. I just though that it may be indexing the memory for future use and not releasing yet which I've seen these newer allocators tend to do but would make sense now for heap crashing.

So we can implement code that compiles. But it will likely blow up at run-time unless we add some very ugly hacks to munge the arguments. We should fix this in python3-sys...

Agreed

indygreg pushed a commit that referenced this pull request Feb 26, 2021
Support for both allocators is implemented in the config and interpreter
initialization layers. Only mimalloc has its Python raw allocator
implemented: snmalloc will always fail if used.

Part of #358.
indygreg pushed a commit that referenced this pull request Feb 26, 2021
Support for both allocators is implemented in the config and interpreter
initialization layers. Only mimalloc has its Python raw allocator
implemented: snmalloc will always fail if used.

Part of #358.
@indygreg
Copy link
Owner

I just pushed all my local commits to the main branch. This includes your code to implement support for mimalloc for the raw allocator.

You'll notice that pyalloc.rs has seen a bit of refactoring. I cleaned up the code after your patch to hopefully make it more readable.

I also pushed the remnants of your branch to the mimalloc branch. Diffing/rebasing against main will be difficult because of the refactoring. The last meaningful commit it can be diffed against is 8a32bcf2f491999f6db9e913a66c6503543f6d12.

There are a few things that aren't yet implemented:

  • No support for Rust global allocators. I wanted to do this separately from the Python allocator backend parts because it is different. It also feels like low-hanging fruit. TBH I don't think setting the Rust global allocator buys us much because not much Rust code runs. And the people who do run Rust code will have their own "driver" crate and can install their own global allocator. But I still think it could be useful to define.
  • snmalloc doesn't track the allocation size, so I haven't enabled its allocator backend. We'll need to do something like we've done for the Rust allocator so it can track allocation sizes. However, while hacking on this I realized the Rust allocator isn't thread safe. That's a serious bug! We can't copy/emulate Rust's allocation tracking code without fixing that.
  • Arena allocation isn't hooked up to the config. This is blocked on the aforementioned python3-sys bug.
  • We don't support setting the mem or object allocators. IMO we should do this. In the end state, we should have configuration fields for the raw (already exists), mem, and object allocators. We should also be able to configure the arena allocator. And we should be able to configure the Rust global allocator. TBH I'm unsure which combinations of settings should be allowed. But I think we should expose setting them all independently. And we should also consider how to configure debugging on the allocator. We may want to invent a single string field to control a preset list of supported allocator settings. Kinda like Python has settings like pymalloc and pymalloc_debug which are adjustable via the PYTHONMALLOC environment variable.

@indygreg
Copy link
Owner

I'm going to implement Starlark support for configuring the mem, obj, and arena allocators today. That way we'll have full control over allocator settings. This should enable some more experimentation. Although some features may be buggy for reasons recorded above.

@ryancinsight
Copy link
Contributor Author

I put an issue into rust-cpython to let the author know about the error as well. In the meantime, setting pymemallocatorex for all three (raw, obj, mem) should override pyobjectarenaallocator completely according to victor stinner but not sure why I saw a performance benefit going through the pyobjectarenaallocator instead.

indygreg added a commit that referenced this pull request Feb 26, 2021
We now define a custom allocator backend in its own field and
have separate fields controlling how it is used. This ensures we
only have a single custom allocator defined rather than a mix of
multiple allocators.

We expose fields to enable installing custom allocators for the mem
and obj domains as well as pymalloc's arena allocator (although the
arena allocator is busted due to
dgrunwald/rust-cpython#257).

We change the name of the "system" backend to "default" because that
is more appropriate.

I also found a subtle issue with the detection of the default backend
only checking for x86_64 on Windows: it should check for all Windows
MSVC arches.

Part of #358.
@indygreg
Copy link
Owner

OK. Refactored allocator control has been pushed. We should now have full control over the allocator settings from Starlark / pyembed!

In the meantime, setting pymemallocatorex for all three (raw, obj, mem) should override pyobjectarenaallocator completely according to victor stinner but not sure why I saw a performance benefit going through the pyobjectarenaallocator instead.

Yes, setting obj and mem will replace pymalloc.

Regarding the performance, my guess is pymalloc is still better than a custom allocator for mem+obj because it is optimized for Python's allocation strategy. I think the sweet spot for custom allocators will be raw + pymalloc's arena allocator. But we won't know this until that python3-sys bug is fixed.

Are you going to send a PR to python3-sys to fix the API or should I do it?

Once that lands upstream, I'm hesitant to use an unreleased python3-sys in our Cargo.toml because that prevents us from publishing a release to crates.io (I'd like to get a PyOxidizer release out in the next few weeks and don't want to be blocked on python3-sys publishing a release). However, it would be trivial to use a custom python3-sys locally to test the performance impact.

indygreg added a commit that referenced this pull request Feb 26, 2021
This addresses some outstanding TODOs and allows us to enable this
allocator! I haven't tested this thouroughly yet. But at least the
test doesn't crash!

Part of #358.
indygreg added a commit that referenced this pull request Feb 26, 2021
Apparently python3-sys is correct! See
dgrunwald/rust-cpython#258. So we can enable
this feature.

As part of this, I fixed an unset pointer that was uncovered by the
newly enabled tests. I guess the tests did their job!

Part of #358.
@indygreg
Copy link
Owner

OK. snmalloc is now implemented (using allocation tracking) and the arena allocator support is enabled.

That leaves us with support for defining the global allocator as the only missing feature I can think of.

I'm not thrilled about the code for ensuring thread safety of allocation tracking. That likely imposes performance overhead. Although we should benchmark. I'd love ideas for how to make that code better. It would also be nice to avoid the tracking overhead if an allocator doesn't need to be thread safe (I think arena allocators are guaranteed to have the GIL held?).

@ryancinsight
Copy link
Contributor Author

OK. snmalloc is now implemented (using allocation tracking) and the arena allocator support is enabled.

This is great news, do you think it would be worth specifying allocator features, mimalloc has a couple different ones like dynamic-tls, secure, etc... that aren’t actually required, the only required one is the extended

That leaves us with support for defining the global allocator as the only missing feature I can think of.

which the more I think about should probably be in the build.rs so that it is used from the very beginning. My concern if we don’t set it is that it may limit the allocations being done by any rust code but this may not be much for normal users.

I'm not thrilled about the code for ensuring thread safety of allocation tracking. That likely imposes performance overhead. Although we should benchmark. I'd love ideas for how to make that code better. It would also be nice to avoid the tracking overhead if an allocator doesn't need to be thread safe (I think arena allocators are guaranteed to have the GIL held?).

I’ll definitely look it over and actually mimalloc and Snmalloc should already be considered thread safe as they are built to run on multiple threads and track their own allocations. Here is the Microsoft article on mimalloc and snmalloc describing how they share processes between threads safely and constantly map use: https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf

In the first paragraph that actually mention Python and how it mimalloc would suite its needs for the allocator backend and do reference counting on its own.

snmalloc on the other hand uses message parsing and tracks on separate threads to deallocate on main thread. It excels at small sized objects which is why I thought it may be better as a pymalloc alternative: https://www.microsoft.com/en-us/research/uploads/prod/2020/04/snmalloc.pdf
It is also an owned heap, which means each thread own it’s own allocator. Again allocator features may be a benefit to add as there is a 1kb snmalloc version and 1 mb version relating to size of individual allocations and appears optimized for this range, with larger objects reducing benefit.

@indygreg
Copy link
Owner

This is great news, do you think it would be worth specifying allocator features, mimalloc has a couple different ones like dynamic-tls, secure, etc... that aren’t actually required, the only required one is the extended

Yes, I think it would be worth allowing those to be specified.

That leaves us with support for defining the global allocator as the only missing feature I can think of.

which the more I think about should probably be in the build.rs so that it is used from the very beginning. My concern if we don’t set it is that it may limit the allocations being done by any rust code but this may not be much for normal users.

I was going to throw the global allocator into the auto-generated Rust crate providing the application. IMO it makes more sense to configure the global allocator on the crate providing an executable and not a library.

I’ll definitely look it over and actually mimalloc and Snmalloc should already be considered thread safe as they are built to run on multiple threads and track their own allocations. Here is the Microsoft article on mimalloc and snmalloc describing how they share processes between threads safely and constantly map use: https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf

I should have been specific about the limitations. I'm not concerned about thread safety of calling the allocator functions: allocators are generally thread safe. The problem is that both the rust and snmalloc backends need to pass the original allocation size to the C API and that size isn't always passed into the Python C memory APIs. So we need to some track the allocations independent of Python. We're currently using a Rust HashMap<*const c_void, usize> behind a the allocator's void* ctx for that, which isn't thread safe. That's the part that needs thread safety.

Alternatively, we could potentially track the allocation size internally in N bytes at the beginning or end of the allocated block. (Preferably the end to ensure memory alignment.) However, I believe there are security implications to this, since a buffer underrun or overrun would corrupt our accounting and this could potentially lead to badness if the previous size fed into the allocator APIs were incorrect. I worry that could somehow be exploited (e.g. you use a buffer overrun to lie about the previous allocation size which leads to a UAF and RCE). I'm not an expert in this area so I could be wrong. If we pursued this, I would likely reach out to some former coworkers who know this space much better than me to get their assessment. The good news is that I'm pretty sure you can work around this by storing a checksum or some such to detect corruption. This is similar to how Python's debug hooks work. And I believe many memory allocators do things like this transparently as built-in security features. Again, I just don't know this space well enough.

indygreg added a commit that referenced this pull request Feb 27, 2021
Previously, use of jemalloc would make jemalloc the Rust global
allocator via the `jemallocator` crate. With the recent introduction
of support for mimalloc and snmalloc, we want to support a similar
feature.

This commit teaches the new project template to configure the Rust
global allocator for all 3 supported custom allocators. We do this
by introducing a new set of crate features for controlling the global
allocator explicitly. The autogenerated main.rs has explicit feature
checking and code for registering the global allocator. This is a
bit more verbose. But I like being explicit and consistent across the
allocators.

We've also introduced tests that verify autogenerated projects with
custom allocators compile correctly.

There is room to introduce Starlark config options to control the
global allocator settings. But these didn't exist before. So I think
we don't need to implement that functionality as part of this commit.

Both mimalloc and snmalloc use cmake as part of their crate build
process. So we document that and skip tests on Windows since cmake
isn't present in CI. We'll likely need more intelligent follow-ups
here (like to probe for the executable in the test). Let's wait for
things to break/complain before introducing more complexity.

Part of #358.
@ryancinsight
Copy link
Contributor Author

ryancinsight commented Feb 27, 2021

Got it, I may be mistaken but I believe the size tracking is actually because the snmalloc team wrote the api for export to rust that the snmalloc-RS crate calls directly. This is in contrast to the mimalloc-rs (rust crate) team that actually implemented mi-malloc, mi-free in rust before converting to size tracked/rust global allocator compatible version.

For example: this is the rust.cc that snmalloc wrote that snmalloc-sys calls: https://github.com/microsoft/snmalloc/blob/master/src/override/rust.cc
which pretty much limits you to size because it’s trying to make compatible with rust.

I won’t ignore that the more direct route that snmalloc takes may be better for the rust global allocator with less function wraps, however, I was hoping to start bringing over the snmalloc-malloc, and snmalloc-free that doesn’t require size, similar to mimalloc, over to the snmalloc-sys crate once I get a chance.

furthermore, I would expect cmake requirements short term, mimalloc-rs has a pull request to use cc crate and I may use cc crate instead when adding functionality to snmalloc-sys.

@indygreg
Copy link
Owner

Got it, I may be mistaken but I believe the size tracking is actually because the snmalloc team wrote the api for export to rust that the snmalloc-RS crate calls directly. This is in contrast to the mimalloc-rs (rust crate) team that actually implemented mi-malloc, mi-free in rust before converting to size tracked/rust global allocator compatible version.

For example: this is the rust.cc that snmalloc wrote that snmalloc-sys calls: https://github.com/microsoft/snmalloc/blob/master/src/override/rust.cc
which pretty much limits you to size because it’s trying to make compatible with rust.

Very interesting!

I won’t ignore that the more direct route that snmalloc takes may be better for the rust global allocator with less function wraps, however, I was hoping to start bringing over the snmalloc-malloc, and snmalloc-free that doesn’t require size, similar to mimalloc, over to the snmalloc-sys crate once I get a chance.

This would be terrific! If the size argument isn't needed, we could remove the allocation tracking code and get performance on par with other allocators (modulo Rust, which will always be special). Ideally there is a C API that doesn't accept the argument. But sending in a specialvalue for the size (e.g. usize::MAX) to indicate unknown would also work. I just want the API semantics strongly documented so we're not taking a risk by not sending in an appropriate size value.

@mjp41
Copy link

mjp41 commented Mar 1, 2021

FYI: I am happy to take changes and additions to the snmalloc rust.cc to support your use case. Please file an issue with the requirements. My aim was to expose the most optimal surface for the global allocator in Rust, but we can add more.

@ryancinsight
Copy link
Contributor Author

FYI: I am happy to take changes and additions to the snmalloc rust.cc to support your use case. Please file an issue with the requirements. My aim was to expose the most optimal surface for the global allocator in Rust, but we can add more.

Hello @mjp41, thanks for the info and I completely agree that you exposed the optimal surface for the rust global allocator. While I plan to use the global allocator with rust projects embedding Python, I would also like to use it for python’s memory allocator but directly, without rust’s size tracking requirements calling snmalloc versions of malloc, realloc, calloc, and free. I’ll see what I can do from my side first and put in an issue as you suggest if I need more support, Best regards!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants