New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad multithreaded performance of allocations #8101
Comments
Thanks for the report. As noted, the performance increase in the case above (with the patch applied) is due to removing pressure on atomic ops in the Numba runtime (NRT), and then also doing less work RE adding debug markers. I agree that #7887 is also likely related. I did some experiments a few weeks back to try out some ideas in relation to improving the performance in these sorts of situations. Steps were roughly:
From doing the above, it became very clear that the memory system statistics collection "got in the way" and prevented further optimisation. I manually removed it (similar to the patch above) and LLVM managed to optimise even further, but certain patterns in the way Numba currently generates code prevented it from optimising the I think the following needs to happen in this order, which is also the approximate order of difficulty:
|
@stuartarchibald That makes a lot of sense to me for what it is worth. While looking at this a bit more, I noticed that in We can trigger a segfault by requesting too much memory: import numba
import numpy as np
n = numba.size_t.maxval // 8 // 2
# Raises a memory error as it should
# np.zeros(n)
@numba.njit
def foo():
return np.zeros(n)
# Segfault
out = foo() I can also open a separate issue for this, I guess it isn't really related to the |
Great, thanks for taking a look.
Please could you put this into a new issue? We can discuss what the best thing to do is on there, many thanks! |
The alloc issue is now #8105 |
Many thanks for opening that. |
xref: #8156 which is discussing making the stats counters optional, off by default. |
xref: #8200 this makes it such that the |
xref: #8235 which implements making the stats counters off by default. |
xref: summary here #8156 (comment), once patches are merged, can potentially close this ticket? |
Fixed in #8235 |
When new arrays are allocated, wrapper functions in
nrt.c
aroundmalloc
are called.Those wrappers increment an atomic counter (I guess for debugging/profiling purposes?),
but if we call those functions from several threads this can come with a serious performance
hit, because different cores write to the same memory location. In some real world code
(that probably could be improved by avoiding some of the allocations) I see a performance
drop of ~30%.
To show the issue more clearly, here is an extreme example:
Timings:
Running on a Threadripper 1920X with 12 cores, and jemalloc (since some cores
have no cache in common this is probably close to the worst case, I guess this will
not look nearly as bad on most laptops).
To evaluate without the atomic counter I use this patch for
nrt.c
Somewhat related: #7887
The text was updated successfully, but these errors were encountered: