-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
many functions slow for bigger-than-cache arrays on linux from numpy>=1.16.5 #15545
Comments
possibly related to #14322 somehow? |
@victor-shepardson That would be my first guess also. How are you installing numpy? It might be interesting to try a source install if that is not what you are doing, although that might degrade BLAS performance depending on the library, but that isn't what we are looking at here. |
Might look at the conversation in #14177. What does |
Although it is curious that |
I don't see this regression on my hardware (which is apparently much slower than yours).
Which is actually a bit faster than 1.16.4. So this is rather curious, and I'd like to know why your numbers can be so much larger. Sounds like something parallel, I wonder if this is a compiler issue. |
@charris thanks for the quick reply. we've been installing from conda. installing from pip shows same behavior, I'll try building from source tomorrow.
|
@victor-shepardson The options are settable, although I suspect you need to be root. I think there is probably some thrashing going on in order to account for the large drop in performance, but this is getting beyond my experience. @pentschev Thoughts? |
This is strange, I wouldn't expect to see any performance difference around CPU cache sizes.
This means there are 3 options possible, and the one inside square brackets is currently selected. I'm wondering if you could check whether there's a difference when you change that to
If hugepage is really the cause, then I would expect to see that setting it to Also tagging @hmaarrfk in case there are other ideas I'm not thinking of now. |
Can you try to preallocate arrays? That made a big difference for me a while back. Even copying arrays is really slow without preallocation |
@charris @pentschev did not have time to try building from source or setting |
That explains a lot. Could you check if there's no performance degradation when you're not swapping anymore? |
@victor-shepardson Thanks for the feedback. Keep us informed. |
I'm also running into some major performance degradation with the new hugepage support, on a dual-socket Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz with 768GB of RAM. I'm seeing this with numpy 1.18.1, and toggling /sys/kernel/mm/transparent_hugepage/enabled between It's unfortunately a rather complex benchmark (involving dask, multithreading and an large astronomical dataset) and I haven't yet reproduced anything similar with a simple microbenchmark. I'll update if I do find something, but for now just treat this as a data point that it's not just @victor-shepardson who has seen problems. Some hardware/system info:
Ubuntu 16.04, Python 3.5.2 (but have also seen issues with 3.8.1) |
In my case I think the issue may be fragmentation. In the slow case, perf top shows that at least half my CPU time is spent in pageblock_pfn_to_page, which is apparently related to defragmentation/compaction. As soon as I disabled defragmentation by writing |
I was able to produce a case where Warning: this uses a lot of memory. If you run it and you don't have hundreds of GBs of RAM you might thrash your swap quite badly. #!/usr/bin/env python3
import sys
import time
import concurrent.futures
import numpy as np
SIZE = 16 * 1024 * 1024
PASSES = 100
THREADS = 64
print(np.__version__, sys.version)
with open('/sys/kernel/mm/transparent_hugepage/enabled') as f:
print(f.read())
def work(thread_id):
arrs = []
for i in range(PASSES):
arrs.append(np.ones(SIZE, np.uint8))
np.stack(arrs)
with concurrent.futures.ThreadPoolExecutor(THREADS) as pool:
start = time.monotonic()
list(pool.map(work, range(THREADS)))
stop = time.monotonic()
print(stop - start) With THP set to madvise:
With it set to never:
The top few lines of perf report when run with madvise:
|
@pentschev checked today with the same large dataset streaming through memory but swapping disabled and there is no degradation. |
Should we just revert these changes for now, or does it seem like we can find another solution to it. Or is this specific to certain systems; can we know that defragmentation will be a problem? |
@seberg I've been thinking the same. |
@victor-shepardson Just to be clear, you ran |
@charris yep, |
@victor-shepardson Apparently swappiness can be tuned. What do the following look like
Some discussion at https://haydenjames.io/linux-performance-almost-always-add-swap-space/. |
I think it's related to the system/problem. To be honest, I'm -1 in reverting this, so far we seem to have 3 data points:
I would rather see more degradation points than successful ones to revert anything. In particular if a system configuration solves the issue, then it may be a required tuning for the user, and even more than that, if that's something beneficial for users in general we should consider whether this can be integrated directly into NumPy rather than relying on OS configurations. |
I agree it would be a pity to revert this, since in general it seems to give some nice speedup. How practical would it be to give users a knob to turn it off e.g. an environment variable? I'm not sure if there is precedent for run-time configuration of numpy, and I guess discoverability might be an issue since a new user is unlikely to know that there is a problem.
Possibly - the link I sent mentions an improvement in 4.6, but on the other hand @victor-shepardson is running 5.0. |
Happy to not revert it! A knob with an env var is probably pretty simple to do, but I wonder if its actually worth it due to discoverability. |
I'll find out if I can try a newer kernel on the same machine. It's a shared machine that people wrong long-running number-crunching on, so it might not be trivial to schedule time to even reboot it. |
@charris we had tried swappiness at 60, 10 and 0. did not touch vfs_cache_pressure:
|
I was able to get the kernel updated on this machine, which fixed the slowdown. The new version is
The benchmark I posted in this comment now outputs
i.e. using hugepages is now much faster instead of much slower. My actual applicable is also improved. Unfortunately this is a machine shared by multiple users for number-crunching, so I won't be able to spend time bisecting kernel versions to determine when the issue was fixed. |
But that is good news then! It seems that there were some improvements in the mainline kernel. From the link you posted in #15545 (comment), it seems that the fix may be torvalds/linux@7cf91a9 that was introduced in kernel 4.6. If someone would be able to test that, it would be great, otherwise I would say this issue has been clarified and we're good to close it, any objections? |
It's good for me, but I think all this proves is that my issue isn't the same as the original bug reporter's, because he's running Linux 5.0. |
@victor-shepardson Do you regard this problem as fixed for your usage case? |
@charris not sure. i noticed some other numpy-heavy code running slowly on
Friday despite no swapping. i will let you know whether it's related once i
can investigate
…On Sun, Mar 1, 2020, 5:15 PM Charles Harris ***@***.***> wrote:
@victor-shepardson <https://github.com/victor-shepardson> Do you regard
this problem as fixed for your usage case?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15545?email_source=notifications&email_token=ABCQD5ADPEEAEKFEAI64S2DRFLNBZA5CNFSM4KSUGKC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENNMTPQ#issuecomment-593152446>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCQD5H2FN2TNN2QMPIG2P3RFLNBZANCNFSM4KSUGKCQ>
.
|
@pentschev I think Keith also suggested it. How annoying would it be to create an environment variable to modify the behaviour. By default it could attempt to check if the kernel version is below 4.6 and in that case default to not use hugepages? |
This isn't a bad idea, but I think it would be nice to first be sure that kernel 4.6 is indeed the sweet spot, so far it seems a bit speculative. I will defer that decision to you whether we should indeed verify or just go ahead and do the changes. If we do the changes, it's probably important that this is documented, as it's not going to be an easy-to-track behavior anymore. Lastly, if someone would like to jump ahead and do the changes, please go ahead, I won't have the bandwidth for that until late March/early April. |
Just to make things more confused, a colleague of mine says he's previously seen performance issues with huge pages that have gone away after a reboot, presumably because it cleared up fragmentation. So it's possible that my upgrade to 4.13 only fixed things because it involved a reboot. So any experiments to determine behaviour of different kernel versions may need to take that into account. |
It could be that this is such a seldom used feature of the kernel that people haven't tested it much... |
By default this disables madvise hugepage on kernels before 4.6, since we expect that these typically see large performance regressions when using hugepages due to slow defragementation code presumably fixed by: torvalds/linux@7cf91a9 This adds support to set the behaviour at startup time through the ``NUMPY_MADVISE_HUGEPAGE`` environment variable. Fixes numpygh-15545
To get this rolling. I have a PR gh-15769 to add an environment variable, but guess to disable it on kernels before 4.6. There seemed to be a slight preference for not guessing (I am personally a bit in favor of guessing, because it seems to me that we lose practically nothing if we just guess correctly most of the time. Just to get a few opinions. |
I'm fine with the proposed solution @seberg , thanks for the work! |
Performance is badly degraded for many operations, often when array sizes similar to L2/L3 cache size are involved. The effect is present for numpy>=1.16.5 on an Intel i9-9960X Ubuntu 18.04 system, not observed for any numpy version on an Intel i9-8950HK MacOS Mojave system.
Reproducing code example:
output with 1.16.4:
output with 1.16.5:
output with 1.18.1:
Python/Numpy version
results of
conda env export
(envs differ only in numpy and numpy-base versions)OS/Hardware:
Ubuntu 18.04.3 LTS (GNU/Linux 5.0.0-37-generic x86_64)
The text was updated successfully, but these errors were encountered: