Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many functions slow for bigger-than-cache arrays on linux from numpy>=1.16.5 #15545

Closed
victor-shepardson opened this issue Feb 10, 2020 · 37 comments · Fixed by #15769
Closed

Comments

@victor-shepardson
Copy link

Performance is badly degraded for many operations, often when array sizes similar to L2/L3 cache size are involved. The effect is present for numpy>=1.16.5 on an Intel i9-9960X Ubuntu 18.04 system, not observed for any numpy version on an Intel i9-8950HK MacOS Mojave system.

Reproducing code example:

conda create -n py37np164 python=3.7 numpy=1.16.4
conda create -n py37np165 python=3.7 numpy=1.16.5
conda create -n py37np181 python=3.7 numpy=1.18.1
import sys
import numpy as np
from timeit import Timer
print(np.__version__, sys.version)
# results sensitive to hardware cache size
n = 5
sizes = [2**20, 2**21, 2**22, 2**23]
stmts = [
    'np.zeros(({},))',
    'np.random.rand({})',
    'np.linspace(0,1,{})',
    'np.exp(np.zeros(({},)))',
]
for stmt in stmts:
    for size in sizes:
        s = stmt.format(size)
        print(s+':')
        t = Timer(s, globals=globals()).timeit(n)
        print(f'\t{size/n/t} elements/second')

output with 1.16.4:

1.16.4 3.7.6 (default, Jan  8 2020, 19:59:22)
[GCC 7.3.0]
np.zeros((1048576,)):
        61421511.05126992 elements/second
np.zeros((2097152,)):
        88332565.33730087 elements/second
np.zeros((4194304,)):
        27011333457.509125 elements/second
np.zeros((8388608,)):
        55727273740.89583 elements/second
np.random.rand(1048576):
        5537466.835408629 elements/second
np.random.rand(2097152):
        4565838.5126173515 elements/second
np.random.rand(4194304):
        5056596.831366348 elements/second
np.random.rand(8388608):
        4877062.023493181 elements/second
np.linspace(0,1,1048576):
        4726193.511268751 elements/second
np.linspace(0,1,2097152):
        24889336.518190637 elements/second
np.linspace(0,1,4194304):
        10163866.009809423 elements/second
np.linspace(0,1,8388608):
        11765754.105458323 elements/second
np.exp(np.zeros((1048576,))):
        54532313.29674471 elements/second
np.exp(np.zeros((2097152,))):
        33393087.31574934 elements/second
np.exp(np.zeros((4194304,))):
        37155269.74395583 elements/second
np.exp(np.zeros((8388608,))):
        34553576.9698848 elements/second

output with 1.16.5:

1.16.5 3.7.6 (default, Jan  8 2020, 19:59:22)
[GCC 7.3.0]
np.zeros((1048576,)):
        60143502.91423442 elements/second
np.zeros((2097152,)):
        87341513.42070064 elements/second
np.zeros((4194304,)):
        19292941759.91131 elements/second
np.zeros((8388608,)):
        54397048327.81841 elements/second
np.random.rand(1048576):
        5473600.047150185 elements/second
np.random.rand(2097152):
        18752.041155728602 elements/second
np.random.rand(4194304):
        19324.038630861298 elements/second
np.random.rand(8388608):
        14506.20855024289 elements/second
np.linspace(0,1,1048576):
        3856487.073921443 elements/second
np.linspace(0,1,2097152):
        16523085.001294544 elements/second
np.linspace(0,1,4194304):
        11812.122234984105 elements/second
np.linspace(0,1,8388608):
        13185.01070840559 elements/second
np.exp(np.zeros((1048576,))):
        47433268.94274437 elements/second
np.exp(np.zeros((2097152,))):
        107295.34448171785 elements/second
np.exp(np.zeros((4194304,))):
        76042.71390730397 elements/second
np.exp(np.zeros((8388608,))):
        92983.00220850644 elements/second

output with 1.18.1:

1.18.1 3.7.6 (default, Jan  8 2020, 19:59:22)
[GCC 7.3.0]
np.zeros((1048576,)):
        50735117.70738101 elements/second
np.zeros((2097152,)):
        83684116.17091243 elements/second
np.zeros((4194304,)):
        21767035415.033813 elements/second
np.zeros((8388608,)):
        54589913512.271355 elements/second
np.random.rand(1048576):
        8093427.641236661 elements/second
np.random.rand(2097152):
        8188684.253048504 elements/second
np.random.rand(4194304):
        13750.96922032222 elements/second
np.random.rand(8388608):
        12809.948623963204 elements/second
np.linspace(0,1,1048576):
        28700369.02352331 elements/second
np.linspace(0,1,2097152):
        31411688.42840547 elements/second
np.linspace(0,1,4194304):
        12422.915392248138 elements/second
np.linspace(0,1,8388608):
        15747.353944479057 elements/second
np.exp(np.zeros((1048576,))):
        6914276.304259895 elements/second
np.exp(np.zeros((2097152,))):
        9039.82013534727 elements/second
np.exp(np.zeros((4194304,))):
        15507.785512087936 elements/second
np.exp(np.zeros((8388608,))):
        13965.641080531177 elements/second

Python/Numpy version

results of conda env export (envs differ only in numpy and numpy-base versions)

name: py37np165
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - blas=1.0=mkl
  - ca-certificates=2020.1.1=0
  - certifi=2019.11.28=py37_0
  - intel-openmp=2020.0=166
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - mkl=2020.0=166
  - mkl-service=2.3.0=py37he904b0f_0
  - mkl_fft=1.0.15=py37ha843d7b_0
  - mkl_random=1.1.0=py37hd6b4f25_0
  - ncurses=6.1=he6710b0_1
  - numpy=1.16.5=py37h7e9f1db_0
  - numpy-base=1.16.5=py37hde5b4d6_0
  - openssl=1.1.1d=h7b6447c_3
  - pip=20.0.2=py37_1
  - python=3.7.6=h0371630_2
  - readline=7.0=h7b6447c_5
  - setuptools=45.1.0=py37_0
  - six=1.14.0=py37_0
  - sqlite=3.31.1=h7b6447c_0
  - tk=8.6.8=hbc83047_0
  - wheel=0.34.2=py37_0
  - xz=5.2.4=h14c3975_4
  - zlib=1.2.11=h7b6447c_3

OS/Hardware:

Ubuntu 18.04.3 LTS (GNU/Linux 5.0.0-37-generic x86_64)

Handle 0x0061, DMI type 4, 48 bytes
Processor Information
        Socket Designation: LGA 2066 R4
        Type: Central Processor
        Family: Xeon
        Manufacturer: Intel(R) Corporation
        ID: 54 06 05 00 FF FB EB BF
        Signature: Type 0, Family 6, Model 85, Stepping 4
        Flags:
                FPU (Floating-point unit on-chip)
                VME (Virtual mode extension)
                DE (Debugging extension)
                PSE (Page size extension)
                TSC (Time stamp counter)
                MSR (Model specific registers)
                PAE (Physical address extension)
                MCE (Machine check exception)
                CX8 (CMPXCHG8 instruction supported)
                APIC (On-chip APIC hardware supported)
                SEP (Fast system call)
                MTRR (Memory type range registers)
                PGE (Page global enable)
                MCA (Machine check architecture)
                CMOV (Conditional move instruction supported)
                PAT (Page attribute table)
                PSE-36 (36-bit page size extension)
                CLFSH (CLFLUSH instruction supported)
                DS (Debug store)
                ACPI (ACPI supported)
                MMX (MMX technology supported)
                FXSR (FXSAVE and FXSTOR instructions supported)
                SSE (Streaming SIMD extensions)
                SSE2 (Streaming SIMD extensions 2)
                SS (Self-snoop)
                HTT (Multi-threading)
                TM (Thermal monitor supported)
                PBE (Pending break enabled)
        Version: Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz
        Voltage: 1.6 V
        External Clock: 100 MHz
        Max Speed: 4000 MHz
        Current Speed: 3100 MHz
        Status: Populated, Enabled
        Upgrade: Other
        L1 Cache Handle: 0x005E
        L2 Cache Handle: 0x005F
        L3 Cache Handle: 0x0060
        Serial Number: Not Specified
        Asset Tag: UNKNOWN
        Part Number: Not Specified
        Core Count: 16
        Core Enabled: 16
        Thread Count: 32
        Characteristics:
                64-bit capable
                Multi-Core
                Hardware Thread
                Execute Protection
                Enhanced Virtualization
                Power/Performance Control

Handle 0x005E, DMI type 7, 19 bytes
Cache Information
        Socket Designation: L1-Cache
        Configuration: Enabled, Not Socketed, Level 1
        Operational Mode: Write Back
        Location: Internal
        Installed Size: 1024 kB
        Maximum Size: 1024 kB
        Supported SRAM Types:
                Synchronous
        Installed SRAM Type: Synchronous
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Instruction
        Associativity: 8-way Set-associative

Handle 0x005F, DMI type 7, 19 bytes
Cache Information
        Socket Designation: L2-Cache
        Configuration: Enabled, Not Socketed, Level 2
        Operational Mode: Varies With Memory Address
        Location: Internal
        Installed Size: 16384 kB
        Maximum Size: 16384 kB
        Supported SRAM Types:
                Synchronous
        Installed SRAM Type: Synchronous
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Unified
        Associativity: 16-way Set-associative

Handle 0x0060, DMI type 7, 19 bytes
Cache Information
        Socket Designation: L3-Cache
        Configuration: Enabled, Not Socketed, Level 3
        Operational Mode: Varies With Memory Address
        Location: Internal
        Installed Size: 22528 kB
        Maximum Size: 22528 kB
        Supported SRAM Types:
                Synchronous
        Installed SRAM Type: Synchronous
        Speed: Unknown
        Error Correction Type: Single-bit ECC
        System Type: Unified
        Associativity: Fully Associative
@victor-shepardson
Copy link
Author

possibly related to #14322 somehow?

@charris
Copy link
Member

charris commented Feb 10, 2020

@victor-shepardson That would be my first guess also. How are you installing numpy? It might be interesting to try a source install if that is not what you are doing, although that might degrade BLAS performance depending on the library, but that isn't what we are looking at here.

@charris
Copy link
Member

charris commented Feb 10, 2020

Might look at the conversation in #14177. What does
cat /sys/kernel/mm/transparent_hugepage/enabled show?

@charris
Copy link
Member

charris commented Feb 10, 2020

Although it is curious that zeros doesn't seem to be affected.

@charris
Copy link
Member

charris commented Feb 10, 2020

I don't see this regression on my hardware (which is apparently much slower than yours).

1.18.2.dev0+e0660d4 3.7.6 (default, Jan 30 2020, 09:44:41) 
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)]
np.zeros((1048576,)):
	38158780.33276095 elements/second
np.zeros((2097152,)):
	75194525.31342381 elements/second
np.zeros((4194304,)):
	14156793846.309706 elements/second
np.zeros((8388608,)):
	45595218962.68746 elements/second
np.random.rand(1048576):
	5043581.73273908 elements/second
np.random.rand(2097152):
	5065115.222716178 elements/second
np.random.rand(4194304):
	4874188.548954194 elements/second
np.random.rand(8388608):
	4932568.4791796 elements/second
np.linspace(0,1,1048576):
	24176792.05426381 elements/second
np.linspace(0,1,2097152):
	17535581.49985825 elements/second
np.linspace(0,1,4194304):
	15133216.867388567 elements/second
np.linspace(0,1,8388608):
	15924277.461800016 elements/second
np.exp(np.zeros((1048576,))):
	6292804.736660751 elements/second
np.exp(np.zeros((2097152,))):
	5133047.037672353 elements/second
np.exp(np.zeros((4194304,))):
	5169529.959440129 elements/second
np.exp(np.zeros((8388608,))):
	5247382.237551451 elements/second

Which is actually a bit faster than 1.16.4. So this is rather curious, and I'd like to know why your numbers can be so much larger. Sounds like something parallel, I wonder if this is a compiler issue.

@victor-shepardson
Copy link
Author

@charris thanks for the quick reply.

we've been installing from conda. installing from pip shows same behavior, I'll try building from source tomorrow.

/sys/kernel/mm/transparent_hugepage/enabled reads always [madvise] never. I don't understand what that means, why would all three options be present?

@charris
Copy link
Member

charris commented Feb 11, 2020

@victor-shepardson The options are settable, although I suspect you need to be root. I think there is probably some thrashing going on in order to account for the large drop in performance, but this is getting beyond my experience. @pentschev Thoughts?

@pentschev
Copy link
Contributor

This is strange, I wouldn't expect to see any performance difference around CPU cache sizes.

/sys/kernel/mm/transparent_hugepage/enabled reads always [madvise] never. I don't understand what that means, why would all three options be present?

This means there are 3 options possible, and the one inside square brackets is currently selected. I'm wondering if you could check whether there's a difference when you change that to always and never. To change that you need root access as @charris mentioned, and can run one of the two below (depending on whether you use sudo or not):

# as root
echo always > /sys/kernel/mm/transparent_hugepage/enabled

# with sudo
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

If hugepage is really the cause, then I would expect to see that setting it to never doesn't present a change in performance in any of the NumPy versions. I'm not sure what to expect from always, I'm guessing we'll see the same performance of madvise.

Also tagging @hmaarrfk in case there are other ideas I'm not thinking of now.

@hmaarrfk
Copy link
Contributor

Can you try to preallocate arrays? That made a big difference for me a while back. Even copying arrays is really slow without preallocation

@victor-shepardson
Copy link
Author

victor-shepardson commented Feb 11, 2020

@charris @pentschev did not have time to try building from source or setting transparent_hugepage/enabled today, but discovered something else. The issue appears related to swap/disk cache. We had been fitting models with stochastic gradient descent to large datasets. Data were being streamed continuously from disk, which was keeping disk cache memory and swap full. After killing these jobs, the issue disappeared. We think it was cache/swap specifically, because there was still plenty of available memory, disk bandwidth and idle cpu cores.

@pentschev
Copy link
Contributor

That explains a lot. Could you check if there's no performance degradation when you're not swapping anymore?

@charris
Copy link
Member

charris commented Feb 12, 2020

@victor-shepardson Thanks for the feedback. Keep us informed.

@bmerry
Copy link
Contributor

bmerry commented Feb 19, 2020

I'm also running into some major performance degradation with the new hugepage support, on a dual-socket Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz with 768GB of RAM. I'm seeing this with numpy 1.18.1, and toggling /sys/kernel/mm/transparent_hugepage/enabled between madvise and never causes a 5x variation in performance (never being 5x faster).

It's unfortunately a rather complex benchmark (involving dask, multithreading and an large astronomical dataset) and I haven't yet reproduced anything similar with a simple microbenchmark. I'll update if I do find something, but for now just treat this as a data point that it's not just @victor-shepardson who has seen problems.

Some hardware/system info:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Stepping:              1
CPU MHz:               2101.000
CPU max MHz:           2101.0000
CPU min MHz:           1200.0000
BogoMIPS:              4201.64
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-17,36-53
NUMA node1 CPU(s):     18-35,54-71
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
              total        used        free      shared  buff/cache   available
Mem:      792516400   108740544    80609196      204808   603166660   681153168
Swap:       8388604     5097540     3291064
Linux com06 4.4.0-128-generic #154-Ubuntu SMP Fri May 25 14:15:18 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu 16.04, Python 3.5.2 (but have also seen issues with 3.8.1)

@bmerry
Copy link
Contributor

bmerry commented Feb 19, 2020

In my case I think the issue may be fragmentation. In the slow case, perf top shows that at least half my CPU time is spent in pageblock_pfn_to_page, which is apparently related to defragmentation/compaction. As soon as I disabled defragmentation by writing never to /sys/kernel/mm/transparent_hugepage/defrag the performance was good again (actually better than with THP totally disabled). It seems that defragmentation being slow is a known issue.

@bmerry
Copy link
Contributor

bmerry commented Feb 19, 2020

I was able to produce a case where madvise is slower than never. It's probably rather system-specific: I had to fiddle with the constants a fair bit to demonstrate a slowdown, and it's still not as dramatic as I'm seeing in my actual application.

Warning: this uses a lot of memory. If you run it and you don't have hundreds of GBs of RAM you might thrash your swap quite badly.

#!/usr/bin/env python3
import sys
import time
import concurrent.futures
import numpy as np

SIZE = 16 * 1024 * 1024
PASSES = 100
THREADS = 64

print(np.__version__, sys.version)
with open('/sys/kernel/mm/transparent_hugepage/enabled') as f:
    print(f.read())

def work(thread_id):
    arrs = []
    for i in range(PASSES):
        arrs.append(np.ones(SIZE, np.uint8))
    np.stack(arrs)

with concurrent.futures.ThreadPoolExecutor(THREADS) as pool:
    start = time.monotonic()
    list(pool.map(work, range(THREADS)))
    stop = time.monotonic()
    print(stop - start)

With THP set to madvise:

1.18.1 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
always [madvise] never

52.80045520141721

With it set to never:

1.18.1 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
always madvise [never]

30.80044653173536

The top few lines of perf report when run with madvise:

Samples: 2M of event 'cycles:pp', Event count (approx.): 13463412745798419
Overhead  Command  Shared Object                                      Symbol
  23.78%  python   [kernel.kallsyms]                                  [k] pageblock_pfn_to_page
  21.88%  python   [kernel.kallsyms]                                  [k] clear_page_c_e
  12.93%  python   [kernel.kallsyms]                                  [k] native_queued_spin_lock_slowpath
   8.19%  python   [unknown]                                          [k] 0x00007f8a2d91baf8
   2.70%  python   [kernel.kallsyms]                                  [k] down_read_trylock
   2.05%  python   [kernel.kallsyms]                                  [k] handle_mm_fault
   1.91%  python   [kernel.kallsyms]                                  [k] get_pfnblock_flags_mask
   1.83%  python   [kernel.kallsyms]                                  [.] native_irq_return_iret
   1.81%  python   libc-2.23.so                                       [.] __memmove_avx_unaligned
   1.62%  python   [kernel.kallsyms]                                  [k] get_page_from_freelist
   1.57%  python   [kernel.kallsyms]                                  [k] __rmqueue.isra.80
   1.48%  python   [kernel.kallsyms]                                  [k] up_read
   0.90%  python   [kernel.kallsyms]                                  [k] try_charge
   0.80%  python   [kernel.kallsyms]                                  [k] smp_call_function_many
   0.78%  python   [kernel.kallsyms]                                  [k] compact_zone
   0.66%  python   [kernel.kallsyms]                                  [k] copy_page

@victor-shepardson
Copy link
Author

@pentschev checked today with the same large dataset streaming through memory but swapping disabled and there is no degradation.

@seberg
Copy link
Member

seberg commented Feb 19, 2020

Should we just revert these changes for now, or does it seem like we can find another solution to it. Or is this specific to certain systems; can we know that defragmentation will be a problem?

@charris
Copy link
Member

charris commented Feb 19, 2020

@seberg I've been thinking the same.

@charris
Copy link
Member

charris commented Feb 19, 2020

@victor-shepardson Just to be clear, you ran swapoff? I wonder if these problems are kernel version dependent.

@victor-shepardson
Copy link
Author

@charris yep, swapoff -a
Screen Shot 2020-02-19 at 3 25 52 PM

@charris
Copy link
Member

charris commented Feb 19, 2020

@victor-shepardson Apparently swappiness can be tuned. What do the following look like

charris@fc [numpy.git (master)]$ cat /proc/sys/vm/swappiness
60
charris@fc [numpy.git (master)]$ cat /proc/sys/vm/vfs_cache_pressure
100

Some discussion at https://haydenjames.io/linux-performance-almost-always-add-swap-space/.

@pentschev
Copy link
Contributor

I think it's related to the system/problem. To be honest, I'm -1 in reverting this, so far we seem to have 3 data points:

  1. @victor-shepardson : works well when it doesn't swap;
  2. mine : also works well in general, I never experienced degradation, only improvements instead;
  3. @bmerry : (unless I misunderstood) works well if defragmentation is disabled, and disabling defragmentation seems to be the workaround/solution for that particular system/problem.

I would rather see more degradation points than successful ones to revert anything. In particular if a system configuration solves the issue, then it may be a required tuning for the user, and even more than that, if that's something beneficial for users in general we should consider whether this can be integrated directly into NumPy rather than relying on OS configurations.

@bmerry
Copy link
Contributor

bmerry commented Feb 20, 2020

To be honest, I'm -1 in reverting this

I agree it would be a pity to revert this, since in general it seems to give some nice speedup. How practical would it be to give users a knob to turn it off e.g. an environment variable? I'm not sure if there is precedent for run-time configuration of numpy, and I guess discoverability might be an issue since a new user is unlikely to know that there is a problem.

I wonder if these problems are kernel version dependent.

Possibly - the link I sent mentions an improvement in 4.6, but on the other hand @victor-shepardson is running 5.0.

@seberg
Copy link
Member

seberg commented Feb 20, 2020

Happy to not revert it! A knob with an env var is probably pretty simple to do, but I wonder if its actually worth it due to discoverability.
If it is something simple like a newer kernel, then maybe we can figure out parameters base on which NumPy can decide not to use it?

@bmerry
Copy link
Contributor

bmerry commented Feb 20, 2020

I'll find out if I can try a newer kernel on the same machine. It's a shared machine that people wrong long-running number-crunching on, so it might not be trivial to schedule time to even reboot it.

@victor-shepardson
Copy link
Author

victor-shepardson commented Feb 20, 2020

@charris we had tried swappiness at 60, 10 and 0. did not touch vfs_cache_pressure:

victor@Ash ~> cat /proc/sys/vm/swappiness                                                  
0
victor@Ash ~> cat /proc/sys/vm/vfs_cache_pressure                                          
100

@bmerry
Copy link
Contributor

bmerry commented Feb 21, 2020

I was able to get the kernel updated on this machine, which fixed the slowdown. The new version is

Linux com06 4.15.0-88-generic #88~16.04.1-Ubuntu SMP Wed Feb 12 04:19:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

The benchmark I posted in this comment now outputs

1.18.1 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
always [madvise] never

15.417192504000013

i.e. using hugepages is now much faster instead of much slower. My actual applicable is also improved.

Unfortunately this is a machine shared by multiple users for number-crunching, so I won't be able to spend time bisecting kernel versions to determine when the issue was fixed.

@pentschev
Copy link
Contributor

But that is good news then! It seems that there were some improvements in the mainline kernel. From the link you posted in #15545 (comment), it seems that the fix may be torvalds/linux@7cf91a9 that was introduced in kernel 4.6. If someone would be able to test that, it would be great, otherwise I would say this issue has been clarified and we're good to close it, any objections?

@bmerry
Copy link
Contributor

bmerry commented Feb 21, 2020

It's good for me, but I think all this proves is that my issue isn't the same as the original bug reporter's, because he's running Linux 5.0.

@charris
Copy link
Member

charris commented Mar 1, 2020

@victor-shepardson Do you regard this problem as fixed for your usage case?

@victor-shepardson
Copy link
Author

victor-shepardson commented Mar 1, 2020 via email

@seberg
Copy link
Member

seberg commented Mar 4, 2020

@pentschev I think Keith also suggested it. How annoying would it be to create an environment variable to modify the behaviour. By default it could attempt to check if the kernel version is below 4.6 and in that case default to not use hugepages?
Such heuristics are not great, but if an env variable is a minor thing (and I assume it is), it seems OK if we guess correctly 90% of the time.

@pentschev
Copy link
Contributor

This isn't a bad idea, but I think it would be nice to first be sure that kernel 4.6 is indeed the sweet spot, so far it seems a bit speculative. I will defer that decision to you whether we should indeed verify or just go ahead and do the changes. If we do the changes, it's probably important that this is documented, as it's not going to be an easy-to-track behavior anymore. Lastly, if someone would like to jump ahead and do the changes, please go ahead, I won't have the bandwidth for that until late March/early April.

@bmerry
Copy link
Contributor

bmerry commented Mar 5, 2020

Just to make things more confused, a colleague of mine says he's previously seen performance issues with huge pages that have gone away after a reboot, presumably because it cleared up fragmentation. So it's possible that my upgrade to 4.13 only fixed things because it involved a reboot. So any experiments to determine behaviour of different kernel versions may need to take that into account.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Mar 6, 2020

It could be that this is such a seldom used feature of the kernel that people haven't tested it much...

@charris charris removed this from the 1.18.2 release milestone Mar 9, 2020
seberg added a commit to seberg/numpy that referenced this issue Mar 17, 2020
By default this disables madvise hugepage on kernels before 4.6, since
we expect that these typically see large performance regressions when
using hugepages due to slow defragementation code presumably fixed by:

torvalds/linux@7cf91a9

This adds support to set the behaviour at startup time through the
``NUMPY_MADVISE_HUGEPAGE`` environment variable.

Fixes numpygh-15545
@seberg
Copy link
Member

seberg commented May 2, 2020

To get this rolling. I have a PR gh-15769 to add an environment variable, but guess to disable it on kernels before 4.6. There seemed to be a slight preference for not guessing (I am personally a bit in favor of guessing, because it seems to me that we lose practically nothing if we just guess correctly most of the time.

Just to get a few opinions.

@pentschev
Copy link
Contributor

I'm fine with the proposed solution @seberg , thanks for the work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants