Bad multithreaded performance of allocations #8101

aseyboldt · 2022-05-23T15:34:03Z

When new arrays are allocated, wrapper functions in nrt.c around malloc are called.
Those wrappers increment an atomic counter (I guess for debugging/profiling purposes?),
but if we call those functions from several threads this can come with a serious performance
hit, because different cores write to the same memory location. In some real world code
(that probably could be improved by avoiding some of the allocations) I see a performance
drop of ~30%.

To show the issue more clearly, here is an extreme example:

import numba
import numpy as np
import threading

@numba.njit(nogil=True)  # release gil so that we actually run this in parallel
def foo(x):
    for _ in range(100_000_000):
        # allocate a new array each time
        x += np.zeros_like(x)
    return x

def run():
    x = np.array(1.)
    foo(x)

threads = [threading.Thread(target=run) for _ in range(20)]

for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

Timings:

With atomic inc: 1h total CPU time, 4min wall time
Without atomic inc (see patch below): 3min total CPU time, 20s wall time

Running on a Threadripper 1920X with 12 cores, and jemalloc (since some cores
have no cache in common this is probably close to the worst case, I guess this will
not look nearly as bad on most laptops).

To evaluate without the atomic counter I use this patch for nrt.c

diff --git a/numba/core/runtime/nrt.c b/numba/core/runtime/nrt.c
index 3a65c9b99..de3f3cfab 100644
--- a/numba/core/runtime/nrt.c
+++ b/numba/core/runtime/nrt.c
@@ -2,6 +2,7 @@
 #include <string.h> /* for memset */
 #include "nrt.h"
 #include "assert.h"
+#include <stdlib.h>
 
 #if !defined MIN
 #define MIN(a, b) ((a) < (b)) ? (a) : (b)
@@ -182,7 +183,7 @@ void NRT_MemInfo_init(NRT_MemInfo *mi,void *data, size_t size,
     mi->external_allocator = external_allocator;
     NRT_Debug(nrt_debug_print("NRT_MemInfo_init mi=%p external_allocator=%p\n", mi, external_allocator));
     /* Update stats */
-    TheMSys.atomic_inc(&TheMSys.stats_mi_alloc);
+    //TheMSys.atomic_inc(&TheMSys.stats_mi_alloc);
 }
 
 NRT_MemInfo *NRT_MemInfo_new(void *data, size_t size,
@@ -259,7 +260,7 @@ NRT_MemInfo* NRT_MemInfo_alloc_dtor_safe(size_t size, NRT_dtor_function dtor) {
     void *data = nrt_allocate_meminfo_and_data(size, &mi, NULL);
     /* Only fill up a couple cachelines with debug markers, to minimize
        overhead. */
-    memset(data, 0xCB, MIN(size, 256));
+    //memset(data, 0xCB, MIN(size, 256));
     NRT_Debug(nrt_debug_print("NRT_MemInfo_alloc_dtor_safe %p %zu\n", data, size));
     NRT_MemInfo_init(mi, data, size, nrt_internal_custom_dtor_safe, dtor, NULL);
     return mi;
@@ -297,7 +298,7 @@ NRT_MemInfo *NRT_MemInfo_alloc_safe_aligned(size_t size, unsigned align) {
     void *data = nrt_allocate_meminfo_and_data_align(size, align, &mi, NULL);
     /* Only fill up a couple cachelines with debug markers, to minimize
        overhead. */
-    memset(data, 0xCB, MIN(size, 256));
+    //memset(data, 0xCB, MIN(size, 256));
     NRT_Debug(nrt_debug_print("NRT_MemInfo_alloc_safe_aligned %p %zu\n",
                               data, size));
     NRT_MemInfo_init(mi, data, size, nrt_internal_dtor_safe, (void*)size, NULL);
@@ -321,7 +322,7 @@ void NRT_dealloc(NRT_MemInfo *mi) {
     NRT_Debug(nrt_debug_print("NRT_dealloc meminfo: %p external_allocator: %p\n", mi, mi->external_allocator));
     if (mi->external_allocator) {
         mi->external_allocator->free(mi, mi->external_allocator->opaque_data);
-        TheMSys.atomic_inc(&TheMSys.stats_free);
+        //TheMSys.atomic_inc(&TheMSys.stats_free);
     } else {
         NRT_Free(mi);
     }
@@ -329,7 +330,7 @@ void NRT_dealloc(NRT_MemInfo *mi) {
 
 void NRT_MemInfo_destroy(NRT_MemInfo *mi) {
     NRT_dealloc(mi);
-    TheMSys.atomic_inc(&TheMSys.stats_mi_free);
+    //TheMSys.atomic_inc(&TheMSys.stats_mi_free);
 }
 
 void NRT_MemInfo_acquire(NRT_MemInfo *mi) {
@@ -469,15 +470,17 @@ void* NRT_Allocate_External(size_t size, NRT_ExternalAllocator *allocator) {
         ptr = allocator->malloc(size, allocator->opaque_data);
         NRT_Debug(nrt_debug_print("NRT_Allocate_External custom bytes=%zu ptr=%p\n", size, ptr));
     } else {
-        ptr = TheMSys.allocator.malloc(size);
+        //ptr = TheMSys.allocator.malloc(size);
+	ptr = malloc(size);
         NRT_Debug(nrt_debug_print("NRT_Allocate_External bytes=%zu ptr=%p\n", size, ptr));
     }
-    TheMSys.atomic_inc(&TheMSys.stats_alloc);
+    //TheMSys.atomic_inc(&TheMSys.stats_alloc);
     return ptr;
 }
 
 void *NRT_Reallocate(void *ptr, size_t size) {
-    void *new_ptr = TheMSys.allocator.realloc(ptr, size);
+    //void *new_ptr = TheMSys.allocator.realloc(ptr, size);
+    void *new_ptr = realloc(ptr, size);
     NRT_Debug(nrt_debug_print("NRT_Reallocate bytes=%zu ptr=%p -> %p\n",
                               size, ptr, new_ptr));
     return new_ptr;
@@ -485,8 +488,9 @@ void *NRT_Reallocate(void *ptr, size_t size) {
 
 void NRT_Free(void *ptr) {
     NRT_Debug(nrt_debug_print("NRT_Free %p\n", ptr));
-    TheMSys.allocator.free(ptr);
-    TheMSys.atomic_inc(&TheMSys.stats_free);
+    free(ptr);
+    //TheMSys.allocator.free(ptr);
+    //TheMSys.atomic_inc(&TheMSys.stats_free);
 }
 
 /*

Somewhat related: #7887

The text was updated successfully, but these errors were encountered:

stuartarchibald · 2022-05-24T08:57:49Z

Thanks for the report. As noted, the performance increase in the case above (with the patch applied) is due to removing pressure on atomic ops in the Numba runtime (NRT), and then also doing less work RE adding debug markers. I agree that #7887 is also likely related.

I did some experiments a few weeks back to try out some ideas in relation to improving the performance in these sorts of situations. Steps were roughly:

Convert the NRT to use <stdatomic.h> opposed to JIT compiling functions to do the atomic operations and injecting them in as function pointers.
Remove the memset that adds those debug markers.
As a result of 1., it was then possible to compile the entire NRT to LLVM bitcode.
As a result of 3. it was then possible to update the code generation in Numba to essentially inline everything,
As a result of 4. LLVM managed to produce optimised LLVM IR which had e.g. the raw calls to malloc visible.

From doing the above, it became very clear that the memory system statistics collection "got in the way" and prevented further optimisation. I manually removed it (similar to the patch above) and LLVM managed to optimise even further, but certain patterns in the way Numba currently generates code prevented it from optimising the malloc/free to use the stack instead.

I think the following needs to happen in this order, which is also the approximate order of difficulty:

Numba should make the switch to use <stdatomic.h>.
The use of TheMSys stats atomic operations needs to be configurable, off by default. It's not that useful during standard runtime, but it is for debugging.
The use of the debug markers when destructors are called also needs to be part of the configuration described in 2.
Compiling the NRT as bitcode ahead of time and linking it in (with updates to force inlining) to get more optimisation opportunity.
Investigate and fix code generation to take full advantage of the previous.

aseyboldt · 2022-05-24T09:17:50Z

@stuartarchibald That makes a lot of sense to me for what it is worth.

While looking at this a bit more, I noticed that in nrt.c numba does not check
if allocations succeeded (eg here). In the C code it also doesn't check that the integer
computation for the alignment doesn't overflow (here), although it seems that it checks
earlier in arrayobj.py that a signed int doesn't overflow, so as long as
align isn't too big the unsigned computation shouldn't overflow, so
this might not be an issue in practice.

We can trigger a segfault by requesting too much memory:

import numba
import numpy as np

n = numba.size_t.maxval // 8 // 2

# Raises a memory error as it should
# np.zeros(n)

@numba.njit
def foo():
    return np.zeros(n)

# Segfault
out = foo()

I can also open a separate issue for this, I guess it isn't really related to the
performance issue.

stuartarchibald · 2022-05-24T09:45:03Z

@stuartarchibald That makes a lot of sense to me for what it is worth.

Great, thanks for taking a look.

While looking at this a bit more, I noticed that in nrt.c numba does not check if allocations succeeded (eg here). In the C code it also doesn't check that the integer computation for the alignment doesn't overflow (here), although it seems that it checks earlier in arrayobj.py that a signed int doesn't overflow, so as long as align isn't too big the unsigned computation shouldn't overflow, so this might not be an issue in practice.

We can trigger a segfault by requesting too much memory:
import numba
import numpy as np

n = numba.size_t.maxval // 8 // 2

# Raises a memory error as it should
# np.zeros(n)

@numba.njit
def foo():
    return np.zeros(n)

# Segfault
out = foo()
I can also open a separate issue for this, I guess it isn't really related to the performance issue.

Please could you put this into a new issue? We can discuss what the best thing to do is on there, many thanks!

aseyboldt · 2022-05-24T09:50:46Z

The alloc issue is now #8105

stuartarchibald · 2022-05-24T10:36:46Z

The alloc issue is now #8105

Many thanks for opening that.

stuartarchibald · 2022-06-15T08:16:25Z

xref: #8156 which is discussing making the stats counters optional, off by default.

stuartarchibald · 2022-06-28T10:32:01Z

xref: #8200 this makes it such that the memset calls to add debug markers are off by default.

stuartarchibald · 2022-07-08T15:42:16Z

xref: #8235 which implements making the stats counters off by default.

stuartarchibald · 2022-07-08T15:58:59Z

xref: summary here #8156 (comment), once patches are merged, can potentially close this ticket?

aseyboldt · 2022-07-30T01:00:03Z

Fixed in #8235

stuartarchibald added the needtriage label May 23, 2022

stuartarchibald added performance - run time Performance issue occurring at run time. and removed needtriage labels May 24, 2022

stuartarchibald mentioned this issue May 24, 2022

Performance of Numba Runtime Allocations #7887

Closed

aseyboldt closed this as completed Jul 30, 2022

stuartarchibald mentioned this issue Jan 3, 2023

Numba compiled functions releasing the GIL often fail to scale in parallel when Numpy and/or lists are involved #8686

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad multithreaded performance of allocations #8101

Bad multithreaded performance of allocations #8101

aseyboldt commented May 23, 2022 •

edited

stuartarchibald commented May 24, 2022

aseyboldt commented May 24, 2022

stuartarchibald commented May 24, 2022

aseyboldt commented May 24, 2022

stuartarchibald commented May 24, 2022

stuartarchibald commented Jun 15, 2022

stuartarchibald commented Jun 28, 2022

stuartarchibald commented Jul 8, 2022

stuartarchibald commented Jul 8, 2022

aseyboldt commented Jul 30, 2022

Bad multithreaded performance of allocations #8101

Bad multithreaded performance of allocations #8101

Comments

aseyboldt commented May 23, 2022 • edited

stuartarchibald commented May 24, 2022

aseyboldt commented May 24, 2022

stuartarchibald commented May 24, 2022

aseyboldt commented May 24, 2022

stuartarchibald commented May 24, 2022

stuartarchibald commented Jun 15, 2022

stuartarchibald commented Jun 28, 2022

stuartarchibald commented Jul 8, 2022

stuartarchibald commented Jul 8, 2022

aseyboldt commented Jul 30, 2022

aseyboldt commented May 23, 2022 •

edited