Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad multithreaded performance of allocations #8101

Closed
aseyboldt opened this issue May 23, 2022 · 10 comments
Closed

Bad multithreaded performance of allocations #8101

aseyboldt opened this issue May 23, 2022 · 10 comments
Labels
performance - run time Performance issue occurring at run time.

Comments

@aseyboldt
Copy link
Contributor

aseyboldt commented May 23, 2022

When new arrays are allocated, wrapper functions in nrt.c around malloc are called.
Those wrappers increment an atomic counter (I guess for debugging/profiling purposes?),
but if we call those functions from several threads this can come with a serious performance
hit, because different cores write to the same memory location. In some real world code
(that probably could be improved by avoiding some of the allocations) I see a performance
drop of ~30%.

To show the issue more clearly, here is an extreme example:

import numba
import numpy as np
import threading

@numba.njit(nogil=True)  # release gil so that we actually run this in parallel
def foo(x):
    for _ in range(100_000_000):
        # allocate a new array each time
        x += np.zeros_like(x)
    return x

def run():
    x = np.array(1.)
    foo(x)

threads = [threading.Thread(target=run) for _ in range(20)]

for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

Timings:

  • With atomic inc: 1h total CPU time, 4min wall time
  • Without atomic inc (see patch below): 3min total CPU time, 20s wall time

Running on a Threadripper 1920X with 12 cores, and jemalloc (since some cores
have no cache in common this is probably close to the worst case, I guess this will
not look nearly as bad on most laptops).

To evaluate without the atomic counter I use this patch for nrt.c

diff --git a/numba/core/runtime/nrt.c b/numba/core/runtime/nrt.c
index 3a65c9b99..de3f3cfab 100644
--- a/numba/core/runtime/nrt.c
+++ b/numba/core/runtime/nrt.c
@@ -2,6 +2,7 @@
 #include <string.h> /* for memset */
 #include "nrt.h"
 #include "assert.h"
+#include <stdlib.h>
 
 #if !defined MIN
 #define MIN(a, b) ((a) < (b)) ? (a) : (b)
@@ -182,7 +183,7 @@ void NRT_MemInfo_init(NRT_MemInfo *mi,void *data, size_t size,
     mi->external_allocator = external_allocator;
     NRT_Debug(nrt_debug_print("NRT_MemInfo_init mi=%p external_allocator=%p\n", mi, external_allocator));
     /* Update stats */
-    TheMSys.atomic_inc(&TheMSys.stats_mi_alloc);
+    //TheMSys.atomic_inc(&TheMSys.stats_mi_alloc);
 }
 
 NRT_MemInfo *NRT_MemInfo_new(void *data, size_t size,
@@ -259,7 +260,7 @@ NRT_MemInfo* NRT_MemInfo_alloc_dtor_safe(size_t size, NRT_dtor_function dtor) {
     void *data = nrt_allocate_meminfo_and_data(size, &mi, NULL);
     /* Only fill up a couple cachelines with debug markers, to minimize
        overhead. */
-    memset(data, 0xCB, MIN(size, 256));
+    //memset(data, 0xCB, MIN(size, 256));
     NRT_Debug(nrt_debug_print("NRT_MemInfo_alloc_dtor_safe %p %zu\n", data, size));
     NRT_MemInfo_init(mi, data, size, nrt_internal_custom_dtor_safe, dtor, NULL);
     return mi;
@@ -297,7 +298,7 @@ NRT_MemInfo *NRT_MemInfo_alloc_safe_aligned(size_t size, unsigned align) {
     void *data = nrt_allocate_meminfo_and_data_align(size, align, &mi, NULL);
     /* Only fill up a couple cachelines with debug markers, to minimize
        overhead. */
-    memset(data, 0xCB, MIN(size, 256));
+    //memset(data, 0xCB, MIN(size, 256));
     NRT_Debug(nrt_debug_print("NRT_MemInfo_alloc_safe_aligned %p %zu\n",
                               data, size));
     NRT_MemInfo_init(mi, data, size, nrt_internal_dtor_safe, (void*)size, NULL);
@@ -321,7 +322,7 @@ void NRT_dealloc(NRT_MemInfo *mi) {
     NRT_Debug(nrt_debug_print("NRT_dealloc meminfo: %p external_allocator: %p\n", mi, mi->external_allocator));
     if (mi->external_allocator) {
         mi->external_allocator->free(mi, mi->external_allocator->opaque_data);
-        TheMSys.atomic_inc(&TheMSys.stats_free);
+        //TheMSys.atomic_inc(&TheMSys.stats_free);
     } else {
         NRT_Free(mi);
     }
@@ -329,7 +330,7 @@ void NRT_dealloc(NRT_MemInfo *mi) {
 
 void NRT_MemInfo_destroy(NRT_MemInfo *mi) {
     NRT_dealloc(mi);
-    TheMSys.atomic_inc(&TheMSys.stats_mi_free);
+    //TheMSys.atomic_inc(&TheMSys.stats_mi_free);
 }
 
 void NRT_MemInfo_acquire(NRT_MemInfo *mi) {
@@ -469,15 +470,17 @@ void* NRT_Allocate_External(size_t size, NRT_ExternalAllocator *allocator) {
         ptr = allocator->malloc(size, allocator->opaque_data);
         NRT_Debug(nrt_debug_print("NRT_Allocate_External custom bytes=%zu ptr=%p\n", size, ptr));
     } else {
-        ptr = TheMSys.allocator.malloc(size);
+        //ptr = TheMSys.allocator.malloc(size);
+	ptr = malloc(size);
         NRT_Debug(nrt_debug_print("NRT_Allocate_External bytes=%zu ptr=%p\n", size, ptr));
     }
-    TheMSys.atomic_inc(&TheMSys.stats_alloc);
+    //TheMSys.atomic_inc(&TheMSys.stats_alloc);
     return ptr;
 }
 
 void *NRT_Reallocate(void *ptr, size_t size) {
-    void *new_ptr = TheMSys.allocator.realloc(ptr, size);
+    //void *new_ptr = TheMSys.allocator.realloc(ptr, size);
+    void *new_ptr = realloc(ptr, size);
     NRT_Debug(nrt_debug_print("NRT_Reallocate bytes=%zu ptr=%p -> %p\n",
                               size, ptr, new_ptr));
     return new_ptr;
@@ -485,8 +488,9 @@ void *NRT_Reallocate(void *ptr, size_t size) {
 
 void NRT_Free(void *ptr) {
     NRT_Debug(nrt_debug_print("NRT_Free %p\n", ptr));
-    TheMSys.allocator.free(ptr);
-    TheMSys.atomic_inc(&TheMSys.stats_free);
+    free(ptr);
+    //TheMSys.allocator.free(ptr);
+    //TheMSys.atomic_inc(&TheMSys.stats_free);
 }
 
 /*

Somewhat related: #7887

@stuartarchibald
Copy link
Contributor

Thanks for the report. As noted, the performance increase in the case above (with the patch applied) is due to removing pressure on atomic ops in the Numba runtime (NRT), and then also doing less work RE adding debug markers. I agree that #7887 is also likely related.

I did some experiments a few weeks back to try out some ideas in relation to improving the performance in these sorts of situations. Steps were roughly:

  1. Convert the NRT to use <stdatomic.h> opposed to JIT compiling functions to do the atomic operations and injecting them in as function pointers.
  2. Remove the memset that adds those debug markers.
  3. As a result of 1., it was then possible to compile the entire NRT to LLVM bitcode.
  4. As a result of 3. it was then possible to update the code generation in Numba to essentially inline everything,
  5. As a result of 4. LLVM managed to produce optimised LLVM IR which had e.g. the raw calls to malloc visible.

From doing the above, it became very clear that the memory system statistics collection "got in the way" and prevented further optimisation. I manually removed it (similar to the patch above) and LLVM managed to optimise even further, but certain patterns in the way Numba currently generates code prevented it from optimising the malloc/free to use the stack instead.

I think the following needs to happen in this order, which is also the approximate order of difficulty:

  1. Numba should make the switch to use <stdatomic.h>.
  2. The use of TheMSys stats atomic operations needs to be configurable, off by default. It's not that useful during standard runtime, but it is for debugging.
  3. The use of the debug markers when destructors are called also needs to be part of the configuration described in 2.
  4. Compiling the NRT as bitcode ahead of time and linking it in (with updates to force inlining) to get more optimisation opportunity.
  5. Investigate and fix code generation to take full advantage of the previous.

@stuartarchibald stuartarchibald added performance - run time Performance issue occurring at run time. and removed needtriage labels May 24, 2022
@aseyboldt
Copy link
Contributor Author

@stuartarchibald That makes a lot of sense to me for what it is worth.

While looking at this a bit more, I noticed that in nrt.c numba does not check
if allocations succeeded (eg here). In the C code it also doesn't check that the integer
computation for the alignment doesn't overflow (here), although it seems that it checks
earlier in arrayobj.py that a signed int doesn't overflow, so as long as
align isn't too big the unsigned computation shouldn't overflow, so
this might not be an issue in practice.

We can trigger a segfault by requesting too much memory:

import numba
import numpy as np

n = numba.size_t.maxval // 8 // 2

# Raises a memory error as it should
# np.zeros(n)

@numba.njit
def foo():
    return np.zeros(n)

# Segfault
out = foo()

I can also open a separate issue for this, I guess it isn't really related to the
performance issue.

@stuartarchibald
Copy link
Contributor

@stuartarchibald That makes a lot of sense to me for what it is worth.

Great, thanks for taking a look.

While looking at this a bit more, I noticed that in nrt.c numba does not check if allocations succeeded (eg here). In the C code it also doesn't check that the integer computation for the alignment doesn't overflow (here), although it seems that it checks earlier in arrayobj.py that a signed int doesn't overflow, so as long as align isn't too big the unsigned computation shouldn't overflow, so this might not be an issue in practice.

We can trigger a segfault by requesting too much memory:

import numba
import numpy as np

n = numba.size_t.maxval // 8 // 2

# Raises a memory error as it should
# np.zeros(n)

@numba.njit
def foo():
    return np.zeros(n)

# Segfault
out = foo()

I can also open a separate issue for this, I guess it isn't really related to the performance issue.

Please could you put this into a new issue? We can discuss what the best thing to do is on there, many thanks!

@aseyboldt
Copy link
Contributor Author

The alloc issue is now #8105

@stuartarchibald
Copy link
Contributor

The alloc issue is now #8105

Many thanks for opening that.

@stuartarchibald
Copy link
Contributor

xref: #8156 which is discussing making the stats counters optional, off by default.

@stuartarchibald
Copy link
Contributor

xref: #8200 this makes it such that the memset calls to add debug markers are off by default.

@stuartarchibald
Copy link
Contributor

xref: #8235 which implements making the stats counters off by default.

@stuartarchibald
Copy link
Contributor

xref: summary here #8156 (comment), once patches are merged, can potentially close this ticket?

@aseyboldt
Copy link
Contributor Author

Fixed in #8235

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance - run time Performance issue occurring at run time.
Projects
None yet
Development

No branches or pull requests

2 participants