Skip to content
esc edited this page Jun 8, 2022 · 1 revision

Numba Meeting: 2022-06-07

Attendees: Todd A. Anderson, Andre Masella, Enrico Guiraud, Graham Markall, Jim Pivarski, stuart, Vincenzo Eduardo Padulano, LI Da, Shannon Quinn, Siu Kwan Lam, brandon willard Benjamin Graham, Kaustubh Chaudhari, Nickholas Riasanovsky, Shannon Quinn, Luk, Guilherme Leobas

NOTE: All communication is subject to the Numba Code of Conduct.

Please refer to this calendar for the next meeting date.

0. Discussion

  • Numba 0.56.0
  • Creating and posting meeting agenda earlier
    • Siu: create meeting document and publishing it immediately after meeting and add topics throughput the week for folks to look at before the meeting
  • Multi-thread, concurrent calls of some numba-jitted cfuncs limited in scaling by numba runtime
    • About the multi-thread, concurrent call, an example function is:
def count_muons(ptr):
    arr = numba.carray(ptr, 10)
    return np.count_nonzero((arr > 1.) & (np.abs(arr) < 7.) & (arr > 0.))

The full reproducer (will try to come up with something that does not depend on ROOT):

import numba
import numpy as np
import ROOT
from time import time

def count_muons_loop(ptr):
    arr = numba.carray(ptr, 10)
    count = 0
    for i in range(len(arr)):
        if arr[i] > 1. and abs(arr[i]) < 7. and arr[i] > 0.:
            count += 1
    return count

def count_muons(ptr):
    arr = numba.carray(ptr, 10)
    return np.count_nonzero((arr > 1.) & (np.abs(arr) < 7.) & (arr > 0.))


if __name__ == "__main__":
    loop_func = numba.cfunc(numba.int32(numba.types.CPointer(numba.float32)), nopython=True)(count_muons_loop)
    numpy_func = numba.cfunc(numba.int32(numba.types.CPointer(numba.float32)), nopython=True)(count_muons)

    ROOT.gInterpreter.Declare(f"""
auto *loopf = reinterpret_cast<int(*)(float*)>({loop_func.address});
auto *npf = reinterpret_cast<int(*)(float*)>({numpy_func.address});
""")

    ROOT.gInterpreter.Calc("""
ROOT::TThreadExecutor t;
float arr[10]{};
std::vector<float*> args(1'000'000'000, arr);
TStopwatch s;
s.Start();
auto res = t.Map(loopf, args);
s.Stop();
s.Print();

s.Reset();
s.Start();
res = t.Map(npf, args);
s.Stop();
s.Print();
    """)

And an example stacktrace of where threads get stuck:

#0  0x00007fffc8e11000 in nrt_atomic_add ()
#1  0x00007ffff6dc442d in NRT_MemInfo_destroy (mi=0x7fff900010e0) at numba/core/runtime/nrt.c:331
#2  NRT_MemInfo_call_dtor (mi=0x7fff900010e0) at numba/core/runtime/nrt.c:348
#3  0x00007fffc04e8566 in numba::np::arraymath::np_count_nonzero::_3clocals_3e::impl_244[abi:c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3d](Array<bool, 1, C, mutable, aligned>, omitted_28default_3dNone_29) ()
#4  0x00007fffc04e3b0c in __main__::count_muons_243[abi:c8tJTIeFCjyCbUFRqqOAK_2f6h0ogIRRJAjSSYRBFEjyYA](float32_2a) ()
#5  0x00007fffc04e3b82 in cfunc._ZN8__main__15count_muons_243B46c8tJTIeFCjyCbUFRqqOAK_2f6h0ogIRRJAjSSYRBFEjyYAE10float32_2a ()
#6  0x00007fffb064cb20 in ROOT::TThreadExecutor::MapImpl<int (*)(float*), float*, void>(int (*)(float*), std::vector<float*, std::allocator<float*> >&)::{lambda(unsigned int)#1}::operator()(unsigned int) const (this=0x55556051cfe0, i=<optimized out>) at /home/blue/ROOT/relwithdebinfo/cmake-build-foo/include/ROOT/TThreadExecutor.hxx:330
#7  0x00007fffc04fd043 in std::function<void (unsigned int)>::operator()(unsigned int) const (__args#0=<optimized out>, this=<optimized out>)
    at /usr/include/c++/11.2.0/bits/std_function.h:560
    • _nrt=False can be set to disable reference counting
    • future changes that are being considered:
      • Remove unneeded atomic ops on internal stats of NRT
      • Inline NRT as LLVM for more aggressive optimization
    • turn off atomicity?
      • flag to turn off atomicity seems doable
@njit('void(string)', no_cfunc_wrapper=True)
def foo(a):
    # raise IndexError(a + ' world', a, 'test', a, 3)
    raise ValueError(a, IndexError)

1. New Issues

  • #8127 - No out of bounds check during advanced array indexing
  • #8128 - typed.List is not considered a types.Sequence while a reflected list is
  • #8131 - All slices of contiguous 2D+ arrays are assumed to be not contiguous (even when they would be)
  • #8132 - Record not recognized as a data type
  • #8135 - Slow compilation of function taking numpy structured arrays with many fields
    • Siu to produce a chrome trace profile for further discussion

llvmlite:

  • #850 - Better error support for creating custom types

Closed Issues

  • #8130 - NumbaIRAssumptionWarning: variable '_i8_impl_v4_cur_2' is not in scope

2. New PRs

  • #8122 - WIP: support register_jitable-ed function as njit function argument
  • #8123 - Fix CUDA print tests on Windows
  • #8124 - Add explicit checks to all allocators in the NRT.
  • #8125 - [DO NOT MERGE] Temp/pr8061
  • #8126 - Mark gufuncs as having mutable inputs
  • #8129 - FIXED :: No out of bounds check during advanced array indexing #8127
  • #8133 - Fix #8132. Regression in Record.make_c_struct for handling nestedarray
  • #8134 - Support non-constant exception values in JIT
  • #8136 - Fix some C++ 11 Issues
  • #8137 - CUDA: Fix #7806, Division by zero stops the kernel

llvmlite:

  • #849 - added type hints

Closed PRs

llvmlite:

  • #851 - adding the llvm_11_consecutive_registers.patch

3. Next Release: Version 0.56.0/0.39.0, RC June

Clone this wiki locally