Bad interaction (hang) involving TBB and subprocess fork and maybe MKL OMP #9501

sklam · 2024-03-18T17:37:49Z

This is documenting an issue discovered during 0.59.1. By writing it in the issue tracker, hopefully it can be searchable in case similar problems occur in the future for us or others. The issue being described here is already fixed by b39bdf2. (Self reminder: cherrypick the fix to main and link this this issue)

The test_issue9490_non_det_ssa_problem testcase can hang on very specifically Python 3.9, NumPy 1.25 and Linux x86-64 at this line:

numba/numba/tests/parfor_iss9490_usecase.py

Line 71 in 147c641

np.testing.assert_allclose(got_beta, exp_beta)

GDB reveals output:

(gdb) info threads
  Id   Target Id                                    Frame
* 1    Thread 0x7ffff7c39280 (LWP 3954038) "python" __pthread_clockjoin_ex (threadid=140736834631424, thread_return=0x0,
    clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
  2    Thread 0x7fffdc49c700 (LWP 3954042) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3    Thread 0x7fffdc09b700 (LWP 3954043) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  4    Thread 0x7fffdbc9a700 (LWP 3954044) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  5    Thread 0x7fffdb899700 (LWP 3954045) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  6    Thread 0x7fffdb498700 (LWP 3954047) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7fffdac96700 (LWP 3954048) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  8    Thread 0x7fffdb097700 (LWP 3954046) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  12   Thread 0x7fffd9c92700 (LWP 3954054) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  13   Thread 0x7fffd9490700 (LWP 3954053) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  14   Thread 0x7fffda895700 (LWP 3954050) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  15   Thread 0x7fffd908f700 (LWP 3954055) "python" 0x00007ffff5b4349a in __kmp_acquire_ticket_lock ()
   from /home/siu/miniconda3/envs/testenv_bad/lib/python3.9/site-packages/mkl/../../../libiomp5.so


Thread 1 "python" received signal SIGINT, Interrupt.
__pthread_clockjoin_ex (threadid=140736414611200, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
145	pthread_join_common.c: No such file or directory.
(gdb) bt
#0  __pthread_clockjoin_ex (threadid=140736414611200, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>)
    at pthread_join_common.c:145
#1  0x00007fffdcba294f in tbb::detail::r1::rml::internal::thread_monitor::join (
    handle=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at /usr/local/src/conda/tbb-2021.8.0/src/tbb/rml_thread_monitor.h:201
#2  tbb::detail::r1::rml::private_worker::release_handle (join=<optimized out>, handle=<optimized out>)
    at /usr/local/src/conda/tbb-2021.8.0/src/tbb/private_server.cpp:231
#3  tbb::detail::r1::rml::private_worker::start_shutdown (this=0x7fffdc7eb900) at /usr/local/src/conda/tbb-2021.8.0/src/tbb/private_server.cpp:256
#4  tbb::detail::r1::rml::private_worker::start_shutdown (this=0x7fffdc7eb900) at /usr/local/src/conda/tbb-2021.8.0/src/tbb/private_server.cpp:236
#5  tbb::detail::r1::rml::private_server::request_close_connection (this=0x7fffdcad3f80) at /usr/local/src/conda/tbb-2021.8.0/src/tbb/private_server.cpp:188
#6  0x00007fffdcb9c473 in tbb::detail::r1::market::release (this=0x7fffdcacf580, is_public=<optimized out>, blocking_terminate=<optimized out>)
    at /usr/local/src/conda/tbb-2021.8.0/src/tbb/market.cpp:214
#7  0x00007fffdcb9770a in tbb::detail::r1::finalize_impl (handle=...,
    handle@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at /usr/local/src/conda/tbb-2021.8.0/src/tbb/governor.cpp:290
#8  tbb::detail::r1::finalize (mode=1, mode@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, handle=...,
    handle@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at /usr/local/src/conda/tbb-2021.8.0/src/tbb/governor.cpp:303
#9  tbb::detail::r1::finalize (handle=..., mode=1) at /usr/local/src/conda/tbb-2021.8.0/src/tbb/governor.cpp:298
#10 0x00007fffdcb7d1db in prepare_fork() ()
   from /home/siu/miniconda3/envs/testenv_bad/lib/python3.9/site-packages/numba/np/ufunc/tbbpool.cpython-39-x86_64-linux-gnu.so
#11 0x00007ffff7cd1c50 in __run_fork_handlers (who=<optimized out>, who@entry=atfork_run_prepare, do_locking=do_locking@entry=true) at register-atfork.c:123
#12 0x00007ffff7d20094 in __libc_fork () at ../sysdeps/nptl/fork.c:58
#13 0x00007ffff7fad47b in subprocess_fork_exec (self=<optimized out>, args=<optimized out>) at /usr/local/src/conda/python-3.9.18/Modules/_posixsubprocess.c:844
#14 0x00000000005073b0 in cfunction_call (func=0x7ffff6bce5e0, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.9.18/Objects/methodobject.c:552

Important observations here:

Thread 1 is stuck at __pthread_clockjoin_ex as part of a prepare_fork call for the TBB backend. TBB is known to have problems during fork. This explains the hang. But it is surprising for numba.testing.assert_allclose would cause a fork.
Thread 14 is from OMP in MKL. It is not confirmed whether having OMP threadpool have an effect in this hang.

Using PDB to trace the hanging line, it is discovered that NumPy 1.25+ introduced this lines which call subprocess thus fork:

https://github.com/numpy/numpy/blob/maintenance/1.25.x/numpy/testing/_private/utils.py#L1239-L1253

Together with the lazy import by numba.testing.assert_allclose, it explains the use of fork.

As to why the test is not failing with NumPy 1.26:

first, NumPy 1.26 conda package does not support Python 3.9. Thus we only test Python 3.10+.
Python 3.10 has several changes to subprocess which may have averted bugs with fork; e.g. potential undefined behavior with subprocess using vfork() on Linux? python/cpython#91401

The [fix] (

numba/numba/tests/parfor_iss9490_usecase.py

Line 71 in 147c641

np.testing.assert_allclose(got_beta, exp_beta)

) is done by forcing one Numba thread thus one TBB thread. That perhaps avoided subprocess fork bug. I didn't do further analysis since the bug is fixed.

The text was updated successfully, but these errors were encountered:

sklam added discussion An issue requiring discussion bug - incorrect behavior Bugs: incorrect behavior no action required No action was needed to resolve. labels Mar 18, 2024

sklam mentioned this issue Mar 18, 2024

Patches for 0.59.1 #9480

Merged

3 tasks

sklam changed the title ~~Documenting issue~~ Bad interaction involving TBB and subprocess fork and maybe MKL OMP Mar 18, 2024

sklam changed the title ~~Bad interaction involving TBB and subprocess fork and maybe MKL OMP~~ Bad interaction (hang) involving TBB and subprocess fork and maybe MKL OMP Mar 18, 2024

sklam added a commit to sklam/numba that referenced this issue Mar 19, 2024

Closes numba#9501. Add link to bug description.

c3b3235

sklam added a commit to sklam/numba that referenced this issue Mar 19, 2024

Closes numba#9501. Add link to bug description.

fb7361a

sklam added a commit to sklam/numba that referenced this issue Mar 19, 2024

Closes numba#9501. Add link to bug description.

63164d5

sklam closed this as completed in 560cf3b Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad interaction (hang) involving TBB and subprocess fork and maybe MKL OMP #9501

Bad interaction (hang) involving TBB and subprocess fork and maybe MKL OMP #9501

sklam commented Mar 18, 2024 •

edited

Bad interaction (hang) involving TBB and subprocess fork and maybe MKL OMP #9501

Bad interaction (hang) involving TBB and subprocess fork and maybe MKL OMP #9501

Comments

sklam commented Mar 18, 2024 • edited

sklam commented Mar 18, 2024 •

edited