Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad interaction (hang) involving TBB and subprocess fork and maybe MKL OMP #9501

Closed
sklam opened this issue Mar 18, 2024 · 0 comments
Closed
Labels
bug - incorrect behavior Bugs: incorrect behavior discussion An issue requiring discussion no action required No action was needed to resolve.

Comments

@sklam
Copy link
Member

sklam commented Mar 18, 2024

This is documenting an issue discovered during 0.59.1. By writing it in the issue tracker, hopefully it can be searchable in case similar problems occur in the future for us or others. The issue being described here is already fixed by b39bdf2. (Self reminder: cherrypick the fix to main and link this this issue)

The test_issue9490_non_det_ssa_problem testcase can hang on very specifically Python 3.9, NumPy 1.25 and Linux x86-64 at this line:

np.testing.assert_allclose(got_beta, exp_beta)

GDB reveals output:

(gdb) info threads
  Id   Target Id                                    Frame
* 1    Thread 0x7ffff7c39280 (LWP 3954038) "python" __pthread_clockjoin_ex (threadid=140736834631424, thread_return=0x0,
    clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
  2    Thread 0x7fffdc49c700 (LWP 3954042) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3    Thread 0x7fffdc09b700 (LWP 3954043) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  4    Thread 0x7fffdbc9a700 (LWP 3954044) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  5    Thread 0x7fffdb899700 (LWP 3954045) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  6    Thread 0x7fffdb498700 (LWP 3954047) "python" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7fffdac96700 (LWP 3954048) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  8    Thread 0x7fffdb097700 (LWP 3954046) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  12   Thread 0x7fffd9c92700 (LWP 3954054) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  13   Thread 0x7fffd9490700 (LWP 3954053) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  14   Thread 0x7fffda895700 (LWP 3954050) "python" 0x00007ffff7d3f89b in sched_yield () at ../sysdeps/unix/syscall-template.S:78
  15   Thread 0x7fffd908f700 (LWP 3954055) "python" 0x00007ffff5b4349a in __kmp_acquire_ticket_lock ()
   from /home/siu/miniconda3/envs/testenv_bad/lib/python3.9/site-packages/mkl/../../../libiomp5.so

Thread 1 "python" received signal SIGINT, Interrupt.
__pthread_clockjoin_ex (threadid=140736414611200, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
145	pthread_join_common.c: No such file or directory.
(gdb) bt
#0  __pthread_clockjoin_ex (threadid=140736414611200, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>)
    at pthread_join_common.c:145
#1  0x00007fffdcba294f in tbb::detail::r1::rml::internal::thread_monitor::join (
    handle=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at /usr/local/src/conda/tbb-2021.8.0/src/tbb/rml_thread_monitor.h:201
#2  tbb::detail::r1::rml::private_worker::release_handle (join=<optimized out>, handle=<optimized out>)
    at /usr/local/src/conda/tbb-2021.8.0/src/tbb/private_server.cpp:231
#3  tbb::detail::r1::rml::private_worker::start_shutdown (this=0x7fffdc7eb900) at /usr/local/src/conda/tbb-2021.8.0/src/tbb/private_server.cpp:256
#4  tbb::detail::r1::rml::private_worker::start_shutdown (this=0x7fffdc7eb900) at /usr/local/src/conda/tbb-2021.8.0/src/tbb/private_server.cpp:236
#5  tbb::detail::r1::rml::private_server::request_close_connection (this=0x7fffdcad3f80) at /usr/local/src/conda/tbb-2021.8.0/src/tbb/private_server.cpp:188
#6  0x00007fffdcb9c473 in tbb::detail::r1::market::release (this=0x7fffdcacf580, is_public=<optimized out>, blocking_terminate=<optimized out>)
    at /usr/local/src/conda/tbb-2021.8.0/src/tbb/market.cpp:214
#7  0x00007fffdcb9770a in tbb::detail::r1::finalize_impl (handle=...,
    handle@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at /usr/local/src/conda/tbb-2021.8.0/src/tbb/governor.cpp:290
#8  tbb::detail::r1::finalize (mode=1, mode@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>, handle=...,
    handle@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at /usr/local/src/conda/tbb-2021.8.0/src/tbb/governor.cpp:303
#9  tbb::detail::r1::finalize (handle=..., mode=1) at /usr/local/src/conda/tbb-2021.8.0/src/tbb/governor.cpp:298
#10 0x00007fffdcb7d1db in prepare_fork() ()
   from /home/siu/miniconda3/envs/testenv_bad/lib/python3.9/site-packages/numba/np/ufunc/tbbpool.cpython-39-x86_64-linux-gnu.so
#11 0x00007ffff7cd1c50 in __run_fork_handlers (who=<optimized out>, who@entry=atfork_run_prepare, do_locking=do_locking@entry=true) at register-atfork.c:123
#12 0x00007ffff7d20094 in __libc_fork () at ../sysdeps/nptl/fork.c:58
#13 0x00007ffff7fad47b in subprocess_fork_exec (self=<optimized out>, args=<optimized out>) at /usr/local/src/conda/python-3.9.18/Modules/_posixsubprocess.c:844
#14 0x00000000005073b0 in cfunction_call (func=0x7ffff6bce5e0, args=<optimized out>, kwargs=<optimized out>)
    at /usr/local/src/conda/python-3.9.18/Objects/methodobject.c:552

Important observations here:

  1. Thread 1 is stuck at __pthread_clockjoin_ex as part of a prepare_fork call for the TBB backend. TBB is known to have problems during fork. This explains the hang. But it is surprising for numba.testing.assert_allclose would cause a fork.
  2. Thread 14 is from OMP in MKL. It is not confirmed whether having OMP threadpool have an effect in this hang.

Using PDB to trace the hanging line, it is discovered that NumPy 1.25+ introduced this lines which call subprocess thus fork:

https://github.com/numpy/numpy/blob/maintenance/1.25.x/numpy/testing/_private/utils.py#L1239-L1253

Together with the lazy import by numba.testing.assert_allclose, it explains the use of fork.

As to why the test is not failing with NumPy 1.26:

The [fix] (

np.testing.assert_allclose(got_beta, exp_beta)
) is done by forcing one Numba thread thus one TBB thread. That perhaps avoided subprocess fork bug. I didn't do further analysis since the bug is fixed.

@sklam sklam added discussion An issue requiring discussion bug - incorrect behavior Bugs: incorrect behavior no action required No action was needed to resolve. labels Mar 18, 2024
@sklam sklam mentioned this issue Mar 18, 2024
3 tasks
@sklam sklam changed the title Documenting issue Bad interaction involving TBB and subprocess fork and maybe MKL OMP Mar 18, 2024
@sklam sklam changed the title Bad interaction involving TBB and subprocess fork and maybe MKL OMP Bad interaction (hang) involving TBB and subprocess fork and maybe MKL OMP Mar 18, 2024
sklam added a commit to sklam/numba that referenced this issue Mar 19, 2024
sklam added a commit to sklam/numba that referenced this issue Mar 19, 2024
sklam added a commit to sklam/numba that referenced this issue Mar 19, 2024
@sklam sklam closed this as completed in 560cf3b Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug - incorrect behavior Bugs: incorrect behavior discussion An issue requiring discussion no action required No action was needed to resolve.
Projects
None yet
Development

No branches or pull requests

1 participant