Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: stack-overflow in OpenBLAS64 on macOS with embedding and importing numpy with pthreads #21799

Open
galbramc opened this issue Jun 19, 2022 · 16 comments
Labels

Comments

@galbramc
Copy link

galbramc commented Jun 19, 2022

Describe the issue:

I am embedding CPython and executing within threads. We are getting a "bus error" when importing numpy within a pthread function, but importing numpy within the "main thread" works. I've observed this both on intel and M1 macOS machines. I'm using numpy 1.22.4 installed with pip.

I compiled CPython 3.9.13 with the address sanitizer and found a stack overflow in dgetrf_parallel. Unfortunately I was not able to compile numpy my self with OpenBLAS to get the complete back trace. Compiling numpy without OpenBLAS runs without any errors. I have not compiled with the address sanitizer on Linux to see if I get the same error there, but I can if that would help.

I did look at the OpenBLAS source and there is this comment in lapack/getrf/getrf_parallel.c:

//In this case, the recursive getrf_parallel may overflow the stack.
//Instead, use malloc to alloc job_t.
#if MAX_CPU_NUMBER > GETRF_MEM_ALLOC_THRESHOLD
#define USE_ALLOC_HEAP
#endif

MAX_CPU_NUMBER is determined by the Makefiles based on the computer where OpenBLAS is compiled, and GETRF_MEM_ALLOC_THRESHOLD is defined as 80 on macOS in common.h. Maybe the solution here is to compile OpenBLAS so it always uses heap allocation in this routine?

Right now I am resorting to trying to import numpy in the main thread to avoid this issue. I've attached c code which reproduces the issue and the address sanitizer back trace information. Any advice on how to fix this would be greatly appreciated. Please let me know if there is any further information I can provide as well.

Reproduce the code example:

#include <Python.h>
#include <pthread.h>

void *thread_function(void *arg)
{
  PyThreadState *mainThreadState = (PyThreadState *)arg;
  PyThreadState *myThreadState = PyThreadState_New(mainThreadState->interp);

  PyEval_RestoreThread(myThreadState);
  
  printf("Thread import numpy\n");
  PyRun_SimpleString("import sys, numpy; print(numpy.__version__, sys.version)\n");
  printf("Thread import numpy done!\n");

  PyThreadState_Clear(myThreadState);
  PyThreadState_DeleteCurrent();
  
  return NULL;
}

int main(int argc, char** argv)
{
  int            stat;
  pthread_t      thread;
  pthread_attr_t attr;
  PyThreadState *mainThreadState = NULL;

  Py_InitializeEx(0);
  
  // Uncomment to avoid stack-overflow error
  //PyRun_SimpleString("try: import numpy\nexcept ImportError: pass\n");
  
  mainThreadState = PyEval_SaveThread();

  pthread_attr_init(&attr);
  stat = pthread_create(&thread, &attr, thread_function, mainThreadState);
  if (stat != 0) {
    printf(" Threading ERROR: %d (pthread_create)\n", stat);
    return 1;
  }
  stat = pthread_join(thread, NULL);
  if (stat != 0) {
    printf(" Threading ERROR: %d (pthread_join)\n", stat);
    return 1;
  }

  PyEval_RestoreThread(mainThreadState);
  Py_Finalize();

  return 0;
}

Error message:

Thread import numpy
AddressSanitizer:DEADLYSIGNAL
=================================================================
==83128==ERROR: AddressSanitizer: stack-overflow on address 0x700006499620 (pc 0x00010a42bf18 bp 0x70000651dc40 sp 0x700006499600 T1)
    #0 0x10a42bf18 in dgetrf_parallel+0x18 (libopenblas64_.0.dylib:x86_64+0x335f18)
    #1 0x10a11bf0b in dgesv_64_+0x19b (libopenblas64_.0.dylib:x86_64+0x25f0b)
    #2 0x105d9fd8e in DOUBLE_inv+0x3ae (_umath_linalg.cpython-39-darwin.so:x86_64+0x9d8e)
    #3 0x105783c0c in generic_wrapped_legacy_loop+0x1c (_multiarray_umath.cpython-39-darwin.so:x86_64+0x33dc0c)
    #4 0x10578aeb6 in ufunc_generic_fastcall+0x50e6 (_multiarray_umath.cpython-39-darwin.so:x86_64+0x344eb6)
    #5 0x10349a0fa in _PyObject_VectorcallTstate abstract.h:118
    #6 0x103496a89 in PyObject_Vectorcall abstract.h:127
    #7 0x103496bc7 in call_function ceval.c:5077
    #8 0x103491ab6 in _PyEval_EvalFrameDefault ceval.c:3537
    #9 0x1032e92ee in _PyEval_EvalFrame pycore_ceval.h:40
    #10 0x1032e74b8 in function_code_fastcall call.c:330
    #11 0x1032e6ed4 in _PyFunction_Vectorcall call.c:367
    #12 0x1032e68b7 in PyVectorcall_Call call.c:231
    #13 0x1032e6ae8 in _PyObject_Call call.c:266
    #14 0x1032e6be1 in PyObject_Call call.c:293
    #15 0x1054c78db in array_implement_array_function+0xdb (_multiarray_umath.cpython-39-darwin.so:x86_64+0x818db)
    #16 0x10335cacc in cfunction_call methodobject.c:552
    #17 0x1032e5f38 in _PyObject_MakeTpCall call.c:191
    #18 0x10349a0db in _PyObject_VectorcallTstate abstract.h:116
    #19 0x103496a89 in PyObject_Vectorcall abstract.h:127
    #20 0x103496bc7 in call_function ceval.c:5077
    #21 0x1034917aa in _PyEval_EvalFrameDefault ceval.c:3520
    #22 0x1034824ee in _PyEval_EvalFrame pycore_ceval.h:40
    #23 0x103498175 in _PyEval_EvalCode ceval.c:4329
    #24 0x1032e7333 in _PyFunction_Vectorcall call.c:396
    #25 0x10349a0fa in _PyObject_VectorcallTstate abstract.h:118
    #26 0x103496a89 in PyObject_Vectorcall abstract.h:127
    #27 0x103496bc7 in call_function ceval.c:5077
    #28 0x1034917aa in _PyEval_EvalFrameDefault ceval.c:3520
    #29 0x1034824ee in _PyEval_EvalFrame pycore_ceval.h:40
    #30 0x103498175 in _PyEval_EvalCode ceval.c:4329
    #31 0x1032e7333 in _PyFunction_Vectorcall call.c:396
    #32 0x1032e6970 in PyVectorcall_Call call.c:243
    #33 0x1032e6ae8 in _PyObject_Call call.c:266
    #34 0x1032e6be1 in PyObject_Call call.c:293
    #35 0x1054c78db in array_implement_array_function+0xdb (_multiarray_umath.cpython-39-darwin.so:x86_64+0x818db)
    #36 0x10335cacc in cfunction_call methodobject.c:552
    #37 0x1032e5f38 in _PyObject_MakeTpCall call.c:191
    #38 0x10349a0db in _PyObject_VectorcallTstate abstract.h:116
    #39 0x103496a89 in PyObject_Vectorcall abstract.h:127
    #40 0x103496bc7 in call_function ceval.c:5077
    #41 0x1034917aa in _PyEval_EvalFrameDefault ceval.c:3520
    #42 0x1034824ee in _PyEval_EvalFrame pycore_ceval.h:40
    #43 0x103498175 in _PyEval_EvalCode ceval.c:4329
    #44 0x1032e7333 in _PyFunction_Vectorcall call.c:396
    #45 0x10349a0fa in _PyObject_VectorcallTstate abstract.h:118
    #46 0x103496a89 in PyObject_Vectorcall abstract.h:127
    #47 0x103496bc7 in call_function ceval.c:5077
    #48 0x103491ab6 in _PyEval_EvalFrameDefault ceval.c:3537
    #49 0x1032e92ee in _PyEval_EvalFrame pycore_ceval.h:40
    #50 0x1032e74b8 in function_code_fastcall call.c:330
    #51 0x1032e6ed4 in _PyFunction_Vectorcall call.c:367
    #52 0x10349a0fa in _PyObject_VectorcallTstate abstract.h:118
    #53 0x103496a89 in PyObject_Vectorcall abstract.h:127
    #54 0x103496bc7 in call_function ceval.c:5077
    #55 0x1034917aa in _PyEval_EvalFrameDefault ceval.c:3520
    #56 0x1034824ee in _PyEval_EvalFrame pycore_ceval.h:40
    #57 0x103498175 in _PyEval_EvalCode ceval.c:4329
    #58 0x103498c65 in _PyEval_EvalCodeWithName ceval.c:4361
    #59 0x10348247a in PyEval_EvalCodeEx ceval.c:4377
    #60 0x103482369 in PyEval_EvalCode ceval.c:828
    #61 0x10347ecbb in builtin_exec_impl bltinmodule.c:1026
    #62 0x10347bbcb in builtin_exec bltinmodule.c.h:396
    #63 0x10335bdb5 in cfunction_vectorcall_FASTCALL methodobject.c:430
    #64 0x1032e68b7 in PyVectorcall_Call call.c:231
    #65 0x1032e6ae8 in _PyObject_Call call.c:266
    #66 0x1032e6be1 in PyObject_Call call.c:293
    #67 0x103496f0c in do_call_core ceval.c:5097
    #68 0x103491f8c in _PyEval_EvalFrameDefault ceval.c:3582
    #69 0x1034824ee in _PyEval_EvalFrame pycore_ceval.h:40
    #70 0x103498175 in _PyEval_EvalCode ceval.c:4329
    #71 0x1032e7333 in _PyFunction_Vectorcall call.c:396
    #72 0x10349a0fa in _PyObject_VectorcallTstate abstract.h:118
    #73 0x103496a89 in PyObject_Vectorcall abstract.h:127
    #74 0x103496bc7 in call_function ceval.c:5077
    #75 0x103491566 in _PyEval_EvalFrameDefault ceval.c:3489
    #76 0x1032e92ee in _PyEval_EvalFrame pycore_ceval.h:40
    #77 0x1032e74b8 in function_code_fastcall call.c:330
    #78 0x1032e6ed4 in _PyFunction_Vectorcall call.c:367
    #79 0x10349a0fa in _PyObject_VectorcallTstate abstract.h:118
    #80 0x103496a89 in PyObject_Vectorcall abstract.h:127
    #81 0x103496bc7 in call_function ceval.c:5077
    #82 0x1034915e5 in _PyEval_EvalFrameDefault ceval.c:3506
    #83 0x1032e92ee in _PyEval_EvalFrame pycore_ceval.h:40
    #84 0x1032e74b8 in function_code_fastcall call.c:330
    #85 0x1032e6ed4 in _PyFunction_Vectorcall call.c:367
    #86 0x10349a0fa in _PyObject_VectorcallTstate abstract.h:118
    #87 0x103496a89 in PyObject_Vectorcall abstract.h:127
    #88 0x103496bc7 in call_function ceval.c:5077
    #89 0x1034917aa in _PyEval_EvalFrameDefault ceval.c:3520
    #90 0x1032e92ee in _PyEval_EvalFrame pycore_ceval.h:40
    #91 0x1032e74b8 in function_code_fastcall call.c:330
    #92 0x1032e6ed4 in _PyFunction_Vectorcall call.c:367
    #93 0x10349a0fa in _PyObject_VectorcallTstate abstract.h:118
    #94 0x103496a89 in PyObject_Vectorcall abstract.h:127
    #95 0x103496bc7 in call_function ceval.c:5077
    #96 0x1034917aa in _PyEval_EvalFrameDefault ceval.c:3520
    #97 0x1032e92ee in _PyEval_EvalFrame pycore_ceval.h:40
    #98 0x1032e74b8 in function_code_fastcall call.c:330
    #99 0x1032e6ed4 in _PyFunction_Vectorcall call.c:367
    #100 0x1032e8aba in _PyObject_VectorcallTstate abstract.h:118
    #101 0x1032e8f00 in object_vacall call.c:792
    #102 0x1032e90dc in _PyObject_CallMethodIdObjArgs call.c:883
    #103 0x1034d9f8d in import_find_and_load import.c:1776
    #104 0x1034d912e in PyImport_ImportModuleLevelObject import.c:1877
    #105 0x103495a8f in import_name ceval.c:5198
    #106 0x10348eb10 in _PyEval_EvalFrameDefault ceval.c:3099
    #107 0x1034824ee in _PyEval_EvalFrame pycore_ceval.h:40
    #108 0x103498175 in _PyEval_EvalCode ceval.c:4329
    #109 0x103498c65 in _PyEval_EvalCodeWithName ceval.c:4361
    #110 0x10348247a in PyEval_EvalCodeEx ceval.c:4377
    #111 0x103482369 in PyEval_EvalCode ceval.c:828
    #112 0x1035058f4 in run_eval_code_obj pythonrun.c:1221
    #113 0x103503c5b in run_mod pythonrun.c:1242
    #114 0x103502f28 in PyRun_StringFlags pythonrun.c:1108
    #115 0x103502ded in PyRun_SimpleStringFlags pythonrun.c:497
    #116 0x102b6195b in thread_function test.c:13
    #117 0x7ff8016284e0 in _pthread_start+0x7c (libsystem_pthread.dylib:x86_64+0x64e0)
    #118 0x7ff801623f6a in thread_start+0xe (libsystem_pthread.dylib:x86_64+0x1f6a)

SUMMARY: AddressSanitizer: stack-overflow (libopenblas64_.0.dylib:x86_64+0x335f18) in dgetrf_parallel+0x18
Thread T1 created by T0 here:
    #0 0x10386f67c in wrap_pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x4267c)
    #1 0x102b61aba in main test.c:37
    #2 0x10610b51d in start+0x1cd (dyld:x86_64+0x551d)

==83128==ABORTING
Abort trap: 6

NumPy/Python version information:

1.22.4 3.9.13 (main, Jun 14 2022, 22:13:49)

@Czaki
Copy link
Contributor

Czaki commented Jul 26, 2023

It looks like still present bug.
The numpy from anaconda works.

@mattip
Copy link
Member

mattip commented Jul 27, 2023

@Czaki what version of NumPy did you try? Could you try again with one of the nightly releases from https://anaconda.org/scientific-python-nightly-wheels/numpy ?

@galbramc
Copy link
Author

Thanks for looking into this. Let me know if I can help.

@Czaki
Copy link
Contributor

Czaki commented Jul 27, 2023

@mattip I have tested it with 1.24.3, 1.24.4, 1.25.0, 1.25.1 and crash on every one.
When install wheel from nightly there is no crash i will check if I could use it for now (as it is marked as 2.0.0.dev0 s I'm not sure if some breaking changes were introduced.

It also works with numpy 1.25.1 from conda (numpy 1.25.1 py311hb8f3215_0).

I'm able to reproduce this bug using my public project, so I could write reproduction instruction. I will be also happy to test any solution that may lead to stable solution when install from PyPi.

@mattip
Copy link
Member

mattip commented Jul 27, 2023

When install wheel from nightly there is no crash

The nightly is built with a newer OpenBLAS. It will be part of the next release.

Maybe the conda forge build also uses a newer OpenBLAS (or Appple's Accelerate backend), I am not sure.

@Czaki
Copy link
Contributor

Czaki commented Jul 27, 2023

When we could expect next release?

@charris
Copy link
Member

charris commented Jul 27, 2023

The nightly is built with a newer OpenBLAS. It will be part of the next release.

Does that need a backport?

@mattip
Copy link
Member

mattip commented Jul 27, 2023

The original PR was #24199 which I think was backported to maintenance/1.25.x in #24243

@Czaki
Copy link
Contributor

Czaki commented Aug 7, 2023

@mattip There is a chance for bugfix release?

using nightly no longer works as

  File "/Users/grzegorzbokota/.pyenv/versions/partseg-3.11/lib/python3.11/site-packages/scipy/linalg/_decomp.py", line 22, in <module>
    from numpy import (array, isfinite, inexact, nonzero, iscomplexobj, cast,
ImportError: cannot import name 'cast' from 'numpy' (/Users/grzegorzbokota/.pyenv/versions/partseg-3.11/lib/python3.11/site-packages/numpy/__init__.py)

@mattip
Copy link
Member

mattip commented Aug 7, 2023

That was commit 729d1f6, and git tag --contains 729d1f6d4 says it should be in 1.25.2 which was released July 31.

@Czaki
Copy link
Contributor

Czaki commented Aug 7, 2023

I have installed 1.25.2 and still got:

  * frame #0: 0x000000019fbe4e6c libsystem_pthread.dylib`___chkstk_darwin + 60
    frame #1: 0x000000013017262c libopenblas64_.0.dylib`dgetrf_parallel + 52
    frame #2: 0x000000013001a610 libopenblas64_.0.dylib`dgesv_64_ + 372
    frame #3: 0x0000000106291054 _umath_linalg.cpython-311-darwin.so`void inv<double>(char**, long const*, long const*, void*) + 652
    frame #4: 0x000000010653f2a8 _multiarray_umath.cpython-311-darwin.so`generic_wrapped_legacy_loop + 40

So problem still exists.

When install from source: pip install --no-binary numpy numpy==1.25.2 --force then it start working.

@mattip
Copy link
Member

mattip commented Aug 8, 2023

@martin-frbg, thoughts? @Czaki you are loading NumPy in a worker thread and using address sanitizer?

@martin-frbg
Copy link

Any chance to see what the stack size is here, and perhaps to increase it (ulimit -s) to (see if that is sufficient to) get that program to run ? I don't think there were any (very) recent changes in that area, and I'm more used to seeing this problem when there is java code involved (imposing its default 1mb stack on everything that loads after it)

@Czaki
Copy link
Contributor

Czaki commented Aug 8, 2023

@mattip I load numpy in main thread. Then I use thread for perform some calculation. It crash when calculate matrix inversion. I do not know if use address sanitizer. How to check it?

@martin-frbg it report EXC_BAD_ACCESS, as I understand it mean access to non allocated memory
Increase of stack from 8176 to 32000 does not change anything.

I do not know how long this problem exists. I just got my first Apple ARM computer. My code works without problem On x86 (Linux, Windows, MacOS) and x86 through Rosetta.

My application is here: https://github.com/4DNucleome/PartSeg. It crash on simple start PartSeg roi. There is no java code.

@martin-frbg
Copy link

Are we sure that this is the actual problem this ticket was originally opened for, and not something else completely unrelated to stack size ?
Any chance there could be NaN values in the input to GETRF ? And I assume one would need an input file with PartSeg roi, or is it actually failing on startup before it gets the chance to even load any data ?

@Czaki
Copy link
Contributor

Czaki commented Aug 8, 2023

@martin-frbg You are right. I check my things, and I hit this error when running the code on Python build in debug mode.

My current error is different and does not fit this Issue. I will open separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants