Skip to content

Conversation

@pengwa
Copy link
Contributor

@pengwa pengwa commented Jul 8, 2021

Description: Fix segment fault for custom function.

In recent master, when onnxruntime_ENABLE_TRAINING_TORCH_INTEROP=ON, once the onnxruntime_training package is built and installed. If we run programs containing 'import onnxruntime', we will see a 'segment fault' errors printed in stdout. This bug seems to be there for a while, not sure why we did not see it earlier.

The call stack looks like this:

>>>ninja --version
1.10.0.git.kitware.jobserver-1

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00005555556e26ff in tupledealloc (op=0x7ffd5a4113f0) at /tmp/build/80754af9/python_1553721932202/work/Objects/tupleobject.c:242
242     in /tmp/build/80754af9/python_1553721932202/work/Objects/tupleobject.c
(gdb) bt
#0  0x00005555556e26ff in tupledealloc (op=0x7ffd5a4113f0) at /tmp/build/80754af9/python_1553721932202/work/Objects/tupleobject.c:242
#1  code_dealloc () at /tmp/build/80754af9/python_1553721932202/work/Objects/codeobject.c:446
#2  0x00005555556e1448 in func_dealloc () at /tmp/build/80754af9/python_1553721932202/work/Objects/funcobject.c:532
#3  0x00007ffd5c0e86d1 in std::function<void (_object*)>::operator()(_object*) const (__args#0=<optimized out>, this=<optimized out>) at /usr/include/c++/7/bits/std_function.h:706
#4  std::unique_ptr<_object, std::function<void (_object*)> >::~unique_ptr() (this=0x7ffd5d029188 <onnxruntime::language_interop_ops::torch::OrtTorchFunctionPool::GetInstance()::instance_+40>,
    __in_chrg=<optimized out>) at /usr/include/c++/7/bits/unique_ptr.h:263
#5  onnxruntime::language_interop_ops::torch::OrtTorchFunctionPool::~OrtTorchFunctionPool (this=0x7ffd5d029160 <onnxruntime::language_interop_ops::torch::OrtTorchFunctionPool::GetInstance()::instance_>,
    __in_chrg=<optimized out>) at /home/pengwa/dev/onnxruntime/onnxruntime/core/language_interop_ops/torch/custom_function_register.h:16
#6  0x00007ffff7706161 in __run_exit_handlers (status=0, listp=0x7ffff7aae718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#7  0x00007ffff770625a in __GI_exit (status=<optimized out>) at exit.c:139
#8  0x0000555555776a19 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1553721932202/work/Python/pylifecycle.c:2282
#9  0x0000555555776ac7 in handle_system_exit () at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:636
#10 0x0000555555776b62 in PyErr_PrintEx () at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:646
#11 0x0000555555649869 in PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffff4ce0)
    at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:435
#12 0x000055555578ee5f in pymain_run_file (p_cf=0x7fffffff4ce0, filename=0x5555558c5920 L"/opt/conda/bin/ninja", fp=0x55555da35300) at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:427
#13 pymain_run_filename (cf=0x7fffffff4ce0, pymain=0x7fffffff4df0) at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:1627
#14 pymain_run_python (pymain=0x7fffffff4df0) at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:2877
#15 pymain_main () at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:3038
#16 0x000055555578ef7c in _Py_UnixMain () at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:3073
#17 0x00007ffff76e4bf7 in __libc_start_main (main=0x55555564aed0 <main>, argc=3, argv=0x7fffffff4f48, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffff4f38)
    at ../csu/libc-start.c:310
#18 0x0000555555734122 in _start () at ../sysdeps/x86_64/elf/start.S:103

The reason is, singleton static instance of OrtTorchFunctionPool is destroyed after Python modules/functions are released.

There are two ways to fix it.

  1. UnRegister the python functions explicitly in some C++ class destructor for example ~TrainingAgent/~InferenceSessions or python module cleanup functions (https://docs.python.org/3/library/atexit.html). This will make sure all registered functions are still valid when we de-reference them.
  2. Let OrtTorchFunctionPool don't own references on python functions, used borrowed ones only, so it does not need de-reference them during destructor. But there is a problem handling the forward(.apply) and backward(.backward) functions. When we register them, we only have a autograd Function's pointer, we need get the pointers of .apply and .backward and increase ref cnt for them, otherwise, the ref count is 0. So we still have to de-reference them in ~OrtTorchFunctionPool, then we still have the segment fault prblem.

This PR is the a fix following the 1st approach. Please ignore the first and second commits (approach 2 and revert changes).

Motivation and Context

  • Why is this change required? What problem does it solve?
  • If it fixes an open issue, please link to the issue here.

@pengwa pengwa requested a review from a team as a code owner July 8, 2021 11:35
@pengwa
Copy link
Contributor Author

pengwa commented Jul 13, 2021

Thank you @baijumeswani and @tlh20 !!!

@pengwa pengwa merged commit 7db4fc8 into master Jul 13, 2021
@pengwa pengwa deleted the pengwa/fix_segmentfault branch July 13, 2021 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants