Fix segment fault for custom function #8331

pengwa · 2021-07-08T11:35:55Z

Description: Fix segment fault for custom function.

In recent master, when onnxruntime_ENABLE_TRAINING_TORCH_INTEROP=ON, once the onnxruntime_training package is built and installed. If we run programs containing 'import onnxruntime', we will see a 'segment fault' errors printed in stdout. This bug seems to be there for a while, not sure why we did not see it earlier.

The call stack looks like this:

>>>ninja --version
1.10.0.git.kitware.jobserver-1

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00005555556e26ff in tupledealloc (op=0x7ffd5a4113f0) at /tmp/build/80754af9/python_1553721932202/work/Objects/tupleobject.c:242
242     in /tmp/build/80754af9/python_1553721932202/work/Objects/tupleobject.c
(gdb) bt
#0  0x00005555556e26ff in tupledealloc (op=0x7ffd5a4113f0) at /tmp/build/80754af9/python_1553721932202/work/Objects/tupleobject.c:242
#1  code_dealloc () at /tmp/build/80754af9/python_1553721932202/work/Objects/codeobject.c:446
#2  0x00005555556e1448 in func_dealloc () at /tmp/build/80754af9/python_1553721932202/work/Objects/funcobject.c:532
#3  0x00007ffd5c0e86d1 in std::function<void (_object*)>::operator()(_object*) const (__args#0=<optimized out>, this=<optimized out>) at /usr/include/c++/7/bits/std_function.h:706
#4  std::unique_ptr<_object, std::function<void (_object*)> >::~unique_ptr() (this=0x7ffd5d029188 <onnxruntime::language_interop_ops::torch::OrtTorchFunctionPool::GetInstance()::instance_+40>,
    __in_chrg=<optimized out>) at /usr/include/c++/7/bits/unique_ptr.h:263
#5  onnxruntime::language_interop_ops::torch::OrtTorchFunctionPool::~OrtTorchFunctionPool (this=0x7ffd5d029160 <onnxruntime::language_interop_ops::torch::OrtTorchFunctionPool::GetInstance()::instance_>,
    __in_chrg=<optimized out>) at /home/pengwa/dev/onnxruntime/onnxruntime/core/language_interop_ops/torch/custom_function_register.h:16
#6  0x00007ffff7706161 in __run_exit_handlers (status=0, listp=0x7ffff7aae718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#7  0x00007ffff770625a in __GI_exit (status=<optimized out>) at exit.c:139
#8  0x0000555555776a19 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1553721932202/work/Python/pylifecycle.c:2282
#9  0x0000555555776ac7 in handle_system_exit () at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:636
#10 0x0000555555776b62 in PyErr_PrintEx () at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:646
#11 0x0000555555649869 in PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7fffffff4ce0)
    at /tmp/build/80754af9/python_1553721932202/work/Python/pythonrun.c:435
#12 0x000055555578ee5f in pymain_run_file (p_cf=0x7fffffff4ce0, filename=0x5555558c5920 L"/opt/conda/bin/ninja", fp=0x55555da35300) at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:427
#13 pymain_run_filename (cf=0x7fffffff4ce0, pymain=0x7fffffff4df0) at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:1627
#14 pymain_run_python (pymain=0x7fffffff4df0) at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:2877
#15 pymain_main () at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:3038
#16 0x000055555578ef7c in _Py_UnixMain () at /tmp/build/80754af9/python_1553721932202/work/Modules/main.c:3073
#17 0x00007ffff76e4bf7 in __libc_start_main (main=0x55555564aed0 <main>, argc=3, argv=0x7fffffff4f48, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffff4f38)
    at ../csu/libc-start.c:310
#18 0x0000555555734122 in _start () at ../sysdeps/x86_64/elf/start.S:103

The reason is, singleton static instance of OrtTorchFunctionPool is destroyed after Python modules/functions are released.

There are two ways to fix it.

UnRegister the python functions explicitly in some C++ class destructor for example ~TrainingAgent/~InferenceSessions or python module cleanup functions (https://docs.python.org/3/library/atexit.html). This will make sure all registered functions are still valid when we de-reference them.
Let OrtTorchFunctionPool don't own references on python functions, used borrowed ones only, so it does not need de-reference them during destructor. But there is a problem handling the forward(.apply) and backward(.backward) functions. When we register them, we only have a autograd Function's pointer, we need get the pointers of .apply and .backward and increase ref cnt for them, otherwise, the ref count is 0. So we still have to de-reference them in ~OrtTorchFunctionPool, then we still have the segment fault prblem.

This PR is the a fix following the 1st approach. Please ignore the first and second commits (approach 2 and revert changes).

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.

…only" This reverts commit 2b235db.

…ation

…nit__.py

…o pengwa/fix_segmentfault

pengwa · 2021-07-13T10:01:01Z

Thank you @baijumeswani and @tlh20 !!!

don't own references on python functions, used borrowed ones only

2b235db

pengwa requested a review from a team as a code owner July 8, 2021 11:35

pengwa added 2 commits July 8, 2021 13:28

Revert "don't own references on python functions, used borrowed ones …

98b8c1c

…only" This reverts commit 2b235db.

unregister registered python functions upon normal interpreter termin…

4590959

…ation

pengwa requested review from BowenBao, baijumeswani, liqunfu, spandantiwari and thiagocrepaldi as code owners July 8, 2021 17:03

pengwa requested review from SherlockNoMad, tlh20 and wschin July 8, 2021 17:13

pengwa added component:ortmodule labels Jul 8, 2021

pengwa added 6 commits July 8, 2021 17:27

minor fix

bc0f8e0

atexit.register(unregister_python_functions) should be called by __i…

1a1bf07

…nit__.py

revert bad automatical clang formatting

41990ac

revert bad format

9b685a4

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

edace4a

…o pengwa/fix_segmentfault

minor fix

a3ebaa0

baijumeswani approved these changes Jul 13, 2021

View reviewed changes

tlh20 approved these changes Jul 13, 2021

View reviewed changes

pengwa merged commit 7db4fc8 into master Jul 13, 2021

pengwa deleted the pengwa/fix_segmentfault branch July 13, 2021 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix segment fault for custom function #8331

Fix segment fault for custom function #8331

Uh oh!

pengwa commented Jul 8, 2021 •

edited

Loading

Uh oh!

pengwa commented Jul 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix segment fault for custom function #8331

Fix segment fault for custom function #8331

Uh oh!

Conversation

pengwa commented Jul 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pengwa commented Jul 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pengwa commented Jul 8, 2021 •

edited

Loading