segfault in python `1.31.0` #23796

huguesalary · 2020-08-12T00:12:45Z

What version of gRPC and what language are you using?

Python 1.31.0.

What operating system (Linux, Windows,...) and version?

Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 x86_64 x86_64 GNU/Linux

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.5.9

What did you do?

Upgraded to 1.31.0

What did you expect to see?

No segfault

What did you see instead?

We started seeing segfaults in both our celery workers and our uWSGI logs:

!!! uWSGI process 20 got Segmentation Fault !!!
*** backtrace of 20 ***
uwsgi(uwsgi_backtrace+0x2a) [0x56079b42b2ea]
uwsgi(uwsgi_segfault+0x23) [0x56079b42b6d3]
/lib/x86_64-linux-gnu/libc.so.6(+0x3efd0) [0x7fa99c451fd0]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x270245) [0x7fa987cda245]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x267473) [0x7fa987cd1473]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x268212) [0x7fa987cd2212]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x16e6a4) [0x7fa987bd86a4]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x26cc84) [0x7fa987cd6c84]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x281fcb) [0x7fa987cebfcb]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x25ccc8) [0x7fa987cc6cc8]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa99e5b36db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa99c534a3f]
*** end of backtrace ***
Tue Aug 11 09:44:16 2020 - uWSGI worker 3 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Tue Aug 11 09:44:16 2020 - uWSGI worker 4 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Segmentation fault (core dumped)

We tested 1.30.0 and did not observe the segfaults anymore.

When this happens with our celery workers, what seems to trigger the segfault is when the worker is restarted after having executed the maximum tasks specified by maxtaskperchild. For example celery -A the_app worker -Q a_queue --concurrency=1 --maxtasksperchild=100 -l info.

I can't currently provide cleaned up code for you to reproduce, but I believe any code making GRPC calls should trigger this after enough time.

The text was updated successfully, but these errors were encountered:

huguesalary · 2020-08-12T01:09:05Z

Hopefully this will help; this is the backtrace generated by GDB when analyzing the core dump:

#0  0x00007f8dc329ad81 in grpc_core::ExecCtx::Run(grpc_core::DebugLocation const&, grpc_closure*, grpc_error*) ()
   from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#1  0x00007f8dc329e275 in grpc_core::LockfreeEvent::SetReady() () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#2  0x00007f8dc3295458 in pollable_process_events(grpc_pollset*, pollable*, bool) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#3  0x00007f8dc3296212 in pollset_work () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#4  0x00007f8dc319c6a4 in run_poller () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#5  0x00007f8dc329ac84 in grpc_core::ExecCtx::Flush() () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#6  0x00007f8dc32affcb in timer_thread(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#7  0x00007f8dc328acc8 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#8  0x00007f8dd65706db in start_thread (arg=0x7f8dc25a4700) at pthread_create.c:463
#9  0x00007f8dd68a9a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

gnossen · 2020-08-12T19:25:29Z

CC @veblush

huguesalary · 2020-08-12T19:52:08Z

After re-running tests, this seems to also affect version 1.30.0, which makes me think it something else is wrong.

A coworker of mine noted the presence of this error too:

[2020-08-11 18:05:12,629: ERROR/MainProcess] Process 'Worker-4' pid:74 exited with 'signal 11 (SIGSEGV)'
I0811 18:05:13.150826638       6 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies

I've seen a few conversations around discussing similar issues and that seemed related to fork.

Also, here's some more stack traces

(gdb) thread apply all bt

Thread 6 (Thread 0x7fffbb7cd700 (LWP 1536)):
#0  0x0000555555b3f2c0 in PyUnicode_Type ()
#1  0x00007fffe437f6c9 in grpc_core::ExecCtx::Flush() () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#2  0x00007fffe437b22d in pollset_work(grpc_pollset*, grpc_pollset_worker**, long) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#3  0x00007fffe429384c in run_poller () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#4  0x00007fffe437f6c9 in grpc_core::ExecCtx::Flush() () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#5  0x00007fffe4393a20 in timer_thread(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#6  0x00007fffe437301b in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#7  0x00007ffff77cc6db in start_thread (arg=0x7fffbb7cd700) at pthread_create.c:463
#8  0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fffe2eca700 (LWP 1532)):
#0  0x00007ffff77d2ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7fffe2ec9ce0, expected=0, futex_word=0x7fffe48d324c <g_cv_wait+44>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:142
#1  __pthread_cond_wait_common (abstime=0x7fffe2ec9d90, mutex=0x7fffe48d3260 <_ZL4g_mu>, cond=0x7fffe48d3220 <g_cv_wait>) at pthread_cond_wait.c:533
#2  __pthread_cond_timedwait (cond=0x7fffe48d3220 <g_cv_wait>, mutex=0x7fffe48d3260 <_ZL4g_mu>, abstime=0x7fffe2ec9d90) at pthread_cond_wait.c:667
#3  0x00007fffe4370c7a in gpr_cv_wait () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#4  0x00007fffe4393b37 in timer_thread(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#5  0x00007fffe437301b in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#6  0x00007ffff77cc6db in start_thread (arg=0x7fffe2eca700) at pthread_create.c:463
#7  0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fffe36cb700 (LWP 1531)):
#0  0x00007ffff77d29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555557b94630) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x555557b945d0, cond=0x555557b94608) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x555557b94608, mutex=0x555557b945d0) at pthread_cond_wait.c:655
#3  0x00007fffe4370c12 in gpr_cv_wait () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#4  0x00007fffe437ffcf in grpc_core::Executor::ThreadMain(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#5  0x00007fffe437301b in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#6  0x00007ffff77cc6db in start_thread (arg=0x7fffe36cb700) at pthread_create.c:463
#7  0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fffe3ecc700 (LWP 1530)):
#0  0x00007ffff77d29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555557b6fb30) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x555557b6fad0, cond=0x555557b6fb08) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x555557b6fb08, mutex=0x555557b6fad0) at pthread_cond_wait.c:655
#3  0x00007fffe4370c12 in gpr_cv_wait () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#4  0x00007fffe437ffcf in grpc_core::Executor::ThreadMain(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#5  0x00007fffe437301b in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#6  0x00007ffff77cc6db in start_thread (arg=0x7fffe3ecc700) at pthread_create.c:463
#7  0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7ffff7fe9740 (LWP 1521)):
#0  0x00007ffff7b05d67 in epoll_wait (epfd=9, events=0x555562ece670, maxevents=1023, timeout=1809) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x0000555555633f0a in ?? ()
#2  0x00005555556c5033 in PyCFunction_Call ()
#3  0x000055555566f0d0 in PyEval_EvalFrameEx ()
#4  0x000055555566f2f3 in PyEval_EvalFrameEx ()
#5  0x000055555566f2f3 in PyEval_EvalFrameEx ()
#6  0x00005555556ca3c2 in ?? ()
#7  0x000055555567157b in ?? ()
#8  0x00005555556c50b1 in PyCFunction_Call ()
#9  0x000055555566f0d0 in PyEval_EvalFrameEx ()
#10 0x000055555572e78f in ?? ()
#11 0x000055555572e853 in PyEval_EvalCodeEx ()
#12 0x00005555556c8c85 in ?? ()
#13 0x00005555555bbe9a in PyObject_Call ()
#14 0x0000555555669938 in PyEval_EvalFrameEx ()
#15 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#16 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#17 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#18 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#19 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#20 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#21 0x000055555572e78f in ?? ()
#22 0x000055555572e853 in PyEval_EvalCodeEx ()
#23 0x00005555556c8d75 in ?? ()
#24 0x00005555555bbe9a in PyObject_Call ()
#25 0x0000555555669938 in PyEval_EvalFrameEx ()
#26 0x000055555572e78f in ?? ()
#27 0x000055555572e853 in PyEval_EvalCodeEx ()
#28 0x00005555556c8d75 in ?? ()
#29 0x00005555555bbe9a in PyObject_Call ()
#30 0x00005555555d8c2c in ?? ()
#31 0x00005555555bbe9a in PyObject_Call ()
#32 0x000055555569ba0c in ?? ()
#33 0x00005555555bbe9a in PyObject_Call ()
#34 0x0000555555669938 in PyEval_EvalFrameEx ()
#35 0x000055555572e78f in ?? ()
#36 0x000055555566d969 in PyEval_EvalFrameEx ()
#37 0x000055555572e78f in ?? ()
#38 0x000055555566d969 in PyEval_EvalFrameEx ()
#39 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#40 0x000055555572e78f in ?? ()
#41 0x000055555566d969 in PyEval_EvalFrameEx ()
#42 0x000055555572e78f in ?? ()
#43 0x000055555566d969 in PyEval_EvalFrameEx ()
#44 0x000055555572e78f in ?? ()
#45 0x000055555566d969 in PyEval_EvalFrameEx ()
#46 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#47 0x000055555572e78f in ?? ()
#48 0x00005555556664cf in PyEval_EvalCode ()
#49 0x0000555555678dd0 in ?? ()
#50 0x0000555555680aa1 in PyRun_FileExFlags ()
#51 0x0000555555680c5d in PyRun_SimpleFileExFlags ()
#52 0x0000555555723424 in Py_Main ()
#53 0x00005555555b6b1d in main ()

gnossen · 2020-08-12T19:58:01Z

@huguesalary The stack trace looks slightly different in your most recent comment. Which thread segfaulted in that case? Thread 6?

huguesalary · 2020-08-12T20:11:34Z

I somehow managed to lose the trace and can't generate another one right now, but, I believe thread 6 was the one that segfaulted in the trace I posted in #23796 (comment)

Now, I also had instances where the segfault that occured looked exactly like thread 1.

huguesalary · 2020-08-13T17:00:03Z

After deploying 1.30.0 on our production systems, the number of segfaults went down to 0.

So, although I was able to create a segfault with 1.30.0 on my own machines, it does seem 1.30.0 is more stable than 1.31.0

veblush · 2020-08-18T00:50:25Z

Two things that might be relevant to this come to my mind:

New artifacts for manylinux2014 is added from 1.31. You can try manylinux2010 artifacts by running pip --platform manylinux2010.
Added TCP_USER_TIMEOUT auto-detection (Added TCP_USER_TIMEOUT auto-detection #23401). This shouldn't be invasive but this never been used in Python so chances are something isn't properly done.

Another interesting thing is the folk error message. It shouldn't be shown, I guess. Richard, gRPC Python is supposed to not use epollex at all?

sergei-iurchenko · 2020-08-19T16:17:00Z

I have the same problem.
After upgrading python dependencies in project we started to get kubernetes pods restart with celery workers.
There were no logs, no tracebacks, no exceptions. We enabled python faulthandler https://blog.richard.do/2018/03/18/how-to-debug-segmentation-fault-in-python/ and got different tracebacks not connected with grpc in different parts of code. By excluding last updates in working branch step by step we found that it was grpc package source of segfaults. Downgrading to 1.30.0 solved this issue

sergei-iurchenko · 2020-08-19T16:20:25Z

Some examples
Fatal Python error: Segmentation fault ││ │ my_process Thread 0x00007fb7bfb21740 (most recent call first): ││ │ my_process File "/usr/local/lib/python3.8/site-packages/kombu/utils/eventio.py", line 84 in poll ││ │ my_process File "/usr/local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 308 in create_loop ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/worker/loops.py", line 83 in asynloop ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 599 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 318 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 369 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/worker/worker.py", line 208 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bin/worker.py", line 259 in run ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bin/base.py", line 253 in __call__ ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bin/worker.py", line 223 in run_from_argv ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bin/celery.py", line 415 in execute ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bin/celery.py", line 487 in handle_argv ││

Another one
Fatal Python error: Segmentation fault ││ my_process Thread 0x00007f4dbbc50740 (most recent call first): ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/connection.py", line 422 in _recv ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/connection.py", line 456 in _recv_bytes ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/connection.py", line 243 in recv_bytes ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/queues.py", line 355 in get_payload ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 445 in _recv ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 473 in receive ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 351 in workloop ││ my_process File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/celery.py", line 253 in sentry_workloop ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 292 in __call__ ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/process.py", line 114 in run ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/process.py", line 327 in _bootstrap ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/popen_fork.py", line 79 in _launch

I can provide additional info for debug purposes but I don`t have a 100% case to reproduce it. It is floating error.

zdenulo · 2020-08-28T21:06:54Z

We're using Flask application and Google PubSub emulator and Google Datastore emulator (both communicating via grpc through client libraries) to run tests. With grpcio==1.31.0 we are getting segmentation fault errors but fortunately, with 1.30.0 it works ok. Not sure what could be the issue. We're Python 3.7.5

cgurnik · 2020-09-17T15:22:57Z

@gnossen Is there anything we can provide to help make progress on this issue?

cgurnik · 2020-11-03T16:21:57Z

Hi @gnossen, just checking in again. Is there any additional information that would be useful?

dmjef · 2020-11-11T22:49:08Z

This is happening to me too with grpcio-1.33.2-cp36-cp36m-manylinux2014_x86_64 on Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1030-aws x86_64). I get constant segfaults. Downgrading to 1.30.0 solves the problem.

exzhawk · 2020-11-16T08:19:02Z

Same here. Python multiprocessing.Process died randomly, leaving segfault message in dmesg like [6558262.243117] grpc_global_tim[42632]: segfault at 7f172819d038 ip 00007f1789caff59 sp 00007f1739bb6c40 error 4 in cygrpc.cpython-36m-x86_64-linux-gnu.so[7f1789a60000+5e8000]. Downgrading to 1.30.0 solves.

install grpcio-1.33.2-cp36-cp36m-manylinux2010_x86_64.whl does not help, still getting same error.

wireman27 · 2020-12-08T16:08:46Z

We faced the exact same issue - multiprocessing with grpc leading to segmentation faults. We were using 1.34.0 in our case but downgrading to 1.30.0 resolved the issue. Haven't tried it with 1.32.0 and 1.33.0 yet.

gnossen · 2020-12-08T18:03:56Z

Sorry for the delay. This thread got buried in my inbox. @wireman27 @exzhawk @dmjef Can one of you please point me to your code and/or describe how to reproduce this issue?

cgurnik · 2020-12-08T20:30:59Z

It looks like the default polling strategy changed from epoll1 in v1.30 to epollex in v1.31. Unfortunately epollex doesn't have fork support so it doesn't work with Celery: Fork support is only compatible with the epoll1 and poll polling strategies. We were able to work around this issue by setting GRPC_POLL_STRATEGY=epoll1 on our Celery workers.

wireman27 · 2020-12-09T04:28:41Z

@gnossen Reproducing might be difficult because we use google-cloud-translate to translate some text and it's exactly that snippet of the translate request to Google that seems to cause the segmentation fault. (It works fine without the translation snippet).

Important to note here is that we initialise 10 sub-processes with multiprocess.Process() and each of these sub-processes are making translation requests. The output of these 10 sub-processes is then funnelled through a multiprocess.Queue() that is picked up by another multiprocess.Process(). Hope this helps.

cpaulik · 2020-12-15T17:03:57Z

We have a similar issue.

We are running processes in multiprocessing using google cloud pubsub. The code we are using is very similar to the function synchronous_pull_with_lease_management from the pubsub client examples

With all versions above 1.30 we get messages like these:

E1215 16:57:23.081484255    5127 ssl_transport_security.cc:514] Corruption detected.
E1215 16:57:23.081556271    5127 ssl_transport_security.cc:490] error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT
E1215 16:57:23.081565033    5127 ssl_transport_security.cc:490] error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC
E1215 16:57:23.081570862    5127 secure_endpoint.cc:208]     Decryption error: TSI_DATA_CORRUPTED

and then segmentation faults.

But it seems the occurrence of the segmentation fault depends on the type of process we are running.
I've not yet been able to make a minimal example. And running it with a simple time.sleep in the worker function does not show the behavior.

TheBirdsNest · 2021-01-04T10:14:24Z

This is starting to be a huge issue for my team; we make heavy use of grpcio and grpc-google-iam-v1.
Downgrading to grpcio==1.30.0 did not seem to help.

Does anyone have this issue too or a workaround for working with GCP libraries?

google-api-core:
    1.24.0
google-api-python-client:
    1.12.8
google-auth:
    1.24.0
google-auth-httplib2:
    0.0.4
google-cloud-core:
    1.5.0
google-cloud-kms:
    2.2.0
google-cloud-logging:
    2.0.1
google-cloud-storage:
    1.35.0
google-crc32c:
    1.1.0
google-resumable-media:
    1.2.0
googleapis-common-protos:
    1.52.0
grpc-google-iam-v1:
    0.12.3
grpcio:
    1.30.0

jrmlhermitte · 2021-02-01T18:15:18Z

We have had this issue too and this thread was helpful for us. We have a workaround, so I thought I would provide details, in hopes it may help.

What we did was as suggested, to set GRPC_POLL_STRATEGY=epoll1 on our workers that use multiprocessing with fork on linux based images.
As a longterm strategy, we have been gradually getting rid of using fork in general (as it was the cause for many other issues).
This happened on workers running python 3.7.6.

We have also found that workers running on python 3.8.6 do not see this issue. However, we have seen errors like these instead:

2021-02-01 12:16:49.489 EST "Corruption detected."
2021-02-01 12:16:49.490 EST "error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT"
2021-02-01 12:16:49.490 EST "error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC"
2021-02-01 12:16:49.490 EST " Decryption error: TSI_DATA_CORRUPTED"
2021-02-01 12:16:49.490 EST "SSL_write failed with error SSL_ERROR_SSL."

which is similar to what @cpaulik is seeing.

If I had to guess, I would naively suspect that what is occurring is that both the parent and child processes are attempting to use the same socket connection (as fork ing provides a copy) resulting in data corruption. Not sure if this helps, or if this is the wrong direction.

This reverts commit 2e6c725.

oakal · 2021-04-22T02:06:48Z

We have had this issue too and this thread was helpful for us. We have a workaround, so I thought I would provide details, in hopes it may help.

What we did was as suggested, to set GRPC_POLL_STRATEGY=epoll1 on our workers that use multiprocessing with fork on linux based images.
As a longterm strategy, we have been gradually getting rid of using fork in general (as it was the cause for many other issues).
This happened on workers running python 3.7.6.

We have also found that workers running on python 3.8.6 do not see this issue. However, we have seen errors like these instead:
2021-02-01 12:16:49.489 EST "Corruption detected."
2021-02-01 12:16:49.490 EST "error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT"
2021-02-01 12:16:49.490 EST "error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC"
2021-02-01 12:16:49.490 EST " Decryption error: TSI_DATA_CORRUPTED"
2021-02-01 12:16:49.490 EST "SSL_write failed with error SSL_ERROR_SSL."
which is similar to what @cpaulik is seeing.

If I had to guess, I would naively suspect that what is occurring is that both the parent and child processes are attempting to use the same socket connection (as fork ing provides a copy) resulting in data corruption. Not sure if this helps, or if this is the wrong direction.

thank you so much @jrmlhermitte I've been trying find a solution for segfault issue for a very long time. This worked like a charm.

Current version of Airflow image uses grpcio==1.31.0, which causes segfaults: b/174948982. Depedency added to allow only versions up to 1.30.0. It hasn't been yet fixed in upstream: grpc/grpc#23796 Change-Id: I62f3fbd75ff64dab6772a534424d06b178e67a42 GitOrigin-RevId: f0a17fad94c95fcb0794c8501ce579b391473c1e

Enable all db and flow tests. On linux, multiprocessing's default is fork, which causes gRPC to fail because its default polling mechanism is epoll. See grpc/grpc#23796

numbsafari · 2022-02-02T15:43:31Z

Hi, 2022 checking in. It appears that gRPC doesn’t work with any kind of pre-fork (eg uWSGI) model system unless you specify the environment variables noted above.

Perhaps a solution would be to add some documentation?

k4nar · 2023-11-17T14:43:45Z

Note: the epollex poller has been removed in gRPC 1.46.0 released in May 2022. So I don't think that the solution above to force the poller to epoll1 is still relevant today, as this is now the default.

huguesalary added kind/bug priority/P2 labels Aug 12, 2020

huguesalary assigned nicolasnoble Aug 12, 2020

huguesalary changed the title ~~segfault in 1.31.0~~ segfault in python 1.31.0 Aug 12, 2020

apolcyn assigned gnossen and unassigned nicolasnoble Aug 12, 2020

markdroth added the lang/Python label Sep 9, 2020

AurelienGasser mentioned this issue Jan 20, 2021

fix the grpcio version to 1.30.0 Substra/substra-backend#371

Merged

joshuarli added a commit to getsentry/sentry that referenced this issue Feb 9, 2021

downgrade grpcio to 1.30.0 following grpc/grpc#23796

2e6c725

joshuarli added a commit to getsentry/sentry that referenced this issue Feb 9, 2021

Revert "downgrade grpcio to 1.30.0 following grpc/grpc#23796"

fa4f4e3

This reverts commit 2e6c725.

therahulkumar mentioned this issue Feb 9, 2021

Segmentation fault during worker shutdown unbit/uwsgi#1651

Open

joshuarli mentioned this issue Feb 9, 2021

google upgrade, take 2 getsentry/sentry#23596

Merged

iostreamdoth mentioned this issue Jul 19, 2021

When triggering a job via airfow UI, job fails abruptly with Negsignal.SIGSEGV apache/airflow#16243

Closed

scv119 mentioned this issue Mar 16, 2023

[<Ray component: Core] Fatal Python error: Segmentation fault ray-project/ray#33285

Closed

Liubey mentioned this issue Mar 8, 2024

uwsgi + flask, got Respawned uWSGI worker, very frequently open-telemetry/opentelemetry-python#3640

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault in python `1.31.0` #23796

segfault in python `1.31.0` #23796

huguesalary commented Aug 12, 2020 •

edited

Loading

huguesalary commented Aug 12, 2020 •

edited

Loading

gnossen commented Aug 12, 2020

huguesalary commented Aug 12, 2020

gnossen commented Aug 12, 2020

huguesalary commented Aug 12, 2020 •

edited

Loading

huguesalary commented Aug 13, 2020

veblush commented Aug 18, 2020

sergei-iurchenko commented Aug 19, 2020

sergei-iurchenko commented Aug 19, 2020 •

edited

Loading

zdenulo commented Aug 28, 2020

cgurnik commented Sep 17, 2020

cgurnik commented Nov 3, 2020

dmjef commented Nov 11, 2020 •

edited

Loading

exzhawk commented Nov 16, 2020

wireman27 commented Dec 8, 2020

gnossen commented Dec 8, 2020

cgurnik commented Dec 8, 2020 •

edited

Loading

wireman27 commented Dec 9, 2020

cpaulik commented Dec 15, 2020

TheBirdsNest commented Jan 4, 2021 •

edited

Loading

jrmlhermitte commented Feb 1, 2021 •

edited

Loading

oakal commented Apr 22, 2021

numbsafari commented Feb 2, 2022

k4nar commented Nov 17, 2023

segfault in python 1.31.0 #23796

segfault in python 1.31.0 #23796

Comments

huguesalary commented Aug 12, 2020 • edited Loading

What version of gRPC and what language are you using?

What operating system (Linux, Windows,...) and version?

What runtime / compiler are you using (e.g. python version or version of gcc)

What did you do?

What did you expect to see?

What did you see instead?

huguesalary commented Aug 12, 2020 • edited Loading

gnossen commented Aug 12, 2020

huguesalary commented Aug 12, 2020

gnossen commented Aug 12, 2020

huguesalary commented Aug 12, 2020 • edited Loading

huguesalary commented Aug 13, 2020

veblush commented Aug 18, 2020

sergei-iurchenko commented Aug 19, 2020

sergei-iurchenko commented Aug 19, 2020 • edited Loading

zdenulo commented Aug 28, 2020

cgurnik commented Sep 17, 2020

cgurnik commented Nov 3, 2020

dmjef commented Nov 11, 2020 • edited Loading

exzhawk commented Nov 16, 2020

wireman27 commented Dec 8, 2020

gnossen commented Dec 8, 2020

cgurnik commented Dec 8, 2020 • edited Loading

wireman27 commented Dec 9, 2020

cpaulik commented Dec 15, 2020

TheBirdsNest commented Jan 4, 2021 • edited Loading

jrmlhermitte commented Feb 1, 2021 • edited Loading

oakal commented Apr 22, 2021

numbsafari commented Feb 2, 2022

k4nar commented Nov 17, 2023

segfault in python `1.31.0` #23796

segfault in python `1.31.0` #23796

huguesalary commented Aug 12, 2020 •

edited

Loading

huguesalary commented Aug 12, 2020 •

edited

Loading

huguesalary commented Aug 12, 2020 •

edited

Loading

sergei-iurchenko commented Aug 19, 2020 •

edited

Loading

dmjef commented Nov 11, 2020 •

edited

Loading

cgurnik commented Dec 8, 2020 •

edited

Loading

TheBirdsNest commented Jan 4, 2021 •

edited

Loading

jrmlhermitte commented Feb 1, 2021 •

edited

Loading