Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in python 1.31.0 #23796

Open
huguesalary opened this issue Aug 12, 2020 · 24 comments
Open

segfault in python 1.31.0 #23796

huguesalary opened this issue Aug 12, 2020 · 24 comments

Comments

@huguesalary
Copy link

huguesalary commented Aug 12, 2020

What version of gRPC and what language are you using?

Python 1.31.0.

What operating system (Linux, Windows,...) and version?

Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 x86_64 x86_64 GNU/Linux

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.5.9

What did you do?

Upgraded to 1.31.0

What did you expect to see?

No segfault

What did you see instead?

We started seeing segfaults in both our celery workers and our uWSGI logs:

!!! uWSGI process 20 got Segmentation Fault !!!
*** backtrace of 20 ***
uwsgi(uwsgi_backtrace+0x2a) [0x56079b42b2ea]
uwsgi(uwsgi_segfault+0x23) [0x56079b42b6d3]
/lib/x86_64-linux-gnu/libc.so.6(+0x3efd0) [0x7fa99c451fd0]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x270245) [0x7fa987cda245]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x267473) [0x7fa987cd1473]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x268212) [0x7fa987cd2212]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x16e6a4) [0x7fa987bd86a4]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x26cc84) [0x7fa987cd6c84]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x281fcb) [0x7fa987cebfcb]
/usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so(+0x25ccc8) [0x7fa987cc6cc8]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fa99e5b36db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa99c534a3f]
*** end of backtrace ***
Tue Aug 11 09:44:16 2020 - uWSGI worker 3 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Tue Aug 11 09:44:16 2020 - uWSGI worker 4 screams: UAAAAAAH my master disconnected: i will kill myself !!!
Segmentation fault (core dumped)

We tested 1.30.0 and did not observe the segfaults anymore.

When this happens with our celery workers, what seems to trigger the segfault is when the worker is restarted after having executed the maximum tasks specified by maxtaskperchild. For example celery -A the_app worker -Q a_queue --concurrency=1 --maxtasksperchild=100 -l info.

I can't currently provide cleaned up code for you to reproduce, but I believe any code making GRPC calls should trigger this after enough time.

@huguesalary huguesalary changed the title segfault in 1.31.0 segfault in python 1.31.0 Aug 12, 2020
@huguesalary
Copy link
Author

huguesalary commented Aug 12, 2020

Hopefully this will help; this is the backtrace generated by GDB when analyzing the core dump:

#0  0x00007f8dc329ad81 in grpc_core::ExecCtx::Run(grpc_core::DebugLocation const&, grpc_closure*, grpc_error*) ()
   from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#1  0x00007f8dc329e275 in grpc_core::LockfreeEvent::SetReady() () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#2  0x00007f8dc3295458 in pollable_process_events(grpc_pollset*, pollable*, bool) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#3  0x00007f8dc3296212 in pollset_work () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#4  0x00007f8dc319c6a4 in run_poller () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#5  0x00007f8dc329ac84 in grpc_core::ExecCtx::Flush() () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#6  0x00007f8dc32affcb in timer_thread(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#7  0x00007f8dc328acc8 in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#8  0x00007f8dd65706db in start_thread (arg=0x7f8dc25a4700) at pthread_create.c:463
#9  0x00007f8dd68a9a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@apolcyn apolcyn assigned gnossen and unassigned nicolasnoble Aug 12, 2020
@gnossen
Copy link
Contributor

gnossen commented Aug 12, 2020

CC @veblush

@huguesalary
Copy link
Author

After re-running tests, this seems to also affect version 1.30.0, which makes me think it something else is wrong.

A coworker of mine noted the presence of this error too:

[2020-08-11 18:05:12,629: ERROR/MainProcess] Process 'Worker-4' pid:74 exited with 'signal 11 (SIGSEGV)'
I0811 18:05:13.150826638       6 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies

I've seen a few conversations around discussing similar issues and that seemed related to fork.

Also, here's some more stack traces
(gdb) thread apply all bt

Thread 6 (Thread 0x7fffbb7cd700 (LWP 1536)):
#0  0x0000555555b3f2c0 in PyUnicode_Type ()
#1  0x00007fffe437f6c9 in grpc_core::ExecCtx::Flush() () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#2  0x00007fffe437b22d in pollset_work(grpc_pollset*, grpc_pollset_worker**, long) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#3  0x00007fffe429384c in run_poller () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#4  0x00007fffe437f6c9 in grpc_core::ExecCtx::Flush() () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#5  0x00007fffe4393a20 in timer_thread(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#6  0x00007fffe437301b in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#7  0x00007ffff77cc6db in start_thread (arg=0x7fffbb7cd700) at pthread_create.c:463
#8  0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7fffe2eca700 (LWP 1532)):
#0  0x00007ffff77d2ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, reltime=0x7fffe2ec9ce0, expected=0, futex_word=0x7fffe48d324c <g_cv_wait+44>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:142
#1  __pthread_cond_wait_common (abstime=0x7fffe2ec9d90, mutex=0x7fffe48d3260 <_ZL4g_mu>, cond=0x7fffe48d3220 <g_cv_wait>) at pthread_cond_wait.c:533
#2  __pthread_cond_timedwait (cond=0x7fffe48d3220 <g_cv_wait>, mutex=0x7fffe48d3260 <_ZL4g_mu>, abstime=0x7fffe2ec9d90) at pthread_cond_wait.c:667
#3  0x00007fffe4370c7a in gpr_cv_wait () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#4  0x00007fffe4393b37 in timer_thread(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#5  0x00007fffe437301b in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#6  0x00007ffff77cc6db in start_thread (arg=0x7fffe2eca700) at pthread_create.c:463
#7  0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7fffe36cb700 (LWP 1531)):
#0  0x00007ffff77d29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555557b94630) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x555557b945d0, cond=0x555557b94608) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x555557b94608, mutex=0x555557b945d0) at pthread_cond_wait.c:655
#3  0x00007fffe4370c12 in gpr_cv_wait () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#4  0x00007fffe437ffcf in grpc_core::Executor::ThreadMain(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#5  0x00007fffe437301b in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#6  0x00007ffff77cc6db in start_thread (arg=0x7fffe36cb700) at pthread_create.c:463
#7  0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7fffe3ecc700 (LWP 1530)):
#0  0x00007ffff77d29f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555557b6fb30) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x555557b6fad0, cond=0x555557b6fb08) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x555557b6fb08, mutex=0x555557b6fad0) at pthread_cond_wait.c:655
#3  0x00007fffe4370c12 in gpr_cv_wait () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#4  0x00007fffe437ffcf in grpc_core::Executor::ThreadMain(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#5  0x00007fffe437301b in grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::{lambda(void*)#1}::_FUN(void*) () from /usr/local/lib/python3.5/dist-packages/grpc/_cython/cygrpc.cpython-35m-x86_64-linux-gnu.so
#6  0x00007ffff77cc6db in start_thread (arg=0x7fffe3ecc700) at pthread_create.c:463
#7  0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7ffff7fe9740 (LWP 1521)):
#0  0x00007ffff7b05d67 in epoll_wait (epfd=9, events=0x555562ece670, maxevents=1023, timeout=1809) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x0000555555633f0a in ?? ()
#2  0x00005555556c5033 in PyCFunction_Call ()
#3  0x000055555566f0d0 in PyEval_EvalFrameEx ()
#4  0x000055555566f2f3 in PyEval_EvalFrameEx ()
#5  0x000055555566f2f3 in PyEval_EvalFrameEx ()
#6  0x00005555556ca3c2 in ?? ()
#7  0x000055555567157b in ?? ()
#8  0x00005555556c50b1 in PyCFunction_Call ()
#9  0x000055555566f0d0 in PyEval_EvalFrameEx ()
#10 0x000055555572e78f in ?? ()
#11 0x000055555572e853 in PyEval_EvalCodeEx ()
#12 0x00005555556c8c85 in ?? ()
#13 0x00005555555bbe9a in PyObject_Call ()
#14 0x0000555555669938 in PyEval_EvalFrameEx ()
#15 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#16 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#17 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#18 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#19 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#20 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#21 0x000055555572e78f in ?? ()
#22 0x000055555572e853 in PyEval_EvalCodeEx ()
#23 0x00005555556c8d75 in ?? ()
#24 0x00005555555bbe9a in PyObject_Call ()
#25 0x0000555555669938 in PyEval_EvalFrameEx ()
#26 0x000055555572e78f in ?? ()
#27 0x000055555572e853 in PyEval_EvalCodeEx ()
#28 0x00005555556c8d75 in ?? ()
#29 0x00005555555bbe9a in PyObject_Call ()
#30 0x00005555555d8c2c in ?? ()
#31 0x00005555555bbe9a in PyObject_Call ()
#32 0x000055555569ba0c in ?? ()
#33 0x00005555555bbe9a in PyObject_Call ()
#34 0x0000555555669938 in PyEval_EvalFrameEx ()
#35 0x000055555572e78f in ?? ()
#36 0x000055555566d969 in PyEval_EvalFrameEx ()
#37 0x000055555572e78f in ?? ()
#38 0x000055555566d969 in PyEval_EvalFrameEx ()
#39 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#40 0x000055555572e78f in ?? ()
#41 0x000055555566d969 in PyEval_EvalFrameEx ()
#42 0x000055555572e78f in ?? ()
#43 0x000055555566d969 in PyEval_EvalFrameEx ()
#44 0x000055555572e78f in ?? ()
#45 0x000055555566d969 in PyEval_EvalFrameEx ()
#46 0x000055555566f2f3 in PyEval_EvalFrameEx ()
#47 0x000055555572e78f in ?? ()
#48 0x00005555556664cf in PyEval_EvalCode ()
#49 0x0000555555678dd0 in ?? ()
#50 0x0000555555680aa1 in PyRun_FileExFlags ()
#51 0x0000555555680c5d in PyRun_SimpleFileExFlags ()
#52 0x0000555555723424 in Py_Main ()
#53 0x00005555555b6b1d in main ()

@gnossen
Copy link
Contributor

gnossen commented Aug 12, 2020

@huguesalary The stack trace looks slightly different in your most recent comment. Which thread segfaulted in that case? Thread 6?

@huguesalary
Copy link
Author

huguesalary commented Aug 12, 2020

I somehow managed to lose the trace and can't generate another one right now, but, I believe thread 6 was the one that segfaulted in the trace I posted in #23796 (comment)

Now, I also had instances where the segfault that occured looked exactly like thread 1.

@huguesalary
Copy link
Author

After deploying 1.30.0 on our production systems, the number of segfaults went down to 0.

So, although I was able to create a segfault with 1.30.0 on my own machines, it does seem 1.30.0 is more stable than 1.31.0

@veblush
Copy link
Contributor

veblush commented Aug 18, 2020

Two things that might be relevant to this come to my mind:

  • New artifacts for manylinux2014 is added from 1.31. You can try manylinux2010 artifacts by running pip --platform manylinux2010.
  • Added TCP_USER_TIMEOUT auto-detection (Added TCP_USER_TIMEOUT auto-detection #23401). This shouldn't be invasive but this never been used in Python so chances are something isn't properly done.

Another interesting thing is the folk error message. It shouldn't be shown, I guess. Richard, gRPC Python is supposed to not use epollex at all?

@sergei-iurchenko
Copy link

I have the same problem.
After upgrading python dependencies in project we started to get kubernetes pods restart with celery workers.
There were no logs, no tracebacks, no exceptions. We enabled python faulthandler https://blog.richard.do/2018/03/18/how-to-debug-segmentation-fault-in-python/ and got different tracebacks not connected with grpc in different parts of code. By excluding last updates in working branch step by step we found that it was grpc package source of segfaults. Downgrading to 1.30.0 solved this issue

@sergei-iurchenko
Copy link

sergei-iurchenko commented Aug 19, 2020

Some examples
Fatal Python error: Segmentation fault ││ │ my_process Thread 0x00007fb7bfb21740 (most recent call first): ││ │ my_process File "/usr/local/lib/python3.8/site-packages/kombu/utils/eventio.py", line 84 in poll ││ │ my_process File "/usr/local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 308 in create_loop ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/worker/loops.py", line 83 in asynloop ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 599 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 318 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 369 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/worker/worker.py", line 208 in start ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bin/worker.py", line 259 in run ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bin/base.py", line 253 in __call__ ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bin/worker.py", line 223 in run_from_argv ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bin/celery.py", line 415 in execute ││ │ my_process File "/usr/local/lib/python3.8/site-packages/celery/bin/celery.py", line 487 in handle_argv ││

Another one
Fatal Python error: Segmentation fault ││ my_process Thread 0x00007f4dbbc50740 (most recent call first): ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/connection.py", line 422 in _recv ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/connection.py", line 456 in _recv_bytes ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/connection.py", line 243 in recv_bytes ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/queues.py", line 355 in get_payload ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 445 in _recv ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 473 in receive ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 351 in workloop ││ my_process File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/celery.py", line 253 in sentry_workloop ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/pool.py", line 292 in __call__ ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/process.py", line 114 in run ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/process.py", line 327 in _bootstrap ││ my_process File "/usr/local/lib/python3.8/site-packages/billiard/popen_fork.py", line 79 in _launch

I can provide additional info for debug purposes but I don`t have a 100% case to reproduce it. It is floating error.

@zdenulo
Copy link

zdenulo commented Aug 28, 2020

We're using Flask application and Google PubSub emulator and Google Datastore emulator (both communicating via grpc through client libraries) to run tests. With grpcio==1.31.0 we are getting segmentation fault errors but fortunately, with 1.30.0 it works ok. Not sure what could be the issue. We're Python 3.7.5

@cgurnik
Copy link

cgurnik commented Sep 17, 2020

@gnossen Is there anything we can provide to help make progress on this issue?

@cgurnik
Copy link

cgurnik commented Nov 3, 2020

Hi @gnossen, just checking in again. Is there any additional information that would be useful?

@dmjef
Copy link

dmjef commented Nov 11, 2020

This is happening to me too with grpcio-1.33.2-cp36-cp36m-manylinux2014_x86_64 on Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-1030-aws x86_64). I get constant segfaults. Downgrading to 1.30.0 solves the problem.

@exzhawk
Copy link

exzhawk commented Nov 16, 2020

Same here. Python multiprocessing.Process died randomly, leaving segfault message in dmesg like [6558262.243117] grpc_global_tim[42632]: segfault at 7f172819d038 ip 00007f1789caff59 sp 00007f1739bb6c40 error 4 in cygrpc.cpython-36m-x86_64-linux-gnu.so[7f1789a60000+5e8000]. Downgrading to 1.30.0 solves.

install grpcio-1.33.2-cp36-cp36m-manylinux2010_x86_64.whl does not help, still getting same error.

@wireman27
Copy link

We faced the exact same issue - multiprocessing with grpc leading to segmentation faults. We were using 1.34.0 in our case but downgrading to 1.30.0 resolved the issue. Haven't tried it with 1.32.0 and 1.33.0 yet.

@gnossen
Copy link
Contributor

gnossen commented Dec 8, 2020

Sorry for the delay. This thread got buried in my inbox. @wireman27 @exzhawk @dmjef Can one of you please point me to your code and/or describe how to reproduce this issue?

@cgurnik
Copy link

cgurnik commented Dec 8, 2020

It looks like the default polling strategy changed from epoll1 in v1.30 to epollex in v1.31. Unfortunately epollex doesn't have fork support so it doesn't work with Celery: Fork support is only compatible with the epoll1 and poll polling strategies. We were able to work around this issue by setting GRPC_POLL_STRATEGY=epoll1 on our Celery workers.

@wireman27
Copy link

@gnossen Reproducing might be difficult because we use google-cloud-translate to translate some text and it's exactly that snippet of the translate request to Google that seems to cause the segmentation fault. (It works fine without the translation snippet).

Important to note here is that we initialise 10 sub-processes with multiprocess.Process() and each of these sub-processes are making translation requests. The output of these 10 sub-processes is then funnelled through a multiprocess.Queue() that is picked up by another multiprocess.Process(). Hope this helps.

@cpaulik
Copy link

cpaulik commented Dec 15, 2020

We have a similar issue.

We are running processes in multiprocessing using google cloud pubsub. The code we are using is very similar to the function synchronous_pull_with_lease_management from the pubsub client examples

With all versions above 1.30 we get messages like these:

E1215 16:57:23.081484255    5127 ssl_transport_security.cc:514] Corruption detected.
E1215 16:57:23.081556271    5127 ssl_transport_security.cc:490] error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT
E1215 16:57:23.081565033    5127 ssl_transport_security.cc:490] error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC
E1215 16:57:23.081570862    5127 secure_endpoint.cc:208]     Decryption error: TSI_DATA_CORRUPTED

and then segmentation faults.

But it seems the occurrence of the segmentation fault depends on the type of process we are running.
I've not yet been able to make a minimal example. And running it with a simple time.sleep in the worker function does not show the behavior.

@TheBirdsNest
Copy link

TheBirdsNest commented Jan 4, 2021

This is starting to be a huge issue for my team; we make heavy use of grpcio and grpc-google-iam-v1.
Downgrading to grpcio==1.30.0 did not seem to help.

Does anyone have this issue too or a workaround for working with GCP libraries?

google-api-core:
    1.24.0
google-api-python-client:
    1.12.8
google-auth:
    1.24.0
google-auth-httplib2:
    0.0.4
google-cloud-core:
    1.5.0
google-cloud-kms:
    2.2.0
google-cloud-logging:
    2.0.1
google-cloud-storage:
    1.35.0
google-crc32c:
    1.1.0
google-resumable-media:
    1.2.0
googleapis-common-protos:
    1.52.0
grpc-google-iam-v1:
    0.12.3
grpcio:
    1.30.0

@jrmlhermitte
Copy link

jrmlhermitte commented Feb 1, 2021

We have had this issue too and this thread was helpful for us. We have a workaround, so I thought I would provide details, in hopes it may help.

What we did was as suggested, to set GRPC_POLL_STRATEGY=epoll1 on our workers that use multiprocessing with fork on linux based images.
As a longterm strategy, we have been gradually getting rid of using fork in general (as it was the cause for many other issues).
This happened on workers running python 3.7.6.

We have also found that workers running on python 3.8.6 do not see this issue. However, we have seen errors like these instead:

2021-02-01 12:16:49.489 EST "Corruption detected."
2021-02-01 12:16:49.490 EST "error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT"
2021-02-01 12:16:49.490 EST "error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC"
2021-02-01 12:16:49.490 EST " Decryption error: TSI_DATA_CORRUPTED"
2021-02-01 12:16:49.490 EST "SSL_write failed with error SSL_ERROR_SSL."

which is similar to what @cpaulik is seeing.

If I had to guess, I would naively suspect that what is occurring is that both the parent and child processes are attempting to use the same socket connection (as fork ing provides a copy) resulting in data corruption. Not sure if this helps, or if this is the wrong direction.

joshuarli added a commit to getsentry/sentry that referenced this issue Feb 9, 2021
@oakal
Copy link

oakal commented Apr 22, 2021

We have had this issue too and this thread was helpful for us. We have a workaround, so I thought I would provide details, in hopes it may help.

What we did was as suggested, to set GRPC_POLL_STRATEGY=epoll1 on our workers that use multiprocessing with fork on linux based images.
As a longterm strategy, we have been gradually getting rid of using fork in general (as it was the cause for many other issues).
This happened on workers running python 3.7.6.

We have also found that workers running on python 3.8.6 do not see this issue. However, we have seen errors like these instead:

2021-02-01 12:16:49.489 EST "Corruption detected."
2021-02-01 12:16:49.490 EST "error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT"
2021-02-01 12:16:49.490 EST "error:1000008b:SSL routines:OPENSSL_internal:DECRYPTION_FAILED_OR_BAD_RECORD_MAC"
2021-02-01 12:16:49.490 EST " Decryption error: TSI_DATA_CORRUPTED"
2021-02-01 12:16:49.490 EST "SSL_write failed with error SSL_ERROR_SSL."

which is similar to what @cpaulik is seeing.

If I had to guess, I would naively suspect that what is occurring is that both the parent and child processes are attempting to use the same socket connection (as fork ing provides a copy) resulting in data corruption. Not sure if this helps, or if this is the wrong direction.

thank you so much @jrmlhermitte I've been trying find a solution for segfault issue for a very long time. This worked like a charm.

leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this issue Jun 9, 2021
Current version of Airflow image uses grpcio==1.31.0, which causes
segfaults: b/174948982. Depedency added to allow only versions up to
1.30.0.

It hasn't been yet fixed in upstream: grpc/grpc#23796

Change-Id: I62f3fbd75ff64dab6772a534424d06b178e67a42
GitOrigin-RevId: f0a17fad94c95fcb0794c8501ce579b391473c1e
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this issue Oct 7, 2021
Current version of Airflow image uses grpcio==1.31.0, which causes
segfaults: b/174948982. Depedency added to allow only versions up to
1.30.0.

It hasn't been yet fixed in upstream: grpc/grpc#23796

Change-Id: I62f3fbd75ff64dab6772a534424d06b178e67a42
GitOrigin-RevId: f0a17fad94c95fcb0794c8501ce579b391473c1e
kurtschelfthout added a commit to meadowdata/meadowflow that referenced this issue Jan 3, 2022
Enable all db and flow tests.
On linux, multiprocessing's default is fork, which causes gRPC to fail
because its default polling mechanism is epoll.
See grpc/grpc#23796
@numbsafari
Copy link

Hi, 2022 checking in. It appears that gRPC doesn’t work with any kind of pre-fork (eg uWSGI) model system unless you specify the environment variables noted above.

Perhaps a solution would be to add some documentation?

@k4nar
Copy link

k4nar commented Nov 17, 2023

Note: the epollex poller has been removed in gRPC 1.46.0 released in May 2022. So I don't think that the solution above to force the poller to epoll1 is still relevant today, as this is now the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests