-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restarting process with USR2 doesn't always work #197
Comments
Can you break into a stuck worker with gdb and run "bt"? Perhaps I need to ungracefully kill workers that don't stop after a timeout... |
I'm having this exact same issue as well. It also occurs on shutdown (SIGTERM to master). It's completely random. Sometimes all workers shut down, sometimes just one of them.
Executing with Here is a backtrace of one of the workers that didn't shut down after a 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 (gdb) bt #0 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007f23954121a9 in native_cond_wait (cond=, mutex=) at thread_pthread.c:307 #2 0x00007f2395415f90 in native_sleep (th=0x603520, timeout_tv=0x0) at thread_pthread.c:908 #3 0x00007f239541782b in sleep_forever (th=0x603520, deadlockable=1) at thread.c:855 #4 0x00007f239541789d in thread_join_sleep (arg=140733465907824) at thread.c:688 #5 0x00007f239531844b in rb_ensure (b_proc=0x7f2395417860 , data1=140733465907824, e_proc=0x7f2395411fa0 , data2=140733465907824) at eval.c:736 #6 0x00007f2395414078 in thread_join (target_th=0x2d9ad40, delay=) at thread.c:721 #7 0x00007f23954141a2 in thread_join_m (argc=0, argv=0x7f239429e208, self=) at thread.c:802 #8 0x00007f239540abe1 in vm_call_cfunc (me=0x719170, blockptr=0x0, recv=, num=0, reg_cfp=0x7f239439da90, th=) at vm_insnhelper.c:404 #9 vm_call_method (th=, cfp=0x7f239439da90, num=, blockptr=0x0, flag=, id=, me=0x719170, recv=58423080) at vm_insnhelper.c:534 #10 0x00007f2395401b52 in vm_exec_core (th=, initial=) at insns.def:1015 #11 0x00007f239540755d in vm_exec (th=0x603520) at vm.c:1220 #12 0x00007f239540d385 in invoke_block_from_c (cref=0x0, blockptr=0x0, argv=0x0, argc=0, self=10399800, block=0x7f239439dc18, th=0x603520) at vm.c:624 #13 vm_yield (th=0x603520, argv=0x0, argc=0) at vm.c:654 #14 rb_yield_0 (argv=0x0, argc=0) at vm_eval.c:740 #15 rb_yield (val=6) at vm_eval.c:747 #16 0x00007f239531820a in rb_protect (proc=0x7f239540cef0 , data=6, state=0x7fff103e581c) at eval.c:711 #17 0x00007f239537dd02 in rb_f_fork (obj=) at process.c:2792 #18 rb_f_fork (obj=) at process.c:2780 #19 0x00007f239540abe1 in vm_call_cfunc (me=0x6fab40, blockptr=0x7f239439dc18, recv=, num=0, reg_cfp=0x7f239439dbf0, th=) at vm_insnhelper.c:404 #20 vm_call_method (th=, cfp=0x7f239439dbf0, num=, blockptr=0x7f239439dc18, flag=, id=, me=0x6fab40, recv=10399800) at vm_insnhelper.c:534 #21 0x00007f2395401b52 in vm_exec_core (th=, initial=) at insns.def:1015 #22 0x00007f239540755d in vm_exec (th=0x603520) at vm.c:1220 #23 0x00007f239540d15e in invoke_block_from_c (cref=0x0, blockptr=0x0, argv=0x7fff103e5b08, argc=1, self=10399800, block=0x7f239439dd20, th=0x603520) at vm.c:624 #24 vm_yield (th=0x603520, argv=0x7fff103e5b08, argc=1) at vm.c:654 #25 rb_yield_0 (argv=0x7fff103e5b08, argc=1) at vm_eval.c:740 #26 rb_yield (val=3) at vm_eval.c:750 #27 0x00007f2395351891 in int_dotimes (num=7) at numeric.c:3290 #28 0x00007f239540abe1 in vm_call_cfunc (me=0x6978a0, blockptr=0x7f239439dd20, recv=, num=0, reg_cfp=0x7f239439dcf8, th=) at vm_insnhelper.c:404 #29 vm_call_method (th=, cfp=0x7f239439dcf8, num=, blockptr=0x7f239439dd20, flag=, id=, me=0x6978a0, recv=7) at vm_insnhelper.c:534 #30 0x00007f2395401b52 in vm_exec_core (th=, initial=) at insns.def:1015 #31 0x00007f239540755d in vm_exec (th=0x603520) at vm.c:1220 #32 0x00007f239540e7f8 in rb_iseq_eval (iseqval=17853160) at vm.c:1447 #33 0x00007f2395319d35 in rb_load_internal (fname=17855800, wrap=) at load.c:310 #34 0x00007f2395319e98 in rb_f_load (argc=, argv=) at load.c:383 #35 0x00007f239540abe1 in vm_call_cfunc (me=0x706da0, blockptr=0x0, recv=, num=1, reg_cfp=0x7f239439df08, th=) at vm_insnhelper.c:404 #36 vm_call_method (th=, cfp=0x7f239439df08, num=, blockptr=0x0, flag=, id=, me=0x706da0, recv=6719200) at vm_insnhelper.c:534 #37 0x00007f2395401b52 in vm_exec_core (th=, initial=) at insns.def:1015 #38 0x00007f239540755d in vm_exec (th=0x603520) at vm.c:1220 #39 0x00007f239540e8df in rb_iseq_eval_main (iseqval=17840040) at vm.c:1461 #40 0x00007f23953170b2 in ruby_exec_internal (n=0x11037a8) at eval.c:204 #41 0x00007f2395317b9d in ruby_exec_node (n=0x11037a8) at eval.c:251 #42 0x00007f239531974e in ruby_run_node (n=0x11037a8) at eval.c:244 #43 0x00000000004007db in main (argc=11, argv=0x7fff103e6508) at main.c:38 Here's (gdb) info threads Id Target Id Frame 15 Thread 0x7f23958ef700 (LWP 4290) "ruby1.9.1" 0x00007f2394fea033 in select () from /lib/x86_64-linux-gnu/libc.so.6 14 Thread 0x7f2391640700 (LWP 4291) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 13 Thread 0x7f2390084700 (LWP 4705) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 12 Thread 0x7f2386807700 (LWP 4706) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 11 Thread 0x7f2386685700 (LWP 4707) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 10 Thread 0x7f2386503700 (LWP 4708) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 9 Thread 0x7f2386381700 (LWP 4709) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 8 Thread 0x7f23861ff700 (LWP 4710) "ruby1.9.1" 0x00007f2394cee89c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 7 Thread 0x7f238607d700 (LWP 4711) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 6 Thread 0x7f2385efb700 (LWP 4712) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 5 Thread 0x7f2385d79700 (LWP 4713) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 4 Thread 0x7f2385bf7700 (LWP 4720) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 3 Thread 0x7f2385a75700 (LWP 4727) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 2 Thread 0x7f23859f4700 (LWP 4728) "SignalSender" 0x00007f2394cedfd0 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 * 1 Thread 0x7f23958e5700 (LWP 4288) "ruby1.9.1" 0x00007f2394cebd84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 I can reproduce this fairly easily if you need more info. |
Yes, I need a way to reproduce this issue. Can you provide a way to do that? |
@evanphx I seem to experience this when using this testing application: https://github.com/leifcr/capistrano-testing The paths for the deployment script is horrible, as it is used for devel/testing. I am running puma under runit, but it seems to happen occasionally when running puma manually as well. I still haven't found a way to reproduce it consistantly, but I do think a "kill after timeout x" is the best option, as that should take care of workers that are just hanging. Could this be triggering if db connections and or logs are not properly closed? It seems to trigger more often if I am reloading the app (curl) just when the phased restart is going to happen Seems to do the same for restart and phased restart |
No activity in 7 months. Is this still an ongoing issue? Can you build a run-able script that reproduces the problem? |
This could very well have been the Process.daemon issue. Let's close and reopen if someone has the issue still. |
Hi all - We're starting to run Puma in production, and I think we're running into this bug. I've attached a backtrace, a thread-level backtrace, and a ruby backtrace. They're unfortunately each from different processes, as we're running in cluster mode and the master started killing timed-out workers while I investigated. We're running ruby 1.9.3-p484, and puma 2.8.1. We've not yet found a reliable way to reproduce:
|
I just noticed this issue and mine might be similar: #649 |
Any updates on this issue team? It's creating huge issues in production envrionment |
@yaminil please open a new issue where you describe how the problem you are having can be reproduced this bug is from 2013 and closed (with last comment from 2015) so I wouldn't expect any updates here |
I've deployed Puma on production a few days ago, and noticed that after some deploys we have to kill -9 puma processes and start a new master.
Our setup is the following:
We're running Puma with
bundle exec puma -C /path/to/config -e production --dir /path/to/current -p 8888 --pidfile /path/to/pidfile -d 2>&1 >> /path/to/log
config is:
On some restarts with USR2, I get a few worker processes failling to close. I ran strace on them, some are looped in
sched_yield
, and some give this output:I also observed strange behaviour on those restarts that seem to succeed. There are 'waves' of requests that show up in log, then they stop, and after a few seconds I get ActiveRecord messages that db connection was established.
What would be next steps for debugging this? I should probably have something in the
on_restart
hook, but not sure what.The text was updated successfully, but these errors were encountered: