New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fork_worker
worker termination during phased-restart
#2786
Comments
Sounds like #2470 See #2470 (comment) for a workaround |
Yes, it might be related. But Do you have an idea, why we trigger server restart for the base worker (Worker 0) only at the end? Once all servers boot? Line 89 in cec07ac
|
Mentioning @wjordan in case he has bandwidth |
Was able to replicate with a test on MacOS: # Ran with an increased timeout as this takes ~65 seconds, default is 45
# TEST_CASE_TIMEOUT=80 bundle exec m test/test_integration_cluster.rb:290
def test_fork_worker_phased_restart
cli_server "test/rackup/hello.ru", config: <<~RUBY
workers 13 # minimum workers I needed to replicate consistently
fork_worker 0
RUBY
while (line = @server.gets).include?("phase: 0")
p line
# wait till all the workers are up before restarting
break if line.include?("Worker 12")
end
Process.kill :USR1, @pid
while (line = @server.gets)
p line
refute line.include?("Terminating timed out worker")
break if line.include?("Worker 12")
end
end Logs without the assertion: "[18135] - Worker 0 (PID: 18138) booted in 0.0s, phase: 0\n"
"[18135] - Worker 1 (PID: 18139) booted in 0.0s, phase: 0\n"
"[18135] - Worker 2 (PID: 18140) booted in 0.0s, phase: 0\n"
"[18135] - Worker 3 (PID: 18141) booted in 0.01s, phase: 0\n"
"[18135] - Worker 4 (PID: 18142) booted in 0.01s, phase: 0\n"
"[18135] - Worker 5 (PID: 18143) booted in 0.01s, phase: 0\n"
"[18135] - Worker 6 (PID: 18144) booted in 0.01s, phase: 0\n"
"[18135] - Worker 7 (PID: 18145) booted in 0.01s, phase: 0\n"
"[18135] - Worker 8 (PID: 18146) booted in 0.01s, phase: 0\n"
"[18135] - Worker 9 (PID: 18147) booted in 0.01s, phase: 0\n"
"[18135] - Worker 10 (PID: 18148) booted in 0.01s, phase: 0\n"
"[18135] - Worker 11 (PID: 18149) booted in 0.01s, phase: 0\n"
"[18135] - Worker 12 (PID: 18150) booted in 0.01s, phase: 0\n"
"[18135] - Starting phased worker restart, phase: 1\n"
"[18135] + Changing to /Users/{redacted_directory_cause_this_is_the_internet}/puma\n"
"[18135] - Stopping 18138 for phased upgrade...\n"
"[18135] - TERM sent to 18138...\n"
"[18135] - Stopping 18138 for phased upgrade...\n"
"[18135] - Worker 0 (PID: 18161) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18139 for phased upgrade...\n"
"[18135] - TERM sent to 18139...\n"
"[18135] - Worker 1 (PID: 18169) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18140 for phased upgrade...\n"
"[18135] - TERM sent to 18140...\n"
"[18135] - Worker 2 (PID: 18176) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18141 for phased upgrade...\n"
"[18135] - TERM sent to 18141...\n"
"[18135] - Worker 3 (PID: 18181) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18142 for phased upgrade...\n"
"[18135] - TERM sent to 18142...\n"
"[18135] - Worker 4 (PID: 18188) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18143 for phased upgrade...\n"
"[18135] - TERM sent to 18143...\n"
"[18135] - Worker 5 (PID: 18195) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18144 for phased upgrade...\n"
"[18135] - TERM sent to 18144...\n"
"[18135] - Worker 6 (PID: 18203) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18145 for phased upgrade...\n"
"[18135] - TERM sent to 18145...\n"
"[18135] - Worker 7 (PID: 18208) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18146 for phased upgrade...\n"
"[18135] - TERM sent to 18146...\n"
"[18135] - Worker 8 (PID: 18215) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18147 for phased upgrade...\n"
"[18135] - TERM sent to 18147...\n"
"[18135] - Worker 9 (PID: 18221) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18148 for phased upgrade...\n"
"[18135] - TERM sent to 18148...\n"
"[18135] - Worker 10 (PID: 18229) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18149 for phased upgrade...\n"
"[18135] - TERM sent to 18149...\n"
"[18135] - Worker 11 (PID: 18234) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18150 for phased upgrade...\n"
"[18135] - TERM sent to 18150...\n"
"[18135] ! Terminating timed out worker (worker failed to check in within 60 seconds): 18161\n" # 👈🏽
"[18135] - Worker 0 (PID: 18244) booted in 0.01s, phase: 1\n"
"[18135] - Worker 12 (PID: 18245) booted in 0.02s, phase: 1\n" I'll dive into the implementation when I get a chance. |
* [Fix #2786] Worker 0 timing out during phased restart * test/test_integration_cluster.rb - fixup for logging changes * Add test which simulates worker 0 timeout * Bump workers, reduce timeout for more realistic but also faster test * Use server log consumption with less blocking in test Co-authored-by: MSP-Greg <Greg.mpls@gmail.com> --------- Co-authored-by: MSP-Greg <Greg.mpls@gmail.com>
Describe the bug
During
phased-restart
, while running withfork_worker
option, worker 0 stops to receive ping message. While the puma process is running, constantly we are doingtimeout_workers
, which checks if workers are responding. Because worker 0, is not receiving "p" message (ping), it is not updating@last_checkin
value (WorkerHandle class), which leads to throwingTerminating timed out worker...
error after exceeding timeout (default: 60), which kills it and starts the loop once again. We end up in the state wherepuma
constantly restarting its workers.Puma config:
To Reproduce
It is quite easy to reproduce this using "hello world" rack application. The only thing you need to do, just run the application using a lot of workers, in my example 32 was more than enough to trigger the behavior.
Spin puma process:
bundle exec puma -C config/puma.rb -e development --threads 4:16 --workers 32 --tag test --pidfile tmp/puma.pid --state tmp/puma.state --dir . --control-url auto hello.ru
Trigger phased-restart:
pumactl -s tmp/puma.state phased-restart
Logs with additional puts
puma/lib/puma/cluster.rb
Line 454 in cec07ac
in
when "p"
addp "worker: #{w.index}, pid: #{w.pid} received p msg!" if w.index == 0
[lib/puma/cluster.rb:448]puma/lib/puma/cluster.rb
Line 516 in a2bcda4
add
p "timeout_workers: #{w.ping_timeout}, pid: #{w.pid}, booted: #{w.booted?}, index: #{w.index}, term: #{w.term?}" if w.index == 0
[lib/puma/cluster.rb:518]Log looks like this:
It is easy to see, once phased restart started ping message is not sent to worker 0, and
ping_timeout
value for worker 0 is not being updated, even though it is healthy and responding.Expected behavior
Once worker 0 is restarted we should send a ping "p" message to its thread as well, and not wait until all the workers is up and running.
Desktop:
The text was updated successfully, but these errors were encountered: