`fork_worker` worker termination during phased-restart #2786

vitiokss · 2022-01-04T17:20:48Z

Describe the bug

During phased-restart, while running with fork_worker option, worker 0 stops to receive ping message. While the puma process is running, constantly we are doing timeout_workers, which checks if workers are responding. Because worker 0, is not receiving "p" message (ping), it is not updating @last_checkin value (WorkerHandle class), which leads to throwing Terminating timed out worker... error after exceeding timeout (default: 60), which kills it and starts the loop once again. We end up in the state where puma constantly restarting its workers.

Puma config:

threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count

port        ENV.fetch("PORT") { 3000 }

environment ENV.fetch("RAILS_ENV") { "development" }

preload_app! false

fork_worker 0

To Reproduce
It is quite easy to reproduce this using "hello world" rack application. The only thing you need to do, just run the application using a lot of workers, in my example 32 was more than enough to trigger the behavior.

Spin puma process:
bundle exec puma -C config/puma.rb -e development --threads 4:16 --workers 32 --tag test --pidfile tmp/puma.pid --state tmp/puma.state --dir . --control-url auto hello.ru

Trigger phased-restart:
pumactl -s tmp/puma.state phased-restart

Logs with additional puts

puma/lib/puma/cluster.rb

Line 454 in cec07ac

w.ping!(result.sub(/^\d+/,'').chomp)

in when "p" add p "worker: #{w.index}, pid: #{w.pid} received p msg!" if w.index == 0 [lib/puma/cluster.rb:448]

puma/lib/puma/cluster.rb

Line 516 in a2bcda4

if !w.term? && w.ping_timeout <= Time.now

add

p "timeout_workers: #{w.ping_timeout}, pid: #{w.pid}, booted: #{w.booted?}, index: #{w.index}, term: #{w.term?}" if w.index == 0

[lib/puma/cluster.rb:518]

Log looks like this:
It is easy to see, once phased restart started ping message is not sent to worker 0, and ping_timeout value for worker 0 is not being updated, even though it is healthy and responding.

"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
[9814] - Worker 31 (PID: 9858) booted in 0.07s, phase: 0
"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:12 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:17 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
[9814] - Starting phased worker restart, phase: 1
[9814] + Changing to .
"timeout_workers: 2022-01-04 19:04:22 +0200, pid: 9827, booted: true, index: 0, term: false"
[9814] - Stopping 9827 for phased upgrade...
[9814] - TERM sent to 9827...
"timeout_workers: 2022-01-04 19:04:22 +0200, pid: 9827, booted: true, index: 0, term: true"
[9814] - Worker 0 (PID: 9878) booted in 0.02s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9828 for phased upgrade...
[9814] - TERM sent to 9828...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 1 (PID: 9891) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9829 for phased upgrade...
[9814] - TERM sent to 9829...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 2 (PID: 9892) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9830 for phased upgrade...
[9814] - TERM sent to 9830...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 3 (PID: 9894) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9831 for phased upgrade...
[9814] - TERM sent to 9831...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 4 (PID: 9895) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
**[9814] - Stopping 9832 for phased upgrade...**
[9814] - TERM sent to 9832...
...

Expected behavior
Once worker 0 is restarted we should send a ping "p" message to its thread as well, and not wait until all the workers is up and running.

Desktop:

OS: Mac
Puma Version: 5.5.2

The text was updated successfully, but these errors were encountered:

dentarg · 2022-01-04T21:12:05Z

Sounds like #2470

See #2470 (comment) for a workaround

vitiokss · 2022-01-05T08:06:12Z

Yes, it might be related. But workaround in this case might just hide other problems.

Do you have an idea, why we trigger server restart for the base worker (Worker 0) only at the end? Once all servers boot?

puma/lib/puma/cluster.rb

Line 89 in cec07ac

@fork_writer << "0\n"

nateberkopec · 2022-01-05T21:06:25Z

Mentioning @wjordan in case he has bandwidth

joshuay03 · 2023-09-04T07:39:11Z

Was able to replicate with a test on MacOS:

  # Ran with an increased timeout as this takes ~65 seconds, default is 45
  # TEST_CASE_TIMEOUT=80 bundle exec m test/test_integration_cluster.rb:290

  def test_fork_worker_phased_restart
    cli_server "test/rackup/hello.ru", config: <<~RUBY
      workers 13 # minimum workers I needed to replicate consistently
      fork_worker 0
    RUBY

    while (line = @server.gets).include?("phase: 0")
      p line
      # wait till all the workers are up before restarting
      break if line.include?("Worker 12")
    end

    Process.kill :USR1, @pid

    while (line = @server.gets)
      p line
      refute line.include?("Terminating timed out worker")
      break if line.include?("Worker 12")
    end
  end

Logs without the assertion:

"[18135] - Worker 0 (PID: 18138) booted in 0.0s, phase: 0\n"
"[18135] - Worker 1 (PID: 18139) booted in 0.0s, phase: 0\n"
"[18135] - Worker 2 (PID: 18140) booted in 0.0s, phase: 0\n"
"[18135] - Worker 3 (PID: 18141) booted in 0.01s, phase: 0\n"
"[18135] - Worker 4 (PID: 18142) booted in 0.01s, phase: 0\n"
"[18135] - Worker 5 (PID: 18143) booted in 0.01s, phase: 0\n"
"[18135] - Worker 6 (PID: 18144) booted in 0.01s, phase: 0\n"
"[18135] - Worker 7 (PID: 18145) booted in 0.01s, phase: 0\n"
"[18135] - Worker 8 (PID: 18146) booted in 0.01s, phase: 0\n"
"[18135] - Worker 9 (PID: 18147) booted in 0.01s, phase: 0\n"
"[18135] - Worker 10 (PID: 18148) booted in 0.01s, phase: 0\n"
"[18135] - Worker 11 (PID: 18149) booted in 0.01s, phase: 0\n"
"[18135] - Worker 12 (PID: 18150) booted in 0.01s, phase: 0\n"

"[18135] - Starting phased worker restart, phase: 1\n"
"[18135] + Changing to /Users/{redacted_directory_cause_this_is_the_internet}/puma\n"
"[18135] - Stopping 18138 for phased upgrade...\n"
"[18135] - TERM sent to 18138...\n"
"[18135] - Stopping 18138 for phased upgrade...\n"
"[18135] - Worker 0 (PID: 18161) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18139 for phased upgrade...\n"
"[18135] - TERM sent to 18139...\n"
"[18135] - Worker 1 (PID: 18169) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18140 for phased upgrade...\n"
"[18135] - TERM sent to 18140...\n"
"[18135] - Worker 2 (PID: 18176) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18141 for phased upgrade...\n"
"[18135] - TERM sent to 18141...\n"
"[18135] - Worker 3 (PID: 18181) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18142 for phased upgrade...\n"
"[18135] - TERM sent to 18142...\n"
"[18135] - Worker 4 (PID: 18188) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18143 for phased upgrade...\n"
"[18135] - TERM sent to 18143...\n"
"[18135] - Worker 5 (PID: 18195) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18144 for phased upgrade...\n"
"[18135] - TERM sent to 18144...\n"
"[18135] - Worker 6 (PID: 18203) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18145 for phased upgrade...\n"
"[18135] - TERM sent to 18145...\n"
"[18135] - Worker 7 (PID: 18208) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18146 for phased upgrade...\n"
"[18135] - TERM sent to 18146...\n"
"[18135] - Worker 8 (PID: 18215) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18147 for phased upgrade...\n"
"[18135] - TERM sent to 18147...\n"
"[18135] - Worker 9 (PID: 18221) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18148 for phased upgrade...\n"
"[18135] - TERM sent to 18148...\n"
"[18135] - Worker 10 (PID: 18229) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18149 for phased upgrade...\n"
"[18135] - TERM sent to 18149...\n"
"[18135] - Worker 11 (PID: 18234) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18150 for phased upgrade...\n"
"[18135] - TERM sent to 18150...\n"
"[18135] ! Terminating timed out worker (worker failed to check in within 60 seconds): 18161\n"  # 👈🏽
"[18135] - Worker 0 (PID: 18244) booted in 0.01s, phase: 1\n"
"[18135] - Worker 12 (PID: 18245) booted in 0.02s, phase: 1\n"

I'll dive into the implementation when I get a chance.

* [Fix #2786] Worker 0 timing out during phased restart * test/test_integration_cluster.rb - fixup for logging changes * Add test which simulates worker 0 timeout * Bump workers, reduce timeout for more realistic but also faster test * Use server log consumption with less blocking in test Co-authored-by: MSP-Greg <Greg.mpls@gmail.com> --------- Co-authored-by: MSP-Greg <Greg.mpls@gmail.com>

nateberkopec added the bug label Jan 5, 2022

joshuay03 added a commit to joshuay03/puma that referenced this issue Sep 15, 2023

[Fix puma#2786] Worker 0 timing out during phased restart

21cf991

joshuay03 mentioned this issue Sep 15, 2023

[Fix #2786] Worker 0 timing out during phased restart #3225

Merged

7 tasks

joshuay03 added a commit to joshuay03/puma that referenced this issue Sep 15, 2023

[Fix puma#2786] Worker 0 timing out during phased restart

cb5df5e

MSP-Greg pushed a commit to MSP-Greg/puma that referenced this issue Sep 26, 2023

[Fix puma#2786] Worker 0 timing out during phased restart

46af58c

joshuay03 added a commit to joshuay03/puma that referenced this issue Sep 26, 2023

[Fix puma#2786] Worker 0 timing out during phased restart

8409db3

nateberkopec closed this as completed in #3225 Sep 27, 2023

joshuay03 mentioned this issue Oct 20, 2023

fork_worker keeps serving stale code indefinitely and at random after phased restart (specially after second phased restart) #2470

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fork_worker` worker termination during phased-restart #2786

`fork_worker` worker termination during phased-restart #2786

vitiokss commented Jan 4, 2022 •

edited

dentarg commented Jan 4, 2022

vitiokss commented Jan 5, 2022

nateberkopec commented Jan 5, 2022

joshuay03 commented Sep 4, 2023 •

edited

fork_worker worker termination during phased-restart #2786

fork_worker worker termination during phased-restart #2786

Comments

vitiokss commented Jan 4, 2022 • edited

dentarg commented Jan 4, 2022

vitiokss commented Jan 5, 2022

nateberkopec commented Jan 5, 2022

joshuay03 commented Sep 4, 2023 • edited

`fork_worker` worker termination during phased-restart #2786

`fork_worker` worker termination during phased-restart #2786

vitiokss commented Jan 4, 2022 •

edited

joshuay03 commented Sep 4, 2023 •

edited