Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fork_worker worker termination during phased-restart #2786

Closed
vitiokss opened this issue Jan 4, 2022 · 4 comments · Fixed by #3225
Closed

fork_worker worker termination during phased-restart #2786

vitiokss opened this issue Jan 4, 2022 · 4 comments · Fixed by #3225
Labels

Comments

@vitiokss
Copy link

vitiokss commented Jan 4, 2022

Describe the bug

During phased-restart, while running with fork_worker option, worker 0 stops to receive ping message. While the puma process is running, constantly we are doing timeout_workers, which checks if workers are responding. Because worker 0, is not receiving "p" message (ping), it is not updating @last_checkin value (WorkerHandle class), which leads to throwing Terminating timed out worker... error after exceeding timeout (default: 60), which kills it and starts the loop once again. We end up in the state where puma constantly restarting its workers.

Puma config:

threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count

port        ENV.fetch("PORT") { 3000 }

environment ENV.fetch("RAILS_ENV") { "development" }

preload_app! false

fork_worker 0

To Reproduce
It is quite easy to reproduce this using "hello world" rack application. The only thing you need to do, just run the application using a lot of workers, in my example 32 was more than enough to trigger the behavior.

Spin puma process:
bundle exec puma -C config/puma.rb -e development --threads 4:16 --workers 32 --tag test --pidfile tmp/puma.pid --state tmp/puma.state --dir . --control-url auto hello.ru

Trigger phased-restart:
pumactl -s tmp/puma.state phased-restart

Logs with additional puts

w.ping!(result.sub(/^\d+/,'').chomp)

in when "p" add p "worker: #{w.index}, pid: #{w.pid} received p msg!" if w.index == 0 [lib/puma/cluster.rb:448]

if !w.term? && w.ping_timeout <= Time.now

add p "timeout_workers: #{w.ping_timeout}, pid: #{w.pid}, booted: #{w.booted?}, index: #{w.index}, term: #{w.term?}" if w.index == 0 [lib/puma/cluster.rb:518]

Log looks like this:
It is easy to see, once phased restart started ping message is not sent to worker 0, and ping_timeout value for worker 0 is not being updated, even though it is healthy and responding.

"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
[9814] - Worker 31 (PID: 9858) booted in 0.07s, phase: 0
"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:07 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:12 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
"timeout_workers: 2022-01-04 19:04:17 +0200, pid: 9827, booted: true, index: 0, term: false"
"worker: 0, pid: 9827 received p msg!"
[9814] - Starting phased worker restart, phase: 1
[9814] + Changing to .
"timeout_workers: 2022-01-04 19:04:22 +0200, pid: 9827, booted: true, index: 0, term: false"
[9814] - Stopping 9827 for phased upgrade...
[9814] - TERM sent to 9827...
"timeout_workers: 2022-01-04 19:04:22 +0200, pid: 9827, booted: true, index: 0, term: true"
[9814] - Worker 0 (PID: 9878) booted in 0.02s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9828 for phased upgrade...
[9814] - TERM sent to 9828...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 1 (PID: 9891) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9829 for phased upgrade...
[9814] - TERM sent to 9829...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 2 (PID: 9892) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9830 for phased upgrade...
[9814] - TERM sent to 9830...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 3 (PID: 9894) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Stopping 9831 for phased upgrade...
[9814] - TERM sent to 9831...
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
[9814] - Worker 4 (PID: 9895) booted in 0.0s, phase: 1
"timeout_workers: 2022-01-04 19:04:25 +0200, pid: 9878, booted: true, index: 0, term: false"
**[9814] - Stopping 9832 for phased upgrade...**
[9814] - TERM sent to 9832...
...

Expected behavior
Once worker 0 is restarted we should send a ping "p" message to its thread as well, and not wait until all the workers is up and running.

Desktop:

  • OS: Mac
  • Puma Version: 5.5.2
@dentarg
Copy link
Member

dentarg commented Jan 4, 2022

Sounds like #2470

See #2470 (comment) for a workaround

@vitiokss
Copy link
Author

vitiokss commented Jan 5, 2022

Yes, it might be related. But workaround in this case might just hide other problems.

Do you have an idea, why we trigger server restart for the base worker (Worker 0) only at the end? Once all servers boot?

@fork_writer << "0\n"

@nateberkopec
Copy link
Member

Mentioning @wjordan in case he has bandwidth

@joshuay03
Copy link
Contributor

joshuay03 commented Sep 4, 2023

Was able to replicate with a test on MacOS:

  # Ran with an increased timeout as this takes ~65 seconds, default is 45
  # TEST_CASE_TIMEOUT=80 bundle exec m test/test_integration_cluster.rb:290

  def test_fork_worker_phased_restart
    cli_server "test/rackup/hello.ru", config: <<~RUBY
      workers 13 # minimum workers I needed to replicate consistently
      fork_worker 0
    RUBY

    while (line = @server.gets).include?("phase: 0")
      p line
      # wait till all the workers are up before restarting
      break if line.include?("Worker 12")
    end

    Process.kill :USR1, @pid

    while (line = @server.gets)
      p line
      refute line.include?("Terminating timed out worker")
      break if line.include?("Worker 12")
    end
  end

Logs without the assertion:

"[18135] - Worker 0 (PID: 18138) booted in 0.0s, phase: 0\n"
"[18135] - Worker 1 (PID: 18139) booted in 0.0s, phase: 0\n"
"[18135] - Worker 2 (PID: 18140) booted in 0.0s, phase: 0\n"
"[18135] - Worker 3 (PID: 18141) booted in 0.01s, phase: 0\n"
"[18135] - Worker 4 (PID: 18142) booted in 0.01s, phase: 0\n"
"[18135] - Worker 5 (PID: 18143) booted in 0.01s, phase: 0\n"
"[18135] - Worker 6 (PID: 18144) booted in 0.01s, phase: 0\n"
"[18135] - Worker 7 (PID: 18145) booted in 0.01s, phase: 0\n"
"[18135] - Worker 8 (PID: 18146) booted in 0.01s, phase: 0\n"
"[18135] - Worker 9 (PID: 18147) booted in 0.01s, phase: 0\n"
"[18135] - Worker 10 (PID: 18148) booted in 0.01s, phase: 0\n"
"[18135] - Worker 11 (PID: 18149) booted in 0.01s, phase: 0\n"
"[18135] - Worker 12 (PID: 18150) booted in 0.01s, phase: 0\n"

"[18135] - Starting phased worker restart, phase: 1\n"
"[18135] + Changing to /Users/{redacted_directory_cause_this_is_the_internet}/puma\n"
"[18135] - Stopping 18138 for phased upgrade...\n"
"[18135] - TERM sent to 18138...\n"
"[18135] - Stopping 18138 for phased upgrade...\n"
"[18135] - Worker 0 (PID: 18161) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18139 for phased upgrade...\n"
"[18135] - TERM sent to 18139...\n"
"[18135] - Worker 1 (PID: 18169) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18140 for phased upgrade...\n"
"[18135] - TERM sent to 18140...\n"
"[18135] - Worker 2 (PID: 18176) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18141 for phased upgrade...\n"
"[18135] - TERM sent to 18141...\n"
"[18135] - Worker 3 (PID: 18181) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18142 for phased upgrade...\n"
"[18135] - TERM sent to 18142...\n"
"[18135] - Worker 4 (PID: 18188) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18143 for phased upgrade...\n"
"[18135] - TERM sent to 18143...\n"
"[18135] - Worker 5 (PID: 18195) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18144 for phased upgrade...\n"
"[18135] - TERM sent to 18144...\n"
"[18135] - Worker 6 (PID: 18203) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18145 for phased upgrade...\n"
"[18135] - TERM sent to 18145...\n"
"[18135] - Worker 7 (PID: 18208) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18146 for phased upgrade...\n"
"[18135] - TERM sent to 18146...\n"
"[18135] - Worker 8 (PID: 18215) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18147 for phased upgrade...\n"
"[18135] - TERM sent to 18147...\n"
"[18135] - Worker 9 (PID: 18221) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18148 for phased upgrade...\n"
"[18135] - TERM sent to 18148...\n"
"[18135] - Worker 10 (PID: 18229) booted in 0.0s, phase: 1\n"
"[18135] - Stopping 18149 for phased upgrade...\n"
"[18135] - TERM sent to 18149...\n"
"[18135] - Worker 11 (PID: 18234) booted in 0.01s, phase: 1\n"
"[18135] - Stopping 18150 for phased upgrade...\n"
"[18135] - TERM sent to 18150...\n"
"[18135] ! Terminating timed out worker (worker failed to check in within 60 seconds): 18161\n"  # 👈🏽
"[18135] - Worker 0 (PID: 18244) booted in 0.01s, phase: 1\n"
"[18135] - Worker 12 (PID: 18245) booted in 0.02s, phase: 1\n"

I'll dive into the implementation when I get a chance.

joshuay03 added a commit to joshuay03/puma that referenced this issue Sep 15, 2023
joshuay03 added a commit to joshuay03/puma that referenced this issue Sep 15, 2023
MSP-Greg pushed a commit to MSP-Greg/puma that referenced this issue Sep 26, 2023
joshuay03 added a commit to joshuay03/puma that referenced this issue Sep 26, 2023
nateberkopec pushed a commit that referenced this issue Sep 27, 2023
* [Fix #2786] Worker 0 timing out during phased restart

* test/test_integration_cluster.rb - fixup for logging changes

* Add test which simulates worker 0 timeout

* Bump workers, reduce timeout for more realistic but also faster test

* Use server log consumption with less blocking in test

Co-authored-by: MSP-Greg <Greg.mpls@gmail.com>

---------

Co-authored-by: MSP-Greg <Greg.mpls@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants