Avoid dogpiling on start/hot restart #3236

pushcx · 2023-09-27T13:31:37Z

pushcx
Sep 27, 2023

Starting one puma worker for my Rails app takes 5-7 seconds, and the prod server runs about 20. But if I start (or hot restart) 20 workers at a time, they dogpile the server and all take 90+ seconds to boot, leaving the site down the entire time.

I added a slightly janky on_worker_boot that calls sleep based on the worker index to stagger the load of starting workers. It works fine, but it also runs on a phased restart (USR2 signal) where it’s not needed. I can’t find a hook that runs on cold start and hot restart but not phased restart. I can’t see a straightforward way to access the Puma::Cluster instance to look at its @phased_restart, and the idea of digging around in ObjectSpace seems much jankier than appropriate.

So I have the suspicion that I’m going about this entirely the wrong way, but I don’t see another option. Any suggestions? I’d love to PR an addition to the docs to help anyone else with this problem. Thanks!

Answered by nateberkopec

Oct 11, 2023

You're creating a footgun though in the case of an overload situation - your Puma processes will ingest more work than you have CPU time to handle, causing requests to get very slow as they wait on the CPU to be available. This creates a difficult situation as your request queue times will not increase (at least not as quickly as they would otherwise), because Puma is still starting to process requests, but processing them slower.

If you're using the threadpool, a 1 to 1 ratio of workers to cpu cores is best.

View full answer

dentarg · 2023-09-27T14:58:08Z

dentarg
Sep 27, 2023
Maintainer

I added some a slightly janky on_worker_boot that calls sleep based on the worker index to stagger the load of starting workers. It works fine, but it also runs on a phased restart (USR2 signal) where it’s not needed

Maybe the information that it is (or isn't) a phased restart should be passed on to the hook – feel free to look into this and make PR for it if it is possible :) Makes sense to me.

3 replies

pushcx Sep 27, 2023
Author

I guess I am also sort of implicitly asking: how are y’all not having this issue? Or is everyone else running their workers on multiple hosts with a load balancer in front, so their deployment sort of does a phased restart at a higher level, avoiding this problem?

dentarg Sep 27, 2023
Maintainer

Yeah I'm not using phased restart anywhere myself. I have even suggested removing it from Puma: #3034

nateberkopec Sep 28, 2023
Maintainer

Or is everyone else running their workers on multiple hosts with a load balancer in front

It's certainly more common now, with Heroku's preboot feature and increased complexity in deployment setups via Kubernetes.

MSP-Greg · 2023-09-27T16:24:14Z

MSP-Greg
Sep 27, 2023
Maintainer

In cluster.rb the Cluster#spawn_workers method:

puma/lib/puma/cluster.rb

Lines 84 to 86 in 252890c

    
             debug "Spawned worker: #{pid}" 
        
             @workers << WorkerHandle.new(idx, pid, @phase, @options) 
        
           end

If you add the following after line 85

unless @phased_restart
  sleep 0.5 until @workers.last.uptime > 5
  @workers.each { |w| w.boot! } # needed to not trigger boot timeout ? fix
end

It will stagger worker creation. It needs more code, as it probably interferes with shutting down Puma, checks for whether workers booted, etc.

It could be an option like worker_restart_stagger or something like that. As to why others aren't having the issue, could be a lot of things...

EDIT: I set it to a long time, and requests were responded to. I also sent USR2 to it, and the workers staggered their start.

0 replies

nateberkopec · 2023-09-28T04:30:01Z

nateberkopec
Sep 28, 2023
Maintainer

How many CPUs are on this server? I'm wondering why starting X workers at the same time is slower than starting 1. If you have X CPUs or more, this shouldn't be the case.

4 replies

pushcx Sep 29, 2023
Author

Four CPUs, 15-20 workers.

nateberkopec Oct 2, 2023
Maintainer

That would only be an appropriate configuration for a workload spending ~80% of its time waiting on I/O and not using Puma's threadpool.

Are you doing that much I/O? Do you have a thread pool configured? What to?

pushcx Oct 4, 2023
Author

We use the thread pool (5 per worker). Honestly, it’s mostly because I was recently forced to bump the vps up to one with more ram and I figured we might as well make use of it.

nateberkopec Oct 11, 2023
Maintainer

You're creating a footgun though in the case of an overload situation - your Puma processes will ingest more work than you have CPU time to handle, causing requests to get very slow as they wait on the CPU to be available. This creates a difficult situation as your request queue times will not increase (at least not as quickly as they would otherwise), because Puma is still starting to process requests, but processing them slower.

If you're using the threadpool, a 1 to 1 ratio of workers to cpu cores is best.

Answer selected by dentarg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid dogpiling on start/hot restart #3236

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Avoid dogpiling on start/hot restart #3236

pushcx Sep 27, 2023

Replies: 3 comments · 7 replies

dentarg Sep 27, 2023 Maintainer

pushcx Sep 27, 2023 Author

dentarg Sep 27, 2023 Maintainer

nateberkopec Sep 28, 2023 Maintainer

MSP-Greg Sep 27, 2023 Maintainer

nateberkopec Sep 28, 2023 Maintainer

pushcx Sep 29, 2023 Author

nateberkopec Oct 2, 2023 Maintainer

pushcx Oct 4, 2023 Author

nateberkopec Oct 11, 2023 Maintainer

pushcx
Sep 27, 2023

Replies: 3 comments 7 replies

dentarg
Sep 27, 2023
Maintainer

pushcx Sep 27, 2023
Author

dentarg Sep 27, 2023
Maintainer

nateberkopec Sep 28, 2023
Maintainer

MSP-Greg
Sep 27, 2023
Maintainer

nateberkopec
Sep 28, 2023
Maintainer

pushcx Sep 29, 2023
Author

nateberkopec Oct 2, 2023
Maintainer

pushcx Oct 4, 2023
Author

nateberkopec Oct 11, 2023
Maintainer