Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: Fork from existing worker process to optimize copy-on-write performance #2099

Merged
merged 3 commits into from May 11, 2020

Conversation

@wjordan
Copy link
Contributor

wjordan commented Jan 4, 2020

This is still rather experimental, but I thought I'd share this work in progress to share preliminary results and spark discussion on whether this optimization seems at all interesting to anyone else to pursue further.

Description

This PR adds a config option (fork_worker) to spawn new worker processes by forking an existing worker process instead of the master process. It also adds a refork command (currently implemented via a URG signal) which re-forks all processes in the cluster from this original worker.

Because the existing worker has been serving live requests, forking new workers from this fully-loaded process can potentially reduce memory usage by further optimizing copy-on-write performance across processes, as a step beyond what the preload_app! directive accomplishes at initialization.

Demonstration

The ideal use-case for this technique involves a Rack application that loads a lot of persistent (non-garbage-collected) content into memory at runtime, instead of at initialization-time where preload_app! is able to load it before the initial fork. (Think of dynamically generated data or code that loads only after particular user inputs, or an in-memory object cache, etc..)

Here's a minimal Rack application which demonstrates the optimization:

# config.ru
run lambda { |_env|
  $big_object ||= Array.new(10_000_000) {Object.new}
  [200, {}, ["Hello world"]].tap{3.times{GC.start}}
}

# config/puma.rb
workers 8
state_path 'puma.state'
activate_control_app
daemonize
fork_worker

To test and measure 'real' memory usage, compare the total pss across all workers with a good amount of request traffic, before and after re-forking cluster workers using the added refork command.

# Total pss of all workers
function pss() {
  bundle exec pumactl stats | ruby -rjson -e '
    puts JSON.parse(ARGF.readlines.last)["worker_status"].map {|w| w["pid"]}.
      sum{|p| File.read("/proc/#{p}/smaps_rollup").match(/Pss:\s+(\d+)/)[1].to_i}
  '
}

$ bundle exec puma
$ wrk http://localhost:9292 -d 20 -H "Connection: close"
$ pss
> 4933788
$ bundle exec pumactl refork
$ wrk http://localhost:9292 -d 20 -H "Connection: close"
$ pss
> 1053686

Memory usage in this (obviously very contrived) example is reduced from ~5GB to ~1GB, or an ~80% reduction in total memory usage across 8 workers.

Obviously most real-world use-cases won't see such a dramatic difference, but this still might possibly be a useful optimization in some situations. I'll look forward to testing on some large Rack applications to see how useful this might be in practice.

Production demo

I ran a rough comparison test on a production Rails app running 40 workers on a single instance, measuring memory usage after serving traffic for 15 minutes after a restart (for 'Baseline'), then restarting workers and serving traffic for another 15 minutes (for 'Optimized'):

Baseline

Per-worker Usage (MB): 2170 Rss, 1530 Pss, 655 Shared

Optimized

Per-worker Usage (MB): 2168 Rss, 1177 Pss (-23%), 1041 Shared (+59%)

This demo shows a ~20% improvement in memory usage on this particular application. (Note this app is running Ruby 2.5, the compacting GC in Ruby 2.7 may enable even greater CoW efficiency.)

@nateberkopec nateberkopec added the perf label Feb 7, 2020
@@ -337,8 +351,7 @@ def stop
def stop_blocked
@status = :stop if @status == :run
wakeup!
@control.stop(true) if @control
Process.waitall
sleep 0.2 until @status == :halt

This comment has been minimized.

Copy link
@nateberkopec

nateberkopec Feb 7, 2020

Member

? Purpose?

This comment has been minimized.

Copy link
@wjordan

wjordan Feb 7, 2020

Author Contributor

Process.waitall no longer completes here, because during a re-fork one of the parent's children is the promoted process which doesn't exit but instead becomes the new parent.

Instead, the halt status is set after the promotion process is complete and all other worker-children have terminated.

@nateberkopec
Copy link
Member

nateberkopec commented Feb 7, 2020

Why did I never know about SIGURG? :headdesk:

I'm a little nervous about the way this restart is implemented, there are some random changes in here that I think will mess up restart state. That can be fixed with some test coverage ofc.

This introduces a full hard-restart, though, right? Who is going to be willing to trade memory savings for 60-120 seconds of hard downtime?

@@ -64,30 +63,19 @@ def start_control
control.min_threads = 0
control.max_threads = 1

case uri.scheme

This comment has been minimized.

Copy link
@nateberkopec

nateberkopec Feb 7, 2020

Member

Wait this is possible? Can we do this is a separate PR?

This comment has been minimized.

Copy link
@wjordan

wjordan Feb 7, 2020

Author Contributor

Extracted this change to #2111.

@@ -184,7 +184,6 @@ def run
when :exit
# nothing
end
@binder.close_unix_paths

This comment has been minimized.

Copy link
@nateberkopec

This comment has been minimized.

Copy link
@wjordan

wjordan Feb 7, 2020

Author Contributor

This reflects a bit of refactoring- this line caused the old parent to delete the Unix socket when shutting down during a promote operation, which in this PR needs to be skipped. The call here is made unnecessary by this change to Cluster#run, which includes a call to close_binder_listeners, which also cleans up any Unix sockets - but is skipped during promote.

Some of this refactoring could be extracted to a separate PR without the promote logic.

This comment has been minimized.

Copy link
@wjordan

wjordan Feb 8, 2020

Author Contributor

See #2112 for the extracted listener-closing refactoring part of this PR, which can be considered separately from the rest of the promote logic.

@wjordan
Copy link
Contributor Author

wjordan commented Feb 7, 2020

This introduces a full hard-restart, though, right? Who is going to be willing to trade memory savings for 60-120 seconds of hard downtime?

No, it just forks a new set of worker processes from one of the already-running workers, which is nearly instant, the app doesn't need to be reloaded from scratch at all.

@wjordan wjordan mentioned this pull request Feb 8, 2020
6 of 8 tasks complete
@wjordan wjordan force-pushed the wjordan:promote branch 2 times, most recently from 81624bd to 8872e88 Feb 11, 2020
@wjordan
Copy link
Contributor Author

wjordan commented Feb 11, 2020

I'm a little nervous about the way this restart is implemented, there are some random changes in here that I think will mess up restart state. That can be fixed with some test coverage ofc.

I completed a pass through to minimize the changes. It's now based off two smaller PRs (#2111, #2112) that have some related fixes extracted out, leaving the promote-specific logic in a single commit (8872e88) that's more concise. I've also added a simple integration-test ensuring the promote control command works properly on a cluster.

Copy link
Member

evanphx left a comment

I think the biggest problem with the approach is how someone would actually use it. I can't figure out how a production system would know when to issue the promote call. If we were going to do this, it would need to be automatic instead.


def promote
Process.kill "URG", @pid
Process.detach @pid

This comment has been minimized.

Copy link
@evanphx

evanphx Feb 12, 2020

Member

Process.detach should never be used. It's implementation is extremely bad as it spawns a Thread just to perform a wait in a loop.

@nateberkopec
Copy link
Member

nateberkopec commented Feb 12, 2020

I was thinking about testing this on CodeTriage with promotion after a few thousand requests

@wjordan wjordan force-pushed the wjordan:promote branch 2 times, most recently from 52f40e9 to 7c978e8 Feb 13, 2020
@wjordan
Copy link
Contributor Author

wjordan commented Feb 13, 2020

I did another pass on the implementation to simply further and make the diff smaller. Instead of a new 'promote' signal In addition to a new refork command now there's a new configuration option fork_worker that when enabled, causes new workers to be forked from worker-0 instead of from the master that join the existing cluster like normal.

This approach leverages the existing phased-restart logic to cycle workers one at a time for copy-on-write improvements with minimal impact to a live site.

@wjordan wjordan changed the title Experiment: Promote worker process to optimize copy-on-write performance Experiment: Fork from existing worker process to optimize copy-on-write performance Feb 13, 2020
@wjordan wjordan force-pushed the wjordan:promote branch from 7c978e8 to 873d089 Feb 21, 2020
@nateberkopec
Copy link
Member

nateberkopec commented Feb 25, 2020

@wjordan Are those 4 new commits you added to this branch required to make the original promote feature pass the tests?

If not, could you break all 4 of those into separate PRs?

@nateberkopec
Copy link
Member

nateberkopec commented Feb 25, 2020

So, we're also going to need a new "worker recycle" option to go with this feature. Essentially, after serving X requests, restart/refork all workers gradually.

This option should only be possible with reforking, and it should only happen once. This is so that people don't use it to create their own puma_worker_killer.

e.g. refork_after_request_count = 2000. I'll come up with a sensible default with some production testing on CodeTriage.

@wjordan
Copy link
Contributor Author

wjordan commented Feb 25, 2020

@wjordan Are those 4 new commits you added to this branch required to make the original promote feature pass the tests?
If not, could you break all 4 of those into separate PRs?

Those extra commits were from #2123 which I rebased this PR on, I thought some of that work may have fixed some of the test failures in this PR. I've rebased this work off master again and it looks like the tests are passing, so it might have been unrelated.

@nateberkopec nateberkopec added this to the 5.0.0 milestone Feb 27, 2020
@nateberkopec
Copy link
Member

nateberkopec commented Mar 20, 2020

Deploying to CodeTriage production to give it a shot.

As for automatic, I think what we should provide is probably an option to do this after a certain amount of requests have been processed (I'm not sure what # is required, but we'll see after testing in prod).

What I'm curious about is if production traffic might expose that these gains are temporary. CoW in Ruby is not great right now. I'm wondering if benchmarking against the same route over and over exaggerates the possible gain because the amount of object allocation is predictable and probably very low. Prod loads will let us know the answer.

@nateberkopec
Copy link
Member

nateberkopec commented Mar 20, 2020

So I tried it and worker 0 reforked? I think that's the opposite of what's supposed to happen?

@nateberkopec
Copy link
Member

nateberkopec commented Mar 20, 2020

Codetriage is also maybe not a great app to test this on because it has such low memory usage to begin with.

@nateberkopec
Copy link
Member

nateberkopec commented Mar 20, 2020

I think this is interesting enough that we can include it still as an "experiment" in 5.0 with the changes required as I described.

# `preload_app` is not required when this option is enabled.
#
# @note Cluster mode only.
def fork_worker(answer=true)

This comment has been minimized.

Copy link
@nateberkopec

nateberkopec Mar 20, 2020

Member

Should be false by default

@nateberkopec
Copy link
Member

nateberkopec commented Mar 21, 2020

Probably should also provide some guidance in docs about forking from a dirty state - connection pools, state, etc.

@wjordan
Copy link
Contributor Author

wjordan commented Apr 2, 2020

Thanks @nateberkopec for spending time testing out this branch! I'll try and find time soon to do another pass on this PR based on the feedback.

So I tried it and worker 0 reforked? I think that's the opposite of what's supposed to happen?

Yeah, worker 0 shouldn't reload itself when sent the refork command (URG signal), only when sent a normal phased_restart command (USR1 signal).

Here's the output I get when starting a test app and running pumactl refork in another terminal, this is what I'd expect to happen, a 'phased restart' reloading all workers except worker 0:

Puma log
[10074] Puma starting in cluster mode...
[10074] * Version 4.3.1 (ruby 2.6.0-p0), codename: Mysterious Traveller
[10074] * Min threads: 0, max threads: 16
[10074] * Environment: development
[10074] * Process workers: 8
[10074] * Phased restart available
[10074] * Listening on tcp://0.0.0.0:9292
[10074] Use Ctrl-C to stop
[10074] * Starting control server on unix:///tmp/puma-status-1585851969025-10074
[10074] - Worker 0 (pid: 10123) booted, phase: 0
[10074] - Worker 1 (pid: 10133) booted, phase: 0
[10074] - Worker 2 (pid: 10138) booted, phase: 0
[10074] - Worker 3 (pid: 10143) booted, phase: 0
[10074] - Worker 4 (pid: 10149) booted, phase: 0
[10074] - Worker 5 (pid: 10155) booted, phase: 0
[10074] - Worker 6 (pid: 10159) booted, phase: 0
[10074] - Worker 7 (pid: 10168) booted, phase: 0
[10074] - Starting phased worker restart, phase: 1
[10074] + Changing to [DIR]
[10074] - Stopping 10133 for phased upgrade...
[10074] - TERM sent to 10133...
[10074] - Worker 1 (pid: 10588) booted, phase: 1
[10074] - Stopping 10138 for phased upgrade...
[10074] - TERM sent to 10138...
[10074] - Worker 2 (pid: 10607) booted, phase: 1
[10074] - Stopping 10143 for phased upgrade...
[10074] - TERM sent to 10143...
[10074] - Worker 3 (pid: 10627) booted, phase: 1
[10074] - Stopping 10149 for phased upgrade...
[10074] - TERM sent to 10149...
[10074] - Worker 4 (pid: 10645) booted, phase: 1
[10074] - Stopping 10155 for phased upgrade...
[10074] - TERM sent to 10155...
[10074] - Worker 5 (pid: 10690) booted, phase: 1
[10074] - Stopping 10159 for phased upgrade...
[10074] - TERM sent to 10159...
[10074] - Worker 6 (pid: 10708) booted, phase: 1
[10074] - Stopping 10168 for phased upgrade...
[10074] - TERM sent to 10168...
[10074] - Worker 7 (pid: 10724) booted, phase: 1

What is the unexpected output you saw?

@nateberkopec
Copy link
Member

nateberkopec commented Apr 15, 2020

NVM my comment - I got the output you got.

@wjordan wjordan force-pushed the wjordan:promote branch 2 times, most recently from e10d26b to e5fe1dc May 1, 2020
@wjordan
Copy link
Contributor Author

wjordan commented May 2, 2020

Finished another pass on this feature. Changes:

  • Added some documentation for the new experimental feature
  • Added an argument to fork_workers to auto-refork once after worker-0 serves a specified number of requests (default 1000 for now).
  • Attempt to address issues with forking worker-0 from a dirty state. Instead of just forking as-is (which could leak open file-descriptors and undefined behavior if user application-code is running mid-fork), worker-0 will first shutdown its running Server object before forking, and start it up again only after the full (re)start is completed.
    • This has the slight (but acceptable for now?) drawback of n-1 worker availability during re-fork, and n-2 during a normal phased-restart (where it's normally n-1). I added a paragraph in the documentation covering this limitation.
Trigger refork for improved copy-on-write performance.
@wjordan wjordan force-pushed the wjordan:promote branch from e5fe1dc to 613ac8e May 2, 2020
wjordan added 2 commits May 2, 2020
Eliminate `instance variable @notify not initialized` warnings
@wjordan wjordan force-pushed the wjordan:promote branch from d819a6b to 0e2fec5 May 4, 2020
@nateberkopec nateberkopec merged commit 0e7d6fa into puma:master May 11, 2020
26 of 27 checks passed
26 of 27 checks passed
ubuntu 2.2
Details
build
Details
ubuntu 2.3
Details
ubuntu 2.4
Details
ubuntu 2.5
Details
ubuntu 2.6
Details
ubuntu 2.7
Details
ubuntu head
Details
ubuntu jruby
Details
ubuntu truffleruby-head
Details
macos 2.2
Details
macos 2.3
Details
macos 2.4 macos 2.4
Details
macos 2.5
Details
macos 2.6
Details
macos 2.7
Details
macos head
Details
macos jruby
Details
macos truffleruby-head
Details
windows 2.2
Details
windows 2.3
Details
windows 2.4
Details
windows 2.5
Details
windows 2.6
Details
windows 2.7
Details
windows mingw
Details
ubuntu jruby-head ubuntu jruby-head
Details
@nateberkopec
Copy link
Member

nateberkopec commented May 11, 2020

Thanks for your hard work here @wjordan - I'm really excited to test out this feature in the 5.0 beta.

@wjordan wjordan mentioned this pull request May 15, 2020
7 of 8 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.