[close #1577] Negative Backpressure Metric #1579

schneems · 2018-05-04T17:54:36Z

This PR introduces the pool_capacity stat which can be used as a “negative back pressure metric”.

What does a “negative backpressure metric” mean? It means when this number is low, we have a higher amount of backpressure. When it is zero it means that our worker has no ability to process additional requests. This information could be used to scale out by adding additional servers. When that happens the requests should be re-distributed between the extra server and the “pool capacity” number for each individual server should go up.

This PR introduces the `pool_capacity` stat which can be used as a “negative back pressure metric”. What does a “negative backpressure metric” mean? It means when this number is low, we have a higher amount of backpressure. When it is zero it means that our worker has no ability to process additional requests. This information could be used to scale out by adding additional servers. When that happens the requests should be re-distributed between the extra server and the “pool capacity” number for each individual server should go up.

Related to puma/puma#1579

mimen · 2018-06-14T01:09:32Z

@schneems Is this ready to be merged?

schneems · 2018-06-14T12:57:43Z

Yep!

jjb · 2018-06-14T13:40:05Z

sweet! a few questions

Where did this idea come from? Discussion in another ticket/channel/IRL?
Does this resolve the "we can't access a meaningful metric given the current design" conundrum discussed in Accurate Backpressure Metrics from Puma Server #1577, or is it more of a stopgap solution?
What is the highest value that this can be?
(related to above) How (if at all) is it affected by 0:N vs. N:N thread configuration?

schneems · 2018-06-14T18:39:38Z

Where did this idea come from? Discussion in another ticket/channel/IRL? Does this resolve the "we can't access a meaningful metric given the current design" conundrum discussed in #1577, or is it more of a stopgap solution?

I think this is the thing TM that "fixes" #1577. Though if you disagree, i'm happy to re-open.

Ahh, i see i missed some of your comments over there, sorry about that. I had intended conversation about this specific interface to happen on this PR. I'm using this in production right now and it is providing good feedback.

What is the highest value that this can be?

That's already been asked, I can add that in as another metric. Seems worthwhile. Per worker it would be max-threads. It's a good idea to let the stat tell us that directly. While i'm doing that I can also expose the current number of threads per worker. The two values are related.

(related to above) How (if at all) is it affected by 0:N vs. N:N thread configuration?

Going down to 0 in "pool capacity" is bad in the N:N config. There may be a race condition with 0:N case where all your workers drop to zero (because you've not yet hit N threads) and then the stat fires and reports that you have 0 capacity, then they all dynamically create a new thread and the capacity would actually be 1 times the number of processes.

I think this is mitigated by the fact that we would need to use a stream of this metric rather than just one value. I.e. even if it's showing 0 right now, under the same sustained load it would show a positive number the next time it reports. Even if we took some scaling action based on the 0 indicator, a future stream of positive numbers would perhaps indicate we could safely scale back down.

sj26 · 2018-08-14T06:14:29Z

Yesss, thank you!

Now we can publish a "Utilization" metric for sensible autoscaling, at least in the N:N configuration. 💯

schneems · 2018-08-14T14:29:14Z

Let me know what you think!

The other major question to answer with metrics is "am I using a good number for my thread count?".

I'm not totally sure about that one. My current best suggestion is to pick a magic number and live with it. Alternatively, adjust it up or down and note if the average response time increases or decreases.

I guess I want some combination of a CPU metric (want to be close to 100% utilization) but also some kind of a "contention" metric. For example, you can certainly saturate an app with 1000 threads, but your contention will be through the roof.

Maybe contention could be time spent idle per thread. I don't know if it's possible to get that from Ruby, it might require a patch to ruby/ruby and wouldn't be useful for at least a year until it gets released.

I'm open to other thoughts/suggestions. Maybe we should open another issue on it for some brainstorming.

jjb · 2018-08-14T14:38:12Z

I've been thinking about this problem for years. Ideally one would like to try different thread counts under similar circumstances and see how performance and CPU load are affected (assuming that processes are scaled up to use available ram, and the amount of ram more threads take is not an issue).

But that's hard to do because it's hard to get directly comparable traffic circumstances, and also cumbersome to schedule the experiment and directly compare results.

But... I just now had this idea: One could have, on the same machine, Puma processes with different thread counts. Then the behavior of each can be compared. Most important is probably simply throughput. If a process with 4 threads serves almost 2 times as many requests as a process with 2 threads, 4 threads is worth the DB connections. If a process with 8 threads serves only 20% more requests than a process with 4 threads, 8 threads is too many and a waste of DB connections (of course one would would to have enough incoming traffic to feel confident there was a sufficient test... and maybe having enough would require scaling down the number of processes for some period of time until there is a tad bit of backpressure).

Doing this would require that the logic which distributes requests to processes is very accurate in its determination of if a process has available capacity. I don't know to what extent this is possible.

If it is possible, then we can add features to Puma to facilitate this sort of experiment.

We could even eventually create some sort of autoscaling.

jjb · 2018-08-14T14:41:07Z

(above comment was edited a lot after posting, please read on website)

schneems · 2018-08-15T17:28:51Z

Interesting. That could be really cool. There's lots of apps that can't run that many workers. I also worry about large apps that might need to tightly control DB connection counts.

I think this is a good experiment. We could maybe manually do this in a "on_worker_boot" block of an app, though would need a way to determine thread multiplier for each process.

I'm not sure how to log and record throughput of each individual worker in that scenario.

Doing this would require that the logic which distributes requests to processes is very accurate in its determination of if a process has available capacity. I don't know to what extent this is possible.

Puma 3.12 does a good job of this AFAIK after a recent patch. You can read the docs about how this all works #1576.

Let me know if you've got any other ideas. The value of a brainstorming session usually comes from the number of ideas. With quantity comes quality.

schneems added 3 commits May 4, 2018 12:54

Include max threads in capacity calculation

a60ceef

Fix tests

ae5c94f

schneems added a commit to heroku/barnes that referenced this pull request May 4, 2018

Record Puma pool capacity value

1ab70a6

Related to puma/puma#1579

schneems mentioned this pull request May 4, 2018

Record Puma pool capacity value heroku/barnes#15

Merged

schneems merged commit 5a7d884 into master Jun 14, 2018

matsimitsu mentioned this pull request Mar 14, 2019

Add Puma minutely probe appsignal/appsignal-ruby#488

Merged

1 task

nateberkopec deleted the schneems/waiting branch March 14, 2020 21:53

yob mentioned this pull request Jan 6, 2021

document what each metric means yob/puma-plugin-statsd#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[close #1577] Negative Backpressure Metric #1579

[close #1577] Negative Backpressure Metric #1579

schneems commented May 4, 2018

mimen commented Jun 14, 2018

schneems commented Jun 14, 2018

jjb commented Jun 14, 2018

schneems commented Jun 14, 2018

sj26 commented Aug 14, 2018

schneems commented Aug 14, 2018

jjb commented Aug 14, 2018 •

edited

Loading

jjb commented Aug 14, 2018

schneems commented Aug 15, 2018

[close #1577] Negative Backpressure Metric #1579

[close #1577] Negative Backpressure Metric #1579

Conversation

schneems commented May 4, 2018

mimen commented Jun 14, 2018

schneems commented Jun 14, 2018

jjb commented Jun 14, 2018

schneems commented Jun 14, 2018

sj26 commented Aug 14, 2018

schneems commented Aug 14, 2018

jjb commented Aug 14, 2018 • edited Loading

jjb commented Aug 14, 2018

schneems commented Aug 15, 2018

jjb commented Aug 14, 2018 •

edited

Loading