Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[close #1577] Negative Backpressure Metric #1579

Merged
merged 3 commits into from
Jun 14, 2018
Merged

Conversation

schneems
Copy link
Contributor

@schneems schneems commented May 4, 2018

This PR introduces the pool_capacity stat which can be used as a “negative back pressure metric”.

What does a “negative backpressure metric” mean? It means when this number is low, we have a higher amount of backpressure. When it is zero it means that our worker has no ability to process additional requests. This information could be used to scale out by adding additional servers. When that happens the requests should be re-distributed between the extra server and the “pool capacity” number for each individual server should go up.

This PR introduces the `pool_capacity` stat which can be used as a “negative back pressure metric”. 

What does a “negative backpressure metric” mean? It means when this number is low, we have a higher amount of backpressure. When it is zero it means that our worker has no ability to process additional requests. This information could be used to scale out by adding additional servers. When that happens the requests should be re-distributed between the extra server and the “pool capacity” number for each individual server should go up.
schneems added a commit to heroku/barnes that referenced this pull request May 4, 2018
@mimen
Copy link

mimen commented Jun 14, 2018

@schneems Is this ready to be merged?

@schneems
Copy link
Contributor Author

Yep!

@schneems schneems merged commit 5a7d884 into master Jun 14, 2018
@jjb
Copy link
Contributor

jjb commented Jun 14, 2018

sweet! a few questions

  • Where did this idea come from? Discussion in another ticket/channel/IRL?
  • Does this resolve the "we can't access a meaningful metric given the current design" conundrum discussed in Accurate Backpressure Metrics from Puma Server #1577, or is it more of a stopgap solution?
  • What is the highest value that this can be?
  • (related to above) How (if at all) is it affected by 0:N vs. N:N thread configuration?

@schneems
Copy link
Contributor Author

Where did this idea come from? Discussion in another ticket/channel/IRL? Does this resolve the "we can't access a meaningful metric given the current design" conundrum discussed in #1577, or is it more of a stopgap solution?

I think this is the thing TM that "fixes" #1577. Though if you disagree, i'm happy to re-open.

Ahh, i see i missed some of your comments over there, sorry about that. I had intended conversation about this specific interface to happen on this PR. I'm using this in production right now and it is providing good feedback.

What is the highest value that this can be?

That's already been asked, I can add that in as another metric. Seems worthwhile. Per worker it would be max-threads. It's a good idea to let the stat tell us that directly. While i'm doing that I can also expose the current number of threads per worker. The two values are related.

(related to above) How (if at all) is it affected by 0:N vs. N:N thread configuration?

Going down to 0 in "pool capacity" is bad in the N:N config. There may be a race condition with 0:N case where all your workers drop to zero (because you've not yet hit N threads) and then the stat fires and reports that you have 0 capacity, then they all dynamically create a new thread and the capacity would actually be 1 times the number of processes.

I think this is mitigated by the fact that we would need to use a stream of this metric rather than just one value. I.e. even if it's showing 0 right now, under the same sustained load it would show a positive number the next time it reports. Even if we took some scaling action based on the 0 indicator, a future stream of positive numbers would perhaps indicate we could safely scale back down.

@sj26
Copy link
Contributor

sj26 commented Aug 14, 2018

Yesss, thank you!

Now we can publish a "Utilization" metric for sensible autoscaling, at least in the N:N configuration. 💯

@schneems
Copy link
Contributor Author

Let me know what you think!

The other major question to answer with metrics is "am I using a good number for my thread count?".

I'm not totally sure about that one. My current best suggestion is to pick a magic number and live with it. Alternatively, adjust it up or down and note if the average response time increases or decreases.

I guess I want some combination of a CPU metric (want to be close to 100% utilization) but also some kind of a "contention" metric. For example, you can certainly saturate an app with 1000 threads, but your contention will be through the roof.

Maybe contention could be time spent idle per thread. I don't know if it's possible to get that from Ruby, it might require a patch to ruby/ruby and wouldn't be useful for at least a year until it gets released.

I'm open to other thoughts/suggestions. Maybe we should open another issue on it for some brainstorming.

@jjb
Copy link
Contributor

jjb commented Aug 14, 2018

I've been thinking about this problem for years. Ideally one would like to try different thread counts under similar circumstances and see how performance and CPU load are affected (assuming that processes are scaled up to use available ram, and the amount of ram more threads take is not an issue).

But that's hard to do because it's hard to get directly comparable traffic circumstances, and also cumbersome to schedule the experiment and directly compare results.

But... I just now had this idea: One could have, on the same machine, Puma processes with different thread counts. Then the behavior of each can be compared. Most important is probably simply throughput. If a process with 4 threads serves almost 2 times as many requests as a process with 2 threads, 4 threads is worth the DB connections. If a process with 8 threads serves only 20% more requests than a process with 4 threads, 8 threads is too many and a waste of DB connections (of course one would would to have enough incoming traffic to feel confident there was a sufficient test... and maybe having enough would require scaling down the number of processes for some period of time until there is a tad bit of backpressure).

Doing this would require that the logic which distributes requests to processes is very accurate in its determination of if a process has available capacity. I don't know to what extent this is possible.

If it is possible, then we can add features to Puma to facilitate this sort of experiment.

We could even eventually create some sort of autoscaling.

@jjb
Copy link
Contributor

jjb commented Aug 14, 2018

(above comment was edited a lot after posting, please read on website)

@schneems
Copy link
Contributor Author

Interesting. That could be really cool. There's lots of apps that can't run that many workers. I also worry about large apps that might need to tightly control DB connection counts.

I think this is a good experiment. We could maybe manually do this in a "on_worker_boot" block of an app, though would need a way to determine thread multiplier for each process.

I'm not sure how to log and record throughput of each individual worker in that scenario.

Doing this would require that the logic which distributes requests to processes is very accurate in its determination of if a process has available capacity. I don't know to what extent this is possible.

Puma 3.12 does a good job of this AFAIK after a recent patch. You can read the docs about how this all works #1576.

Let me know if you've got any other ideas. The value of a brainstorming session usually comes from the number of ideas. With quantity comes quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants