Accurate Backpressure Metrics from Puma Server #1577

schneems · 2018-05-01T21:36:57Z

TLDR; The existing backlog metric is flawed as a backpressure mechanism and may soon be basically meaningless with #1563.

I want to be able to measure whether a Ruby web app is being under-provisioned for resources. One way that this can be accomplished is by measuring the number of requests that are fully formed that a server is not able to process.

Currently as a proxy I'm using Puma's "backlog" (which is different than than TCP backlog).

Backlog Behavior

For a request to appear in the backlog metric it must go through this flow:

A request comes into the server, perhaps it is queued at the socket
- A Puma::Server instance gets notified that there is capacity in its thread pool
- The Puma::Server accepts the request from the socket and puts it into the Reactor
- The reactor ensures that the header and body are fully formed
- The reactor passes the request to the thread pool
- At this point the @todo array size has been incremented by one

If there are requests in @todo that have not been poped off and are being worked on then they show up as "backlog".

At first this looks like an ideal metric, except for the Puma::Server meters request based on threadpool capacity. In an ideal world it will never try to add a request if there is not capacity. There are cases where a two slower clients might get their requests accepted and then there is a temporary backlog, otherwise the backlog metric should be zero.

Possible solutions

1) TCP backlog metric

We could measure the number of requests that are queued at the TCP socket and report on that.

Unfortunately: There doesn't seem to be an API for getting this data. We can see the max size of the TCP backlog, but not actually how many clients are connected and awaiting being processed. Also even if we could, then we're risking weird conditions for example if someone is slow client attacking us, there could be tons of things in the backlog that will never be fully requests that are ready to be processed by the server.

2) 1 Reactor, many servers

Change the architecture so instead of 1 reactor for every process, we have one reactor on the parent process (or another sidecar process). It tries to buffer requests as fast as possible with no waiting. It can pass the fully buffered requests via some mechanism. Perhaps a unix domain socket though I don't know what it would look like for windows and jruby (though they can't run in clustered mode anyway).

If we did this than the backpressure metric would be requests that have been fully buffered by the reactor but have not been accepted by a cluster or threadpool. Assuming we can measure that somehow.

3) Any other ideas?

???

cc/ @jjb

The text was updated successfully, but these errors were encountered:

jjb · 2018-05-02T02:24:52Z

backing up a little: if formerly we considered @todo.size to be the number to watch because it caused waiting to happen, can't we now consider @todo.size - @waiting to be that number to watch? And since we formerly didn't watch @waiting at all even though it had the same effect on this loop, and now @waiting's standalone role has been removed, maybe @todo.size - @waiting is in fact more meaningful than standalone @todo.size was before?

(but I'll admit I didn't yet think deeply about your request lifecycle description-- I'm first simply comparing what we were looking at before to what we can look at now in the same place)

noahgibbs · 2018-05-02T16:25:25Z

Some of this is clearly subtle enough that I'm not going to fully understand and reflect it here. So: take my comments with a grain of salt. And some of it I'm restating just to make sure I understand it.

I agree on approach 1 that there doesn't seem to be a good mechanism to get the information you want from TCP (pending connections without accepting them) or other highly-relevant metrics (wait time before acceptance, requests that were rejected due to queue size.) And while you could maybe reimplement enough to do that, ew, don't. There are some attempts in that direction (Goliath, ZeroMQ.) But I don't see them, or that approach, helping you.

Hm. So approach 2 would do more work in the parent process (hard to scale), use more memory, have more of a single point of failure, and then have to pass the full-size request from master process to worker. In return, it would have one obvious place to handle backpressure and slow requests.

Right now, based on #1563, it sounds like you have a choice of a hard-to-reason-about working algorithm, or a simpler, easier-to-explain method that doesn't work well. You tried switching to the simpler method, then reverted it. Am I understanding that correctly?

Later the parent process of approach #2 might be a good choice for Guilds - multiple threads with multiple VM locks. But of course those aren't available yet.

SamSaffron · 2018-05-02T22:03:48Z

What we do at Discourse with Unicorn/NGINX is this:

https://github.com/discourse/prometheus_exporter/blob/master/lib/prometheus_exporter/middleware.rb#L70-L76

We set HTTP_X_REQUEST_START header and then have a metric of how long reqs are waiting between NGINX and unicorn.

This allows us quite extensive amount of excellent alerting ... we also use raindrops https://bogomips.org/raindrops/ which give us slightly cruder metrics (N Queued)

We use this kind of information for amazingly awesome features like: https://github.com/discourse/discourse/blob/master/lib/middleware/anonymous_cache.rb#L122-L131

We will force clients into an anonymous mode (and to use the blazing fast anonymous cache) if stuff starts queueing on a single route for logged in users. EG: someone posted ... OH NO OUTAGE AT EVE ONLINE on reddit and suddenly a herd of logged in users flood us.

I don't know if you require NGINX in your architecture or not, but fronting stuff with it simplifies this so so much.

schneems · 2018-05-03T16:47:26Z

@jjb

can't we now consider @todo.size - @waiting to be that number to watch?

The issue is in the implementation. @todo only grows when there are threads that are not doing work. The only time that @todo should be higher than zero is during the slight race condition where a request is added to the @todo array, but is not yet picked up by an idle worker. With this patch it means that @todo should always be small, usually zero but sometimes one. If we use @todo - @waiting then we should always get zero.

Previously that metric should have been close to zero, but there was essentially a bug that when the above mentioned race condition happened it could let say 6 requests in before trying to process them. Basically even before this patch, the metric wasn't actually all that useful. Or at least wasn't doing what I thought it was.

@SamSaffron Thanks for the info. that's helpful. I also didn't know you used unicorn. Is discourse threadsafe?

I don't know if you require NGINX in your architecture or not, but fronting stuff with it simplifies this so so much.

That's a good question and something to look into. We do not require nginx in front of Puma. We talked about using it as a reactor at last RailsConf (phoenix) but punted on it.

I'm curious if @FooBarWidget has any insights. Is there a way to measure this kind of "backpressure" in passenger? Can you get any meaningful metrics out of nginx directly?

noahgibbs · 2018-05-03T17:02:23Z

I hope Discourse is threadsafe... I'm using it configured with a lot of Puma threads...

schneems · 2018-05-03T21:13:51Z

Another idea: instead of measuring the capacity we don't have which is what i'm trying to do with "backpressure" we could instead measure the capacity that I do have. Basically we could report idle time and when all servers are reporting zero (or low) idle time for a period then we know we are at capacity.

For example if I know that I have 2 workers and 5 threads, if my service isn't getting hit much then maybe it's telling me my idle capacity is 10. If I get hit with a New York times mention (or whatever) then suddenly I see that my idle capacity is down to zero, then if I scale out by adding another server I should get some slack re-introduced into my system.

SamSaffron · 2018-05-04T01:07:40Z

I hope Discourse is threadsafe... I'm using it configured with a lot of Puma threads...

Yes Discourse is very threadsafe it has been so since day 1.

jjb · 2018-05-25T18:41:33Z

Puma::Server meters request based on threadpool capacity. In an ideal world it will never try to add a request if there is not capacity.

can you point me to the area of code where this happens? does the Server decide which thread gets the request, or does the threadpool do that? what do you mean by "meter"?

if i were to naively design such a system myself, i would make

a global queue of requests coming in (probably implemented with Queue)
a pool of threads which take requests off of the que

Are there such structures in Puma? If not, is that for a particular reason?

jjb · 2018-05-25T18:45:36Z

okay i found this in ThreadPool

Each thread in the pool has an internal loop where it pulls a request from the @todo array and proceses it.

So I guess you are saying that the size of this queue doesn't matter and there is something further up which indicates lack of capacity. So the answers to my other questions above will lead me in the right direction. Thanks!

jjb · 2018-06-04T13:48:01Z

A thought/realization I just had: in #1512, I have min threads set to 1 and max threads set to whatever (5 or 6 typically). from my understanding of things, the delta between running and max_threads is meaningful. (even though that's not clear from my ticket, which only shows examples with max_threads of 1).

So if that's true, then the part of puma that decides to spool up and down threads (when min and max are not identical) also "knows" if capacity is available.

is this interesting?

[close #1577] Negative Backpressure Metric

GeminPatel · 2018-09-25T15:50:01Z

Can someone help me understand that if we have a puma backlog that is waiting for the full request b4 putting them into todo list. And a worker always pop's the request b4 processing them, then how "There are cases where a two slower clients might get their requests accepted and then there is a temporary backlog, otherwise the backlog metric should be zero."? Is it when writing back?

schneems mentioned this issue May 9, 2018

Request body read time metric #1569

Merged

jjb mentioned this issue May 25, 2018

Improved thread pool code #1589

Closed

schneems closed this as completed in 8b10df8 Jun 14, 2018

schneems added a commit that referenced this issue Jun 14, 2018

Merge pull request #1579 from puma/schneems/waiting

5a7d884

[close #1577] Negative Backpressure Metric

jjb mentioned this issue Jun 14, 2018

[close #1577] Negative Backpressure Metric #1579

Merged

yob mentioned this issue Jan 6, 2021

document what each metric means yob/puma-plugin-statsd#27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accurate Backpressure Metrics from Puma Server #1577

Accurate Backpressure Metrics from Puma Server #1577

schneems commented May 1, 2018

jjb commented May 2, 2018 •

edited

noahgibbs commented May 2, 2018

SamSaffron commented May 2, 2018

schneems commented May 3, 2018

noahgibbs commented May 3, 2018

schneems commented May 3, 2018

SamSaffron commented May 4, 2018

jjb commented May 25, 2018

jjb commented May 25, 2018

jjb commented Jun 4, 2018

GeminPatel commented Sep 25, 2018

Accurate Backpressure Metrics from Puma Server #1577

Accurate Backpressure Metrics from Puma Server #1577

Comments

schneems commented May 1, 2018

Backlog Behavior

Possible solutions

1) TCP backlog metric

2) 1 Reactor, many servers

3) Any other ideas?

jjb commented May 2, 2018 • edited

noahgibbs commented May 2, 2018

SamSaffron commented May 2, 2018

schneems commented May 3, 2018

noahgibbs commented May 3, 2018

schneems commented May 3, 2018

SamSaffron commented May 4, 2018

jjb commented May 25, 2018

jjb commented May 25, 2018

jjb commented Jun 4, 2018

GeminPatel commented Sep 25, 2018

jjb commented May 2, 2018 •

edited