-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributor: add per-tenant counters for total and received requests #2770
Conversation
Add per-tenant counters for total and received requests to provide visibility into what an appropriate value for a per-tenant request limit might be. Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
4d717b7
to
41e9eda
Compare
pkg/distributor/distributor.go
Outdated
@@ -629,6 +643,7 @@ func (d *Distributor) wrapPushWithMiddlewares(next push.Func) push.Func { | |||
// To guarantee that, middleware functions will be called in reversed order, wrapping the | |||
// result from previous call. | |||
middlewares = append(middlewares, d.instanceLimitsMiddleware) // should run first | |||
middlewares = append(middlewares, d.prePushUserRequestMiddleware) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the instanceLimitsMiddleware
could already reject some requests, shouldn't those be counted too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. We don't increment per-tenant metrics for anything else before checking the instance limits.
The way the other per-tenant limits work is:
- The total metric is "up to this number of X may be accepted"
- The rejected metric is "this number of X was rejected or dropped for reason Y"
- The received metric is "this number made it all the way to a distributor and will be sent to an ingester"
I'm having trouble fitting the instance limits into that model. We'd increase the "up to this number of X may be accepted" metric but then they'd all be rejected due to something the user has no control over. We'd need to add a "system is overloaded" reason to the rejection reasons for the various validation metrics which doesn't make sense to me (since this is an operational issue, not a validation one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, hitting the instance limits results in 500 errors which will be retried by the user. If we didn't run that middleware first, we're potentially double counting user samples/exemplars/requests. This is a problem since these counters exist to provide visibility into the dimensions that we're limiting users on. It'd be confusing to increase the count of requests without a corresponding "rejection" metric when hitting instance limits.
pkg/distributor/distributor.go
Outdated
} | ||
|
||
cleanupInDefer = false | ||
d.incomingRequests.WithLabelValues(userID).Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is literally the only required change in business logic. Shall we include it in d.instanceLimitsMiddleware
instead? (We can rename that middleware, if that bothers you.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's the only line. Sure, I can add it to the instance limits middleware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 1777b9c
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metric names are not quite clear to me. In a month or two I will not remember difference between cortex_distributor_received_requests_total
and cortex_distributor_requests_in_total
. Without checking help, I'd think that "cortex_distributor_received_requests_total" is actually ALL requests. Can we try to find better names?
Thinking out loudly:
- "incoming" (all) vs "processed" (after initial checks)
- "incoming" (all) vs "accepted" (as in "accepted" by distributor, not necessarily ingester)
- "incoming" vs "handled"
(that's all variation on the same theme)
Yes, BUT these names were picked to match the other per-tenant metrics for samples, exemplars, and metadata. I'd rather not "fix" these ones and break the pattern used by the others. |
Oh, I see. I missed that we have similarly named metrics for other stuff. Looking at those metrics, I think some of them are broken after HA deduplication has been moved to separate middleware. (ping @replay) |
I'll open an issue to fix those metrics now that we've split various checks into middleware. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a unit test on the metric please? Should be easy to add the metric check to existing test cases.
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri nick.pillitteri@grafana.com
What this PR does
Add per-tenant counters for total and received requests to provide visibility
into what an appropriate value for a per-tenant request limit might be.
Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]