Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributor: add per-tenant counters for total and received requests #2770

Merged
merged 2 commits into from
Aug 19, 2022

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Aug 18, 2022

Signed-off-by: Nick Pillitteri nick.pillitteri@grafana.com

What this PR does

Add per-tenant counters for total and received requests to provide visibility
into what an appropriate value for a per-tenant request limit might be.

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Add per-tenant counters for total and received requests to provide visibility
into what an appropriate value for a per-tenant request limit might be.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters marked this pull request as ready for review August 18, 2022 18:58
@@ -629,6 +643,7 @@ func (d *Distributor) wrapPushWithMiddlewares(next push.Func) push.Func {
// To guarantee that, middleware functions will be called in reversed order, wrapping the
// result from previous call.
middlewares = append(middlewares, d.instanceLimitsMiddleware) // should run first
middlewares = append(middlewares, d.prePushUserRequestMiddleware)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the instanceLimitsMiddleware could already reject some requests, shouldn't those be counted too?

Copy link
Contributor Author

@56quarters 56quarters Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. We don't increment per-tenant metrics for anything else before checking the instance limits.

The way the other per-tenant limits work is:

  • The total metric is "up to this number of X may be accepted"
  • The rejected metric is "this number of X was rejected or dropped for reason Y"
  • The received metric is "this number made it all the way to a distributor and will be sent to an ingester"

I'm having trouble fitting the instance limits into that model. We'd increase the "up to this number of X may be accepted" metric but then they'd all be rejected due to something the user has no control over. We'd need to add a "system is overloaded" reason to the rejection reasons for the various validation metrics which doesn't make sense to me (since this is an operational issue, not a validation one).

Copy link
Contributor Author

@56quarters 56quarters Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, hitting the instance limits results in 500 errors which will be retried by the user. If we didn't run that middleware first, we're potentially double counting user samples/exemplars/requests. This is a problem since these counters exist to provide visibility into the dimensions that we're limiting users on. It'd be confusing to increase the count of requests without a corresponding "rejection" metric when hitting instance limits.

@56quarters 56quarters requested a review from replay August 19, 2022 13:26
}

cleanupInDefer = false
d.incomingRequests.WithLabelValues(userID).Add(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is literally the only required change in business logic. Shall we include it in d.instanceLimitsMiddleware instead? (We can rename that middleware, if that bothers you.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's the only line. Sure, I can add it to the instance limits middleware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 1777b9c

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric names are not quite clear to me. In a month or two I will not remember difference between cortex_distributor_received_requests_total and cortex_distributor_requests_in_total. Without checking help, I'd think that "cortex_distributor_received_requests_total" is actually ALL requests. Can we try to find better names?

Thinking out loudly:

  • "incoming" (all) vs "processed" (after initial checks)
  • "incoming" (all) vs "accepted" (as in "accepted" by distributor, not necessarily ingester)
  • "incoming" vs "handled"

(that's all variation on the same theme)

@56quarters
Copy link
Contributor Author

Metric names are not quite clear to me. In a month or two I will not remember difference between cortex_distributor_received_requests_total and cortex_distributor_requests_in_total. Without checking help, I'd think that "cortex_distributor_received_requests_total" is actually ALL requests. Can we try to find better names?

Thinking out loudly:

* "incoming" (all) vs "processed" (after initial checks)

* "incoming" (all) vs "accepted" (as in "accepted" by distributor, not necessarily ingester)

* "incoming" vs "handled"

(that's all variation on the same theme)

Yes, BUT these names were picked to match the other per-tenant metrics for samples, exemplars, and metadata. I'd rather not "fix" these ones and break the pattern used by the others.

@pstibrany
Copy link
Member

Yes, BUT these names were picked to match the other per-tenant metrics for samples, exemplars, and metadata. I'd rather not "fix" these ones and break the pattern used by the others.

Oh, I see. I missed that we have similarly named metrics for other stuff.

Looking at those metrics, I think some of them are broken after HA deduplication has been moved to separate middleware. (ping @replay)

@56quarters
Copy link
Contributor Author

Looking at those metrics, I think some of them are broken after HA deduplication has been moved to separate middleware. (ping @replay)

I'll open an issue to fix those metrics now that we've split various checks into middleware.

@56quarters 56quarters merged commit 3dc50e2 into main Aug 19, 2022
@56quarters 56quarters deleted the 56quarters/per-user-requests branch August 19, 2022 15:22
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test on the metric please? Should be easy to add the metric check to existing test cases.

56quarters added a commit that referenced this pull request Aug 19, 2022
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this pull request Aug 19, 2022
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters self-assigned this Aug 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants