-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial implementation of per-query limits #8727
Conversation
Signed-off-by: Callum Styan <callumstyan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting the work. I think there was some misunderstanding around the error messaging.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think we are close. It's advice against introducing new limits in this pull request as this is scope creep. We should land this forest and add the other limits later.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Agreed. I removed the addition of the Sorry about the messy state for your last review, not sure what happened there. Lots of stuff that I had cleaned up locally somehow didn't make it into a commit. Maybe I messed up my |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. This looks really good. What's missing
- An integration test.
- A configuration flag to enable and disable this feature.
- Some clean up for the logger.
I'll see how far I come today.
589ad6c
to
15d57bc
Compare
15d57bc
to
d86fd4a
Compare
51b85a8
to
1a3f7f1
Compare
@cstyan I've added an integration test. There's a race condition with the modules. Sometimes the test works and sometimes it fails. |
4592927
to
0d69463
Compare
if err := mm.AddDependency(Server, QueryLimitsInterceptors); err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is using mm.AppDependency()
different to appending to deps
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deps doesn't already have an entry for Server
, not sure if that's intentional or not. We end up looping through deps
anyways and calling AddDependency
, so it's essentially the same as if we had a Server
entry in deps
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I didn't do it because Deos was missing it. That said I'm not sure we need the inceptor at all.
Signed-off-by: Callum Styan <callumstyan@gmail.com>
invalid json that we can't unmarshal) Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
**What this PR does / why we need it**: #8727 introduced per-request limits. These should be enforce for label queries as well. This change adds the required middleware to all label query endpoints. **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md`
**What this PR does / why we need it**: This PR implements two new per-tenant limits that are enforced on log and metric queries (both range and instant) when TSDB is used: - `max_query_bytes_read`: Refuse queries that would read more than the configured bytes here. Overall limit regardless of splitting/sharding. The goal is to refuse queries that would take too long. The default value of 0 disables this limit. - `max_querier_bytes_read`: Refuse queries in which any of their subqueries after splitting and sharding would read more than the configured bytes here. The goal is to avoid a querier from running a query that would load too much data in memory and can potentially get OOMed. The default value of 0 disables this limit. These new limits can be configured per tenant and per query (see #8727). The bytes a query would read are estimated through TSDB's index stats. Even though they are not exact, they are good enough to have a rough estimation of whether a query is too big to run or not. For more details on this refer to this discussion in the PR: #8670 (comment). Both limits are implemented in the frontend. Even though we considered implementing `max_querier_bytes_read` in the querier, this way, the limits for pre and post splitting/sharding queries are enforced close to each other on the same component. Moreover, this way we can reduce the number of index stats requests issued to the index gateways by reusing the stats gathered while sharding the query. With regard to how index stats requests are issued: - We parallelize index stats requests by splitting them into queries that span up to 24h since our indices are sharded by 24h periods. On top of that, this prevents a single index gateway from processing a single huge request like `{app=~".+"} for 30d`. - If sharding is enabled and the query is shardable, for `max_querier_bytes_read`, we re-use the stats requests issued by the sharding ware. Specifically, we look at the [bytesPerShard][1] to enforce this limit. Note that once we merge this PR and enable these limits, the load of index stats requests will increase substantially and we may discover bottlenecks in our index gateways and TSDB. After speaking with @owen-d, we think it should be fine as, if needed, we can scale up our index gateways and support caching index stats requests. Here's a demo of this working: <img width="1647" alt="image" src="https://user-images.githubusercontent.com/8354290/226918478-d4b6c2fd-de4d-478a-9c8b-e38fe148fa95.png"> <img width="1647" alt="image" src="https://user-images.githubusercontent.com/8354290/226918798-a71b1db8-ea68-4d00-933b-e5eb1524d240.png"> **Which issue(s) this PR fixes**: This PR addresses grafana/loki-private#674. **Special notes for your reviewer**: - @jeschkies has reviewed the changes related to query-time limits. - I've done some refactoring in this PR: - Extracted logic to get stats for a set of matches into a new function [getStatsForMatchers][2]. - Extracted the _Handler_ interface implementation for [queryrangebase.roundTripper][3] into a new type [queryrangebase.roundTripperHandler][4]. This is used to create the handler that skips the rest of configured middlewares when sending an index stat quests ([example][5]). **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` [1]: https://github.com/grafana/loki/blob/ff847305afaf7de5eb56436f3683773e88701075/pkg/querier/queryrange/shard_resolver.go#L179-L186 [2]: https://github.com/grafana/loki/blob/ff847305afaf7de5eb56436f3683773e88701075/pkg/querier/queryrange/shard_resolver.go#L72 [3]: https://github.com/grafana/loki/blob/3d2fff3a2d416a48a73346a53ba7499b0eeb67f7/pkg/querier/queryrange/queryrangebase/roundtrip.go#L124 [4]: https://github.com/grafana/loki/blob/3d2fff3a2d416a48a73346a53ba7499b0eeb67f7/pkg/querier/queryrange/queryrangebase/roundtrip.go#L163 [5]: https://github.com/grafana/loki/blob/f422e0a52b743a11209b8276510feb2ab8241486/pkg/querier/queryrange/roundtrip.go#L521
**What this PR does / why we need it**: At #8727 we introduced various limits that can now be configured at query time. We always compare the value of the limit configured at query time with the value set on the overrides for the tenant or the default if not configured (aka original); applying the most restrictive one. If the most restrictive is the original value or the limit is not configured at query-time, we print the following debug message: https://github.com/grafana/loki/blob/9e1846c47a1f3685fc540b7d03d285c7530da223/pkg/util/querylimits/limiter.go#L43 It will be printed many times: every time the query is not configured at query-time, as well as every time the original value is more restrictive. Moreover, this log message lacks some useful information such as the original value, the query-time value, the tenant ID and the limit name. This PR fixes this by removing the debug message above and printing a new debug message only if we return the query-time limit. This new log also prints the tenant ID, the limit name, as well as the original and the query-time limit values. **Which issue(s) this PR fixes**: Fixes #8932 **Special notes for your reviewer**: **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md`
What this PR does / why we need it:
Sometimes we want to limit the impact of a single query by imposing limits that are stricter than the current tenant limit. E.g. the maximum query length could be seven days but based on the query or an admins decision a query should just have a maximum length of one day. This is where per-request limits come into play. They are passed via the
X-Loki-Query-Limit
header and extracted into the requests context.It is the responsibility of the operator or admin that the header is valid.
Which issue(s) this PR fixes:
Fixes #8762
Checklist
CONTRIBUTING.md
guide (required)CHANGELOG.md
updateddocs/sources/upgrading/_index.md