Add initial implementation of per-query limits #8727

cstyan · 2023-03-07T06:04:40Z

What this PR does / why we need it:
Sometimes we want to limit the impact of a single query by imposing limits that are stricter than the current tenant limit. E.g. the maximum query length could be seven days but based on the query or an admins decision a query should just have a maximum length of one day. This is where per-request limits come into play. They are passed via the X-Loki-Query-Limit header and extracted into the requests context.

It is the responsibility of the operator or admin that the header is valid.

Which issue(s) this PR fixes:
Fixes #8762

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
CHANGELOG.md updated
Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

Signed-off-by: Callum Styan <callumstyan@gmail.com>

jeschkies

Thanks for starting the work. I think there was some misunderstanding around the error messaging.

pkg/util/querylimits/limiter.go

Signed-off-by: Callum Styan <callumstyan@gmail.com>

jeschkies

Thanks. I think we are close. It's advice against introducing new limits in this pull request as this is scope creep. We should land this forest and add the other limits later.

pkg/loki/loki.go

pkg/querier/querier.go

pkg/util/querylimits/propagation.go

pkg/validation/limits.go

pkg/validation/limits_test.go

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan · 2023-03-08T22:24:16Z

Thanks. I think we are close. It's advice against introducing new limits in this pull request as this is scope creep. We should land this forest and add the other limits later.

Agreed. I removed the addition of the RequiredLabels.

Sorry about the messy state for your last review, not sure what happened there. Lots of stuff that I had cleaned up locally somehow didn't make it into a commit. Maybe I messed up my git commit add --patch last night.

jeschkies

Nice. This looks really good. What's missing

An integration test.
A configuration flag to enable and disable this feature.
Some clean up for the logger.

I'll see how far I come today.

pkg/util/querylimits/grpc.go

jeschkies · 2023-03-09T11:35:08Z

@cstyan I've added an integration test. There's a race condition with the modules. Sometimes the test works and sometimes it fails.

chaudum · 2023-03-09T15:48:59Z

pkg/loki/loki.go

+		if err := mm.AddDependency(Server, QueryLimitsInterceptors); err != nil {
+			return err
+		}


How is using mm.AppDependency() different to appending to deps?

Deps doesn't already have an entry for Server, not sure if that's intentional or not. We end up looping through deps anyways and calling AddDependency, so it's essentially the same as if we had a Server entry in deps.

Yes. I didn't do it because Deos was missing it. That said I'm not sure we need the inceptor at all.

pkg/util/querylimits/grpc_test.go

pkg/util/querylimits/limiter.go

pkg/util/querylimits/propagation.go

Signed-off-by: Callum Styan <callumstyan@gmail.com>

invalid json that we can't unmarshal) Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>

**What this PR does / why we need it**: #8727 introduced per-request limits. These should be enforce for label queries as well. This change adds the required middleware to all label query endpoints. **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md`

@owen-d

**What this PR does / why we need it**: This PR implements two new per-tenant limits that are enforced on log and metric queries (both range and instant) when TSDB is used: - `max_query_bytes_read`: Refuse queries that would read more than the configured bytes here. Overall limit regardless of splitting/sharding. The goal is to refuse queries that would take too long. The default value of 0 disables this limit. - `max_querier_bytes_read`: Refuse queries in which any of their subqueries after splitting and sharding would read more than the configured bytes here. The goal is to avoid a querier from running a query that would load too much data in memory and can potentially get OOMed. The default value of 0 disables this limit. These new limits can be configured per tenant and per query (see #8727). The bytes a query would read are estimated through TSDB's index stats. Even though they are not exact, they are good enough to have a rough estimation of whether a query is too big to run or not. For more details on this refer to this discussion in the PR: #8670 (comment). Both limits are implemented in the frontend. Even though we considered implementing `max_querier_bytes_read` in the querier, this way, the limits for pre and post splitting/sharding queries are enforced close to each other on the same component. Moreover, this way we can reduce the number of index stats requests issued to the index gateways by reusing the stats gathered while sharding the query. With regard to how index stats requests are issued: - We parallelize index stats requests by splitting them into queries that span up to 24h since our indices are sharded by 24h periods. On top of that, this prevents a single index gateway from processing a single huge request like `{app=~".+"} for 30d`. - If sharding is enabled and the query is shardable, for `max_querier_bytes_read`, we re-use the stats requests issued by the sharding ware. Specifically, we look at the [bytesPerShard][1] to enforce this limit. Note that once we merge this PR and enable these limits, the load of index stats requests will increase substantially and we may discover bottlenecks in our index gateways and TSDB. After speaking with @owen-d, we think it should be fine as, if needed, we can scale up our index gateways and support caching index stats requests. Here's a demo of this working: <img width="1647" alt="image" src="https://user-images.githubusercontent.com/8354290/226918478-d4b6c2fd-de4d-478a-9c8b-e38fe148fa95.png"> <img width="1647" alt="image" src="https://user-images.githubusercontent.com/8354290/226918798-a71b1db8-ea68-4d00-933b-e5eb1524d240.png"> **Which issue(s) this PR fixes**: This PR addresses grafana/loki-private#674. **Special notes for your reviewer**: - @jeschkies has reviewed the changes related to query-time limits. - I've done some refactoring in this PR: - Extracted logic to get stats for a set of matches into a new function [getStatsForMatchers][2]. - Extracted the _Handler_ interface implementation for [queryrangebase.roundTripper][3] into a new type [queryrangebase.roundTripperHandler][4]. This is used to create the handler that skips the rest of configured middlewares when sending an index stat quests ([example][5]). **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` [1]: https://github.com/grafana/loki/blob/ff847305afaf7de5eb56436f3683773e88701075/pkg/querier/queryrange/shard_resolver.go#L179-L186 [2]: https://github.com/grafana/loki/blob/ff847305afaf7de5eb56436f3683773e88701075/pkg/querier/queryrange/shard_resolver.go#L72 [3]: https://github.com/grafana/loki/blob/3d2fff3a2d416a48a73346a53ba7499b0eeb67f7/pkg/querier/queryrange/queryrangebase/roundtrip.go#L124 [4]: https://github.com/grafana/loki/blob/3d2fff3a2d416a48a73346a53ba7499b0eeb67f7/pkg/querier/queryrange/queryrangebase/roundtrip.go#L163 [5]: https://github.com/grafana/loki/blob/f422e0a52b743a11209b8276510feb2ab8241486/pkg/querier/queryrange/roundtrip.go#L521

**What this PR does / why we need it**: At #8727 we introduced various limits that can now be configured at query time. We always compare the value of the limit configured at query time with the value set on the overrides for the tenant or the default if not configured (aka original); applying the most restrictive one. If the most restrictive is the original value or the limit is not configured at query-time, we print the following debug message: https://github.com/grafana/loki/blob/9e1846c47a1f3685fc540b7d03d285c7530da223/pkg/util/querylimits/limiter.go#L43 It will be printed many times: every time the query is not configured at query-time, as well as every time the original value is more restrictive. Moreover, this log message lacks some useful information such as the original value, the query-time value, the tenant ID and the limit name. This PR fixes this by removing the debug message above and printing a new debug message only if we return the query-time limit. This new log also prints the tenant ID, the limit name, as well as the original and the query-time limit values. **Which issue(s) this PR fixes**: Fixes #8932 **Special notes for your reviewer**: **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md`

Add initial implementation of per-query limits

164f78a

Signed-off-by: Callum Styan <callumstyan@gmail.com>

pull-request-size bot added the size/XXL label Mar 7, 2023

jeschkies requested changes Mar 7, 2023

View reviewed changes

pkg/util/querylimits/limiter.go Outdated Show resolved Hide resolved

cstyan added 3 commits March 7, 2023 16:00

Back out incorrect error handling

e8cf011

Signed-off-by: Callum Styan <callumstyan@gmail.com>

add validation step for required labels

de65a32

Signed-off-by: Callum Styan <callumstyan@gmail.com>

clean up, remove query timeout

c6f53ae

Signed-off-by: Callum Styan <callumstyan@gmail.com>

pull-request-size bot added size/XL and removed size/XXL labels Mar 8, 2023

jeschkies requested changes Mar 8, 2023

View reviewed changes

Review cleanup

8ef5d6a

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan changed the title ~~WIP Add initial implementation of per-query limits~~ Add initial implementation of per-query limits Mar 9, 2023

cstyan marked this pull request as ready for review March 9, 2023 00:11

cstyan requested a review from a team as a code owner March 9, 2023 00:11

jeschkies reviewed Mar 9, 2023

View reviewed changes

pkg/util/querylimits/grpc.go Show resolved Hide resolved

jeschkies force-pushed the loki-per-query-limits branch from 589ad6c to 15d57bc Compare March 9, 2023 10:08

Inject logger with component.

d86fd4a

jeschkies force-pushed the loki-per-query-limits branch from 15d57bc to d86fd4a Compare March 9, 2023 10:17

Enable per request limits with flag.

1a3f7f1

jeschkies force-pushed the loki-per-query-limits branch from 51b85a8 to 1a3f7f1 Compare March 9, 2023 10:48

jeschkies requested a review from JStickler as a code owner March 9, 2023 10:48

github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Mar 9, 2023

WIP: integration test.

a4e8e80

jeschkies added 3 commits March 9, 2023 12:48

Log when limits are injected and extracted.

107b9e4

Actually add integration test.

15b72e3

Test limits with integration test.

0d69463

jeschkies force-pushed the loki-per-query-limits branch from 4592927 to 0d69463 Compare March 9, 2023 15:13

chaudum reviewed Mar 9, 2023

View reviewed changes

REmove _ = level.Debug.

a31a2fe

jeschkies added 3 commits March 9, 2023 20:24

Enable middleware when flag is set.

36c135f

Add changelog entry.

b8e6376

Test grpc without limits.

f523c16

jeschkies approved these changes Mar 9, 2023

View reviewed changes

jeschkies and others added 5 commits March 9, 2023 20:46

Enable per-request limits for Docker compose setup.

7342918

allow omitempty for the query limits struct

a55e58f

Signed-off-by: Callum Styan <callumstyan@gmail.com>

write error if middleware can't extract limits (this would be because of

1dc4b22

invalid json that we can't unmarshal) Signed-off-by: Callum Styan <callumstyan@gmail.com>

use logger instead of println, at debug level

b24b515

Signed-off-by: Callum Styan <callumstyan@gmail.com>

fix linting

ebcdf27

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan merged commit 5a85f66 into main Mar 10, 2023

cstyan deleted the loki-per-query-limits branch March 10, 2023 05:21

jeschkies added a commit to jeschkies/loki that referenced this pull request Mar 10, 2023

Remove print statements introduced by grafana#8727.

bc85538

jeschkies mentioned this pull request Mar 10, 2023

Impose per-request limit on label queries. #8777

Merged

5 tasks

jeschkies added a commit that referenced this pull request Mar 10, 2023

Remove print statements introduced by #8727. (#8775)

8eac851

jeschkies mentioned this pull request Mar 13, 2023

Ensure LimitsMiddleware is injected after per-request limits middleware. #8790

Closed

salvacorts mentioned this pull request Mar 22, 2023

Max bytes read limit #8670

Merged

5 tasks

This was referenced Mar 29, 2023

Usefulness of logging in query-time limits. #8932

Closed

Log when returning query-time limit #8938

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial implementation of per-query limits #8727

Add initial implementation of per-query limits #8727

cstyan commented Mar 7, 2023 •

edited by jeschkies

Loading

jeschkies left a comment

jeschkies left a comment

cstyan commented Mar 8, 2023

jeschkies left a comment •

edited

Loading

jeschkies commented Mar 9, 2023

chaudum Mar 9, 2023

cstyan Mar 9, 2023

jeschkies Mar 9, 2023 •

edited

Loading

Add initial implementation of per-query limits #8727

Add initial implementation of per-query limits #8727

Conversation

cstyan commented Mar 7, 2023 • edited by jeschkies Loading

jeschkies left a comment

Choose a reason for hiding this comment

jeschkies left a comment

Choose a reason for hiding this comment

cstyan commented Mar 8, 2023

jeschkies left a comment • edited Loading

Choose a reason for hiding this comment

jeschkies commented Mar 9, 2023

chaudum Mar 9, 2023

Choose a reason for hiding this comment

cstyan Mar 9, 2023

Choose a reason for hiding this comment

jeschkies Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

cstyan commented Mar 7, 2023 •

edited by jeschkies

Loading

jeschkies left a comment •

edited

Loading

jeschkies Mar 9, 2023 •

edited

Loading