Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial implementation of per-query limits #8727

Merged
merged 20 commits into from
Mar 10, 2023
Merged

Conversation

cstyan
Copy link
Contributor

@cstyan cstyan commented Mar 7, 2023

What this PR does / why we need it:
Sometimes we want to limit the impact of a single query by imposing limits that are stricter than the current tenant limit. E.g. the maximum query length could be seven days but based on the query or an admins decision a query should just have a maximum length of one day. This is where per-request limits come into play. They are passed via the X-Loki-Query-Limit header and extracted into the requests context.

It is the responsibility of the operator or admin that the header is valid.

Which issue(s) this PR fixes:
Fixes #8762

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting the work. I think there was some misunderstanding around the error messaging.

pkg/util/querylimits/limiter.go Outdated Show resolved Hide resolved
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think we are close. It's advice against introducing new limits in this pull request as this is scope creep. We should land this forest and add the other limits later.

pkg/loki/loki.go Outdated Show resolved Hide resolved
pkg/querier/querier.go Outdated Show resolved Hide resolved
pkg/querier/querier.go Outdated Show resolved Hide resolved
pkg/querier/querier.go Outdated Show resolved Hide resolved
pkg/querier/querier.go Outdated Show resolved Hide resolved
pkg/util/querylimits/propagation.go Outdated Show resolved Hide resolved
pkg/util/querylimits/propagation.go Show resolved Hide resolved
pkg/validation/limits.go Outdated Show resolved Hide resolved
pkg/validation/limits.go Outdated Show resolved Hide resolved
pkg/validation/limits_test.go Outdated Show resolved Hide resolved
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan
Copy link
Contributor Author

cstyan commented Mar 8, 2023

Thanks. I think we are close. It's advice against introducing new limits in this pull request as this is scope creep. We should land this forest and add the other limits later.

Agreed. I removed the addition of the RequiredLabels.

Sorry about the messy state for your last review, not sure what happened there. Lots of stuff that I had cleaned up locally somehow didn't make it into a commit. Maybe I messed up my git commit add --patch last night.

@cstyan cstyan changed the title WIP Add initial implementation of per-query limits Add initial implementation of per-query limits Mar 9, 2023
@cstyan cstyan marked this pull request as ready for review March 9, 2023 00:11
@cstyan cstyan requested a review from a team as a code owner March 9, 2023 00:11
Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. This looks really good. What's missing

  • An integration test.
  • A configuration flag to enable and disable this feature.
  • Some clean up for the logger.

I'll see how far I come today.

pkg/util/querylimits/grpc.go Show resolved Hide resolved
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Mar 9, 2023
@jeschkies
Copy link
Contributor

@cstyan I've added an integration test. There's a race condition with the modules. Sometimes the test works and sometimes it fails.

Comment on lines +680 to +682
if err := mm.AddDependency(Server, QueryLimitsInterceptors); err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is using mm.AppDependency() different to appending to deps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deps doesn't already have an entry for Server, not sure if that's intentional or not. We end up looping through deps anyways and calling AddDependency, so it's essentially the same as if we had a Server entry in deps.

Copy link
Contributor

@jeschkies jeschkies Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I didn't do it because Deos was missing it. That said I'm not sure we need the inceptor at all.

pkg/util/querylimits/grpc_test.go Show resolved Hide resolved
pkg/util/querylimits/limiter.go Outdated Show resolved Hide resolved
pkg/util/querylimits/limiter.go Show resolved Hide resolved
pkg/util/querylimits/propagation.go Show resolved Hide resolved
pkg/util/querylimits/propagation.go Outdated Show resolved Hide resolved
jeschkies and others added 5 commits March 9, 2023 20:46
Signed-off-by: Callum Styan <callumstyan@gmail.com>
invalid json that we can't unmarshal)

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan merged commit 5a85f66 into main Mar 10, 2023
@cstyan cstyan deleted the loki-per-query-limits branch March 10, 2023 05:21
jeschkies added a commit to jeschkies/loki that referenced this pull request Mar 10, 2023
jeschkies added a commit that referenced this pull request Mar 12, 2023
**What this PR does / why we need it**:
#8727 introduced per-request limits. These should be enforce for label
queries as well. This change adds the required middleware to all label
query endpoints.

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
@salvacorts salvacorts mentioned this pull request Mar 22, 2023
5 tasks
salvacorts added a commit that referenced this pull request Mar 23, 2023
**What this PR does / why we need it**:

This PR implements two new per-tenant limits that are enforced on log
and metric queries (both range and instant) when TSDB is used:

- `max_query_bytes_read`: Refuse queries that would read more than the
configured bytes here. Overall limit regardless of splitting/sharding.
The goal is to refuse queries that would take too long. The default
value of 0 disables this limit.

- `max_querier_bytes_read`: Refuse queries in which any of their
subqueries after splitting and sharding would read more than the
configured bytes here. The goal is to avoid a querier from running a
query that would load too much data in memory and can potentially get
OOMed. The default value of 0 disables this limit.

These new limits can be configured per tenant and per query (see
#8727).

The bytes a query would read are estimated through TSDB's index stats.
Even though they are not exact, they are good enough to have a rough
estimation of whether a query is too big to run or not. For more details
on this refer to this discussion in the PR:
#8670 (comment).

Both limits are implemented in the frontend. Even though we considered
implementing `max_querier_bytes_read` in the querier, this way, the
limits for pre and post splitting/sharding queries are enforced close to
each other on the same component. Moreover, this way we can reduce the
number of index stats requests issued to the index gateways by reusing
the stats gathered while sharding the query.

With regard to how index stats requests are issued:
- We parallelize index stats requests by splitting them into queries
that span up to 24h since our indices are sharded by 24h periods. On top
of that, this prevents a single index gateway from processing a single
huge request like `{app=~".+"} for 30d`.
- If sharding is enabled and the query is shardable, for
`max_querier_bytes_read`, we re-use the stats requests issued by the
sharding ware. Specifically, we look at the [bytesPerShard][1] to
enforce this limit.

Note that once we merge this PR and enable these limits, the load of
index stats requests will increase substantially and we may discover
bottlenecks in our index gateways and TSDB. After speaking with @owen-d,
we think it should be fine as, if needed, we can scale up our index
gateways and support caching index stats requests.

Here's a demo of this working:
<img width="1647" alt="image"
src="https://user-images.githubusercontent.com/8354290/226918478-d4b6c2fd-de4d-478a-9c8b-e38fe148fa95.png">

<img width="1647" alt="image"
src="https://user-images.githubusercontent.com/8354290/226918798-a71b1db8-ea68-4d00-933b-e5eb1524d240.png">


**Which issue(s) this PR fixes**:
This PR addresses grafana/loki-private#674.

**Special notes for your reviewer**:

- @jeschkies has reviewed the changes related to query-time limits.
- I've done some refactoring in this PR:
- Extracted logic to get stats for a set of matches into a new function
[getStatsForMatchers][2].
- Extracted the _Handler_ interface implementation for
[queryrangebase.roundTripper][3] into a new type
[queryrangebase.roundTripperHandler][4]. This is used to create the
handler that skips the rest of configured middlewares when sending an
index stat quests ([example][5]).

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`


[1]:
https://github.com/grafana/loki/blob/ff847305afaf7de5eb56436f3683773e88701075/pkg/querier/queryrange/shard_resolver.go#L179-L186

[2]:
https://github.com/grafana/loki/blob/ff847305afaf7de5eb56436f3683773e88701075/pkg/querier/queryrange/shard_resolver.go#L72

[3]:
https://github.com/grafana/loki/blob/3d2fff3a2d416a48a73346a53ba7499b0eeb67f7/pkg/querier/queryrange/queryrangebase/roundtrip.go#L124

[4]:
https://github.com/grafana/loki/blob/3d2fff3a2d416a48a73346a53ba7499b0eeb67f7/pkg/querier/queryrange/queryrangebase/roundtrip.go#L163

[5]:
https://github.com/grafana/loki/blob/f422e0a52b743a11209b8276510feb2ab8241486/pkg/querier/queryrange/roundtrip.go#L521
salvacorts added a commit that referenced this pull request Mar 29, 2023
**What this PR does / why we need it**:
At #8727 we introduced various
limits that can now be configured at query time. We always compare the
value of the limit configured at query time with the value set on the
overrides for the tenant or the default if not configured (aka
original); applying the most restrictive one.

If the most restrictive is the original value or the limit is not
configured at query-time, we print the following debug message:

https://github.com/grafana/loki/blob/9e1846c47a1f3685fc540b7d03d285c7530da223/pkg/util/querylimits/limiter.go#L43

It will be printed many times: every time the query is not configured at
query-time, as well as every time the original value is more
restrictive. Moreover, this log message lacks some useful information
such as the original value, the query-time value, the tenant ID and the
limit name.

This PR fixes this by removing the debug message above and printing a
new debug message only if we return the query-time limit. This new log
also prints the tenant ID, the limit name, as well as the original and
the query-time limit values.

**Which issue(s) this PR fixes**:
Fixes #8932

**Special notes for your reviewer**:

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support per-request limits and propage them to querier.
3 participants