Query-frontend: customisable query splitting for queries overlapping `query_ingester_within` window #11535

dannykopping · 2023-12-20T13:45:54Z

What this PR does / why we need it:
The config option query_ingesters_within defines the window during which logs could be present on ingesters, and as such queriers will send queries to ingesters instead.

split_queries_by_interval is defined to split queries into subqueries for increased parallelism.

Aggressive query splitting within the query_ingesters_within window can result in overloading ingesters with unnecessarily large numbers of subqueries, which perversely can impact writes.

query_ingesters_within is set to 3h by default. In Grafana Cloud Logs we set split_queries_by_interval as low as 15m (defaults to 1h), which would result in result in 3*60/15=12 requests. Every querier queries every ingester during this window, so that's 12 requests per ingester per query which has the query_ingesters_within window in its time range (i.e. a query from now to now-7d would include the query_ingesters_within window as well, now-3h to now-7d would not).

However, we do want to split queries so an ingester won't have to handle a query for a full query_ingesters_within window - this could involve a large amount of data. To account for this, this PR introduces a new option split_ingester_queries_by_interval on the query-frontend; this setting is disabled by default.

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
CHANGELOG.md updated
- If the change is worth mentioning in the release notes, add add-to-release-notes label
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

github-actions · 2023-12-20T13:49:25Z

Trivy scan found the following vulnerabilities:

HIGH, Target: docker.io/grafana/loki:main-3048e4a (alpine 3.18.4), Type: alpine openssl: Incorrect cipher key and IV length processing in libcrypto3 v3.1.3-r0. Fixed in v3.1.4-r0
HIGH, Target: docker.io/grafana/loki:main-3048e4a (alpine 3.18.4), Type: alpine openssl: Incorrect cipher key and IV length processing in libssl3 v3.1.3-r0. Fixed in v3.1.4-r0
\nTo see more details on these vulnerabilities, and how/where to fix them, please run docker build -t grafana/loki:main-3048e4a -f cmd/loki/Dockerfile .
trivy i grafana/loki:main-3048e4a on your branch. If these were not introduced by your PR, please considering fixing them in via a subsequent PR. Thanks!

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

…gester-splits

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

…ingester-splits

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

ashwanthgoli · 2024-01-09T06:22:45Z

pkg/querier/queryrange/splitters.go

+	// for queries outside the `query_ingesters_within` window.
+	//
+	// The given factory is responsible for building the splits and appending to reqs.
+	newStart, newEnd := buildIngesterQuerySplitsAndRebound(execTime, s.limits, s.iqo, tenantIDs, lokiReq, factory, false)


for metric queries, this does not round the split boundaries to the step. is that intentional?

You're right, I can't use this function here; thanks! I'll fix this.

Great catch on this @ashwanthgoli! I've addressed it in b7dfa09, PTAL.

pkg/querier/queryrange/splitters.go

ashwanthgoli · 2024-01-09T08:46:42Z

something to be aware of: caching middlewares downstream expect the sub-queries to be split and aligned by split_queries_by_interval or other related configs based on the request type. This change breaks that assumption.

for ex: a cache key spanning an interval of 15m might store a larger 1h extent.
this may not affect the correctness for metric queries since we extract what we need, but it stores results that do not belong to the cache key.

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

dannykopping · 2024-01-10T09:05:14Z

something to be aware of: caching middlewares downstream expect the sub-queries to be split and aligned by split_queries_by_interval or other related configs based on the request type. This change breaks that assumption.

for ex: a cache key spanning an interval of 15m might store a larger 1h extent. this may not affect the correctness for metric queries since we extract what we need, but it stores results that do not belong to the cache key.

Hhmm, what should we do about this? I guess we shouldn't be caching metric queries within query_ingesters_within since the values can change anyway while logs are in the head block?

ashwanthgoli · 2024-01-11T06:26:11Z

pkg/querier/queryrange/splitters.go

+		s.buildMetricSplits(lokiReq.GetStep(), ingesterQueryInterval, start, end, factory)
+
+		// rebound after ingester queries have been split out
+		end = start


we should also substract step from end to leave a step gap on the ingester split boundary similar to what we do with each split.

loki/pkg/querier/queryrange/splitters.go

Line 155 in b7dfa09

// Round up to the step before the next interval boundary.

Implemented in 75ee015, good catch!

ashwanthgoli · 2024-01-11T06:52:04Z

@dannykopping should we also make the cache layer aware of the ingester splits and pick a different interval for the cache key based on the query range being served?

I guess we shouldn't be caching metric queries within query_ingesters_within since the values can change anyway while logs are in the head block?

problem with this is that the current defaults for max_cache_freshness are much smaller, sub 30m. this might have some performance impact

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

…ingester-splits

dannykopping · 2024-01-11T13:20:49Z

@dannykopping should we also make the cache layer aware of the ingester splits and pick a different interval for the cache key based on the query range being served?

I guess we shouldn't be caching metric queries within query_ingesters_within since the values can change anyway while logs are in the head block?

problem with this is that the current defaults for max_cache_freshness are much smaller, sub 30m. this might have some performance impact

Great idea! I'll do this in a follow-up PR.

ashwanthgoli

lgtm

vlad-diachenko · 2024-01-12T08:58:20Z

pkg/querier/queryrange/splitters.go

+
+func (s *metricQuerySplitter) alignStartEnd(step int64, start, end time.Time) (time.Time, time.Time) {
+	// step align start and end time of the query. Start time is rounded down and end time is rounded up.
+	stepNs := step * 1e6


fingers crossed that step is ms ;)

pkg/validation/limits.go

vlad-diachenko

looks awesome 🚀

… keys (#11679) **What this PR does / why we need it**: Follow up from #11535 (specifically from [this thread](#11535 (comment))), this PR modifies the results cache implementation to use the same interval for generating cache keys.

…`query_ingester_within` window (#11535) **What this PR does / why we need it**: The config option `query_ingesters_within` defines the window during which logs _could_ be present on ingesters, and as such queriers will send queries to ingesters instead. `split_queries_by_interval` is defined to split queries into subqueries for increased parallelism. Aggressive query splitting within the `query_ingesters_within` window can result in overloading ingesters with unnecessarily large numbers of subqueries, which perversely can impact writes. `query_ingesters_within` is set to 3h by default. In Grafana Cloud Logs we set `split_queries_by_interval` as low as 15m (defaults to 1h), which would result in result in 3*60/15=12 requests. Every querier queries every ingester during this window, so that's 12 requests _per ingester per query_ which has the `query_ingesters_within` window in its time range _(i.e. a query from now to now-7d would include the `query_ingesters_within` window as well, now-3h to now-7d would not)_. However, we _do_ want to split queries so an ingester won't have to handle a query for a full `query_ingesters_within` window - this could involve a large amount of data. To account for this, this PR introduces a new option `split_ingester_queries_by_interval` on the query-frontend; this setting is disabled by default. ![image](https://github.com/grafana/loki/assets/373762/2e671bd8-9e8d-4bf3-addf-bebcfc25e8d7)

… keys (#11679) **What this PR does / why we need it**: Follow up from #11535 (specifically from [this thread](#11535 (comment))), this PR modifies the results cache implementation to use the same interval for generating cache keys.

…`query_ingester_within` window (grafana#11535) **What this PR does / why we need it**: The config option `query_ingesters_within` defines the window during which logs _could_ be present on ingesters, and as such queriers will send queries to ingesters instead. `split_queries_by_interval` is defined to split queries into subqueries for increased parallelism. Aggressive query splitting within the `query_ingesters_within` window can result in overloading ingesters with unnecessarily large numbers of subqueries, which perversely can impact writes. `query_ingesters_within` is set to 3h by default. In Grafana Cloud Logs we set `split_queries_by_interval` as low as 15m (defaults to 1h), which would result in result in 3*60/15=12 requests. Every querier queries every ingester during this window, so that's 12 requests _per ingester per query_ which has the `query_ingesters_within` window in its time range _(i.e. a query from now to now-7d would include the `query_ingesters_within` window as well, now-3h to now-7d would not)_. However, we _do_ want to split queries so an ingester won't have to handle a query for a full `query_ingesters_within` window - this could involve a large amount of data. To account for this, this PR introduces a new option `split_ingester_queries_by_interval` on the query-frontend; this setting is disabled by default. ![image](https://github.com/grafana/loki/assets/373762/2e671bd8-9e8d-4bf3-addf-bebcfc25e8d7)

… keys (grafana#11679) **What this PR does / why we need it**: Follow up from grafana#11535 (specifically from [this thread](grafana#11535 (comment))), this PR modifies the results cache implementation to use the same interval for generating cache keys.

pull-request-size bot added the size/XL label Dec 20, 2023

Danny Kopping added 4 commits December 28, 2023 13:40

Initial working implementation

09e0858

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Don't affect splitting if query_store_only is true

6db4c06

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Refactoring

c4b1871

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Fix up tests

628cace

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

dannykopping force-pushed the dannykopping/max-ingester-splits branch 2 times, most recently from 11c9f8a to 98c3234 Compare January 3, 2024 12:33

Danny Kopping added 2 commits January 3, 2024 14:35

Move splitter to own type

0f6d6cd

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

WIP: working prototype

83fac1f

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

dannykopping force-pushed the dannykopping/max-ingester-splits branch 2 times, most recently from d5b2072 to 58d6af1 Compare January 4, 2024 12:16

Refactoring

3f7b74c

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

dannykopping force-pushed the dannykopping/max-ingester-splits branch from 58d6af1 to 3f7b74c Compare January 4, 2024 12:32

Danny Kopping added 5 commits January 4, 2024 14:34

Merge remote-tracking branch 'upstream/main' into dannykopping/max-in…

41d21b1

…gester-splits

Refactor after merge

438c659

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Working impl, adding some tests

d58c1f7

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Minor refactoring to subtests

e4c3ecd

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Add more test cases

b8e3827

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

pull-request-size bot added size/XXL and removed size/XL labels Jan 5, 2024

Danny Kopping added 2 commits January 8, 2024 11:53

Merge branch 'main' of github.com:grafana/loki into dannykopping/max-…

ea70a51

…ingester-splits

Splitting queries during ingester query windows specially

afd63dc

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Jan 8, 2024

make doc, fixing linter issue

75980df

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

dannykopping changed the title ~~[DRAFT] Limit query splitting for query_ingester_within window~~ Customisable query splitting for queries overlapping query_ingester_within window Jan 8, 2024

dannykopping changed the title ~~Customisable query splitting for queries overlapping query_ingester_within window~~ Query-frontend: customisable query splitting for queries overlapping query_ingester_within window Jan 8, 2024

CHANGELOG

ac70af3

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

dannykopping force-pushed the dannykopping/max-ingester-splits branch from 3b5f3ba to ac70af3 Compare January 8, 2024 13:13

Checking for QueryStoreOnly

cb77f49

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

dannykopping marked this pull request as ready for review January 8, 2024 13:44

dannykopping requested a review from a team as a code owner January 8, 2024 13:44

ashwanthgoli reviewed Jan 9, 2024

View reviewed changes

pkg/querier/queryrange/splitters.go Outdated Show resolved Hide resolved

Danny Kopping added 2 commits January 9, 2024 15:02

Revert use of SplitGap to time.Millisecond

78b2c5a

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Align metric query splits with step

b7dfa09

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

ashwanthgoli reviewed Jan 11, 2024

View reviewed changes

Danny Kopping added 2 commits January 11, 2024 14:32

Step-align end split from ingester query window

75ee015

Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

Merge branch 'main' of github.com:grafana/loki into dannykopping/max-…

292130e

…ingester-splits

ashwanthgoli approved these changes Jan 11, 2024

View reviewed changes

dannykopping merged commit bcd0315 into grafana:main Jan 11, 2024
8 checks passed

dannykopping deleted the dannykopping/max-ingester-splits branch January 11, 2024 13:56

vlad-diachenko reviewed Jan 12, 2024

View reviewed changes

pkg/validation/limits.go Show resolved Hide resolved

vlad-diachenko reviewed Jan 12, 2024

View reviewed changes

dannykopping mentioned this pull request Jan 15, 2024

Query-frontend: use the same query split interval for generated cache keys #11679

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query-frontend: customisable query splitting for queries overlapping `query_ingester_within` window #11535

Query-frontend: customisable query splitting for queries overlapping `query_ingester_within` window #11535

dannykopping commented Dec 20, 2023 •

edited

github-actions bot commented Dec 20, 2023 •

edited

ashwanthgoli Jan 9, 2024

dannykopping Jan 9, 2024

dannykopping Jan 10, 2024

ashwanthgoli commented Jan 9, 2024

dannykopping commented Jan 10, 2024

ashwanthgoli Jan 11, 2024

dannykopping Jan 11, 2024

ashwanthgoli commented Jan 11, 2024

dannykopping commented Jan 11, 2024

ashwanthgoli left a comment

vlad-diachenko Jan 12, 2024

vlad-diachenko left a comment

Query-frontend: customisable query splitting for queries overlapping query_ingester_within window #11535

Query-frontend: customisable query splitting for queries overlapping query_ingester_within window #11535

Conversation

dannykopping commented Dec 20, 2023 • edited

github-actions bot commented Dec 20, 2023 • edited

ashwanthgoli Jan 9, 2024

Choose a reason for hiding this comment

dannykopping Jan 9, 2024

Choose a reason for hiding this comment

dannykopping Jan 10, 2024

Choose a reason for hiding this comment

ashwanthgoli commented Jan 9, 2024

dannykopping commented Jan 10, 2024

ashwanthgoli Jan 11, 2024

Choose a reason for hiding this comment

dannykopping Jan 11, 2024

Choose a reason for hiding this comment

ashwanthgoli commented Jan 11, 2024

dannykopping commented Jan 11, 2024

ashwanthgoli left a comment

Choose a reason for hiding this comment

vlad-diachenko Jan 12, 2024

Choose a reason for hiding this comment

vlad-diachenko left a comment

Choose a reason for hiding this comment

Query-frontend: customisable query splitting for queries overlapping `query_ingester_within` window #11535

Query-frontend: customisable query splitting for queries overlapping `query_ingester_within` window #11535

dannykopping commented Dec 20, 2023 •

edited

github-actions bot commented Dec 20, 2023 •

edited