Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query-frontend: customisable query splitting for queries overlapping query_ingester_within window #11535

Merged

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Dec 20, 2023

What this PR does / why we need it:
The config option query_ingesters_within defines the window during which logs could be present on ingesters, and as such queriers will send queries to ingesters instead.

split_queries_by_interval is defined to split queries into subqueries for increased parallelism.

Aggressive query splitting within the query_ingesters_within window can result in overloading ingesters with unnecessarily large numbers of subqueries, which perversely can impact writes.

query_ingesters_within is set to 3h by default. In Grafana Cloud Logs we set split_queries_by_interval as low as 15m (defaults to 1h), which would result in result in 3*60/15=12 requests. Every querier queries every ingester during this window, so that's 12 requests per ingester per query which has the query_ingesters_within window in its time range (i.e. a query from now to now-7d would include the query_ingesters_within window as well, now-3h to now-7d would not).

However, we do want to split queries so an ingester won't have to handle a query for a full query_ingesters_within window - this could involve a large amount of data. To account for this, this PR introduces a new option split_ingester_queries_by_interval on the query-frontend; this setting is disabled by default.

image

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
    • If the change is worth mentioning in the release notes, add add-to-release-notes label
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

Copy link
Contributor

github-actions bot commented Dec 20, 2023

Trivy scan found the following vulnerabilities:

  • HIGH, Target: docker.io/grafana/loki:main-3048e4a (alpine 3.18.4), Type: alpine openssl: Incorrect cipher key and IV length processing in libcrypto3 v3.1.3-r0. Fixed in v3.1.4-r0
  • HIGH, Target: docker.io/grafana/loki:main-3048e4a (alpine 3.18.4), Type: alpine openssl: Incorrect cipher key and IV length processing in libssl3 v3.1.3-r0. Fixed in v3.1.4-r0
    \nTo see more details on these vulnerabilities, and how/where to fix them, please run docker build -t grafana/loki:main-3048e4a -f cmd/loki/Dockerfile .
    trivy i grafana/loki:main-3048e4a on your branch. If these were not introduced by your PR, please considering fixing them in via a subsequent PR. Thanks!

Danny Kopping added 4 commits December 28, 2023 13:40
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping dannykopping force-pushed the dannykopping/max-ingester-splits branch 2 times, most recently from 11c9f8a to 98c3234 Compare January 3, 2024 12:33
Danny Kopping added 2 commits January 3, 2024 14:35
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping dannykopping force-pushed the dannykopping/max-ingester-splits branch 2 times, most recently from d5b2072 to 58d6af1 Compare January 4, 2024 12:16
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping dannykopping force-pushed the dannykopping/max-ingester-splits branch from 58d6af1 to 3f7b74c Compare January 4, 2024 12:32
Danny Kopping added 5 commits January 4, 2024 14:34
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Danny Kopping added 2 commits January 8, 2024 11:53
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Jan 8, 2024
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping dannykopping changed the title [DRAFT] Limit query splitting for query_ingester_within window Customisable query splitting for queries overlapping query_ingester_within window Jan 8, 2024
@dannykopping dannykopping changed the title Customisable query splitting for queries overlapping query_ingester_within window Query-frontend: customisable query splitting for queries overlapping query_ingester_within window Jan 8, 2024
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping dannykopping force-pushed the dannykopping/max-ingester-splits branch from 3b5f3ba to ac70af3 Compare January 8, 2024 13:13
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping dannykopping marked this pull request as ready for review January 8, 2024 13:44
@dannykopping dannykopping requested a review from a team as a code owner January 8, 2024 13:44
// for queries outside the `query_ingesters_within` window.
//
// The given factory is responsible for building the splits and appending to reqs.
newStart, newEnd := buildIngesterQuerySplitsAndRebound(execTime, s.limits, s.iqo, tenantIDs, lokiReq, factory, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for metric queries, this does not round the split boundaries to the step. is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I can't use this function here; thanks! I'll fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch on this @ashwanthgoli! I've addressed it in b7dfa09, PTAL.

@ashwanthgoli
Copy link
Contributor

something to be aware of: caching middlewares downstream expect the sub-queries to be split and aligned by split_queries_by_interval or other related configs based on the request type. This change breaks that assumption.

for ex: a cache key spanning an interval of 15m might store a larger 1h extent.
this may not affect the correctness for metric queries since we extract what we need, but it stores results that do not belong to the cache key.

Danny Kopping added 2 commits January 9, 2024 15:02
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping
Copy link
Contributor Author

something to be aware of: caching middlewares downstream expect the sub-queries to be split and aligned by split_queries_by_interval or other related configs based on the request type. This change breaks that assumption.

for ex: a cache key spanning an interval of 15m might store a larger 1h extent. this may not affect the correctness for metric queries since we extract what we need, but it stores results that do not belong to the cache key.

Hhmm, what should we do about this? I guess we shouldn't be caching metric queries within query_ingesters_within since the values can change anyway while logs are in the head block?

s.buildMetricSplits(lokiReq.GetStep(), ingesterQueryInterval, start, end, factory)

// rebound after ingester queries have been split out
end = start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also substract step from end to leave a step gap on the ingester split boundary similar to what we do with each split.

// Round up to the step before the next interval boundary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 75ee015, good catch!

@ashwanthgoli
Copy link
Contributor

@dannykopping should we also make the cache layer aware of the ingester splits and pick a different interval for the cache key based on the query range being served?

I guess we shouldn't be caching metric queries within query_ingesters_within since the values can change anyway while logs are in the head block?

problem with this is that the current defaults for max_cache_freshness are much smaller, sub 30m. this might have some performance impact

Danny Kopping added 2 commits January 11, 2024 14:32
Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
@dannykopping
Copy link
Contributor Author

@dannykopping should we also make the cache layer aware of the ingester splits and pick a different interval for the cache key based on the query range being served?

I guess we shouldn't be caching metric queries within query_ingesters_within since the values can change anyway while logs are in the head block?

problem with this is that the current defaults for max_cache_freshness are much smaller, sub 30m. this might have some performance impact

Great idea! I'll do this in a follow-up PR.

Copy link
Contributor

@ashwanthgoli ashwanthgoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dannykopping dannykopping merged commit bcd0315 into grafana:main Jan 11, 2024
8 checks passed
@dannykopping dannykopping deleted the dannykopping/max-ingester-splits branch January 11, 2024 13:56

func (s *metricQuerySplitter) alignStartEnd(step int64, start, end time.Time) (time.Time, time.Time) {
// step align start and end time of the query. Start time is rounded down and end time is rounded up.
stepNs := step * 1e6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fingers crossed that step is ms ;)

Copy link
Contributor

@vlad-diachenko vlad-diachenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks awesome 🚀

dannykopping pushed a commit that referenced this pull request Jan 18, 2024
… keys (#11679)

**What this PR does / why we need it**:
Follow up from #11535 (specifically from [this
thread](#11535 (comment))),
this PR modifies the results cache implementation to use the same
interval for generating cache keys.
dannykopping pushed a commit that referenced this pull request Jan 18, 2024
…`query_ingester_within` window (#11535)

**What this PR does / why we need it**:
The config option `query_ingesters_within` defines the window during
which logs _could_ be present on ingesters, and as such queriers will
send queries to ingesters instead.

`split_queries_by_interval` is defined to split queries into subqueries
for increased parallelism.

Aggressive query splitting within the `query_ingesters_within` window
can result in overloading ingesters with unnecessarily large numbers of
subqueries, which perversely can impact writes.

`query_ingesters_within` is set to 3h by default. In Grafana Cloud Logs
we set `split_queries_by_interval` as low as 15m (defaults to 1h), which
would result in result in 3*60/15=12 requests. Every querier queries
every ingester during this window, so that's 12 requests _per ingester
per query_ which has the `query_ingesters_within` window in its time
range _(i.e. a query from now to now-7d would include the
`query_ingesters_within` window as well, now-3h to now-7d would not)_.

However, we _do_ want to split queries so an ingester won't have to
handle a query for a full `query_ingesters_within` window - this could
involve a large amount of data. To account for this, this PR introduces
a new option `split_ingester_queries_by_interval` on the query-frontend;
this setting is disabled by default.

![image](https://github.com/grafana/loki/assets/373762/2e671bd8-9e8d-4bf3-addf-bebcfc25e8d7)
dannykopping pushed a commit that referenced this pull request Jan 18, 2024
… keys (#11679)

**What this PR does / why we need it**:
Follow up from #11535 (specifically from [this
thread](#11535 (comment))),
this PR modifies the results cache implementation to use the same
interval for generating cache keys.
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
…`query_ingester_within` window (grafana#11535)

**What this PR does / why we need it**:
The config option `query_ingesters_within` defines the window during
which logs _could_ be present on ingesters, and as such queriers will
send queries to ingesters instead.

`split_queries_by_interval` is defined to split queries into subqueries
for increased parallelism.

Aggressive query splitting within the `query_ingesters_within` window
can result in overloading ingesters with unnecessarily large numbers of
subqueries, which perversely can impact writes.

`query_ingesters_within` is set to 3h by default. In Grafana Cloud Logs
we set `split_queries_by_interval` as low as 15m (defaults to 1h), which
would result in result in 3*60/15=12 requests. Every querier queries
every ingester during this window, so that's 12 requests _per ingester
per query_ which has the `query_ingesters_within` window in its time
range _(i.e. a query from now to now-7d would include the
`query_ingesters_within` window as well, now-3h to now-7d would not)_.

However, we _do_ want to split queries so an ingester won't have to
handle a query for a full `query_ingesters_within` window - this could
involve a large amount of data. To account for this, this PR introduces
a new option `split_ingester_queries_by_interval` on the query-frontend;
this setting is disabled by default.


![image](https://github.com/grafana/loki/assets/373762/2e671bd8-9e8d-4bf3-addf-bebcfc25e8d7)
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
… keys (grafana#11679)

**What this PR does / why we need it**:
Follow up from grafana#11535 (specifically from [this
thread](grafana#11535 (comment))),
this PR modifies the results cache implementation to use the same
interval for generating cache keys.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants