Loki: introduce an exponential backoff on failed query retries. #10959

slim-bean · 2023-10-18T18:43:43Z

What this PR does / why we need it:

Currently any subquery which fails with a 5xx status code is retried by the query frontend immediately, and by default will make a total of 5 attempts to execute the subquery, ultimately failing the query if you exhaust those attempts.

We notice there are times when subqueries can fail extremely quickly, mostly around queriers handling various GRPC related issues between components like ingesters or index gateways.

In these cases a query will exhaust all the retries in less than a second and fail back to the user.

Also in cases where a query is causing a lot of memory pressure on a pool of queriers leading to OOM crashes of queriers, the lack of a backoff on retries can make this behavior quite aggressive and potentially cause disruption if you don't have enough queriers available.

This PR introduces a simple exponential backoff on retries of failed queries to allow for the downstream GRPC issues to have time to sort out as well as causing queries causing a lot of memory pressure and OOMing to be slowed down and minimize their impact slightly.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
CHANGELOG.md updated
- If the change is worth mentioning in the release notes, add add-to-release-notes label
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

Signed-off-by: Edward Welch <edward.welch@grafana.com>

dannykopping

LGTM, a test would be helpful

pkg/querier/queryrange/queryrangebase/retry.go

dannykopping · 2023-10-18T19:16:56Z

pkg/querier/queryrange/queryrangebase/retry.go

@@ -70,7 +86,8 @@ func (r retry) Do(ctx context.Context, req Request) (Response, error) {
 		httpResp, ok := httpgrpc.HTTPResponseFromError(err)
 		if !ok || httpResp.Code/100 == 5 {
 			lastErr = err
-			level.Error(util_log.WithContext(ctx, r.log)).Log("msg", "error processing request", "try", tries, "query", req.GetQuery(), "err", err)
+			level.Error(util_log.WithContext(ctx, r.log)).Log("msg", "error processing request", "try", tries, "query", req.GetQuery(), "retry_in", bk.NextDelay(), "err", err)


NextDelay() 👍 nice touch

**What this PR does / why we need it**: Currently any subquery which fails with a 5xx status code is retried by the query frontend immediately, and by default will make a total of 5 attempts to execute the subquery, ultimately failing the query if you exhaust those attempts. We notice there are times when subqueries can fail extremely quickly, mostly around queriers handling various GRPC related issues between components like ingesters or index gateways. In these cases a query will exhaust all the retries in less than a second and fail back to the user. Also in cases where a query is causing a lot of memory pressure on a pool of queriers leading to OOM crashes of queriers, the lack of a backoff on retries can make this behavior quite aggressive and potentially cause disruption if you don't have enough queriers available. This PR introduces a simple exponential backoff on retries of failed queries to allow for the downstream GRPC issues to have time to sort out as well as causing queries causing a lot of memory pressure and OOMing to be slowed down and minimize their impact slightly. **Which issue(s) this PR fixes**: Fixes #<issue number> **Special notes for your reviewer**: **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [x] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/setup/upgrade/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e) --------- Signed-off-by: Edward Welch <edward.welch@grafana.com> (cherry picked from commit d66dd30)

…ana#10959) **What this PR does / why we need it**: Currently any subquery which fails with a 5xx status code is retried by the query frontend immediately, and by default will make a total of 5 attempts to execute the subquery, ultimately failing the query if you exhaust those attempts. We notice there are times when subqueries can fail extremely quickly, mostly around queriers handling various GRPC related issues between components like ingesters or index gateways. In these cases a query will exhaust all the retries in less than a second and fail back to the user. Also in cases where a query is causing a lot of memory pressure on a pool of queriers leading to OOM crashes of queriers, the lack of a backoff on retries can make this behavior quite aggressive and potentially cause disruption if you don't have enough queriers available. This PR introduces a simple exponential backoff on retries of failed queries to allow for the downstream GRPC issues to have time to sort out as well as causing queries causing a lot of memory pressure and OOMing to be slowed down and minimize their impact slightly. **Which issue(s) this PR fixes**: Fixes #<issue number> **Special notes for your reviewer**: **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [x] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/setup/upgrade/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](grafana@d10549e) --------- Signed-off-by: Edward Welch <edward.welch@grafana.com>

introduce an exponential backoff on failed query retries.

7898f87

Signed-off-by: Edward Welch <edward.welch@grafana.com>

slim-bean requested a review from a team as a code owner October 18, 2023 18:43

pull-request-size bot added the size/S label Oct 18, 2023

add changelog

3b7b8c1

Signed-off-by: Edward Welch <edward.welch@grafana.com>

owen-d approved these changes Oct 18, 2023

View reviewed changes

dannykopping approved these changes Oct 18, 2023

View reviewed changes

slim-bean merged commit d66dd30 into main Oct 19, 2023
4 checks passed

slim-bean deleted the backoff-on-query-retry branch October 19, 2023 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki: introduce an exponential backoff on failed query retries. #10959

Loki: introduce an exponential backoff on failed query retries. #10959

slim-bean commented Oct 18, 2023 •

edited

dannykopping left a comment

dannykopping Oct 18, 2023

Loki: introduce an exponential backoff on failed query retries. #10959

Loki: introduce an exponential backoff on failed query retries. #10959

Conversation

slim-bean commented Oct 18, 2023 • edited

dannykopping left a comment

Choose a reason for hiding this comment

dannykopping Oct 18, 2023

Choose a reason for hiding this comment

slim-bean commented Oct 18, 2023 •

edited