Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki: introduce an exponential backoff on failed query retries. #10959

Merged
merged 2 commits into from Oct 19, 2023

Conversation

slim-bean
Copy link
Collaborator

@slim-bean slim-bean commented Oct 18, 2023

What this PR does / why we need it:

Currently any subquery which fails with a 5xx status code is retried by the query frontend immediately, and by default will make a total of 5 attempts to execute the subquery, ultimately failing the query if you exhaust those attempts.

We notice there are times when subqueries can fail extremely quickly, mostly around queriers handling various GRPC related issues between components like ingesters or index gateways.

In these cases a query will exhaust all the retries in less than a second and fail back to the user.

Also in cases where a query is causing a lot of memory pressure on a pool of queriers leading to OOM crashes of queriers, the lack of a backoff on retries can make this behavior quite aggressive and potentially cause disruption if you don't have enough queriers available.

This PR introduces a simple exponential backoff on retries of failed queries to allow for the downstream GRPC issues to have time to sort out as well as causing queries causing a lot of memory pressure and OOMing to be slowed down and minimize their impact slightly.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
    • If the change is worth mentioning in the release notes, add add-to-release-notes label
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

Signed-off-by: Edward Welch <edward.welch@grafana.com>
@slim-bean slim-bean requested a review from a team as a code owner October 18, 2023 18:43
Signed-off-by: Edward Welch <edward.welch@grafana.com>
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a test would be helpful

@@ -70,7 +86,8 @@ func (r retry) Do(ctx context.Context, req Request) (Response, error) {
httpResp, ok := httpgrpc.HTTPResponseFromError(err)
if !ok || httpResp.Code/100 == 5 {
lastErr = err
level.Error(util_log.WithContext(ctx, r.log)).Log("msg", "error processing request", "try", tries, "query", req.GetQuery(), "err", err)
level.Error(util_log.WithContext(ctx, r.log)).Log("msg", "error processing request", "try", tries, "query", req.GetQuery(), "retry_in", bk.NextDelay(), "err", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NextDelay() 👍 nice touch

@slim-bean slim-bean merged commit d66dd30 into main Oct 19, 2023
4 checks passed
@slim-bean slim-bean deleted the backoff-on-query-retry branch October 19, 2023 13:32
slim-bean added a commit that referenced this pull request Oct 25, 2023
**What this PR does / why we need it**:

Currently any subquery which fails with a 5xx status code is retried by
the query frontend immediately, and by default will make a total of 5
attempts to execute the subquery, ultimately failing the query if you
exhaust those attempts.

We notice there are times when subqueries can fail extremely quickly,
mostly around queriers handling various GRPC related issues between
components like ingesters or index gateways.

In these cases a query will exhaust all the retries in less than a
second and fail back to the user.

Also in cases where a query is causing a lot of memory pressure on a
pool of queriers leading to OOM crashes of queriers, the lack of a
backoff on retries can make this behavior quite aggressive and
potentially cause disruption if you don't have enough queriers
available.

This PR introduces a simple exponential backoff on retries of failed
queries to allow for the downstream GRPC issues to have time to sort out
as well as causing queries causing a lot of memory pressure and OOMing
to be slowed down and minimize their impact slightly.

**Which issue(s) this PR fixes**:
Fixes #<issue number>

**Special notes for your reviewer**:

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)

---------

Signed-off-by: Edward Welch <edward.welch@grafana.com>
(cherry picked from commit d66dd30)
slim-bean added a commit that referenced this pull request Oct 25, 2023
**What this PR does / why we need it**:

Currently any subquery which fails with a 5xx status code is retried by
the query frontend immediately, and by default will make a total of 5
attempts to execute the subquery, ultimately failing the query if you
exhaust those attempts.

We notice there are times when subqueries can fail extremely quickly,
mostly around queriers handling various GRPC related issues between
components like ingesters or index gateways.

In these cases a query will exhaust all the retries in less than a
second and fail back to the user.

Also in cases where a query is causing a lot of memory pressure on a
pool of queriers leading to OOM crashes of queriers, the lack of a
backoff on retries can make this behavior quite aggressive and
potentially cause disruption if you don't have enough queriers
available.

This PR introduces a simple exponential backoff on retries of failed
queries to allow for the downstream GRPC issues to have time to sort out
as well as causing queries causing a lot of memory pressure and OOMing
to be slowed down and minimize their impact slightly.

**Which issue(s) this PR fixes**:
Fixes #<issue number>

**Special notes for your reviewer**:

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)

---------

Signed-off-by: Edward Welch <edward.welch@grafana.com>
(cherry picked from commit d66dd30)
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
…ana#10959)

**What this PR does / why we need it**:

Currently any subquery which fails with a 5xx status code is retried by
the query frontend immediately, and by default will make a total of 5
attempts to execute the subquery, ultimately failing the query if you
exhaust those attempts.

We notice there are times when subqueries can fail extremely quickly,
mostly around queriers handling various GRPC related issues between
components like ingesters or index gateways.

In these cases a query will exhaust all the retries in less than a
second and fail back to the user.

Also in cases where a query is causing a lot of memory pressure on a
pool of queriers leading to OOM crashes of queriers, the lack of a
backoff on retries can make this behavior quite aggressive and
potentially cause disruption if you don't have enough queriers
available.

This PR introduces a simple exponential backoff on retries of failed
queries to allow for the downstream GRPC issues to have time to sort out
as well as causing queries causing a lot of memory pressure and OOMing
to be slowed down and minimize their impact slightly.


**Which issue(s) this PR fixes**:
Fixes #<issue number>

**Special notes for your reviewer**:

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](grafana@d10549e)

---------

Signed-off-by: Edward Welch <edward.welch@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants