Skip to content

Commit

Permalink
Loki: introduce an exponential backoff on failed query retries. (#10959)
Browse files Browse the repository at this point in the history
**What this PR does / why we need it**:

Currently any subquery which fails with a 5xx status code is retried by
the query frontend immediately, and by default will make a total of 5
attempts to execute the subquery, ultimately failing the query if you
exhaust those attempts.

We notice there are times when subqueries can fail extremely quickly,
mostly around queriers handling various GRPC related issues between
components like ingesters or index gateways.

In these cases a query will exhaust all the retries in less than a
second and fail back to the user.

Also in cases where a query is causing a lot of memory pressure on a
pool of queriers leading to OOM crashes of queriers, the lack of a
backoff on retries can make this behavior quite aggressive and
potentially cause disruption if you don't have enough queriers
available.

This PR introduces a simple exponential backoff on retries of failed
queries to allow for the downstream GRPC issues to have time to sort out
as well as causing queries causing a lot of memory pressure and OOMing
to be slowed down and minimize their impact slightly.

**Which issue(s) this PR fixes**:
Fixes #<issue number>

**Special notes for your reviewer**:

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)

---------

Signed-off-by: Edward Welch <edward.welch@grafana.com>
(cherry picked from commit d66dd30)
  • Loading branch information
slim-bean committed Oct 25, 2023
1 parent 09927db commit 2eff30b
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 1 deletion.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
* [10366](https://github.com/grafana/loki/pull/10366) **shantanualsi** Upgrade thanos objstore, dskit and other modules
* [10451](https://github.com/grafana/loki/pull/10451) **shantanualsi** Upgrade thanos `objstore`
* [10814](https://github.com/grafana/loki/pull/10814) **shantanualsi,kaviraj** Upgrade prometheus to v0.47.1 and dskit
* [10959](https://github.com/grafana/loki/pull/10959) **slim-bean** introduce a backoff wait on subquery retries.

#### Promtail

Expand Down
19 changes: 18 additions & 1 deletion pkg/querier/queryrange/queryrangebase/retry.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@ package queryrangebase

import (
"context"
"time"

"github.com/go-kit/log"
"github.com/go-kit/log/level"
"github.com/grafana/dskit/backoff"
"github.com/grafana/dskit/httpgrpc"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
Expand Down Expand Up @@ -57,6 +59,20 @@ func (r retry) Do(ctx context.Context, req Request) (Response, error) {
defer func() { r.metrics.retriesCount.Observe(float64(tries)) }()

var lastErr error

// For the default of 5 tries
// try 0: no delay
// try 1: 250ms wait
// try 2: 500ms wait
// try 3: 1s wait
// try 4: 2s wait

cfg := backoff.Config{
MinBackoff: 250 * time.Millisecond,
MaxBackoff: 5 * time.Second,
MaxRetries: 0,
}
bk := backoff.New(ctx, cfg)
for ; tries < r.maxRetries; tries++ {
if ctx.Err() != nil {
return nil, ctx.Err()
Expand All @@ -70,7 +86,8 @@ func (r retry) Do(ctx context.Context, req Request) (Response, error) {
httpResp, ok := httpgrpc.HTTPResponseFromError(err)
if !ok || httpResp.Code/100 == 5 {
lastErr = err
level.Error(util_log.WithContext(ctx, r.log)).Log("msg", "error processing request", "try", tries, "query", req.GetQuery(), "err", err)
level.Error(util_log.WithContext(ctx, r.log)).Log("msg", "error processing request", "try", tries, "query", req.GetQuery(), "retry_in", bk.NextDelay(), "err", err)
bk.Wait()
continue
}

Expand Down

0 comments on commit 2eff30b

Please sign in to comment.