Loki: introduce an exponential backoff on failed query retries. (#10959)

**What this PR does / why we need it**: Currently any subquery which fails with a 5xx status code is retried by the query frontend immediately, and by default will make a total of 5 attempts to execute the subquery, ultimately failing the query if you exhaust those attempts. We notice there are times when subqueries can fail extremely quickly, mostly around queriers handling various GRPC related issues between components like ingesters or index gateways. In these cases a query will exhaust all the retries in less than a second and fail back to the user. Also in cases where a query is causing a lot of memory pressure on a pool of queriers leading to OOM crashes of queriers, the lack of a backoff on retries can make this behavior quite aggressive and potentially cause disruption if you don't have enough queriers available. This PR introduces a simple exponential backoff on retries of failed queries to allow for the downstream GRPC issues to have time to sort out as well as causing queries causing a lot of memory pressure and OOMing to be slowed down and minimize their impact slightly. **Which issue(s) this PR fixes**: Fixes #<issue number> **Special notes for your reviewer**: **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [x] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/setup/upgrade/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e) --------- Signed-off-by: Edward Welch <edward.welch@grafana.com> (cherry picked from commit d66dd30)
grafana · Oct 25, 2023 · 2eff30b · 2eff30b
1 parent 09927db
commit 2eff30b
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -34,6 +34,7 @@
 * [10366](https://github.com/grafana/loki/pull/10366) **shantanualsi** Upgrade thanos objstore, dskit and other modules
 * [10451](https://github.com/grafana/loki/pull/10451) **shantanualsi** Upgrade thanos `objstore`
 * [10814](https://github.com/grafana/loki/pull/10814) **shantanualsi,kaviraj** Upgrade prometheus to v0.47.1 and dskit
+* [10959](https://github.com/grafana/loki/pull/10959) **slim-bean** introduce a backoff wait on subquery retries.
 
 #### Promtail
 

diff --git a/pkg/querier/queryrange/queryrangebase/retry.go b/pkg/querier/queryrange/queryrangebase/retry.go
@@ -2,9 +2,11 @@ package queryrangebase
 
 import (
 	"context"
+	"time"
 
 	"github.com/go-kit/log"
 	"github.com/go-kit/log/level"
+	"github.com/grafana/dskit/backoff"
 	"github.com/grafana/dskit/httpgrpc"
 	"github.com/prometheus/client_golang/prometheus"
 	"github.com/prometheus/client_golang/prometheus/promauto"
@@ -57,6 +59,20 @@ func (r retry) Do(ctx context.Context, req Request) (Response, error) {
 	defer func() { r.metrics.retriesCount.Observe(float64(tries)) }()
 
 	var lastErr error
+
+	// For the default of 5 tries
+	// try 0: no delay
+	// try 1: 250ms wait
+	// try 2: 500ms wait
+	// try 3: 1s wait
+	// try 4: 2s wait
+
+	cfg := backoff.Config{
+		MinBackoff: 250 * time.Millisecond,
+		MaxBackoff: 5 * time.Second,
+		MaxRetries: 0,
+	}
+	bk := backoff.New(ctx, cfg)
 	for ; tries < r.maxRetries; tries++ {
 		if ctx.Err() != nil {
 			return nil, ctx.Err()
@@ -70,7 +86,8 @@ func (r retry) Do(ctx context.Context, req Request) (Response, error) {
 		httpResp, ok := httpgrpc.HTTPResponseFromError(err)
 		if !ok || httpResp.Code/100 == 5 {
 			lastErr = err
-			level.Error(util_log.WithContext(ctx, r.log)).Log("msg", "error processing request", "try", tries, "query", req.GetQuery(), "err", err)
+			level.Error(util_log.WithContext(ctx, r.log)).Log("msg", "error processing request", "try", tries, "query", req.GetQuery(), "retry_in", bk.NextDelay(), "err", err)
+			bk.Wait()
 			continue
 		}