client-go: refactor retry logic for backoff, rate limiter and metric to be reused by Watch, Stream, and Do #108347

tkashem · 2022-02-25T05:49:28Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

it's a pure refactor that consolidates the rate limiter, backoff and metric logic into a single site so it can be reused by Watch, Stream, and Do
no change in behavior, the unit test added in client-go: add unit test to verify order of calls with retry #108262 passes, this implies that the order of invocation is preserved.

Which issue(s) this PR fixes:

Fixes #108302

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

tkashem · 2022-02-25T05:51:26Z

/sig api-machinery
/assign @wojtek-t @aojea

This paves the way for us to add the retry metric, please review at your convenience

aojea · 2022-02-25T08:45:11Z

staging/src/k8s.io/client-go/rest/with_retry.go

+			//  we can merge these two sleeps:
+			//  BackOffManager.Sleep(max(backoffManager.CalculateBackoff(), retryAfter))
+			//  see https://github.com/kubernetes/kubernetes/issues/108302
+			request.backoff.Sleep(r.retryAfter.Wait)


should this sleep take into account the context?
It seems that the context can be cancelled meanwhile we are sleeping

i think it should, maybe the BackoffManager interface predates context. we allow users to specify their own BackoffManager instance, so adding context to Sleep would be a breaking change.

yeah, but at least we should have some place here that checks the context or we'll retry with a context that was cancelled.

maybe at the end of the function we can return ctx.Err() or check is not nil ?

L127 func (r *withRetry) prepareForNextRetry(ctx context.Context, request *Request) error {
is doing the check for the context too

aojea · 2022-02-25T08:47:15Z

staging/src/k8s.io/client-go/rest/with_retry.go

+	defer func() {
+		// we are done with this attempt, start with a clean slate
+		r.retryAfter = nil
+	}()


question: why defer if there are is only one exit path?

aojea · 2022-02-25T08:49:20Z

staging/src/k8s.io/client-go/rest/with_retry.go

 		}
+
+		// we always do a backoff sleep including the first try
+		request.backoff.Sleep(request.backoff.CalculateBackoff(url))


does it simplify something if we sum the values ?
request.backoff.CalculateBackoff(url) + r.retryAfter.Wait

I think the suggestion from @wojtek-t was to max(backoffManager.CalculateBackoff(), retryAfter), this will be done in a follow up PR, this PR is refactor-only

aojea · 2022-02-25T08:59:34Z

staging/src/k8s.io/client-go/rest/with_retry.go

 }

-func (r *withRetry) BeforeNextRetry(ctx context.Context, backoff BackoffManager, retryAfter *RetryAfter, url string, body io.Reader) error {
+func (r *withRetry) PrepareForNextRetry(ctx context.Context, request *Request) error {


the pattern seems to be always

if r.retry.IsNextRetry(req, resp, err, neverRetryError) { err := r.retry.PrepareForNextRetry(ctx, r) if err == nil { return false, nil

and this PrepareForNextRetry does 3 additional checks, the new one is just check the output of IsNextRetry that is set r.retryAfter != nil
Should we merge this 2 methods? is PrepareForNextRetry really needed now?

that's a good suggestion, i combined these two, now Stream, Watch, and Do look more easy to follow

aojea · 2022-02-25T09:03:05Z

staging/src/k8s.io/client-go/rest/with_retry.go

+		// we are done with this attempt, start with a clean slate
+		r.retryAfter = nil
+	}()
+	updateURLMetrics(ctx, request, resp, err)


I feel the metrics should not belong here, inside the retry logic

yes, i moved the metric out to its original place.

aojea · 2022-02-26T16:18:45Z

this looks really nice,

tkashem · 2022-02-27T18:36:23Z

/retest

wojtek-t

This is nice

staging/src/k8s.io/client-go/rest/with_retry.go

wojtek-t · 2022-03-01T07:43:39Z

staging/src/k8s.io/client-go/rest/with_retry.go

-	// if retry is set to true, retryAfter will contain the information
-	// regarding the next retry.
+	// IsNextRetry internally maintains the retry after
+	// parameters - retry reason, and wait duration associated


I don't understand this part of the comment - what parameters? Where those are?

yeah, it's confusing, i moved the comments below where it's more localized

wojtek-t · 2022-03-01T07:46:01Z

staging/src/k8s.io/client-go/rest/request.go

@@ -918,8 +866,7 @@ func (r *Request) request(ctx context.Context, fn func(*http.Request, *http.Resp
 				fn(req, resp)
 			}

-			var retry bool
-			retryAfter, retry = r.retry.NextRetry(req, resp, err, func(req *http.Request, err error) bool {
+			if retry := r.retry.IsNextRetry(ctx, r, req, resp, err, func(req *http.Request, err error) bool {


nit: can you define the input function excplitly (same as in line 605 for Watch()) ?

ping (this comment and the other)

as soon as I define an explicit func, TestDoRequestSuccess keeps failing, it's very strange, I will pursue it in a follow up PR if that is okay.

i had a typo, it's fixed now.

jpbetz · 2022-03-01T21:13:32Z

/triage accepted
/cc @yliaog

tkashem · 2022-03-02T17:14:06Z

/retest

tkashem · 2022-03-02T19:40:28Z

@aojea @wojtek-t it's ready for another pass, please take a look.

wojtek-t · 2022-03-03T10:19:43Z

/lgtm
/approve

k8s-ci-robot · 2022-03-03T10:20:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tkashem, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/client-go/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2022-03-03T11:22:58Z

/test pull-kubernetes-e2e-gce-ubuntu-containerd

Kubernetes e2e suite: [sig-cli] Kubectl client Simple pod should contain last line of the log expand_less

This reverts commit 52cd4d5, reversing changes made to 428ec84.

smarterclayton · 2022-03-30T17:45:40Z

staging/src/k8s.io/client-go/rest/with_retry.go

 	return nil
 }

+func (r *withRetry) After(ctx context.Context, request *Request, resp *http.Response, err error) {


Why are we updating backoff here? Why can't this be in Before?

we update the backoff after we get an answer tuple (response, err) from the server. If we store the answer tuple then maybe we can update back off in Before and thus get rid of After. I can look into it when I do the follow up refactor.

On the other hand, After can be a useful place to call metrics, update back-off and such. We can decide whether it makes sense to keep After when we do the follow-up refactor.

k8s-ci-robot requested review from smarterclayton and sttts February 25, 2022 05:49

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 25, 2022

k8s-ci-robot assigned aojea and wojtek-t Feb 25, 2022

aojea reviewed Feb 25, 2022

View reviewed changes

tkashem force-pushed the refactor branch 3 times, most recently from 68c3b8e to b7ae62f Compare February 25, 2022 20:13

tkashem mentioned this pull request Feb 28, 2022

client-go: add a metric to count request retries #108396

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2022

wojtek-t reviewed Mar 1, 2022

View reviewed changes

tkashem force-pushed the refactor branch from b7ae62f to 10e1120 Compare March 1, 2022 14:55

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2022

k8s-ci-robot requested a review from yliaog March 1, 2022 21:13

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 1, 2022

tkashem force-pushed the refactor branch 3 times, most recently from edac4aa to 1553996 Compare March 2, 2022 16:33

tkashem force-pushed the refactor branch from 1553996 to d2f4d28 Compare March 2, 2022 19:29

client-go: refactor retry logic for backoff, rate limiter and metric

cecc563

tkashem force-pushed the refactor branch from d2f4d28 to cecc563 Compare March 2, 2022 19:39

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2022

k8s-ci-robot merged commit 52cd4d5 into kubernetes:master Mar 3, 2022

k8s-ci-robot added this to the v1.24 milestone Mar 3, 2022

MadhavJivrajani mentioned this pull request Mar 23, 2022

go1.18: Data race: client-go retry body reset #108906

Closed

tkashem added a commit to tkashem/kubernetes that referenced this pull request Mar 29, 2022

Revert "Merge pull request kubernetes#108347 from tkashem/refactor"

3e1751a

This reverts commit 52cd4d5, reversing changes made to 428ec84.

smarterclayton reviewed Mar 30, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client-go: refactor retry logic for backoff, rate limiter and metric to be reused by Watch, Stream, and Do #108347

client-go: refactor retry logic for backoff, rate limiter and metric to be reused by Watch, Stream, and Do #108347

tkashem commented Feb 25, 2022

tkashem commented Feb 25, 2022

aojea Feb 25, 2022

tkashem Feb 25, 2022

aojea Feb 26, 2022

aojea Feb 26, 2022

aojea Feb 25, 2022

aojea Feb 25, 2022

tkashem Feb 25, 2022

aojea Feb 25, 2022 •

edited

tkashem Feb 25, 2022

aojea Feb 25, 2022

tkashem Feb 25, 2022

aojea commented Feb 26, 2022

tkashem commented Feb 27, 2022

wojtek-t left a comment

wojtek-t Mar 1, 2022

tkashem Mar 2, 2022

wojtek-t Mar 1, 2022

wojtek-t Mar 1, 2022

tkashem Mar 2, 2022

tkashem Mar 2, 2022

jpbetz commented Mar 1, 2022

tkashem commented Mar 2, 2022

tkashem commented Mar 2, 2022

wojtek-t commented Mar 3, 2022

k8s-ci-robot commented Mar 3, 2022

aojea commented Mar 3, 2022

smarterclayton Mar 30, 2022

tkashem Mar 30, 2022 •

edited

client-go: refactor retry logic for backoff, rate limiter and metric to be reused by Watch, Stream, and Do #108347

client-go: refactor retry logic for backoff, rate limiter and metric to be reused by Watch, Stream, and Do #108347

Conversation

tkashem commented Feb 25, 2022

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

tkashem commented Feb 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea Feb 25, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented Feb 26, 2022

tkashem commented Feb 27, 2022

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbetz commented Mar 1, 2022

tkashem commented Mar 2, 2022

tkashem commented Mar 2, 2022

wojtek-t commented Mar 3, 2022

k8s-ci-robot commented Mar 3, 2022

aojea commented Mar 3, 2022

Choose a reason for hiding this comment

tkashem Mar 30, 2022 • edited

Choose a reason for hiding this comment

aojea Feb 25, 2022 •

edited

tkashem Mar 30, 2022 •

edited