Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client-go: refactor retry logic for backoff, rate limiter and metric to be reused by Watch, Stream, and Do #108347

Merged
merged 1 commit into from
Mar 3, 2022

Conversation

tkashem
Copy link
Contributor

@tkashem tkashem commented Feb 25, 2022

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #108302

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 25, 2022
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 25, 2022
@tkashem
Copy link
Contributor Author

tkashem commented Feb 25, 2022

/sig api-machinery
/assign @wojtek-t @aojea

This paves the way for us to add the retry metric, please review at your convenience

// we can merge these two sleeps:
// BackOffManager.Sleep(max(backoffManager.CalculateBackoff(), retryAfter))
// see https://github.com/kubernetes/kubernetes/issues/108302
request.backoff.Sleep(r.retryAfter.Wait)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this sleep take into account the context?
It seems that the context can be cancelled meanwhile we are sleeping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it should, maybe the BackoffManager interface predates context. we allow users to specify their own BackoffManager instance, so adding context to Sleep would be a breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but at least we should have some place here that checks the context or we'll retry with a context that was cancelled.

maybe at the end of the function we can return ctx.Err() or check is not nil ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L127 func (r *withRetry) prepareForNextRetry(ctx context.Context, request *Request) error {
is doing the check for the context too

Comment on lines 230 to 226
defer func() {
// we are done with this attempt, start with a clean slate
r.retryAfter = nil
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why defer if there are is only one exit path?

}

// we always do a backoff sleep including the first try
request.backoff.Sleep(request.backoff.CalculateBackoff(url))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it simplify something if we sum the values ?
request.backoff.CalculateBackoff(url) + r.retryAfter.Wait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the suggestion from @wojtek-t was to max(backoffManager.CalculateBackoff(), retryAfter), this will be done in a follow up PR, this PR is refactor-only

}

func (r *withRetry) BeforeNextRetry(ctx context.Context, backoff BackoffManager, retryAfter *RetryAfter, url string, body io.Reader) error {
func (r *withRetry) PrepareForNextRetry(ctx context.Context, request *Request) error {
Copy link
Member

@aojea aojea Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pattern seems to be always

				if r.retry.IsNextRetry(req, resp, err, neverRetryError) {
					err := r.retry.PrepareForNextRetry(ctx, r)
					if err == nil {
						return false, nil
					

and this PrepareForNextRetry does 3 additional checks, the new one is just check the output of IsNextRetry that is set r.retryAfter != nil
Should we merge this 2 methods? is PrepareForNextRetry really needed now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good suggestion, i combined these two, now Stream, Watch, and Do look more easy to follow

// we are done with this attempt, start with a clean slate
r.retryAfter = nil
}()
updateURLMetrics(ctx, request, resp, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the metrics should not belong here, inside the retry logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i moved the metric out to its original place.

@tkashem tkashem force-pushed the refactor branch 3 times, most recently from 68c3b8e to b7ae62f Compare February 25, 2022 20:13
@aojea
Copy link
Member

aojea commented Feb 26, 2022

this looks really nice,

@tkashem
Copy link
Contributor Author

tkashem commented Feb 27, 2022

/retest

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2022
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice

// if retry is set to true, retryAfter will contain the information
// regarding the next retry.
// IsNextRetry internally maintains the retry after
// parameters - retry reason, and wait duration associated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this part of the comment - what parameters? Where those are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's confusing, i moved the comments below where it's more localized

@@ -918,8 +866,7 @@ func (r *Request) request(ctx context.Context, fn func(*http.Request, *http.Resp
fn(req, resp)
}

var retry bool
retryAfter, retry = r.retry.NextRetry(req, resp, err, func(req *http.Request, err error) bool {
if retry := r.retry.IsNextRetry(ctx, r, req, resp, err, func(req *http.Request, err error) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you define the input function excplitly (same as in line 605 for Watch()) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping (this comment and the other)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as soon as I define an explicit func, TestDoRequestSuccess keeps failing, it's very strange, I will pursue it in a follow up PR if that is okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had a typo, it's fixed now.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2022
@jpbetz
Copy link
Contributor

jpbetz commented Mar 1, 2022

/triage accepted
/cc @yliaog

@k8s-ci-robot k8s-ci-robot requested a review from yliaog March 1, 2022 21:13
@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 1, 2022
@tkashem tkashem force-pushed the refactor branch 3 times, most recently from edac4aa to 1553996 Compare March 2, 2022 16:33
@tkashem
Copy link
Contributor Author

tkashem commented Mar 2, 2022

/retest

@tkashem
Copy link
Contributor Author

tkashem commented Mar 2, 2022

@aojea @wojtek-t it's ready for another pass, please take a look.

@wojtek-t
Copy link
Member

wojtek-t commented Mar 3, 2022

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tkashem, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2022
@aojea
Copy link
Member

aojea commented Mar 3, 2022

/test pull-kubernetes-e2e-gce-ubuntu-containerd

Kubernetes e2e suite: [sig-cli] Kubectl client Simple pod should contain last line of the log expand_less

@k8s-ci-robot k8s-ci-robot merged commit 52cd4d5 into kubernetes:master Mar 3, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Mar 3, 2022
tkashem added a commit to tkashem/kubernetes that referenced this pull request Mar 29, 2022
This reverts commit 52cd4d5, reversing
changes made to 428ec84.
return nil
}

func (r *withRetry) After(ctx context.Context, request *Request, resp *http.Response, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we updating backoff here? Why can't this be in Before?

Copy link
Contributor Author

@tkashem tkashem Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we update the backoff after we get an answer tuple (response, err) from the server. If we store the answer tuple then maybe we can update back off in Before and thus get rid of After. I can look into it when I do the follow up refactor.

On the other hand, After can be a useful place to call metrics, update back-off and such. We can decide whether it makes sense to keep After when we do the follow-up refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

client-go: inconsistencies among Do, Watch, and Stream
6 participants