Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update existing node lease with retry. #81663

Merged

Conversation

jfbai
Copy link
Contributor

@jfbai jfbai commented Aug 20, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

Retry update latest lease early when failure happens rather than walk to backoffEnsureLease which will call GET anyway.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

None.

Does this PR introduce a user-facing change?:

None

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

/sig node

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 20, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @jfbai. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jfbai
Copy link
Contributor Author

jfbai commented Aug 20, 2019

/cc @alejandrox1
/cc @answer1991
/assign @wojtek-t

PTAL, thanks a lot. :)

@k8s-ci-robot
Copy link
Contributor

@jfbai: GitHub didn't allow me to request PR reviews from the following users: answer1991.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @alejandrox1
/cc @answer1991
/assign @wojtek-t

PTAL, thanks a lot. :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jfbai
Copy link
Contributor Author

jfbai commented Aug 20, 2019

@answer1991 PTAL, thanks a lot. :)

return
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove else - in if branch there is return anyway.

@wojtek-t
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 20, 2019
@wojtek-t wojtek-t added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 20, 2019
@jfbai jfbai force-pushed the update-existing-node-lease-with-retry branch from 12e71b5 to 16b2951 Compare August 21, 2019 02:02
@jfbai
Copy link
Contributor Author

jfbai commented Aug 21, 2019

/retest

@jfbai jfbai force-pushed the update-existing-node-lease-with-retry branch from 16b2951 to d5e0b66 Compare August 21, 2019 02:25
@answer1991
Copy link
Contributor

answer1991 commented Aug 21, 2019

@jfbai

When I write my PR#81174, I seriously think about if we need use retry update here. I think it's not necessary. If update lease failed, there are following situations:

  1. Conflict(409): cost is same, retryUpdateLease will call backoffEnsureLease immediately:
func (c *controller) retryUpdateLease(base *coordinationv1.Lease) error {
	for i := 0; i < maxUpdateRetries; i++ {
		.....
		if apierrors.IsConflict(err) {
			base, _ = c.backoffEnsureLease()
			continue
		}
  1. NotFound(404): use retry will be worse, retry maxUpdateRetries times is not necessary. Fast fail to GET/PUT will be better
  2. 500 or failed to communicate with API Server: no need to retry update, fast fail to GET/PUT will be better here I think.

So, please think again if we need retry update here.

@jfbai jfbai force-pushed the update-existing-node-lease-with-retry branch from d5e0b66 to 7773662 Compare August 21, 2019 03:49
@wojtek-t
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 21, 2019
@wojtek-t
Copy link
Member

@jfbai - this matches exactly what I wrote above :)

@jfbai
Copy link
Contributor Author

jfbai commented Aug 21, 2019

@jfbai - this matches exactly what I wrote above :)

@wojtek-t Oops, I missed your comment when I was writing comment. :)

@jfbai
Copy link
Contributor Author

jfbai commented Aug 21, 2019

According to the API Convention, Update (PUT) operation will create API object if it does not exist, so I think there is no need to check handle 404.

PUT // - Update or create the resource with the given name with the JSON object provided by the client.

@answer1991
Copy link
Contributor

answer1991 commented Aug 21, 2019

@jfbai @wojtek-t

When 5XX or 429 happens, there would be chance to update lease successfully via retrying again, it can help to reduce GET call.

Actually, clientset Update method will handle 429 automatically. We do not need to call retry explicitly.

But this can potentially help with some transient errors (like connectivity issues or sth like that).

If cause transient errors or 5xx errors, I think it's better to fallback to backoffEnsureLease immediately. Because With two or more Update calls, the cost is same with backoffEnsureLease. While backoffEnsureLease will uses exponentially increasing waits to prevent overloading, but retryUpdateLease not

@answer1991
Copy link
Contributor

answer1991 commented Aug 21, 2019

@jfbai @wojtek-t

For 404 - we should probably change retryUpdateLease to do backoffEnsureLease for that case too - once we do that, this will also be no-op.

Actually we have not did this change.

In this case, if once failed to update with the latest lease, just fallback to backoffEnsureLease will be better.

@answer1991
Copy link
Contributor

Please think about my comments agent.
If my thought is right, please add comments above the codes about why we do not use re-try update here.

@answer1991
Copy link
Contributor

According to the API Convention, Update (PUT) operation will create API object if it does not exist, so I think there is no need to check handle 404.

In my experience, If API Server side can not find the Object, client call Update will receive 404.

@jfbai
Copy link
Contributor Author

jfbai commented Aug 21, 2019

@jfbai @wojtek-t

When 5XX or 429 happens, there would be chance to update lease successfully via retrying again, it can help to reduce GET call.

Actually, clientset Update method will handle 429 automatically. We do not need to call retry explicitly.

But this can potentially help with some transient errors (like connectivity issues or sth like that).

If cause transient errors or 5xx errors, I think it's better to fallback to backoffEnsureLease immediately. Because With two or more Update calls, the cost is same with backoffEnsureLease. While backoffEnsureLease will uses exponentially increasing waits to prevent overloading, but retryUpdateLease not

backoffEnsureLease uses exponentially increasing

@jfbai @wojtek-t

When 5XX or 429 happens, there would be chance to update lease successfully via retrying again, it can help to reduce GET call.

Actually, clientset Update method will handle 429 automatically. We do not need to call retry explicitly.

But this can potentially help with some transient errors (like connectivity issues or sth like that).

If cause transient errors or 5xx errors, I think it's better to fallback to backoffEnsureLease immediately. Because With two or more Update calls, the cost is same with backoffEnsureLease. While backoffEnsureLease will uses exponentially increasing waits to prevent overloading, but retryUpdateLease not

@answer1991 Thanks for you explanation, but lease client is a restInterface,

// leases implements LeaseInterface
type leases struct {
	client rest.Interface
	ns     string
}

and the Update call will finally dive into func (r *Request) request(fn func(*http.Request, *http.Response)) error,

and the retry logic will only automatically handle Connection Reset error and GET operation.

			// "Connection reset by peer" is usually a transient error.
			// Thus in case of "GET" operations, we simply retry it.
			// We are not automatically retrying "write" operations, as
			// they are not idempotent.
			if !net.IsConnectionReset(err) || r.verb != "GET" {
				return err
			}

So, I think 429 would not be handled automatically. :)

@jfbai
Copy link
Contributor Author

jfbai commented Aug 21, 2019

@jfbai @wojtek-t

When 5XX or 429 happens, there would be chance to update lease successfully via retrying again, it can help to reduce GET call.

Actually, clientset Update method will handle 429 automatically. We do not need to call retry explicitly.

But this can potentially help with some transient errors (like connectivity issues or sth like that).

If cause transient errors or 5xx errors, I think it's better to fallback to backoffEnsureLease immediately. Because With two or more Update calls, the cost is same with backoffEnsureLease. While backoffEnsureLease will uses exponentially increasing waits to prevent overloading, but retryUpdateLease not

@answer1991 retryUpdateLease will not cause overloading, it will retry at most maxUpdateRetries.
When transient errors or 5xx errors happens, it would be better to retry as soon as possible, after maxUpdateRetries retries, the request still fails, then fall back to backoffEnsureLease for ensuring lease.

@jfbai
Copy link
Contributor Author

jfbai commented Aug 21, 2019

According to the API Convention, Update (PUT) operation will create API object if it does not exist, so I think there is no need to check handle 404.

In my experience, If API Server side can not find the Object, client call Update will receive 404.

Yes, you are right, Update (PUT) does not ensure Create object if it does not exist, it depends. :)

// Updater is an object that can update an instance of a RESTful object.
type Updater interface {
	// New returns an empty object that can be used with Update after request data has been put into it.
	// This object must be a pointer type for use with Codec.DecodeInto([]byte, runtime.Object)
	New() runtime.Object

	// Update finds a resource in the storage and updates it. Some implementations
	// may allow updates creates the object - they should set the created boolean
	// to true.
	Update(ctx context.Context, name string, objInfo UpdatedObjectInfo, createValidation ValidateObjectFunc, updateValidation ValidateObjectUpdateFunc, forceAllowCreate bool, options *metav1.UpdateOptions) (runtime.Object, bool, error)
}

Update:

My mistake, I have not got to know how apiserver handles Update (PUT), but I do a simple test in my cluster, and Update (PUT) can create new lease if it does not exist. Steps as below.

  1. Modify ensureLease(): change Get() -> Update(), and remove Create(), and build a kubelet binary.
  2. Deploy new binary in my cluster.
  3. Remove the old lease.

then I observed the lease is created.

@jfbai
Copy link
Contributor Author

jfbai commented Aug 21, 2019

@wojtek-t Sorry for bugging you, we needs your suggestion about this change.

@jfbai
Copy link
Contributor Author

jfbai commented Aug 22, 2019

ping @wojtek-t to take another look, thanks a lot.

@wojtek-t
Copy link
Member

So, I think 429 would not be handled automatically. :)

We actually handle 429 in client:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/rest/request.go#L1049

creating on Update operation

It's defined per resource type:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/registry/rest/update.go#L44
We do create on update for leases:
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/coordination/lease/strategy.go#L64

Going back to original question - I don't have strong opinion either way. This PR doesn't harm, but it also doesn't gain us much (though the code is a bit more consistent too).

I will let @answer1991 to also reply before canceling hold.

@jfbai
Copy link
Contributor Author

jfbai commented Aug 22, 2019

So, I think 429 would not be handled automatically. :)

We actually handle 429 in client:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/rest/request.go#L1049

creating on Update operation

It's defined per resource type:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/registry/rest/update.go#L44
We do create on update for leases:
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/coordination/lease/strategy.go#L64

Going back to original question - I don't have strong opinion either way. This PR doesn't harm, but it also doesn't gain us much (though the code is a bit more consistent too).

I will let @answer1991 to also reply before canceling hold.

@wojtek-t It is very appreciated to explain the details, thanks a lot.

@@ -98,12 +98,10 @@ func (c *controller) sync() {
// If at some point other agents will also be frequently updating the Lease object, this
// can result in performance degradation, because we will end up with calling additional
// GET/PUT - at this point this whole "if" should be removed.
lease, err := c.leaseClient.Update(c.newLease(c.latestLease))
err := c.retryUpdateLease(c.newLease(c.latestLease))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := c.retryUpdateLease(c.newLease(c.latestLease)); err == nil {
return
}

Copy link
Contributor Author

@jfbai jfbai Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the err would be undefined in the klog.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For logging purposes below - I agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, LTGM

@answer1991
Copy link
Contributor

@jfbai

I add comments, otherwise LGTM.

@jfbai
Copy link
Contributor Author

jfbai commented Aug 22, 2019

@jfbai

I add comments, otherwise LGTM.

@answer1991 Thanks a lot. :)

@wojtek-t
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 22, 2019
@answer1991
Copy link
Contributor

I remembered why I did not use re-try Update there...

There is another situation: 409. If the kubelet cached Lease is updated by other agent or controller(for now, it seems no agent will do that), then re-try Update will totally failed no matter how times kubelet re-try. For this situation, re-try will be worse.

So, we need handle 409 response in method retryUpdateLease and exit re-try immediately if we decided to use re-try Update to update latest cached Lease.

@answer1991
Copy link
Contributor

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 22, 2019
@answer1991
Copy link
Contributor

There is another situation: 409

My mistake, handled.

@answer1991
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 22, 2019
@jfbai
Copy link
Contributor Author

jfbai commented Aug 23, 2019

/test pull-kubernetes-integration

@jfbai
Copy link
Contributor Author

jfbai commented Aug 23, 2019

/test pull-kubernetes-bazel-test

@k8s-ci-robot k8s-ci-robot merged commit 10bd85a into kubernetes:master Aug 23, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants