Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update existing node lease with retry. #81663

Merged

Conversation

@jfbai
Copy link
Contributor

commented Aug 20, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

Retry update latest lease early when failure happens rather than walk to backoffEnsureLease which will call GET anyway.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

None.

Does this PR introduce a user-facing change?:

None

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

/sig node

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

Hi @jfbai. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2019

/cc @alejandrox1
/cc @answer1991
/assign @wojtek-t

PTAL, thanks a lot. :)

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

@jfbai: GitHub didn't allow me to request PR reviews from the following users: answer1991.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @alejandrox1
/cc @answer1991
/assign @wojtek-t

PTAL, thanks a lot. :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested a review from alejandrox1 Aug 20, 2019
@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2019

@answer1991 PTAL, thanks a lot. :)

return
} else {

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Aug 20, 2019

Member

remove else - in if branch there is return anyway.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

/ok-to-test

@jfbai jfbai force-pushed the jfbai:update-existing-node-lease-with-retry branch from 12e71b5 to 16b2951 Aug 21, 2019
@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

/retest

@jfbai jfbai force-pushed the jfbai:update-existing-node-lease-with-retry branch from 16b2951 to d5e0b66 Aug 21, 2019
@answer1991

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2019

@jfbai

When I write my PR#81174, I seriously think about if we need use retry update here. I think it's not necessary. If update lease failed, there are following situations:

  1. Conflict(409): cost is same, retryUpdateLease will call backoffEnsureLease immediately:
func (c *controller) retryUpdateLease(base *coordinationv1.Lease) error {
	for i := 0; i < maxUpdateRetries; i++ {
		.....
		if apierrors.IsConflict(err) {
			base, _ = c.backoffEnsureLease()
			continue
		}
  1. NotFound(404): use retry will be worse, retry maxUpdateRetries times is not necessary. Fast fail to GET/PUT will be better
  2. 500 or failed to communicate with API Server: no need to retry update, fast fail to GET/PUT will be better here I think.

So, please think again if we need retry update here.

@jfbai jfbai force-pushed the jfbai:update-existing-node-lease-with-retry branch from d5e0b66 to 7773662 Aug 21, 2019
@wojtek-t

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 21, 2019
@wojtek-t

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

@jfbai - this matches exactly what I wrote above :)

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

@jfbai - this matches exactly what I wrote above :)

@wojtek-t Oops, I missed your comment when I was writing comment. :)

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

According to the API Convention, Update (PUT) operation will create API object if it does not exist, so I think there is no need to check handle 404.

PUT // - Update or create the resource with the given name with the JSON object provided by the client.

@answer1991

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2019

@jfbai @wojtek-t

When 5XX or 429 happens, there would be chance to update lease successfully via retrying again, it can help to reduce GET call.

Actually, clientset Update method will handle 429 automatically. We do not need to call retry explicitly.

But this can potentially help with some transient errors (like connectivity issues or sth like that).

If cause transient errors or 5xx errors, I think it's better to fallback to backoffEnsureLease immediately. Because With two or more Update calls, the cost is same with backoffEnsureLease. While backoffEnsureLease will uses exponentially increasing waits to prevent overloading, but retryUpdateLease not

@answer1991

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2019

@jfbai @wojtek-t

For 404 - we should probably change retryUpdateLease to do backoffEnsureLease for that case too - once we do that, this will also be no-op.

Actually we have not did this change.

In this case, if once failed to update with the latest lease, just fallback to backoffEnsureLease will be better.

@answer1991

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2019

Please think about my comments agent.
If my thought is right, please add comments above the codes about why we do not use re-try update here.

@answer1991

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2019

According to the API Convention, Update (PUT) operation will create API object if it does not exist, so I think there is no need to check handle 404.

In my experience, If API Server side can not find the Object, client call Update will receive 404.

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

@jfbai @wojtek-t

When 5XX or 429 happens, there would be chance to update lease successfully via retrying again, it can help to reduce GET call.

Actually, clientset Update method will handle 429 automatically. We do not need to call retry explicitly.

But this can potentially help with some transient errors (like connectivity issues or sth like that).

If cause transient errors or 5xx errors, I think it's better to fallback to backoffEnsureLease immediately. Because With two or more Update calls, the cost is same with backoffEnsureLease. While backoffEnsureLease will uses exponentially increasing waits to prevent overloading, but retryUpdateLease not

backoffEnsureLease uses exponentially increasing

@jfbai @wojtek-t

When 5XX or 429 happens, there would be chance to update lease successfully via retrying again, it can help to reduce GET call.

Actually, clientset Update method will handle 429 automatically. We do not need to call retry explicitly.

But this can potentially help with some transient errors (like connectivity issues or sth like that).

If cause transient errors or 5xx errors, I think it's better to fallback to backoffEnsureLease immediately. Because With two or more Update calls, the cost is same with backoffEnsureLease. While backoffEnsureLease will uses exponentially increasing waits to prevent overloading, but retryUpdateLease not

@answer1991 Thanks for you explanation, but lease client is a restInterface,

// leases implements LeaseInterface
type leases struct {
	client rest.Interface
	ns     string
}

and the Update call will finally dive into func (r *Request) request(fn func(*http.Request, *http.Response)) error,

and the retry logic will only automatically handle Connection Reset error and GET operation.

			// "Connection reset by peer" is usually a transient error.
			// Thus in case of "GET" operations, we simply retry it.
			// We are not automatically retrying "write" operations, as
			// they are not idempotent.
			if !net.IsConnectionReset(err) || r.verb != "GET" {
				return err
			}

So, I think 429 would not be handled automatically. :)

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

@jfbai @wojtek-t

When 5XX or 429 happens, there would be chance to update lease successfully via retrying again, it can help to reduce GET call.

Actually, clientset Update method will handle 429 automatically. We do not need to call retry explicitly.

But this can potentially help with some transient errors (like connectivity issues or sth like that).

If cause transient errors or 5xx errors, I think it's better to fallback to backoffEnsureLease immediately. Because With two or more Update calls, the cost is same with backoffEnsureLease. While backoffEnsureLease will uses exponentially increasing waits to prevent overloading, but retryUpdateLease not

@answer1991 retryUpdateLease will not cause overloading, it will retry at most maxUpdateRetries.
When transient errors or 5xx errors happens, it would be better to retry as soon as possible, after maxUpdateRetries retries, the request still fails, then fall back to backoffEnsureLease for ensuring lease.

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

According to the API Convention, Update (PUT) operation will create API object if it does not exist, so I think there is no need to check handle 404.

In my experience, If API Server side can not find the Object, client call Update will receive 404.

Yes, you are right, Update (PUT) does not ensure Create object if it does not exist, it depends. :)

// Updater is an object that can update an instance of a RESTful object.
type Updater interface {
	// New returns an empty object that can be used with Update after request data has been put into it.
	// This object must be a pointer type for use with Codec.DecodeInto([]byte, runtime.Object)
	New() runtime.Object

	// Update finds a resource in the storage and updates it. Some implementations
	// may allow updates creates the object - they should set the created boolean
	// to true.
	Update(ctx context.Context, name string, objInfo UpdatedObjectInfo, createValidation ValidateObjectFunc, updateValidation ValidateObjectUpdateFunc, forceAllowCreate bool, options *metav1.UpdateOptions) (runtime.Object, bool, error)
}

Update:

My mistake, I have not got to know how apiserver handles Update (PUT), but I do a simple test in my cluster, and Update (PUT) can create new lease if it does not exist. Steps as below.

  1. Modify ensureLease(): change Get() -> Update(), and remove Create(), and build a kubelet binary.
  2. Deploy new binary in my cluster.
  3. Remove the old lease.

then I observed the lease is created.

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

@wojtek-t Sorry for bugging you, we needs your suggestion about this change.

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

ping @wojtek-t to take another look, thanks a lot.

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

So, I think 429 would not be handled automatically. :)

We actually handle 429 in client:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/rest/request.go#L1049

creating on Update operation

It's defined per resource type:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/registry/rest/update.go#L44
We do create on update for leases:
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/coordination/lease/strategy.go#L64

Going back to original question - I don't have strong opinion either way. This PR doesn't harm, but it also doesn't gain us much (though the code is a bit more consistent too).

I will let @answer1991 to also reply before canceling hold.

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

So, I think 429 would not be handled automatically. :)

We actually handle 429 in client:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/rest/request.go#L1049

creating on Update operation

It's defined per resource type:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/registry/rest/update.go#L44
We do create on update for leases:
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/coordination/lease/strategy.go#L64

Going back to original question - I don't have strong opinion either way. This PR doesn't harm, but it also doesn't gain us much (though the code is a bit more consistent too).

I will let @answer1991 to also reply before canceling hold.

@wojtek-t It is very appreciated to explain the details, thanks a lot.

@@ -98,12 +98,10 @@ func (c *controller) sync() {
// If at some point other agents will also be frequently updating the Lease object, this
// can result in performance degradation, because we will end up with calling additional
// GET/PUT - at this point this whole "if" should be removed.
lease, err := c.leaseClient.Update(c.newLease(c.latestLease))
err := c.retryUpdateLease(c.newLease(c.latestLease))

This comment has been minimized.

Copy link
@answer1991

answer1991 Aug 22, 2019

Contributor

if err := c.retryUpdateLease(c.newLease(c.latestLease)); err == nil {
return
}

This comment has been minimized.

Copy link
@jfbai

jfbai Aug 22, 2019

Author Contributor

In this case, the err would be undefined in the klog.

This comment has been minimized.

Copy link
@wojtek-t

wojtek-t Aug 22, 2019

Member

For logging purposes below - I agree.

This comment has been minimized.

Copy link
@answer1991

answer1991 Aug 22, 2019

Contributor

ok, LTGM

@answer1991

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2019

@jfbai

I add comments, otherwise LGTM.

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

@jfbai

I add comments, otherwise LGTM.

@answer1991 Thanks a lot. :)

@wojtek-t

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

/hold cancel

@answer1991

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2019

I remembered why I did not use re-try Update there...

There is another situation: 409. If the kubelet cached Lease is updated by other agent or controller(for now, it seems no agent will do that), then re-try Update will totally failed no matter how times kubelet re-try. For this situation, re-try will be worse.

So, we need handle 409 response in method retryUpdateLease and exit re-try immediately if we decided to use re-try Update to update latest cached Lease.

@answer1991

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2019

/hold

@answer1991

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2019

There is another situation: 409

My mistake, handled.

@answer1991

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2019

/hold cancel

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2019

/test pull-kubernetes-integration

@jfbai

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2019

/test pull-kubernetes-bazel-test

@k8s-ci-robot k8s-ci-robot merged commit 10bd85a into kubernetes:master Aug 23, 2019
23 checks passed
23 checks passed
cla/linuxfoundation jfbai authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.