Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling OptimisticLockError in kubelet node lease controller #79341

Merged
merged 1 commit into from
Jun 27, 2019

Conversation

krzysied
Copy link
Contributor

What type of PR is this?
/kind bug

What this PR does / why we need it:
Updating node lease assumes that node lease hasn't changed between get and update calls. It sometimes happens that kubelet doesn't get ack from apiserver even though that node lease is correctly updated. Kubelet tries to re-update node lease, which fails due to the apiserver having never version of lease. This PR adds additional lease get when optimistic lock error happens.

This bug was described by @mborsz in #79096 (comment)

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 24, 2019
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 24, 2019
@krzysied
Copy link
Contributor Author

/priority important-soon
/assign @wojtek-t
/cc @mborsz

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 24, 2019
@k8s-ci-robot k8s-ci-robot requested a review from mborsz June 24, 2019 15:25
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 25, 2019
@krzysied
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-100-performance

patchCalls chan pair
}

func newFakeLeaseClient() *fakeLeaseClient {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the test. Reactor works perfectly.

@krzysied krzysied force-pushed the kubelet_lease_fix branch 4 times, most recently from fdebc07 to d490d9e Compare June 26, 2019 13:49
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to take a deeper look too and will have a couple more cosmetic comments.

for i := 0; i < maxUpdateRetries; i++ {
_, err := c.leaseClient.Update(c.newLease(base))
// OptimisticLockError requires getting the newer version of lease to proceed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move it to the bottm of the loop - then you don't need to define err, and obviously in the first iteration it will always be nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
I wanted to avoid additional get call. I assume that code is more readable now?

for i := 0; i < maxUpdateRetries; i++ {
_, err := c.leaseClient.Update(c.newLease(base))
// OptimisticLockError requires getting the newer version of lease to proceed.
if err != nil && strings.Contains(err.Error(), registry.OptimisticLockErrorMsg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to leak OptimisticLockErrorMsg here. Instead use:
apierrors.IsConflict() function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@wojtek-t
Copy link
Member

/assign @wangzhen127

@krzysied krzysied force-pushed the kubelet_lease_fix branch 2 times, most recently from e25b569 to 4dd8892 Compare June 26, 2019 19:05
pkg/kubelet/nodelease/controller.go Show resolved Hide resolved
UID: types.UID("foo-uid"),
},
}
noConnectionUpdateErr := apierrors.NewServerTimeout(schema.GroupResource{Group: "v1", Resource: "lease"}, "put", 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gr := schema.GroupResource{Group: "v1", Resource: "lease"}

and use it here and below (the lines will be shorter then).

optimistcLockUpdateErr := apierrors.NewConflict(schema.GroupResource{Group: "v1", Resource: "lease"}, "lease", fmt.Errorf("conflict"))
cases := []struct {
desc string
updateReactor func(action clienttesting.Action) (handled bool, ret runtime.Object, err error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove names of parameters in the here and below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}{
{
desc: "no errors",
updateReactor: func(action clienttesting.Action) (handled bool, ret runtime.Object, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove names of output params - they are not used anywhere

both here and in all following cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@krzysied
Copy link
Contributor Author

/test pull-kubernetes-bazel-test

leaseDurationSeconds: 10,
onRepeatedHeartbeatFailure: tc.onRepeatedHeartbeatFailure,
}
if err := c.retryUpdateLease(nil); fmt.Sprintf("%v", err) != fmt.Sprintf("%v", tc.expectErr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need fmt.Sprintf?

Please fix so that we could simply compare errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hacky way to compare errors - normally you use Error(), however this will not work with nil. To avoid nil checks etc, using Spritnf will cause to call Error() if err != nil, or just returns .

Assuming that we don't really need to compare errors, but rather check if error was returned or not, I can change ecpectErr to bool and verify of err != nil. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that we don't really need to compare errors, but rather check if error was returned or not, I can change ecpectErr to bool and verify of err != nil. WDYT?

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@wojtek-t
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 27, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: krzysied, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 27, 2019
@k8s-ci-robot k8s-ci-robot merged commit f208769 into kubernetes:master Jun 27, 2019
@krzysied krzysied deleted the kubelet_lease_fix branch June 27, 2019 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants