Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rolling update daemonset bug in clock-skew scenario #77208

Merged
merged 1 commit into from
May 4, 2019

Conversation

DaiHao
Copy link
Contributor

@DaiHao DaiHao commented Apr 29, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

Pod's Ready condition is managed by kubelet, which is related to node clock. When controller's clock is slower than node, we will found that IsPodAvailable function judges pod unavailable. Then daemonset lose the change to update its status.
In this commit, controller resync the DaemonSet after MinReadySeconds as a last line of defense to guard against clock-skew.

Related issue #41641

Which issue(s) this PR fixes:

Fixes #77203

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

enhance the daemonset sync logic in clock-skew scenario

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 29, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @DaiHao. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 29, 2019
@DaiHao
Copy link
Contributor Author

DaiHao commented Apr 29, 2019

PTAL @resouer @Kargakis

@answer1991
Copy link
Contributor

it's a critical bug in our production environment.

@zhangxiaoyu-zidif PTAL

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 30, 2019
@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 30, 2019
@0xmichalis
Copy link
Contributor

/priority important-soon
/kind bug

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/bug Categorizes issue or PR as related to a bug. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Apr 30, 2019
@0xmichalis
Copy link
Contributor

@kubernetes/sig-apps-pr-reviews needs an approval

@DaiHao
Copy link
Contributor Author

DaiHao commented Apr 30, 2019

/test pull-kubernetes-e2e-gce-device-plugin-gpu

@DaiHao DaiHao changed the title Enqueue controllers after minreadyseconds when all pods are ready Fix rolling update daemonset bug in clock-skew scenario Apr 30, 2019
@zhangxiaoyu-zidif
Copy link
Contributor

/lgtm

@zhangxiaoyu-zidif
Copy link
Contributor

ping @janetkuo

@@ -1198,6 +1198,10 @@ func (dsc *DaemonSetsController) updateDaemonSetStatus(ds *apps.DaemonSet, nodeL
return fmt.Errorf("error storing status for daemon set %#v: %v", ds, err)
}

// Resync the DaemonSet after MinReadySeconds as a last line of defense to guard against clock-skew.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would help to enhance this comment with a more detailed explanation of this issue. I am still not following how the nodes (on which pod is running)clock skew from controller nodes clock causes this ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the reason would be this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1174 , the reason from what i understand is
1: kubelet marks the pod ready and changes LastTransitionTime
2: controller checks IsPodAvailable() and finds the pod unavailable since its diffing LastTransitionTime(set by kubelet) with time.Now which is controller time. Since controller is behind in clock, minreadySeconds is not satisfied and it marks unavailable even though minReadySeconds is satisfied

can one of you confirm if this is the right understanding of this issue @DaiHao @Kargakis ?
Also in this case when a pod is marked unavailable, why wont it be requeued ?

I think including all of this explanation will be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see here. #77208 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the reason would be this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1174 , the reason from what i understand is
1: kubelet marks the pod ready and changes LastTransitionTime
2: controller checks IsPodAvailable() and finds the pod unavailable since its diffing LastTransitionTime(set by kubelet) with time.Now which is controller time. Since controller is behind in clock, minreadySeconds is not satisfied and it marks unavailable even though minReadySeconds is satisfied

can one of you confirm if this is the right understanding of this issue @DaiHao @Kargakis ?
Also in this case when a pod is marked unavailable, why wont it be requeued ?

I think including all of this explanation will be helpful

Your understanding is right.
Pod's status do not change in the sync loop, controller mark it unavailable only in daemonset's status, but which is equal with its last status, so daemonset also not requeue.
see here.

func storeDaemonSetStatus(dsClient unversionedapps.DaemonSetInterface, ds *apps.DaemonSet, desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable, numberUnavailable int, updateObservedGen bool) error {
if int(ds.Status.DesiredNumberScheduled) == desiredNumberScheduled &&
int(ds.Status.CurrentNumberScheduled) == currentNumberScheduled &&
int(ds.Status.NumberMisscheduled) == numberMisscheduled &&
int(ds.Status.NumberReady) == numberReady &&
int(ds.Status.UpdatedNumberScheduled) == updatedNumberScheduled &&
int(ds.Status.NumberAvailable) == numberAvailable &&
int(ds.Status.NumberUnavailable) == numberUnavailable &&
ds.Status.ObservedGeneration >= ds.Generation {
return nil
}

@janetkuo
Copy link
Member

janetkuo commented May 4, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DaiHao, janetkuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 4, 2019
@k8s-ci-robot k8s-ci-robot merged commit e871241 into kubernetes:master May 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RollingUpdate daemonset got stucked
7 participants