Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rolling update daemonset bug in clock-skew scenario #77208

Merged
merged 1 commit into from May 4, 2019

Conversation

@DaiHao
Copy link
Member

commented Apr 29, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

Pod's Ready condition is managed by kubelet, which is related to node clock. When controller's clock is slower than node, we will found that IsPodAvailable function judges pod unavailable. Then daemonset lose the change to update its status.
In this commit, controller resync the DaemonSet after MinReadySeconds as a last line of defense to guard against clock-skew.

Related issue #41641

Which issue(s) this PR fixes:

Fixes #77203

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

enhance the daemonset sync logic in clock-skew scenario
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

Hi @DaiHao. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@DaiHao

This comment has been minimized.

Copy link
Member Author

commented Apr 29, 2019

@answer1991

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

it's a critical bug in our production environment.

@zhangxiaoyu-zidif PTAL

@kargakis

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

/priority important-soon
/kind bug

@kargakis

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

@DaiHao

This comment has been minimized.

Copy link
Member Author

commented Apr 30, 2019

/test pull-kubernetes-e2e-gce-device-plugin-gpu

@DaiHao DaiHao changed the title Enqueue controllers after minreadyseconds when all pods are ready Fix rolling update daemonset bug in clock-skew scenario Apr 30, 2019

@zhangxiaoyu-zidif

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

/lgtm

@zhangxiaoyu-zidif

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

ping @janetkuo

@@ -1198,6 +1198,10 @@ func (dsc *DaemonSetsController) updateDaemonSetStatus(ds *apps.DaemonSet, nodeL
return fmt.Errorf("error storing status for daemon set %#v: %v", ds, err)
}

// Resync the DaemonSet after MinReadySeconds as a last line of defense to guard against clock-skew.

This comment has been minimized.

Copy link
@krmayankk

krmayankk May 1, 2019

Contributor

it would help to enhance this comment with a more detailed explanation of this issue. I am still not following how the nodes (on which pod is running)clock skew from controller nodes clock causes this ?

This comment has been minimized.

Copy link
@krmayankk

krmayankk May 1, 2019

Contributor

Seems like the reason would be this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1174 , the reason from what i understand is
1: kubelet marks the pod ready and changes LastTransitionTime
2: controller checks IsPodAvailable() and finds the pod unavailable since its diffing LastTransitionTime(set by kubelet) with time.Now which is controller time. Since controller is behind in clock, minreadySeconds is not satisfied and it marks unavailable even though minReadySeconds is satisfied

can one of you confirm if this is the right understanding of this issue @DaiHao @kargakis ?
Also in this case when a pod is marked unavailable, why wont it be requeued ?

I think including all of this explanation will be helpful

This comment has been minimized.

Copy link
@DaiHao

DaiHao May 1, 2019

Author Member

see here. #77208 (comment)

This comment has been minimized.

Copy link
@DaiHao

DaiHao May 1, 2019

Author Member

Seems like the reason would be this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1174 , the reason from what i understand is
1: kubelet marks the pod ready and changes LastTransitionTime
2: controller checks IsPodAvailable() and finds the pod unavailable since its diffing LastTransitionTime(set by kubelet) with time.Now which is controller time. Since controller is behind in clock, minreadySeconds is not satisfied and it marks unavailable even though minReadySeconds is satisfied

can one of you confirm if this is the right understanding of this issue @DaiHao @kargakis ?
Also in this case when a pod is marked unavailable, why wont it be requeued ?

I think including all of this explanation will be helpful

Your understanding is right.
Pod's status do not change in the sync loop, controller mark it unavailable only in daemonset's status, but which is equal with its last status, so daemonset also not requeue.
see here.

func storeDaemonSetStatus(dsClient unversionedapps.DaemonSetInterface, ds *apps.DaemonSet, desiredNumberScheduled, currentNumberScheduled, numberMisscheduled, numberReady, updatedNumberScheduled, numberAvailable, numberUnavailable int, updateObservedGen bool) error {
if int(ds.Status.DesiredNumberScheduled) == desiredNumberScheduled &&
int(ds.Status.CurrentNumberScheduled) == currentNumberScheduled &&
int(ds.Status.NumberMisscheduled) == numberMisscheduled &&
int(ds.Status.NumberReady) == numberReady &&
int(ds.Status.UpdatedNumberScheduled) == updatedNumberScheduled &&
int(ds.Status.NumberAvailable) == numberAvailable &&
int(ds.Status.NumberUnavailable) == numberUnavailable &&
ds.Status.ObservedGeneration >= ds.Generation {
return nil
}

@janetkuo

This comment has been minimized.

Copy link
Member

commented May 4, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 4, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DaiHao, janetkuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit e871241 into kubernetes:master May 4, 2019

20 checks passed

cla/linuxfoundation DaiHao authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.