New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backoff for DS's pod deletion to limit fighting with kubelet failing the pod repeatedly #65309

Merged
merged 2 commits into from Aug 15, 2018

Conversation

@tnozicka
Contributor

tnozicka commented Jun 21, 2018

What this PR does / why we need it:
Limits consequences of DS controller on hot loop fighting with kubelet.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #65240

Release note:

DaemonSet controller is now using backoff algorithm to avoid hot loops fighting with kubelet on pod recreation when a particular DaemonSet is misconfigured.

TODO:

  • Export the backoff settings as args or constants
  • Add test a case

/cc @mfojtik
(Will add more folks when it's ready, to avoid spamming them.)

@@ -1305,24 +1315,87 @@ func TestDaemonKillFailedPods(t *testing.T) {
{numFailedPods: 0, numNormalPods: 0, expectedCreates: 1, expectedDeletes: 0, expectedEvents: 0, test: "no pods (create 1)"},
{numFailedPods: 1, numNormalPods: 0, expectedCreates: 0, expectedDeletes: 1, expectedEvents: 1, test: "1 failed pod (kill 1), 0 normal pod (create 0; will create in the next sync)"},
{numFailedPods: 1, numNormalPods: 3, expectedCreates: 0, expectedDeletes: 3, expectedEvents: 1, test: "1 failed pod (kill 1), 3 normal pods (kill 2)"},
{numFailedPods: 2, numNormalPods: 1, expectedCreates: 0, expectedDeletes: 2, expectedEvents: 2, test: "2 failed pods (kill 2), 1 normal pod"},

This comment has been minimized.

@tnozicka

tnozicka Jun 25, 2018

Contributor

Having 2 failed DS pods for 1 DS got rate limited by the new functionality

@tnozicka

tnozicka Jun 25, 2018

Contributor

Having 2 failed DS pods for 1 DS got rate limited by the new functionality

This comment has been minimized.

@liggitt

liggitt Jun 27, 2018

Member

does that make this test time-dependent? will that cause test flakes?

@liggitt

liggitt Jun 27, 2018

Member

does that make this test time-dependent? will that cause test flakes?

This comment has been minimized.

@tnozicka

tnozicka Jun 28, 2018

Contributor

In normal circumstances there is only 1 running pod for particular DS on 1 node. Is there a case where it'd be legitimate to have 2 running (non deleted) pods on the same node for the same DS? (Rolling update first deletes the old pods.) If so for the flake to appear a node would have to evict both of those pods within backoff interval.
But this applies even for the normal case of 1 pod on node for a DS - if the same node evicts pod for the same DS twice within the backoff interval the creation gets rate limited. If that happens in your cluster, it seems like you have a root cause to take care of, not the flake.

@tnozicka

tnozicka Jun 28, 2018

Contributor

In normal circumstances there is only 1 running pod for particular DS on 1 node. Is there a case where it'd be legitimate to have 2 running (non deleted) pods on the same node for the same DS? (Rolling update first deletes the old pods.) If so for the flake to appear a node would have to evict both of those pods within backoff interval.
But this applies even for the normal case of 1 pod on node for a DS - if the same node evicts pod for the same DS twice within the backoff interval the creation gets rate limited. If that happens in your cluster, it seems like you have a root cause to take care of, not the flake.

This comment has been minimized.

@liggitt

liggitt Jun 29, 2018

Member

I meant that before we were using a fake clock, if 50ms was exceeded during the test loop, this case might get hit. with a fake clock there should be no time-based flakiness

@liggitt

liggitt Jun 29, 2018

Member

I meant that before we were using a fake clock, if 50ms was exceeded during the test loop, this case might get hit. with a fake clock there should be no time-based flakiness

}
for _, test := range tests {
t.Logf("test case: %s\n", test.test)
for _, strategy := range updateStrategies() {
t.Run(test.test, func(t *testing.T) {

This comment has been minimized.

@tnozicka

tnozicka Jun 25, 2018

Contributor

Just switching to a subtest, the diff got messed up

@tnozicka

tnozicka Jun 25, 2018

Contributor

Just switching to a subtest, the diff got messed up

@tnozicka tnozicka changed the title from [WIP] - Add backoff for DS's pod deletion to limit fighting with kubelet failing the pod repeatedly to Add backoff for DS's pod deletion to limit fighting with kubelet failing the pod repeatedly Jun 26, 2018

@tnozicka

This comment has been minimized.

Show comment
Hide comment
@tnozicka
Contributor

tnozicka commented Jun 26, 2018

@tnozicka

This comment has been minimized.

Show comment
Hide comment
@tnozicka
Contributor

tnozicka commented Jun 27, 2018

@k8s-ci-robot k8s-ci-robot requested review from deads2k and liggitt Jun 27, 2018

msg := fmt.Sprintf("Found failed daemon pod %s/%s on node %s, will try to kill it", pod.Namespace, pod.Name, node.Name)
glog.V(2).Infof(msg)
// Emit an event so that it's discoverable to users.
dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedDaemonPodReason, msg)
podsToDelete = append(podsToDelete, pod.Name)
failedPodsObserved++

This comment has been minimized.

@liggitt

liggitt Jun 27, 2018

Member

it seems like the new backoff code should come here after we've emitted the event and incremented failedPodsObserved

@liggitt

liggitt Jun 27, 2018

Member

it seems like the new backoff code should come here after we've emitted the event and incremented failedPodsObserved

This comment has been minimized.

@liggitt

liggitt Jun 27, 2018

Member

oh, unless that means we would emit the event on every backoff check

@liggitt

liggitt Jun 27, 2018

Member

oh, unless that means we would emit the event on every backoff check

This comment has been minimized.

@tnozicka

tnozicka Jun 28, 2018

Contributor

yes, that's exactly the reason why I did it before the event

@tnozicka

tnozicka Jun 28, 2018

Contributor

yes, that's exactly the reason why I did it before the event

@tnozicka

comments addressed

msg := fmt.Sprintf("Found failed daemon pod %s/%s on node %s, will try to kill it", pod.Namespace, pod.Name, node.Name)
glog.V(2).Infof(msg)
// Emit an event so that it's discoverable to users.
dsc.eventRecorder.Eventf(ds, v1.EventTypeWarning, FailedDaemonPodReason, msg)
podsToDelete = append(podsToDelete, pod.Name)
failedPodsObserved++

This comment has been minimized.

@tnozicka

tnozicka Jun 28, 2018

Contributor

yes, that's exactly the reason why I did it before the event

@tnozicka

tnozicka Jun 28, 2018

Contributor

yes, that's exactly the reason why I did it before the event

@@ -1305,24 +1315,87 @@ func TestDaemonKillFailedPods(t *testing.T) {
{numFailedPods: 0, numNormalPods: 0, expectedCreates: 1, expectedDeletes: 0, expectedEvents: 0, test: "no pods (create 1)"},
{numFailedPods: 1, numNormalPods: 0, expectedCreates: 0, expectedDeletes: 1, expectedEvents: 1, test: "1 failed pod (kill 1), 0 normal pod (create 0; will create in the next sync)"},
{numFailedPods: 1, numNormalPods: 3, expectedCreates: 0, expectedDeletes: 3, expectedEvents: 1, test: "1 failed pod (kill 1), 3 normal pods (kill 2)"},
{numFailedPods: 2, numNormalPods: 1, expectedCreates: 0, expectedDeletes: 2, expectedEvents: 2, test: "2 failed pods (kill 2), 1 normal pod"},

This comment has been minimized.

@tnozicka

tnozicka Jun 28, 2018

Contributor

In normal circumstances there is only 1 running pod for particular DS on 1 node. Is there a case where it'd be legitimate to have 2 running (non deleted) pods on the same node for the same DS? (Rolling update first deletes the old pods.) If so for the flake to appear a node would have to evict both of those pods within backoff interval.
But this applies even for the normal case of 1 pod on node for a DS - if the same node evicts pod for the same DS twice within the backoff interval the creation gets rate limited. If that happens in your cluster, it seems like you have a root cause to take care of, not the flake.

@tnozicka

tnozicka Jun 28, 2018

Contributor

In normal circumstances there is only 1 running pod for particular DS on 1 node. Is there a case where it'd be legitimate to have 2 running (non deleted) pods on the same node for the same DS? (Rolling update first deletes the old pods.) If so for the flake to appear a node would have to evict both of those pods within backoff interval.
But this applies even for the normal case of 1 pod on node for a DS - if the same node evicts pod for the same DS twice within the backoff interval the creation gets rate limited. If that happens in your cluster, it seems like you have a root cause to take care of, not the flake.

Show outdated Hide outdated pkg/controller/daemon/daemon_controller.go
Show outdated Hide outdated pkg/controller/daemon/daemon_controller_test.go
@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Jun 29, 2018

Member

looks reasonable to me, though I haven't used the flowcontrol backoff before. would be good to get an ack from someone familiar with it to make sure the usage is right.

Member

liggitt commented Jun 29, 2018

looks reasonable to me, though I haven't used the flowcontrol backoff before. would be good to get an ack from someone familiar with it to make sure the usage is right.

@tnozicka

This comment has been minimized.

Show comment
Hide comment
@tnozicka

tnozicka Jul 31, 2018

Contributor

flake #66337
/test pull-kubernetes-e2e-gce-100-performance

Contributor

tnozicka commented Jul 31, 2018

flake #66337
/test pull-kubernetes-e2e-gce-100-performance

@tnozicka

This comment has been minimized.

Show comment
Hide comment
@tnozicka
Contributor

tnozicka commented Aug 3, 2018

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Aug 3, 2018

Contributor

[MILESTONENOTIFIER] Milestone Pull Request Needs Approval

@janetkuo @kow3ns @liggitt @tnozicka @kubernetes/sig-apps-misc

Action required: This pull request must have the status/approved-for-milestone label applied by a SIG maintainer. If the label is not applied within 2 days, the pull request will be moved out of the v1.12 milestone.

Pull Request Labels
  • sig/apps: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Help
Contributor

k8s-merge-robot commented Aug 3, 2018

[MILESTONENOTIFIER] Milestone Pull Request Needs Approval

@janetkuo @kow3ns @liggitt @tnozicka @kubernetes/sig-apps-misc

Action required: This pull request must have the status/approved-for-milestone label applied by a SIG maintainer. If the label is not applied within 2 days, the pull request will be moved out of the v1.12 milestone.

Pull Request Labels
  • sig/apps: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/bug: Fixes a bug discovered during the current release.
Help
@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Aug 6, 2018

Member

looks pretty good to me, just a couple nits and a question about the starting backoff

Member

liggitt commented Aug 6, 2018

looks pretty good to me, just a couple nits and a question about the starting backoff

@tnozicka

This comment has been minimized.

Show comment
Hide comment
@tnozicka

tnozicka Aug 13, 2018

Contributor

/retest

Contributor

tnozicka commented Aug 13, 2018

/retest

@tnozicka

This comment has been minimized.

Show comment
Hide comment
@tnozicka

tnozicka Aug 13, 2018

Contributor

(previous integration flake from DS was unrelated - fixed in #67346)

Contributor

tnozicka commented Aug 13, 2018

(previous integration flake from DS was unrelated - fixed in #67346)

continue
}
dsc.failedPodsBackoff.Next(backoffKey, now)

This comment has been minimized.

@janetkuo

janetkuo Aug 14, 2018

Member

Do we need failedPodsBackoff? Why not just use dsc.enqueueDaemonSetRateLimited(ds)?

@janetkuo

janetkuo Aug 14, 2018

Member

Do we need failedPodsBackoff? Why not just use dsc.enqueueDaemonSetRateLimited(ds)?

This comment has been minimized.

@liggitt

liggitt Aug 14, 2018

Member

Yes, we need it. Rate-limited re-queuing does not prevent other events on the DaemonSet causing it to be queued and processed prior to that rate-limited re-add, which would trigger delete/recreate of all of the failed pods on the failed nodes. We want the controller to remain responsive, without churning on permanently failing nodes

@liggitt

liggitt Aug 14, 2018

Member

Yes, we need it. Rate-limited re-queuing does not prevent other events on the DaemonSet causing it to be queued and processed prior to that rate-limited re-add, which would trigger delete/recreate of all of the failed pods on the failed nodes. We want the controller to remain responsive, without churning on permanently failing nodes

@tnozicka

This comment has been minimized.

Show comment
Hide comment
@tnozicka

tnozicka Aug 15, 2018

Contributor

(only fixed a Bazel conflict)

Contributor

tnozicka commented Aug 15, 2018

(only fixed a Bazel conflict)

@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Aug 15, 2018

Member

one last nit on the starting backoff, then LGTM

Member

liggitt commented Aug 15, 2018

one last nit on the starting backoff, then LGTM

@tnozicka

This comment has been minimized.

Show comment
Hide comment
@tnozicka

tnozicka Aug 15, 2018

Contributor

nit fixed

$ branch=$(git rev-parse --abbrev-ref HEAD) && interdiff -U1 <(git diff $(git merge-base origin/${branch} upstream) origin/${branch}) <(git diff $(git merge-base ${branch} upstream) ${branch}) | colordiff
diff -U1 b/cmd/kube-controller-manager/app/apps.go b/cmd/kube-controller-manager/app/apps.go
--- b/cmd/kube-controller-manager/app/apps.go
+++ b/cmd/kube-controller-manager/app/apps.go
@@ -24,5 +24,4 @@
 	"fmt"
-	"time"
-
 	"net/http"
+	"time"
 
@@ -46,3 +45,3 @@
 		ctx.ClientBuilder.ClientOrDie("daemon-set-controller"),
-		flowcontrol.NewBackOff(5*time.Second, 15*time.Minute),
+		flowcontrol.NewBackOff(1*time.Second, 15*time.Minute),
 	)
Contributor

tnozicka commented Aug 15, 2018

nit fixed

$ branch=$(git rev-parse --abbrev-ref HEAD) && interdiff -U1 <(git diff $(git merge-base origin/${branch} upstream) origin/${branch}) <(git diff $(git merge-base ${branch} upstream) ${branch}) | colordiff
diff -U1 b/cmd/kube-controller-manager/app/apps.go b/cmd/kube-controller-manager/app/apps.go
--- b/cmd/kube-controller-manager/app/apps.go
+++ b/cmd/kube-controller-manager/app/apps.go
@@ -24,5 +24,4 @@
 	"fmt"
-	"time"
-
 	"net/http"
+	"time"
 
@@ -46,3 +45,3 @@
 		ctx.ClientBuilder.ClientOrDie("daemon-set-controller"),
-		flowcontrol.NewBackOff(5*time.Second, 15*time.Minute),
+		flowcontrol.NewBackOff(1*time.Second, 15*time.Minute),
 	)
@liggitt

This comment has been minimized.

Show comment
Hide comment
@liggitt

liggitt Aug 15, 2018

Member

/lgtm

Member

liggitt commented Aug 15, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Aug 15, 2018

@deads2k

This comment has been minimized.

Show comment
Hide comment
@deads2k

deads2k Aug 15, 2018

Contributor

I'm glad to see this being fixed. Hopefully the scheduler based daemonset impl will remember to do something similar.

/approve

Contributor

deads2k commented Aug 15, 2018

I'm glad to see this being fixed. Hopefully the scheduler based daemonset impl will remember to do something similar.

/approve

@k8s-ci-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-ci-robot

k8s-ci-robot Aug 15, 2018

Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, liggitt, tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Contributor

k8s-ci-robot commented Aug 15, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, liggitt, tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot Aug 15, 2018

Contributor

Automatic merge from submit-queue (batch tested with PRs 62441, 66702, 67254, 67421, 65309). If you want to cherry-pick this change to another branch, please follow the instructions here.

Contributor

k8s-merge-robot commented Aug 15, 2018

Automatic merge from submit-queue (batch tested with PRs 62441, 66702, 67254, 67421, 65309). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit c1f7df2 into kubernetes:master Aug 15, 2018

18 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation tnozicka authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-e2e-kubeadm-gce Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

@tnozicka tnozicka deleted the tnozicka:add-ds-recreate-backoff branch Aug 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment