[Flaky test] [sig-node] Kubelet should correctly account for terminated pods after restart #105686

ehashman · 2021-10-14T20:28:15Z

Which jobs are flaking?

node-kubelet-serial

Which tests are flaking?

Kubelet should correctly account for terminated pods after restart

Since when has it been flaking?

Since test was created, flakes 10-20% of the time. Not entirely clear why, I haven't looked.

Testgrid link

https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=Kubelet%20should%20correctly%20account%20for%20terminated%20pods%20after%20restart

Reason for failure (if possible)

No response

Anything else we need to know?

/assign

Relevant SIG(s)

/sig node

249043822 · 2021-10-18T05:05:39Z

Introduced by #105527, maybe we should use startTimeout to replace time.Minute

kubernetes/test/e2e_node/restart_test.go

Line 176 in 9804a83

completedPods := waitForPods(f, podCountRestartNever, time.Minute)

See the log below, a pod becomes Succeed may exhaust tens of seconds

I1017 16:41:22.066] Oct 17 15:20:55.797: INFO: Waiting up to 5m0s for pod "pod68c194cb-9705-46e8-bc3f-204f77093794" in namespace "pids-limit-test-6747" to be "Succeeded or Failed"
I1017 16:41:22.066] Oct 17 15:20:55.807: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 9.280568ms
I1017 16:41:22.066] Oct 17 15:20:57.812: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 2.014854193s
I1017 16:41:22.066] Oct 17 15:20:59.815: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 4.017547065s
I1017 16:41:22.067] Oct 17 15:21:01.819: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 6.021437568s
I1017 16:41:22.067] Oct 17 15:21:03.822: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 8.025113282s
I1017 16:41:22.067] Oct 17 15:21:05.828: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 10.030465264s
I1017 16:41:22.067] Oct 17 15:21:07.832: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 12.035131744s
I1017 16:41:22.067] Oct 17 15:21:09.836: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 14.038253919s
I1017 16:41:22.067] Oct 17 15:21:11.839: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 16.041491005s
I1017 16:41:22.068] Oct 17 15:21:13.843: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 18.045771345s
I1017 16:41:22.068] Oct 17 15:21:15.846: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 20.049193113s
I1017 16:41:22.068] Oct 17 15:21:17.852: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 22.054460658s
I1017 16:41:22.068] Oct 17 15:21:19.854: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 24.057172942s
I1017 16:41:22.068] Oct 17 15:21:21.859: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 26.061613448s
I1017 16:41:22.068] Oct 17 15:21:23.863: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 28.06609073s
I1017 16:41:22.069] Oct 17 15:21:25.869: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 30.072213316s
I1017 16:41:22.069] Oct 17 15:21:27.874: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 32.076514455s
I1017 16:41:22.069] Oct 17 15:21:29.878: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 34.080568245s
I1017 16:41:22.069] Oct 17 15:21:31.882: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 36.084891008s
I1017 16:41:22.069] Oct 17 15:21:33.885: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 38.087975373s
I1017 16:41:22.069] Oct 17 15:21:35.890: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 40.093100519s
I1017 16:41:22.070] Oct 17 15:21:37.895: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 42.097885965s
I1017 16:41:22.070] Oct 17 15:21:39.899: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Succeeded", Reason="", readiness=false. Elapsed: 44.102158089s

249043822 · 2021-10-18T05:06:00Z

/triage accepted

ehashman · 2021-10-20T01:15:21Z

Introduced by #105527, maybe we should use startTimeout to replace time.Minute

kubernetes/test/e2e_node/restart_test.go

Line 176 in 9804a83

completedPods := waitForPods(f, podCountRestartNever, time.Minute)

See the log below, a pod becomes Succeed may exhaust tens of seconds

Yeah I only tested this locally and not in CI, it seems that in CI the pods take much longer. Timings just need to be tweaked.

ehashman · 2021-11-01T03:33:57Z

/reopen

still a number of flakes since the attached PR merged, I don't think that fixed it.

k8s-ci-robot · 2021-11-01T03:34:15Z

@ehashman: Reopened this issue.

In response to this:

/reopen

still a number of flakes since the attached PR merged, I don't think that fixed it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

249043822 · 2021-11-02T00:32:20Z

@ehashman https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=Kubelet%20should%20correctly%20account%20for%20terminated%20pods%20after%20restart
from this link, I see all suites passed, which flakes do you mean?

pacoxu · 2021-11-15T02:06:02Z

I see this failure in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/106408/pull-kubernetes-node-kubelet-serial/1459822195130765312/.

249043822 · 2021-11-15T06:26:50Z

so this problem should have been fixed by #106371

ehashman · 2021-11-15T21:42:19Z

From gubernator:

Failure cluster f3393094bdefc50e4d1f

Error text:

_output/local/go/src/k8s.io/kubernetes/test/e2e_node/restart_test.go:152
Nov  1 10:28:15.543: Failed to run sufficient restartNever pods, got 1 but expected 2
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113

Recent failures:

11/14/2021, 6:36:14 AM ci-kubernetes-node-kubelet-serial
11/13/2021, 6:35:14 AM ci-kubernetes-node-kubelet-serial
11/12/2021, 10:35:13 PM ci-kubernetes-node-kubelet-serial
11/12/2021, 6:34:13 PM ci-kubernetes-node-kubelet-serial
11/12/2021, 10:33:58 AM ci-kubernetes-node-kubelet-serial

In all cases, it's timing out. I'll try upping the startTimeout.

ehashman · 2021-11-18T23:02:19Z

/reopen

still flaking

k8s-ci-robot · 2021-11-18T23:02:37Z

@ehashman: Reopened this issue.

In response to this:

/reopen

still flaking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ehashman · 2021-11-18T23:04:11Z

Luckily the kubelet log is clear as to what's going on. From https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1461404704922669056/artifacts/n1-standard-2-cos-93-16623-39-21-991516df/kubelet.log (flake https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1461404704922669056)

_output/local/go/src/k8s.io/kubernetes/test/e2e_node/restart_test.go:146
Nov 18 19:38:37.481: Failed to run sufficient restartNever pods, got 1 but expected 2
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113

I1118 19:37:05.003842   43953 config.go:383] "Receiving a new pod" pod="restart-test-4322/test-a693eea8-a137-471f-bc38-0768abf47fa5"
I1118 19:37:05.003899   43953 kubelet.go:2063] "SyncLoop ADD" source="api" pods=[restart-test-4322/test-a693eea8-a137-471f-bc38-0768abf47fa5]
I1118 19:37:05.003930   43953 topology_manager.go:200] "Topology Admit Handler"
I1118 19:37:05.003948   43953 manager.go:914] "Looking for needed resources" needed=1 resourceName="cpu"
I1118 19:37:05.004036   43953 predicate.go:145] "Predicate failed on Pod" pod="restart-test-4322/test-a693eea8-a137-471f-bc38-0768abf47fa5" err="Node didn't have enough resource: cpu, requested: 1000, used: 1100, capacity: 2000"

The test should work fine if we drop the CPU request a bit, let me tweak it to make up for the possible extra 100m scheduled. I'm not sure what's causing that (aren't these tests serial? what the heck is getting scheduled) but we should be resilient to it.

Also, https://github.com/kubernetes/kubernetes/pull/105926/files#r752688249 needs to be fixed. This test will no longer detect the regression it was designed to show after that change. I will revert it and also fix the resource requirements and that should fix this test.

ehashman · 2021-11-18T23:04:34Z

/milestone v1.23

need to fix this before test freeze

ehashman added the kind/flake Categorizes issue or PR as related to a flaky test. label Oct 14, 2021

k8s-ci-robot assigned ehashman Oct 14, 2021

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 14, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 18, 2021

249043822 mentioned this issue Oct 27, 2021

Fix:Flaky test] [sig-node] Kubelet should correctly account for terminated pods after restart #105926

Merged

k8s-ci-robot closed this as completed in #105926 Oct 28, 2021

k8s-ci-robot reopened this Nov 1, 2021

pacoxu mentioned this issue Nov 15, 2021

e2e_node: fix nil pointer exception under quota lsci test #106408

Merged

ehashman mentioned this issue Nov 15, 2021

Fix timeout flake in restart kubelet e2e #106434

Merged

k8s-ci-robot closed this as completed in #106434 Nov 16, 2021

k8s-ci-robot reopened this Nov 18, 2021

k8s-ci-robot added this to the v1.23 milestone Nov 18, 2021

ehashman mentioned this issue Nov 18, 2021

Deflake "Kubelet should correctly account for terminated pods after restart" #106544

Merged

k8s-ci-robot closed this as completed in #106544 Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky test] [sig-node] Kubelet should correctly account for terminated pods after restart #105686

[Flaky test] [sig-node] Kubelet should correctly account for terminated pods after restart #105686

ehashman commented Oct 14, 2021

249043822 commented Oct 18, 2021

249043822 commented Oct 18, 2021

ehashman commented Oct 20, 2021

ehashman commented Nov 1, 2021

k8s-ci-robot commented Nov 1, 2021

249043822 commented Nov 2, 2021

pacoxu commented Nov 15, 2021

249043822 commented Nov 15, 2021

ehashman commented Nov 15, 2021

ehashman commented Nov 18, 2021

k8s-ci-robot commented Nov 18, 2021

ehashman commented Nov 18, 2021

ehashman commented Nov 18, 2021

[Flaky test] [sig-node] Kubelet should correctly account for terminated pods after restart #105686

[Flaky test] [sig-node] Kubelet should correctly account for terminated pods after restart #105686

Comments

ehashman commented Oct 14, 2021

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

249043822 commented Oct 18, 2021

249043822 commented Oct 18, 2021

ehashman commented Oct 20, 2021

ehashman commented Nov 1, 2021

k8s-ci-robot commented Nov 1, 2021

249043822 commented Nov 2, 2021

pacoxu commented Nov 15, 2021

249043822 commented Nov 15, 2021

ehashman commented Nov 15, 2021

Failure cluster f3393094bdefc50e4d1f

Error text:

Recent failures:

ehashman commented Nov 18, 2021

k8s-ci-robot commented Nov 18, 2021

ehashman commented Nov 18, 2021

ehashman commented Nov 18, 2021