Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky test] [sig-node] Kubelet should correctly account for terminated pods after restart #105686

Closed
ehashman opened this issue Oct 14, 2021 · 13 comments · Fixed by #105926, #106434 or #106544
Closed
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ehashman
Copy link
Member

Which jobs are flaking?

node-kubelet-serial

Which tests are flaking?

Kubelet should correctly account for terminated pods after restart

Since when has it been flaking?

Since test was created, flakes 10-20% of the time. Not entirely clear why, I haven't looked.

Testgrid link

https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=Kubelet%20should%20correctly%20account%20for%20terminated%20pods%20after%20restart

Reason for failure (if possible)

No response

Anything else we need to know?

/assign

Relevant SIG(s)

/sig node

@ehashman ehashman added the kind/flake Categorizes issue or PR as related to a flaky test. label Oct 14, 2021
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 14, 2021
@249043822
Copy link
Member

Introduced by #105527, maybe we should use startTimeout to replace time.Minute

completedPods := waitForPods(f, podCountRestartNever, time.Minute)

See the log below, a pod becomes Succeed may exhaust tens of seconds

I1017 16:41:22.066] Oct 17 15:20:55.797: INFO: Waiting up to 5m0s for pod "pod68c194cb-9705-46e8-bc3f-204f77093794" in namespace "pids-limit-test-6747" to be "Succeeded or Failed"
I1017 16:41:22.066] Oct 17 15:20:55.807: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 9.280568ms
I1017 16:41:22.066] Oct 17 15:20:57.812: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 2.014854193s
I1017 16:41:22.066] Oct 17 15:20:59.815: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 4.017547065s
I1017 16:41:22.067] Oct 17 15:21:01.819: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 6.021437568s
I1017 16:41:22.067] Oct 17 15:21:03.822: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 8.025113282s
I1017 16:41:22.067] Oct 17 15:21:05.828: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 10.030465264s
I1017 16:41:22.067] Oct 17 15:21:07.832: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 12.035131744s
I1017 16:41:22.067] Oct 17 15:21:09.836: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 14.038253919s
I1017 16:41:22.067] Oct 17 15:21:11.839: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 16.041491005s
I1017 16:41:22.068] Oct 17 15:21:13.843: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 18.045771345s
I1017 16:41:22.068] Oct 17 15:21:15.846: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 20.049193113s
I1017 16:41:22.068] Oct 17 15:21:17.852: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 22.054460658s
I1017 16:41:22.068] Oct 17 15:21:19.854: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 24.057172942s
I1017 16:41:22.068] Oct 17 15:21:21.859: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 26.061613448s
I1017 16:41:22.068] Oct 17 15:21:23.863: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 28.06609073s
I1017 16:41:22.069] Oct 17 15:21:25.869: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 30.072213316s
I1017 16:41:22.069] Oct 17 15:21:27.874: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 32.076514455s
I1017 16:41:22.069] Oct 17 15:21:29.878: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 34.080568245s
I1017 16:41:22.069] Oct 17 15:21:31.882: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 36.084891008s
I1017 16:41:22.069] Oct 17 15:21:33.885: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 38.087975373s
I1017 16:41:22.069] Oct 17 15:21:35.890: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 40.093100519s
I1017 16:41:22.070] Oct 17 15:21:37.895: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Pending", Reason="", readiness=false. Elapsed: 42.097885965s
I1017 16:41:22.070] Oct 17 15:21:39.899: INFO: Pod "pod68c194cb-9705-46e8-bc3f-204f77093794": Phase="Succeeded", Reason="", readiness=false. Elapsed: 44.102158089s

@249043822
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 18, 2021
@ehashman
Copy link
Member Author

Introduced by #105527, maybe we should use startTimeout to replace time.Minute

completedPods := waitForPods(f, podCountRestartNever, time.Minute)

See the log below, a pod becomes Succeed may exhaust tens of seconds

Yeah I only tested this locally and not in CI, it seems that in CI the pods take much longer. Timings just need to be tweaked.

@ehashman
Copy link
Member Author

ehashman commented Nov 1, 2021

/reopen

still a number of flakes since the attached PR merged, I don't think that fixed it.

@k8s-ci-robot
Copy link
Contributor

@ehashman: Reopened this issue.

In response to this:

/reopen

still a number of flakes since the attached PR merged, I don't think that fixed it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 1, 2021
@249043822
Copy link
Member

@pacoxu
Copy link
Member

pacoxu commented Nov 15, 2021

I see this failure in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/106408/pull-kubernetes-node-kubelet-serial/1459822195130765312/.

@249043822
Copy link
Member

so this problem should have been fixed by #106371

@ehashman
Copy link
Member Author

From gubernator:

Failure cluster f3393094bdefc50e4d1f

Error text:
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/restart_test.go:152
Nov  1 10:28:15.543: Failed to run sufficient restartNever pods, got 1 but expected 2
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113

Recent failures:

11/14/2021, 6:36:14 AM ci-kubernetes-node-kubelet-serial
11/13/2021, 6:35:14 AM ci-kubernetes-node-kubelet-serial
11/12/2021, 10:35:13 PM ci-kubernetes-node-kubelet-serial
11/12/2021, 6:34:13 PM ci-kubernetes-node-kubelet-serial
11/12/2021, 10:33:58 AM ci-kubernetes-node-kubelet-serial

In all cases, it's timing out. I'll try upping the startTimeout.

@ehashman
Copy link
Member Author

/reopen

still flaking

@k8s-ci-robot
Copy link
Contributor

@ehashman: Reopened this issue.

In response to this:

/reopen

still flaking

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Nov 18, 2021
@ehashman
Copy link
Member Author

Luckily the kubelet log is clear as to what's going on. From https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1461404704922669056/artifacts/n1-standard-2-cos-93-16623-39-21-991516df/kubelet.log (flake https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1461404704922669056)

_output/local/go/src/k8s.io/kubernetes/test/e2e_node/restart_test.go:146
Nov 18 19:38:37.481: Failed to run sufficient restartNever pods, got 1 but expected 2
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113
I1118 19:37:05.003842   43953 config.go:383] "Receiving a new pod" pod="restart-test-4322/test-a693eea8-a137-471f-bc38-0768abf47fa5"
I1118 19:37:05.003899   43953 kubelet.go:2063] "SyncLoop ADD" source="api" pods=[restart-test-4322/test-a693eea8-a137-471f-bc38-0768abf47fa5]
I1118 19:37:05.003930   43953 topology_manager.go:200] "Topology Admit Handler"
I1118 19:37:05.003948   43953 manager.go:914] "Looking for needed resources" needed=1 resourceName="cpu"
I1118 19:37:05.004036   43953 predicate.go:145] "Predicate failed on Pod" pod="restart-test-4322/test-a693eea8-a137-471f-bc38-0768abf47fa5" err="Node didn't have enough resource: cpu, requested: 1000, used: 1100, capacity: 2000"

The test should work fine if we drop the CPU request a bit, let me tweak it to make up for the possible extra 100m scheduled. I'm not sure what's causing that (aren't these tests serial? what the heck is getting scheduled) but we should be resilient to it.

Also, https://github.com/kubernetes/kubernetes/pull/105926/files#r752688249 needs to be fixed. This test will no longer detect the regression it was designed to show after that change. I will revert it and also fix the resource requirements and that should fix this test.

@ehashman
Copy link
Member Author

/milestone v1.23

need to fix this before test freeze

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
4 participants