-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flaking test][sig-node] NodeProblemDetector should run without error #121973
Comments
I found some similar logs in #115192. /cc @SergeyKanzhelev @harche |
/cc @kubernetes/sig-node-test-failures |
According to @Vyom-Yadav 's investigation, this issue is really old, this started flaking recently It started to flake from 10-24, on which day we bump the npd version in #121382. I think we should revert it before we find the root cause. |
/area test |
@SergeyKanzhelev this node problem detect need to be tagged so it is not picked by default, jobs that want to exercise these npd tests should be explicitily opt in for it |
the problem here is that when is a daemonset, it accounts for scheduling and sometimes the instances are too small that the pods are not able to be scheduled and the e2e timesout kubernetes/test-infra#31312 NPD should not run by default, and the tests should be opt-in |
Perhaps we could run the tests serially? If running all tests in parallel result in all memory to be consumed, then that might help. I know we have a bunch of new features between NPD versions (0.8.9 to 0.8.13/14), but I don't recall any of those particularly affecting memory usage. /triage accepted |
This will be fixed by #123114 |
This only flake in https://testgrid.k8s.io/sig-release-1.29-blocking#gce-cos-k8sbeta-default. I need watch after we upgrade to v0.8.16. #123114 |
Flake in https://testgrid.k8s.io/sig-release-1.29-blocking#gce-cos-k8sstable1-default but the CI is blocked by #124438.
It still use registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.13
|
/close So this should be fixed with npd v0.8.16. |
@pacoxu: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Failure cluster f0cef9ada6202025601f
https://storage.googleapis.com/k8s-triage/index.html?test=NodeProblemDetector%20should%20run%20without%20error
Error text:
Recent failures:
2023/11/20 18:55:47 ci-kubernetes-e2e-gci-gce-kube-dns
2023/11/20 18:41:47 ci-kubernetes-e2e-prow-canary
2023/11/20 16:55:47 ci-kubernetes-e2e-gci-gce-proto
2023/11/20 13:52:47 ci-kubernetes-e2e-gci-gce-network-proxy-grpc
2023/11/20 13:10:47 ci-kubernetes-e2e-gci-gce-ipvs
/kind flake
/sig node
This flake in the master blocking board recently:
The text was updated successfully, but these errors were encountered: