[Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂 #120913

mmiranda96 · 2023-09-27T17:38:58Z

Which jobs are flaking?

node-kubelet-serial-containerd

Which tests are flaking?

There are multiple tests:

E2eNode Suite.[It] [sig-node] Density [Serial] [Slow] create a batch of pods latency/resource should be within limit when create 10 pods with 0s interval
E2eNode Suite.[It] [sig-node] POD Resources [Serial] [Feature:PodResources][NodeFeature:PodResources] with the builtin rate limit values should hit throttling when calling podresources List in a tight loop
E2eNode Suite.[It] [sig-node] Device Manager [Serial] [Feature:DeviceManager][NodeFeature:DeviceManager] With sample device plugin [Serial] [Disruptive] should deploy pod consuming devices first but fail with admission error after kubelet restart in case device plugin hasn't re-registered
E2eNode Suite.[It] [sig-node] Device Plugin [Feature:DevicePluginProbe][NodeFeature:DevicePluginProbe][Serial] DevicePlugin [Serial] [Disruptive] Keeps device plugin assignments across kubelet restarts (no pod restart, no device plugin restart)
E2eNode Suite.[It] [sig-node] Device Plugin [Feature:DevicePluginProbe][NodeFeature:DevicePluginProbe][Serial] DevicePlugin [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration)

Since when has it been flaking?

Flakes have been present for a while.

Testgrid link

https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd

Reason for failure (if possible)

No response

Anything else we need to know?

We run each test multiple times (3). In most cases it's only one of them that fails. This might not be a critical issue, but ideally we want a green Testgrid.

Relevant SIG(s)

/sig node

pacoxu · 2023-11-21T08:04:06Z

/kind failing-test
/remove-kind flake

It keeps failing. The last success and the only success that I can see now is 11-07.

I found that this CI is in https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd which is sig-node release blocking CI. If this is release blocking, we should fix it ASAP. If not, we may move this CI to another board like https://testgrid.k8s.io/sig-node-containerd.
/cc @SergeyKanzhelev @mrunalp

pacoxu · 2023-11-23T05:07:07Z

link the slack thread here: https://kubernetes.slack.com/archives/C0BP8PW9G/p1700553934108539

ffromani · 2023-11-29T18:21:01Z

/cc

SergeyKanzhelev · 2023-11-29T18:23:55Z

Device manager tests are failing because of the reconnection to socket error. Not a regression.

E2eNode Suite.[It] [sig-node] POD Resources [Serial] [Feature:PodResources][NodeFeature:PodResources] with the builtin rate limit values should hit throttling when calling podresources List in a tight loop

Another known issue, not a regression. Flakes from the nature of test how it validates the throttling logic

E2eNode Suite.[It] [sig-node] Density [Serial] [Slow] create a batch of pods latency/resource should be within limit when create 10 pods with 0s interval

Also not a regression, need to take a look after release

ffromani · 2023-11-29T18:24:12Z

this PR wants to reduce/remove flakes: #122024

bart0sh · 2023-12-12T10:40:55Z

Latest test run failed only density tests and only for 2 nodes out of 3:

n1-standard-2-cos-stable-109-17800-66-32-85034074
Dec 11 13:58:43.336: INFO: CPU usage of containers:
container 50th% 90th% 95th% 99th% 100th%
"/" N/A N/A N/A N/A N/A
"runtime" 0.072 0.886 0.886 0.886 0.898
"kubelet" 0.012 0.098 0.098 0.098 0.147
...
[FAILED] CPU usage exceeding limits:
container "runtime": expected 95th% usage < 0.600; got 0.886
In [It] at: test/e2e_node/resource_usage_test.go:288 @ 12/11/23 13:58:43.337
n1-standard-2-ubuntu-gke-2204-1-25-v20231206-d39e8fd2
Dec 11 14:58:22.650: INFO: CPU usage of containers:
container 50th% 90th% 95th% 99th% 100th%
"/" N/A N/A N/A N/A N/A
"runtime" 0.003 0.757 0.757 0.757 0.779
"kubelet" 0.013 0.118 0.118 0.118 0.159
...
[FAILED] CPU usage exceeding limits:
container "runtime": expected 95th% usage < 0.600; got 0.757

What's interesting is that succeeded node is configured similarly to the failed one, but runtime metrics are much better:

n1-standard-2-cos-stable-109-17800-66-32-e1854a72
Dec 11 13:56:40.371: INFO: CPU usage of containers:
container 50th% 90th% 95th% 99th% 100th%
"/" N/A N/A N/A N/A N/A
"runtime" 0.123 0.198 0.218 0.218 0.234
"kubelet" 0.048 0.067 0.086 0.086 0.125

The only difference I can see is that one configuration requests 2 nvidia-tesla-k80 accelerators. I'm not sure if it's related to the density tests failures though.

k8s-triage-robot · 2024-03-11T11:19:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pacoxu · 2024-03-14T06:56:35Z

/remove-lifecycle stale
this is still a good umbrella for https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd

[Flaking Test] [sig-node] [Serial] without SRIOV devices in the system with CPU manager None policy should return the expected responses #121850
[sig-node] Failure cluster [c13abd3c...] flakes from Memory Manager Metrics #123454
[Flaking Test] [sig-node] CriticalPod [Serial] [Disruptive] [NodeFeature:CriticalPod] when we need to admit a critical pod should ... #123924
[Flaking Test] [sig-node] [GracefulNodeShutdownBasedOnPodPriority] when gracefully shutting down with Pod priority should be able to gracefully shutdown pods with various grace periods #123922

/retitle [Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂

pacoxu · 2024-03-14T06:58:14Z

/triage accepted

mmiranda96 added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 27, 2023

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 27, 2023

dims mentioned this issue Nov 3, 2023

Failure in [sig-node] Density [Serial] [Slow] create a batch of pods latency/resource should be within limit when create 10 pods with 0s interval [sig-node] kubernetes-sigs/provider-aws-test-infra#195

Closed

k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed kind/flake Categorizes issue or PR as related to a flaky test. labels Nov 21, 2023

SergeyKanzhelev added this to Triage in SIG Node CI/Test Board Nov 29, 2023

SergeyKanzhelev moved this from Triage to Issues - To do in SIG Node CI/Test Board Nov 29, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2024

k8s-ci-robot changed the title ~~node-kubelet-serial-containerd job multiple flakes~~ [Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂 Mar 14, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 14, 2024

AnishShah mentioned this issue Mar 27, 2024

[Flaky test] [sig-node] Restart [Serial] [Slow] [Disruptive] Kubelet should force-delete non-admissible pods that was admitted and running before kubelet restart #124067

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂 #120913

[Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂 #120913

mmiranda96 commented Sep 27, 2023

pacoxu commented Nov 21, 2023

pacoxu commented Nov 23, 2023

ffromani commented Nov 29, 2023

SergeyKanzhelev commented Nov 29, 2023

ffromani commented Nov 29, 2023

bart0sh commented Dec 12, 2023

k8s-triage-robot commented Mar 11, 2024

pacoxu commented Mar 14, 2024

pacoxu commented Mar 14, 2024

[Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂 #120913

[Flaking Test] [sig-node] ☂️ node-kubelet-serial-containerd job multiple flakes🌂 #120913

Comments

mmiranda96 commented Sep 27, 2023

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

pacoxu commented Nov 21, 2023

pacoxu commented Nov 23, 2023

ffromani commented Nov 29, 2023

SergeyKanzhelev commented Nov 29, 2023

ffromani commented Nov 29, 2023

bart0sh commented Dec 12, 2023

k8s-triage-robot commented Mar 11, 2024

pacoxu commented Mar 14, 2024

pacoxu commented Mar 14, 2024