Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaking Test] [sig-node] 鈽傦笍 node-kubelet-serial-containerd job multiple flakes馃寕 #120913

Open
mmiranda96 opened this issue Sep 27, 2023 · 9 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mmiranda96
Copy link
Contributor

Which jobs are flaking?

node-kubelet-serial-containerd

Which tests are flaking?

There are multiple tests:

  • E2eNode Suite.[It] [sig-node] Density [Serial] [Slow] create a batch of pods latency/resource should be within limit when create 10 pods with 0s interval
  • E2eNode Suite.[It] [sig-node] POD Resources [Serial] [Feature:PodResources][NodeFeature:PodResources] with the builtin rate limit values should hit throttling when calling podresources List in a tight loop
  • E2eNode Suite.[It] [sig-node] Device Manager [Serial] [Feature:DeviceManager][NodeFeature:DeviceManager] With sample device plugin [Serial] [Disruptive] should deploy pod consuming devices first but fail with admission error after kubelet restart in case device plugin hasn't re-registered
  • E2eNode Suite.[It] [sig-node] Device Plugin [Feature:DevicePluginProbe][NodeFeature:DevicePluginProbe][Serial] DevicePlugin [Serial] [Disruptive] Keeps device plugin assignments across kubelet restarts (no pod restart, no device plugin restart)
  • E2eNode Suite.[It] [sig-node] Device Plugin [Feature:DevicePluginProbe][NodeFeature:DevicePluginProbe][Serial] DevicePlugin [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration)

Since when has it been flaking?

Flakes have been present for a while.

Testgrid link

https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd

Reason for failure (if possible)

No response

Anything else we need to know?

We run each test multiple times (3). In most cases it's only one of them that fails. This might not be a critical issue, but ideally we want a green Testgrid.

Relevant SIG(s)

/sig node

@mmiranda96 mmiranda96 added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 27, 2023
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 27, 2023
@pacoxu
Copy link
Member

pacoxu commented Nov 21, 2023

/kind failing-test
/remove-kind flake

It keeps failing. The last success and the only success that I can see now is 11-07.

I found that this CI is in https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd which is sig-node release blocking CI. If this is release blocking, we should fix it ASAP. If not, we may move this CI to another board like https://testgrid.k8s.io/sig-node-containerd.
/cc @SergeyKanzhelev @mrunalp

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed kind/flake Categorizes issue or PR as related to a flaky test. labels Nov 21, 2023
@pacoxu
Copy link
Member

pacoxu commented Nov 23, 2023

link the slack thread here: https://kubernetes.slack.com/archives/C0BP8PW9G/p1700553934108539

@ffromani
Copy link
Contributor

/cc

@SergeyKanzhelev
Copy link
Member

Device manager tests are failing because of the reconnection to socket error. Not a regression.

E2eNode Suite.[It] [sig-node] POD Resources [Serial] [Feature:PodResources][NodeFeature:PodResources] with the builtin rate limit values should hit throttling when calling podresources List in a tight loop

Another known issue, not a regression. Flakes from the nature of test how it validates the throttling logic

E2eNode Suite.[It] [sig-node] Density [Serial] [Slow] create a batch of pods latency/resource should be within limit when create 10 pods with 0s interval

Also not a regression, need to take a look after release

@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Issues - To do in SIG Node CI/Test Board Nov 29, 2023
@ffromani
Copy link
Contributor

this PR wants to reduce/remove flakes: #122024

@bart0sh
Copy link
Contributor

bart0sh commented Dec 12, 2023

Latest test run failed only density tests and only for 2 nodes out of 3:

  • n1-standard-2-cos-stable-109-17800-66-32-85034074
    Dec 11 13:58:43.336: INFO: CPU usage of containers:
    container 50th% 90th% 95th% 99th% 100th%
    "/" N/A N/A N/A N/A N/A
    "runtime" 0.072 0.886 0.886 0.886 0.898
    "kubelet" 0.012 0.098 0.098 0.098 0.147
    ...
    [FAILED] CPU usage exceeding limits:
    container "runtime": expected 95th% usage < 0.600; got 0.886
    In [It] at: test/e2e_node/resource_usage_test.go:288 @ 12/11/23 13:58:43.337
  • n1-standard-2-ubuntu-gke-2204-1-25-v20231206-d39e8fd2
    Dec 11 14:58:22.650: INFO: CPU usage of containers:
    container 50th% 90th% 95th% 99th% 100th%
    "/" N/A N/A N/A N/A N/A
    "runtime" 0.003 0.757 0.757 0.757 0.779
    "kubelet" 0.013 0.118 0.118 0.118 0.159
    ...
    [FAILED] CPU usage exceeding limits:
    container "runtime": expected 95th% usage < 0.600; got 0.757

What's interesting is that succeeded node is configured similarly to the failed one, but runtime metrics are much better:

  • n1-standard-2-cos-stable-109-17800-66-32-e1854a72
    Dec 11 13:56:40.371: INFO: CPU usage of containers:
    container 50th% 90th% 95th% 99th% 100th%
    "/" N/A N/A N/A N/A N/A
    "runtime" 0.123 0.198 0.218 0.218 0.234
    "kubelet" 0.048 0.067 0.086 0.086 0.125

The only difference I can see is that one configuration requests 2 nvidia-tesla-k80 accelerators. I'm not sure if it's related to the density tests failures though.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2024
@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2024
@k8s-ci-robot k8s-ci-robot changed the title node-kubelet-serial-containerd job multiple flakes [Flaking Test] [sig-node] 鈽傦笍 node-kubelet-serial-containerd job multiple flakes馃寕 Mar 14, 2024
@pacoxu
Copy link
Member

pacoxu commented Mar 14, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests

7 participants