[flaking test][sig-node] NodeProblemDetector should run without error #121973

pacoxu · 2023-11-21T02:15:23Z

Failure cluster f0cef9ada6202025601f

https://storage.googleapis.com/k8s-triage/index.html?test=NodeProblemDetector%20should%20run%20without%20error

Error text:

[FAILED] an error on the server ("Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = NotFound desc = failed to convert to cri containerd stats format: failed to decode container metrics for \"34e999f718280d188dc98de552ec36df97d2ee7649424427cb02dad2be6bd459\": failed to obtain cpu stats: failed to get usage nano cores, containerID: 34e999f718280d188dc98de552ec36df97d2ee7649424427cb02dad2be6bd459: failed to get container ID: 34e999f718280d188dc98de552ec36df97d2ee7649424427cb02dad2be6bd459: not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-2qzf:10250)
In [It] at: test/e2e/node/node_problem_detector.go:379 @ 11/12/23 04:28:06.191

 Nov 19 22:21:12.806: INFO: Unexpected error: 
      <*errors.StatusError | 0xc00358c280>: 
      an error on the server ("Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = NotFound desc = failed to convert to cri containerd stats format: failed to decode container metrics for \"bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc\": failed to obtain cpu stats: failed to get usage nano cores, containerID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc: failed to get container ID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc: not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-4fz9:10250)
      {
          ErrStatus: 
              code: 500
              details:
                causes:
                - message: 'Internal Error: failed to list pod stats: failed to list all container
                    stats: rpc error: code = NotFound desc = failed to convert to cri containerd
                    stats format: failed to decode container metrics for "bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc":
                    failed to obtain cpu stats: failed to get usage nano cores, containerID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc:
                    failed to get container ID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc:
                    not found'
                  reason: UnexpectedServerResponse
                kind: nodes
                name: bootstrap-e2e-minion-group-4fz9:10250
              message: 'an error on the server ("Internal Error: failed to list pod stats: failed
                to list all container stats: rpc error: code = NotFound desc = failed to convert
                to cri containerd stats format: failed to decode container metrics for \"bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc\":
                failed to obtain cpu stats: failed to get usage nano cores, containerID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc:
                failed to get container ID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc:
                not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-4fz9:10250)'
              metadata: {}
              reason: InternalError
              status: Failure,
      }
  [FAILED] in [It] - test/e2e/node/node_problem_detector.go:379 @ 11/19/23 22:21:12.806

Recent failures:

2023/11/20 18:55:47 ci-kubernetes-e2e-gci-gce-kube-dns
2023/11/20 18:41:47 ci-kubernetes-e2e-prow-canary
2023/11/20 16:55:47 ci-kubernetes-e2e-gci-gce-proto
2023/11/20 13:52:47 ci-kubernetes-e2e-gci-gce-network-proxy-grpc
2023/11/20 13:10:47 ci-kubernetes-e2e-gci-gce-ipvs

/kind flake
/sig node

This flake in the master blocking board recently:

The text was updated successfully, but these errors were encountered:

pacoxu · 2023-11-21T02:17:50Z

"failed to list pod stats: failed to list all container stats: rpc error: code = Unavailable desc = error reading from server: EOF" request="/stats/summary"

I found some similar logs in #115192.

/cc @SergeyKanzhelev @harche
for more attentions from sig-node, as this may be a case that the flaking indicts a bug of kubelet.

Vyom-Yadav · 2023-11-21T04:12:45Z

/cc @kubernetes/sig-node-test-failures

pacoxu · 2023-11-21T08:52:35Z

According to @Vyom-Yadav 's investigation, this issue is really old, this started flaking recently
https://storage.googleapis.com/k8s-triage/index.html?date=2023-10-30&test=NodeProblemDetector%20should%20run%20without%20error#1adbf9900df472bd8059

It started to flake from 10-24, on which day we bump the npd version in #121382.

I think we should revert it before we find the root cause.

SergeyKanzhelev · 2023-11-29T19:24:28Z

/area test

aojea · 2023-11-29T19:33:09Z

/area test

@SergeyKanzhelev this node problem detect need to be tagged so it is not picked by default, jobs that want to exercise these npd tests should be explicitily opt in for it

aojea · 2023-11-29T19:34:48Z

It started to flake from 10-24, on which day we bump the npd version in #121382.

the problem here is that when is a daemonset, it accounts for scheduling and sometimes the instances are too small that the pods are not able to be scheduled and the e2e timesout

kubernetes/test-infra#31312
kubernetes/test-infra#31315

NPD should not run by default, and the tests should be opt-in

mmiranda96 · 2023-12-06T18:12:49Z

Perhaps we could run the tests serially? If running all tests in parallel result in all memory to be consumed, then that might help.

I know we have a bunch of new features between NPD versions (0.8.9 to 0.8.13/14), but I don't recall any of those particularly affecting memory usage.

/triage accepted
/priority important-soon

pacoxu · 2024-02-18T06:46:01Z

https://github.com/kubernetes/node-problem-detector/releases/tag/v0.8.15 is released.

/cc @vteratipally

pacoxu · 2024-03-01T06:19:58Z

This will be fixed by #123114

pacoxu · 2024-03-05T10:01:49Z

This only flake in https://testgrid.k8s.io/sig-release-1.29-blocking#gce-cos-k8sbeta-default.

I need watch after we upgrade to v0.8.16. #123114

pacoxu · 2024-04-22T09:48:37Z

Flake in https://testgrid.k8s.io/sig-release-1.29-blocking#gce-cos-k8sstable1-default but the CI is blocked by #124438.

STEP: Gather node-problem-detector cpu and memory stats - k8s.io/kubernetes/test/e2e/node/node_problem_detector.go:191 @ 04/13/24 07:29:30.996
Apr 13 07:32:57.833: INFO: Unexpected error: 
    <*errors.StatusError | 0xc002b3a820>: 
    an error on the server ("Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = NotFound desc = failed to convert to cri containerd stats format: failed to decode container metrics for \"1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217\": failed to obtain cpu stats: failed to get usage nano cores, containerID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217: failed to get container ID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217: not found") has prevented the request from succeeding (get nodes test-359c739436-minion-group-wp50:10250)
    {
        ErrStatus: 
            code: 500
            details:
              causes:
              - message: 'Internal Error: failed to list pod stats: failed to list all container
                  stats: rpc error: code = NotFound desc = failed to convert to cri containerd
                  stats format: failed to decode container metrics for "1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217":
                  failed to obtain cpu stats: failed to get usage nano cores, containerID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217:
                  failed to get container ID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217:
                  not found'
                reason: UnexpectedServerResponse
              kind: nodes
              name: test-359c739436-minion-group-wp50:10250
            message: 'an error on the server ("Internal Error: failed to list pod stats: failed
              to list all container stats: rpc error: code = NotFound desc = failed to convert
              to cri containerd stats format: failed to decode container metrics for \"1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217\":
              failed to obtain cpu stats: failed to get usage nano cores, containerID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217:
              failed to get container ID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217:
              not found") has prevented the request from succeeding (get nodes test-359c739436-minion-group-wp50:10250)'
            metadata: {}
            reason: InternalError
            status: Failure,
    }
[FAILED] an error on the server ("Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = NotFound desc = failed to convert to cri containerd stats format: failed to decode container metrics for \"1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217\": failed to obtain cpu stats: failed to get usage nano cores, containerID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217: failed to get container ID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217: not found") has prevented the request from succeeding (get nodes test-359c739436-minion-group-wp50:10250)
I

It still use registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.13

                                                    application/vnd.docker.distribution.manifest.list.v2+json sha256:d65a5c35dc7e948ff6d0bbe15b5e272c55790acbc9db4963ad91dc9939b4e293 55.3 MiB  linux/amd64,linux/arm64                                                      io.cri-containerd.image=managed

pacoxu · 2024-04-22T09:55:40Z

https://github.com/kubernetes/test-infra/blob/181ac5037445e499549098c014897a4674c211fd/config/jobs/kubernetes/generated/generated.yaml#L297-L310

ci-kubernetes-e2e-gce-cos-k8sstable1-default is using latest-1.29.

/close

So this should be fixed with npd v0.8.16.

k8s-ci-robot · 2024-04-22T09:55:45Z

@pacoxu: Closing this issue.

In response to this:

https://github.com/kubernetes/test-infra/blob/181ac5037445e499549098c014897a4674c211fd/config/jobs/kubernetes/generated/generated.yaml#L297-L310

ci-kubernetes-e2e-gce-cos-k8sstable1-default is using latest-1.29.

/close

So this should be fixed with npd v0.8.16.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 21, 2023

pacoxu mentioned this issue Nov 21, 2023

Revert "Update npd version to 0.8.13 in configure.sh" #121977

Closed

k8s-ci-robot added the area/test label Nov 29, 2023

pacoxu mentioned this issue Nov 30, 2023

Upgrade NodeProblemDetector to v0.8.14+ #122118

Open

SergeyKanzhelev added this to Triage in SIG Node CI/Test Board Dec 4, 2023

mmiranda96 moved this from Triage to Issues - To do in SIG Node CI/Test Board Dec 6, 2023

k8s-ci-robot closed this as completed Apr 22, 2024

SIG Node CI/Test Board automation moved this from Issues - To do to Done Apr 22, 2024

pacoxu mentioned this issue Apr 25, 2024

Bump npd from v0.8.16 to v0.8.18 #123740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flaking test][sig-node] NodeProblemDetector should run without error #121973

[flaking test][sig-node] NodeProblemDetector should run without error #121973

pacoxu commented Nov 21, 2023

pacoxu commented Nov 21, 2023

Vyom-Yadav commented Nov 21, 2023

pacoxu commented Nov 21, 2023

SergeyKanzhelev commented Nov 29, 2023

aojea commented Nov 29, 2023

aojea commented Nov 29, 2023

mmiranda96 commented Dec 6, 2023

pacoxu commented Feb 18, 2024 •

edited

Loading

pacoxu commented Mar 1, 2024

pacoxu commented Mar 5, 2024

pacoxu commented Apr 22, 2024 •

edited

Loading

pacoxu commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024

[flaking test][sig-node] NodeProblemDetector should run without error #121973

[flaking test][sig-node] NodeProblemDetector should run without error #121973

Comments

pacoxu commented Nov 21, 2023

Failure cluster f0cef9ada6202025601f

Error text:

Recent failures:

pacoxu commented Nov 21, 2023

Vyom-Yadav commented Nov 21, 2023

pacoxu commented Nov 21, 2023

SergeyKanzhelev commented Nov 29, 2023

aojea commented Nov 29, 2023

aojea commented Nov 29, 2023

mmiranda96 commented Dec 6, 2023

pacoxu commented Feb 18, 2024 • edited Loading

pacoxu commented Mar 1, 2024

pacoxu commented Mar 5, 2024

pacoxu commented Apr 22, 2024 • edited Loading

pacoxu commented Apr 22, 2024

k8s-ci-robot commented Apr 22, 2024

pacoxu commented Feb 18, 2024 •

edited

Loading

pacoxu commented Apr 22, 2024 •

edited

Loading