Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flaking test][sig-node] NodeProblemDetector should run without error #121973

Closed
pacoxu opened this issue Nov 21, 2023 · 13 comments
Closed

[flaking test][sig-node] NodeProblemDetector should run without error #121973

pacoxu opened this issue Nov 21, 2023 · 13 comments
Labels
area/test kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@pacoxu
Copy link
Member

pacoxu commented Nov 21, 2023

Failure cluster f0cef9ada6202025601f

https://storage.googleapis.com/k8s-triage/index.html?test=NodeProblemDetector%20should%20run%20without%20error

Error text:
[FAILED] an error on the server ("Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = NotFound desc = failed to convert to cri containerd stats format: failed to decode container metrics for \"34e999f718280d188dc98de552ec36df97d2ee7649424427cb02dad2be6bd459\": failed to obtain cpu stats: failed to get usage nano cores, containerID: 34e999f718280d188dc98de552ec36df97d2ee7649424427cb02dad2be6bd459: failed to get container ID: 34e999f718280d188dc98de552ec36df97d2ee7649424427cb02dad2be6bd459: not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-2qzf:10250)
In [It] at: test/e2e/node/node_problem_detector.go:379 @ 11/12/23 04:28:06.191

 Nov 19 22:21:12.806: INFO: Unexpected error: 
      <*errors.StatusError | 0xc00358c280>: 
      an error on the server ("Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = NotFound desc = failed to convert to cri containerd stats format: failed to decode container metrics for \"bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc\": failed to obtain cpu stats: failed to get usage nano cores, containerID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc: failed to get container ID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc: not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-4fz9:10250)
      {
          ErrStatus: 
              code: 500
              details:
                causes:
                - message: 'Internal Error: failed to list pod stats: failed to list all container
                    stats: rpc error: code = NotFound desc = failed to convert to cri containerd
                    stats format: failed to decode container metrics for "bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc":
                    failed to obtain cpu stats: failed to get usage nano cores, containerID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc:
                    failed to get container ID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc:
                    not found'
                  reason: UnexpectedServerResponse
                kind: nodes
                name: bootstrap-e2e-minion-group-4fz9:10250
              message: 'an error on the server ("Internal Error: failed to list pod stats: failed
                to list all container stats: rpc error: code = NotFound desc = failed to convert
                to cri containerd stats format: failed to decode container metrics for \"bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc\":
                failed to obtain cpu stats: failed to get usage nano cores, containerID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc:
                failed to get container ID: bb5ad9e5466f818f2a6e3b3d61e3841af1e65ea2a58abe9e0e68f26395d04acc:
                not found") has prevented the request from succeeding (get nodes bootstrap-e2e-minion-group-4fz9:10250)'
              metadata: {}
              reason: InternalError
              status: Failure,
      }
  [FAILED] in [It] - test/e2e/node/node_problem_detector.go:379 @ 11/19/23 22:21:12.806

Recent failures:

2023/11/20 18:55:47 ci-kubernetes-e2e-gci-gce-kube-dns
2023/11/20 18:41:47 ci-kubernetes-e2e-prow-canary
2023/11/20 16:55:47 ci-kubernetes-e2e-gci-gce-proto
2023/11/20 13:52:47 ci-kubernetes-e2e-gci-gce-network-proxy-grpc
2023/11/20 13:10:47 ci-kubernetes-e2e-gci-gce-ipvs

/kind flake
/sig node

This flake in the master blocking board recently:

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 21, 2023
@pacoxu
Copy link
Member Author

pacoxu commented Nov 21, 2023

"failed to list pod stats: failed to list all container stats: rpc error: code = Unavailable desc = error reading from server: EOF" request="/stats/summary"

I found some similar logs in #115192.

/cc @SergeyKanzhelev @harche
for more attentions from sig-node, as this may be a case that the flaking indicts a bug of kubelet.

@Vyom-Yadav
Copy link
Member

/cc @kubernetes/sig-node-test-failures

@pacoxu
Copy link
Member Author

pacoxu commented Nov 21, 2023

According to @Vyom-Yadav 's investigation, this issue is really old, this started flaking recently
https://storage.googleapis.com/k8s-triage/index.html?date=2023-10-30&test=NodeProblemDetector%20should%20run%20without%20error#1adbf9900df472bd8059

It started to flake from 10-24, on which day we bump the npd version in #121382.

I think we should revert it before we find the root cause.

@SergeyKanzhelev
Copy link
Member

/area test

@aojea
Copy link
Member

aojea commented Nov 29, 2023

/area test

@SergeyKanzhelev this node problem detect need to be tagged so it is not picked by default, jobs that want to exercise these npd tests should be explicitily opt in for it

@aojea
Copy link
Member

aojea commented Nov 29, 2023

It started to flake from 10-24, on which day we bump the npd version in #121382.

the problem here is that when is a daemonset, it accounts for scheduling and sometimes the instances are too small that the pods are not able to be scheduled and the e2e timesout

kubernetes/test-infra#31312
kubernetes/test-infra#31315

NPD should not run by default, and the tests should be opt-in

@mmiranda96
Copy link
Contributor

Perhaps we could run the tests serially? If running all tests in parallel result in all memory to be consumed, then that might help.

I know we have a bunch of new features between NPD versions (0.8.9 to 0.8.13/14), but I don't recall any of those particularly affecting memory usage.

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 6, 2023
@mmiranda96 mmiranda96 moved this from Triage to Issues - To do in SIG Node CI/Test Board Dec 6, 2023
@pacoxu
Copy link
Member Author

pacoxu commented Feb 18, 2024

@pacoxu
Copy link
Member Author

pacoxu commented Mar 1, 2024

This will be fixed by #123114

@pacoxu
Copy link
Member Author

pacoxu commented Mar 5, 2024

This only flake in https://testgrid.k8s.io/sig-release-1.29-blocking#gce-cos-k8sbeta-default.

I need watch after we upgrade to v0.8.16. #123114

@pacoxu
Copy link
Member Author

pacoxu commented Apr 22, 2024

Flake in https://testgrid.k8s.io/sig-release-1.29-blocking#gce-cos-k8sstable1-default but the CI is blocked by #124438.

STEP: Gather node-problem-detector cpu and memory stats - k8s.io/kubernetes/test/e2e/node/node_problem_detector.go:191 @ 04/13/24 07:29:30.996
Apr 13 07:32:57.833: INFO: Unexpected error: 
    <*errors.StatusError | 0xc002b3a820>: 
    an error on the server ("Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = NotFound desc = failed to convert to cri containerd stats format: failed to decode container metrics for \"1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217\": failed to obtain cpu stats: failed to get usage nano cores, containerID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217: failed to get container ID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217: not found") has prevented the request from succeeding (get nodes test-359c739436-minion-group-wp50:10250)
    {
        ErrStatus: 
            code: 500
            details:
              causes:
              - message: 'Internal Error: failed to list pod stats: failed to list all container
                  stats: rpc error: code = NotFound desc = failed to convert to cri containerd
                  stats format: failed to decode container metrics for "1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217":
                  failed to obtain cpu stats: failed to get usage nano cores, containerID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217:
                  failed to get container ID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217:
                  not found'
                reason: UnexpectedServerResponse
              kind: nodes
              name: test-359c739436-minion-group-wp50:10250
            message: 'an error on the server ("Internal Error: failed to list pod stats: failed
              to list all container stats: rpc error: code = NotFound desc = failed to convert
              to cri containerd stats format: failed to decode container metrics for \"1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217\":
              failed to obtain cpu stats: failed to get usage nano cores, containerID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217:
              failed to get container ID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217:
              not found") has prevented the request from succeeding (get nodes test-359c739436-minion-group-wp50:10250)'
            metadata: {}
            reason: InternalError
            status: Failure,
    }
[FAILED] an error on the server ("Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = NotFound desc = failed to convert to cri containerd stats format: failed to decode container metrics for \"1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217\": failed to obtain cpu stats: failed to get usage nano cores, containerID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217: failed to get container ID: 1a615e6a6281966ce4100f5686f0984d2d6fdd49e9a2a9a6bb46696db8558217: not found") has prevented the request from succeeding (get nodes test-359c739436-minion-group-wp50:10250)
I

It still use registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.13

                                                    application/vnd.docker.distribution.manifest.list.v2+json sha256:d65a5c35dc7e948ff6d0bbe15b5e272c55790acbc9db4963ad91dc9939b4e293 55.3 MiB  linux/amd64,linux/arm64                                                      io.cri-containerd.image=managed 

@pacoxu
Copy link
Member Author

pacoxu commented Apr 22, 2024

https://github.com/kubernetes/test-infra/blob/181ac5037445e499549098c014897a4674c211fd/config/jobs/kubernetes/generated/generated.yaml#L297-L310

  • ci-kubernetes-e2e-gce-cos-k8sstable1-default is using latest-1.29.

/close

So this should be fixed with npd v0.8.16.

@k8s-ci-robot
Copy link
Contributor

@pacoxu: Closing this issue.

In response to this:

https://github.com/kubernetes/test-infra/blob/181ac5037445e499549098c014897a4674c211fd/config/jobs/kubernetes/generated/generated.yaml#L297-L310

  • ci-kubernetes-e2e-gce-cos-k8sstable1-default is using latest-1.29.

/close

So this should be fixed with npd v0.8.16.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SIG Node CI/Test Board automation moved this from Issues - To do to Done Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging a pull request may close this issue.

6 participants