Metrics server fails to pull metrics from Windows workers #91575

brianharwell · 2020-05-29T12:39:32Z

/sig windows

What happened:
I have a Kubernetes cluster with 2 Windows worker nodes. When I run kubectl top nodes the Windows nodes report as unknown. I did some investigating and I am seeing errors in the logs.

What you expected to happen:
When I do kubectl top nodes or kubectl top pods the Windows workers are included.

How to reproduce it (as minimally and precisely as possible):
No idea, One Windows node was reporting metrics yesterday but today neither are reporting metrics.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.17.5
Cloud provider or hardware configuration:
Rancher managed cluster of VMs on vSphere. 2 CentOS 7 managers, 2 CentOS 7 workers, 2 Windows Server 2019 1809 workers
OS (e.g: cat /etc/os-release): CentOS Linux 7 (Core)
Kernel (e.g. uname -a): 3.10.0-1062.12.1.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools: Rancher
Network plugin and version (if this is a network-related bug): Flannel in L2Bridge mode
Others:

Logs from metric server

E0529 12:04:50.809303       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:05:50.838175       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:06:50.815777       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:07:50.800927       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:08:50.821804       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:09:12.819567       1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-02": no metrics known for node
E0529 12:09:12.819592       1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-01": no metrics known for node
E0529 12:09:50.809012       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist."]
E0529 12:10:53.085842       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = container 29f403ebb265389ac1bcbe39f8a555045e1e461a2abf065f11d2f8b267f83b12 encountered an error during Properties: failure in a Windows system call: A system shutdown is in progress. (0x45b)"]
E0529 12:12:00.147458       1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-02": no metrics known for node
E0529 12:12:00.147485       1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-01": no metrics known for node
E0529 12:12:44.741135       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]
E0529 12:13:44.740851       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]
E0529 12:14:44.740965       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]
E0529 12:15:44.740936       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]

I saw an error trying to fetch the metrics so from the node running the metrics server pod (qa-k8sm-02 I ran curl -v -k https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true

curl -v -k https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true
* About to connect() to 10.4.111.68 port 10250 (#0)
*   Trying 10.4.111.68...
* Connected to 10.4.111.68 (10.4.111.68) port 10250 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
*       subject: CN=qa-k8sw-win-02@1589888891
*       start date: May 19 10:48:10 2020 GMT
*       expire date: May 19 10:48:10 2021 GMT
*       common name: qa-k8sw-win-02@1589888891
*       issuer: CN=qa-k8sw-win-02-ca@1589888890
> GET /stats/summary?only_cpu_and_memory=true HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.4.111.68:10250
> Accept: */*
>
< HTTP/1.1 401 Unauthorized
< Date: Fri, 29 May 2020 12:30:40 GMT
< Content-Length: 12
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 10.4.111.68 left intact

The connection to the server worked so I looked closer at the error and it was a 500 so I thought there was an issue with the pod on the Windows server so I views the logs of cattle-node-agent-windows-gkrf9

WARN: Default docker named pipe is not found
WARN: Please bind mount in the docker named pipe to //./pipe/docker_engine if docker errors occur
WARN: example: docker run -v //./pipe/custom_docker_named_pipe://./pipe/docker_engine ...
INFO: https://rancher.mycompany.com is accessible
time="2020-05-29T07:12:09-05:00" level=info msg="Rancher agent version v2.4.3 is starting"
time="2020-05-29T07:12:09-05:00" level=info msg="Listening on /tmp/log.sock"
time="2020-05-29T07:12:09-05:00" level=info msg="Option etcd=false"
time="2020-05-29T07:12:09-05:00" level=info msg="Option controlPlane=false"
time="2020-05-29T07:12:09-05:00" level=info msg="Option worker=true"
time="2020-05-29T07:12:09-05:00" level=info msg="Option requestedHostname=qa-k8sw-win-02"
time="2020-05-29T07:12:09-05:00" level=info msg="Option customConfig=map[address:10.4.111.68 internalAddress: label:map[rke.cattle.io/windows-build:17763 rke.cattle.io/windows-kernel-version:17763.1.amd64fre.rs5_release.180914-1434 rke.cattle.io/windows-major-version:10 rke.cattle.io/windows-minor-version:0 rke.cattle.io/windows-release-id:1809 rke.cattle.io/windows-version:10.0.17763.1098] roles:[worker] taints:[]]"
time="2020-05-29T07:12:09-05:00" level=info msg="Connecting to wss://rancher.mycompany.com/v3/connect with token qt6xtcslz7gwrjfw5tszj8r5tjrjpqk5kwnfqm85l7wc64tkjwfcqj"
time="2020-05-29T07:12:09-05:00" level=info msg="Connecting to proxy" url="wss://rancher.mycompany.com/v3/connect"
time="2020-05-29T07:12:09-05:00" level=info msg="Starting plan monitor, checking every 120 seconds"

So I didn't see any errors there so I logged into the Windows server and started looking at container logs and when I viewed the logs for the kubelet I saw a bunch of errors...

E0529 07:18:32.644370   10452 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43
c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
I0529 07:18:37.672584   10452 setters.go:73] Using node IP: "10.4.111.189"
E0529 07:18:42.692479   10452 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = Unknown desc = hcsshim::O
penComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
E0529 07:18:42.692479   10452 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43
c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
I0529 07:18:47.678996   10452 setters.go:73] Using node IP: "10.4.111.189"
E0529 07:18:50.811858   10452 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = Unknown desc = hcsshim::O
penComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
E0529 07:18:50.811858   10452 handler.go:321] HTTP InternalServerError serving /stats/summary: Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem
 cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
E0529 07:18:52.740196   10452 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = Unknown desc = hcsshim::O
penComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
E0529 07:18:52.740196   10452 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43
c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.

So the underlying cause appears to be A virtual machine or container with the specified identifier does not exist

The text was updated successfully, but these errors were encountered:

brianharwell · 2020-05-29T17:51:32Z

I apologize. I did not know there was a separate repo for metrics server. I'm going to post there.

AlexEyler · 2020-06-04T18:57:33Z

I'm seeing this issue as well, created an issue in kubernetes-sigs/metrics-server#539 (comment) since I couldn't find one created by Brian. Just wanted to share that if anyone else is searching for this issue.

marosset · 2020-06-26T21:14:25Z

Fixed by #90554

brianharwell added the kind/bug Categorizes issue or PR as related to a bug. label May 29, 2020

k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label May 29, 2020

brianharwell closed this as completed May 29, 2020

marosset added this to In Progress (v1.19) in SIG-Windows Jun 5, 2020

marosset moved this from In Progress (v1.19) to Done (v1.19) in SIG-Windows Jun 26, 2020

feinoujc mentioned this issue Jul 16, 2020

[EKS] [windows]: 1.17 windows node ami kubelet has issues reporting stats aws/containers-roadmap#988

Closed

brianpursley mentioned this issue May 14, 2024

"kubectl top nodes" reports "unknown" when executing multiple concurrent “kubectl exec” requests against a pod running on a Windows node #124700

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics server fails to pull metrics from Windows workers #91575

Metrics server fails to pull metrics from Windows workers #91575

brianharwell commented May 29, 2020

brianharwell commented May 29, 2020

AlexEyler commented Jun 4, 2020

marosset commented Jun 26, 2020

Metrics server fails to pull metrics from Windows workers #91575

Metrics server fails to pull metrics from Windows workers #91575

Comments

brianharwell commented May 29, 2020

Logs from metric server

brianharwell commented May 29, 2020

AlexEyler commented Jun 4, 2020

marosset commented Jun 26, 2020