Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics server fails to pull metrics from Windows workers #91575

Closed
brianharwell opened this issue May 29, 2020 · 3 comments
Closed

Metrics server fails to pull metrics from Windows workers #91575

brianharwell opened this issue May 29, 2020 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/windows Categorizes an issue or PR as relevant to SIG Windows.

Comments

@brianharwell
Copy link

/sig windows

What happened:
I have a Kubernetes cluster with 2 Windows worker nodes. When I run kubectl top nodes the Windows nodes report as unknown. I did some investigating and I am seeing errors in the logs.

What you expected to happen:
When I do kubectl top nodes or kubectl top pods the Windows workers are included.

How to reproduce it (as minimally and precisely as possible):
No idea, One Windows node was reporting metrics yesterday but today neither are reporting metrics.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.17.5
  • Cloud provider or hardware configuration:
    Rancher managed cluster of VMs on vSphere. 2 CentOS 7 managers, 2 CentOS 7 workers, 2 Windows Server 2019 1809 workers
  • OS (e.g: cat /etc/os-release): CentOS Linux 7 (Core)
  • Kernel (e.g. uname -a): 3.10.0-1062.12.1.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Feb 4 23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: Rancher
  • Network plugin and version (if this is a network-related bug): Flannel in L2Bridge mode
  • Others:

Logs from metric server

E0529 12:04:50.809303       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:05:50.838175       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:06:50.815777       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:07:50.800927       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:08:50.821804       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist."]
E0529 12:09:12.819567       1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-02": no metrics known for node
E0529 12:09:12.819592       1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-01": no metrics known for node
E0529 12:09:50.809012       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 7d7489daf4b889d4b55d7889a617017768035b7c1c43e8cef5ac0210e7b2ac65: A virtual machine or container with the specified identifier does not exist."]
E0529 12:10:53.085842       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = container 29f403ebb265389ac1bcbe39f8a555045e1e461a2abf065f11d2f8b267f83b12 encountered an error during Properties: failure in a Windows system call: A system shutdown is in progress. (0x45b)"]
E0529 12:12:00.147458       1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-02": no metrics known for node
E0529 12:12:00.147485       1 reststorage.go:135] unable to fetch node metrics for node "qa-k8sw-win-01": no metrics known for node
E0529 12:12:44.741135       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]
E0529 12:13:44.740851       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]
E0529 12:14:44.740965       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]
E0529 12:15:44.740936       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-01: unable to fetch metrics from Kubelet qa-k8sw-win-01 (10.4.111.189): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:qa-k8sw-win-02: unable to fetch metrics from Kubelet qa-k8sw-win-02 (10.4.111.68): Get https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded]

I saw an error trying to fetch the metrics so from the node running the metrics server pod (qa-k8sm-02 I ran curl -v -k https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true

curl -v -k https://10.4.111.68:10250/stats/summary?only_cpu_and_memory=true
* About to connect() to 10.4.111.68 port 10250 (#0)
*   Trying 10.4.111.68...
* Connected to 10.4.111.68 (10.4.111.68) port 10250 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
*       subject: CN=qa-k8sw-win-02@1589888891
*       start date: May 19 10:48:10 2020 GMT
*       expire date: May 19 10:48:10 2021 GMT
*       common name: qa-k8sw-win-02@1589888891
*       issuer: CN=qa-k8sw-win-02-ca@1589888890
> GET /stats/summary?only_cpu_and_memory=true HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.4.111.68:10250
> Accept: */*
>
< HTTP/1.1 401 Unauthorized
< Date: Fri, 29 May 2020 12:30:40 GMT
< Content-Length: 12
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 10.4.111.68 left intact

The connection to the server worked so I looked closer at the error and it was a 500 so I thought there was an issue with the pod on the Windows server so I views the logs of cattle-node-agent-windows-gkrf9

WARN: Default docker named pipe is not found
WARN: Please bind mount in the docker named pipe to //./pipe/docker_engine if docker errors occur
WARN: example: docker run -v //./pipe/custom_docker_named_pipe://./pipe/docker_engine ...
INFO: https://rancher.mycompany.com is accessible
time="2020-05-29T07:12:09-05:00" level=info msg="Rancher agent version v2.4.3 is starting"
time="2020-05-29T07:12:09-05:00" level=info msg="Listening on /tmp/log.sock"
time="2020-05-29T07:12:09-05:00" level=info msg="Option etcd=false"
time="2020-05-29T07:12:09-05:00" level=info msg="Option controlPlane=false"
time="2020-05-29T07:12:09-05:00" level=info msg="Option worker=true"
time="2020-05-29T07:12:09-05:00" level=info msg="Option requestedHostname=qa-k8sw-win-02"
time="2020-05-29T07:12:09-05:00" level=info msg="Option customConfig=map[address:10.4.111.68 internalAddress: label:map[rke.cattle.io/windows-build:17763 rke.cattle.io/windows-kernel-version:17763.1.amd64fre.rs5_release.180914-1434 rke.cattle.io/windows-major-version:10 rke.cattle.io/windows-minor-version:0 rke.cattle.io/windows-release-id:1809 rke.cattle.io/windows-version:10.0.17763.1098] roles:[worker] taints:[]]"
time="2020-05-29T07:12:09-05:00" level=info msg="Connecting to wss://rancher.mycompany.com/v3/connect with token qt6xtcslz7gwrjfw5tszj8r5tjrjpqk5kwnfqm85l7wc64tkjwfcqj"
time="2020-05-29T07:12:09-05:00" level=info msg="Connecting to proxy" url="wss://rancher.mycompany.com/v3/connect"
time="2020-05-29T07:12:09-05:00" level=info msg="Starting plan monitor, checking every 120 seconds"

So I didn't see any errors there so I logged into the Windows server and started looking at container logs and when I viewed the logs for the kubelet I saw a bunch of errors...

E0529 07:18:32.644370   10452 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43
c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
I0529 07:18:37.672584   10452 setters.go:73] Using node IP: "10.4.111.189"
E0529 07:18:42.692479   10452 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = Unknown desc = hcsshim::O
penComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
E0529 07:18:42.692479   10452 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43
c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
I0529 07:18:47.678996   10452 setters.go:73] Using node IP: "10.4.111.189"
E0529 07:18:50.811858   10452 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = Unknown desc = hcsshim::O
penComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
E0529 07:18:50.811858   10452 handler.go:321] HTTP InternalServerError serving /stats/summary: Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem
 cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
E0529 07:18:52.740196   10452 remote_runtime.go:495] ListContainerStats with filter &ContainerStatsFilter{Id:,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = Unknown desc = hcsshim::O
penComputeSystem cf9d0d43c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.
E0529 07:18:52.740196   10452 eviction_manager.go:246] eviction manager: failed to get summary stats: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem cf9d0d43
c099624bbbb13cb4a607ca3a818a66aad1631897d47e1c66827782ac: A virtual machine or container with the specified identifier does not exist.

So the underlying cause appears to be A virtual machine or container with the specified identifier does not exist

@brianharwell brianharwell added the kind/bug Categorizes issue or PR as related to a bug. label May 29, 2020
@k8s-ci-robot k8s-ci-robot added the sig/windows Categorizes an issue or PR as relevant to SIG Windows. label May 29, 2020
@brianharwell
Copy link
Author

I apologize. I did not know there was a separate repo for metrics server. I'm going to post there.

@AlexEyler
Copy link

I'm seeing this issue as well, created an issue in kubernetes-sigs/metrics-server#539 (comment) since I couldn't find one created by Brian. Just wanted to share that if anyone else is searching for this issue.

@marosset marosset added this to In Progress (v1.19) in SIG-Windows Jun 5, 2020
@marosset
Copy link
Contributor

Fixed by #90554

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Projects
None yet
Development

No branches or pull requests

4 participants