Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics server fails to collect metrics for Windows pods/nodes #539

Closed
AlexEyler opened this issue Jun 2, 2020 · 6 comments
Closed

Metrics server fails to collect metrics for Windows pods/nodes #539

AlexEyler opened this issue Jun 2, 2020 · 6 comments

Comments

@AlexEyler
Copy link

What happened:
Running kubectl top node results in this error for our Windows agents:
Error from server (NotFound): nodemetrics.metrics.k8s.io "7813k8s004" not found

What you expected to happen:
I'd expect to see some stats for the Windows agents

Anything else we need to know?:
Metrics for linux pods/nodes work fine

Environment:

  • Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): aks-engine v0.50.0
  • Container Network Setup (flannel, calico, etc.): Azure CNI
  • Kubernetes version (use kubectl version): 1.18.2 (both client/server)
  • Metrics Server manifest:
Name:                 metrics-server-7f6c5b76b9-ghx5k
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 k8s-master-78132855-4/10.255.255.9
Start Time:           Mon, 11 May 2020 13:56:06 -0700
Labels:               k8s-app=metrics-server
                      pod-template-hash=7f6c5b76b9
Annotations:          kubernetes.io/psp: privileged
Status:               Running
IP:                   10.240.1.41
IPs:
  IP:           10.240.1.41
Controlled By:  ReplicaSet/metrics-server-7f6c5b76b9
Containers:
  metrics-server:
    Container ID:  docker://4fc7e8c86c8c781eab3d95d70561156c3d4a05915fe755075c25954a0c69306d
    Image:         mcr.microsoft.com/oss/kubernetes/metrics-server:v0.3.5
    Image ID:      docker-pullable://k8s.gcr.io/metrics-server-amd64@sha256:edab4c64c4e29f665adaf6c0e11fe00ac9d71bb1fdce61bfa3d9e2a664331f79
    Port:          <none>
    Host Port:     <none>
    Command:
      /metrics-server
      --kubelet-insecure-tls
    State:          Running
      Started:      Mon, 11 May 2020 13:56:10 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from metrics-server-token-pvjnb (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  metrics-server-token-pvjnb:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  metrics-server-token-pvjnb
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>
  • Kubelet config:
  • Metrics Server logs:
E0521 05:12:14.428027       1 manager.go:111] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:7813k8s004: unable to fetch metrics from Kubelet 7813k8s004 (7813k8s004): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem adc0bbe1bc6fb9d6535a096bdab23b2cf796d45c62633432d09e9dfaad43f515: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:7813k8s003: unable to fetch metrics from Kubelet 7813k8s003 (7813k8s003): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem ee1531ecad2e41a2c1f17299d622e6c81ddd6fdc7894c905c63a7567b377f897: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:7813k8s002: unable to fetch metrics from Kubelet 7813k8s002 (7813k8s002): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 9ffc63a41104fa934d0706718f86f0d0b2329d988523b55136afe4dfb3406c7f: A virtual machine or container with the specified identifier does not exist.", unable to fully scrape metrics from source kubelet_summary:7813k8s001: unable to fetch metrics from Kubelet 7813k8s001 (7813k8s001): request failed - "500 Internal Server Error", response: "Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem 1e6d3e48fed736289f9a87703e00910c785ed6ecc3c04af00c9e7eb6aef7d972: A virtual machine or container with the specified identifier does not exist."]

/king bug

@serathius
Copy link
Contributor

serathius commented Jun 2, 2020

In Metrics Server depends on scraping Kubelet to get metrics. In logs you can see responses from Kubelet: 500 Internal Server Error

content:

Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem adc0bbe1bc6fb9d6535a096bdab23b2cf796d45c62633432d09e9dfaad43f515: A virtual machine or container with the specified identifier does not exist.

It's not a problem on Metrics Server side, but on Kubelet.

@marosset
Copy link

marosset commented Jun 2, 2020

I think this is fixed by kubernetes/kubernetes#90554 which has an outstanding backport request to 1.18.
@AlexEyler would you be able to see if this is resolved in k/k master?

@AlexEyler
Copy link
Author

@marosset I'll look into this, can I just deploy a 1.19 kubelet.exe/kubeproxy.exe into the Windows VM? I haven't tried this before.

@marosset
Copy link

marosset commented Jun 5, 2020

@alexeldeib Sorry I don't fully understand your question,
You'd want to go through all of the steps to join Windows nodes to an existing K8s cluster.
Alternatively you could start with a node that is already joined to a cluster, stop the kubelet service, overwrite kubelet.exe with one from https://dl.k8s.io/v1.19.1-beta.1/kubernetes-node-windows-amd64.tar.gz
The specifics on how to do that depend on how the nodes were configured/deployed.

@serathius
Copy link
Contributor

Closing as problem was identified as a problem in Kubelet

@milanankush
Copy link

Hey @AlexEyler and @marosset I wanted to add HPA for my windows pods for which I wanted to have Metrics server configured to get info out of my windows node. What documentation or steps can I follow to do so?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants