[EKS] unable to fetch metrics from Kubelet #129

sc-rz · 2018-09-01T04:18:30Z

Hi,

I am testing the recently released HPA on Amazon's EKS but running into an issue where it's failing to ping the node.

(actual IP redacted)

$ kubectl logs -l app=metrics-server -n kube-system
...
E0901 04:09:10.815694       1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-aa-bb-cc-dd.ec2.internal: unable to fetch metrics from Kubelet ip-aa-bb-cc-dd.ec2.internal (ip-aa-bb-cc-dd.ec2.internal): Get https://ip-aa-bb-cc-dd.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-aa-bb-cc-dd.ec2.internal on 10.100.0.10:53: no such host, unable to fully scrape metrics from source

$ kubectl get nodes
NAME                             STATUS    ROLES     AGE       VERSION
ip-aa-bb-cc-dd.ec2.internal   Ready     <none>    1h        v1.10.3

$ kubectl describe node 
...
Addresses:
  InternalIP:  aa.bb.cc.dd
  Hostname:    ip-aa-bb-cc-dd.ec2.internal

I am using v0.3 after running kubectl apply -f metrics-server/deploy/1.8+/ on commit 931ef84

Do i need to configure something?

Thanks

The text was updated successfully, but these errors were encountered:

sc-rz · 2018-09-01T05:40:24Z

Nevermind, this was an issue with my VPC DNS resolution

dijeesh · 2018-09-01T05:40:46Z

Same here,

I have manually set Image to metrics-server-amd64:v0.3.0 in metrics-server-deployment.yaml and deployed.

But,

`` kubectl logs metrics-server-754478c688-j5ckq -n kube-system
I0901 03:49:30.403514       1 serving.go:273] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
W0901 03:49:30.723508       1 authentication.go:166] cluster doesn't provide client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication to extension api-server won't work.
W0901 03:49:30.732733       1 authentication.go:210] cluster doesn't provide client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication to extension api-server won't work.
[restful] 2018/09/01 03:49:30 log.go:33: [restful/swagger] listing is available at https://:443/swaggerapi
[restful] 2018/09/01 03:49:30 log.go:33: [restful/swagger] https://:443/swaggerui/ is mapped to folder /swagger-ui/
I0901 03:49:30.778391       1 serve.go:96] Serving securely on [::]:443

And HPA is still showing

Warning FailedGetResourceMetric 4m (x191 over 1h) horizontal-pod-autoscaler unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

sc-rz · 2018-09-01T07:00:50Z

I am also still unable to get HPA working. I ran kubectl describe apiservice v1beta1.metrics.k8s.io and am having the same errors as in #45

sc-rz · 2018-09-02T04:19:59Z

Figured out my issue -- my worker node security group was misconfigured. I had to add an inbound rule to allow HTTPS (port 443) traffic from the control plane security group.

dijeesh · 2018-09-04T01:09:17Z

I just added incoming 443 from CONTROLE PLANE SECURITY GROUP and looks like it's working now. Thanks @sc-rz

LucasSales · 2018-09-11T20:19:26Z

The solution proposed by @MIBc works. Change the metrics-server-deployment.yaml file and add:

command:
- /metrics-server
- --kubelet-preferred-address-types=InternalIP

zhangzhaorui · 2018-10-26T07:57:50Z

没关系，这是我的VPC DNS解析的问题

Nevermind, this was an issue with my VPC DNS resolution

hi boss！ my metrics-server pod hava the same as error information：

E1026 07:37:04.007899 1 reststorage.go:144] unable to fetch pod metrics for pod dev-java/csg-application-68584c6b66-c65k9: no metrics known for pod
E1026 07:37:34.022311 1 reststorage.go:144] unable to fetch pod metrics for pod dev-java/csg-application-68584c6b66-c65k9: no metrics known for pod
E1026 07:37:38.242410 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-001: unable to fetch metrics from Kubelet idc-k8snode-javaphp-001 (idc-k8snode-javaphp-001): Get https://idc-k8snode-javaphp-001:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-001 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8smaster-javaphp-001: unable to fetch metrics from Kubelet idc-k8smaster-javaphp-001 (idc-k8smaster-javaphp-001): Get https://idc-k8smaster-javaphp-001:10250/stats/summary/: dial tcp: lookup idc-k8smaster-javaphp-001 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-002: unable to fetch metrics from Kubelet idc-k8snode-javaphp-002 (idc-k8snode-javaphp-002): Get https://idc-k8snode-javaphp-002:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-002 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-003: unable to fetch metrics from Kubelet idc-k8snode-javaphp-003 (idc-k8snode-javaphp-003): Get https://idc-k8snode-javaphp-003:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-003 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8smaster-javaphp-002: unable to fetch metrics from Kubelet idc-k8smaster-javaphp-002 (idc-k8smaster-javaphp-002): Get https://idc-k8smaster-javaphp-002:10250/stats/summary/: dial tcp: lookup idc-k8smaster-javaphp-002 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-004: unable to fetch metrics from Kubelet idc-k8snode-javaphp-004 (idc-k8snode-javaphp-004): Get https://idc-k8snode-javaphp-004:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-004 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8smaster-javaphp-003: unable to fetch metrics from Kubelet idc-k8smaster-javaphp-003 (idc-k8smaster-javaphp-003): Get https://idc-k8smaster-javaphp-003:10250/stats/summary/: dial tcp: lookup idc-k8smaster-javaphp-003 on 10.96.0.10:53: no such host]

How did you solve it？！

GeekyTex · 2018-10-26T14:48:58Z

Thanks @LucasSales, this ended up fixing the issue for me as well. It looks like port 443 has since been added to the needed SGs, but I was still getting the following error in my metrics-server:

E1026 14:41:58.325491 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-0-166-28.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-166-28.ec2.internal (ip-10-0-166-28.ec2.internal): Get https://ip-10-0-166-28.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-166-28.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-135-135.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-135-135.ec2.internal (ip-10-0-135-135.ec2.internal): Get https://ip-10-0-135-135.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-135-135.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-146-30.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-146-30.ec2.internal (ip-10-0-146-30.ec2.internal): Get https://ip-10-0-146-30.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-146-30.ec2.internal on 172.20.0.10:53: no such host]

Adding the command above works. Not sure if the root issue is related to CNI or something else. Would be curious to know if anyone else hits this.

FWIW, my cluster was manually set up (still in early POC phase) and was built per the current AWS Getting Started docs.

kiahmed · 2018-11-09T23:46:06Z

stuck with this issue over a week..tried all the above ..tried @LucasSales approach but that gives certificate error saying not created for that host ip, and my host would be changing in my cluster . port 443 is opened though ..not sure why everybody is talking about that

DirectXMan12 · 2018-11-13T21:20:14Z

@kiahmed basically, you need to tell metrics-server to connect to your pods using a name or address that it can actually look up. So, by saying InternalIP, you're telling metrics-server to not use hostnames, but instead use the internal IP address of the node. However, if your serving certificates on the Kubelet aren't valid for that IP, you'll get a certificate error.

kiahmed · 2018-11-13T21:42:16Z

--kubelet-insecure-tls did the job which is okay for now for dev cluster, but even in prod api would be getting access under k8 main apiserver anyway and it has its own CA and validation, so does it really matter?

DirectXMan12 · 2018-11-26T21:39:00Z

metrics-server doesn't talk to the nodes via the main API server -- it talks to them directly. Using --kubelet-insecure-tls means that someone could MITM the metrics-server <-> kubelet connection, unless you're using some sort of service mesh or what-have-you that provides its own auth.

cdmurph32 · 2018-12-06T16:52:04Z

Nevermind, this was an issue with my VPC DNS resolution

I think I hit this issue as well, and it wasn't clear to me how VPC settings could break metrics server, besides NACLs.
So just in case other people are broken because of their VPC configuration (not because of NACLs):

The value of http://169.254.169.254/latest/meta-data/local-hostname is set from the VPC DHCP settings. https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html
Kubernetes pods get their hostname from this ec2 instance metadata. This sets the node label kubernetes.io/hostname
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1244
Metrics server by default uses this label as the hostname for the node (makes sense).
https://github.com/kubernetes-incubator/metrics-server/blob/master/pkg/sources/summary/addrs.go#L23-L40
If your DHCP settings are wrong, (ex you override the defaults unintentionally through copy paste errors in Cloudformation templates, or your custom domain isn't resolvable from within Kubernetes), metrics server won't be able to get anything.
unable to fully scrape metrics from source kubelet_summary:ip-10-68-234-200.us-west-2.compute.internal: unable to fetch metrics from Kubelet ip-10-68-234-200.us-west-2.compute.internal (ip-10-68-234-200.ec2.internal): Get https://ip-10-68-234-200.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-68-234-200.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-68-234-239.us-west-2.compute.internal: unable to fetch metrics from Kubelet ip-10-68-234-239.us-west-2.compute.internal (ip-10-68-234-239.ec2.internal): Get https://ip-10-68-234-239.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-68-234-239.ec2.internal on 172.20.0.10:53: no such host

jitesh-prajapati123 · 2018-12-14T06:26:09Z

I am getting following error.

E1214 06:23:17.408800 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-0-3-12.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-3-12.ec2.internal (ip-10-0-3-12.ec2.internal): Get https://ip-10-0-3-12.ec2.internal:10250/stats/summary/: dial tcp: i/o timeout, unable to fully scrape metrics from source kubelet_summary:ip-10-0-1-54.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-1-54.ec2.internal (ip-10-0-1-54.ec2.internal): Get https://ip-10-0-1-54.ec2.internal:10250/stats/summary/: dial tcp: i/o timeout]

When I did curl to https://ip-10-0-3-12.ec2.internal:10250/stats/summary/ it gives me following.

SSL certificate problem: unable to get local issuer certificate
curl: (60) SSL certificate problem: unable to get local issuer certificate

jitesh-prajapati123 · 2018-12-14T06:27:28Z

Thanks @LucasSales, this ended up fixing the issue for me as well. It looks like port 443 has since been added to the needed SGs, but I was still getting the following error in my metrics-server:

E1026 14:41:58.325491 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-0-166-28.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-166-28.ec2.internal (ip-10-0-166-28.ec2.internal): Get https://ip-10-0-166-28.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-166-28.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-135-135.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-135-135.ec2.internal (ip-10-0-135-135.ec2.internal): Get https://ip-10-0-135-135.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-135-135.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-146-30.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-146-30.ec2.internal (ip-10-0-146-30.ec2.internal): Get https://ip-10-0-146-30.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-146-30.ec2.internal on 172.20.0.10:53: no such host]

Adding the command above works. Not sure if the root issue is related to CNI or something else. Would be curious to know if anyone else hits this.

FWIW, my cluster was manually set up (still in early POC phase) and was built per the current AWS Getting Started docs.

I have same issue.

jairovm · 2019-01-11T01:26:43Z

Hi guys, I'm running metrics-server through a helm chart on EKS and got all my HPA working but one, see:

NAMESPACE       NAME                       REFERENCE                             TARGETS                        MINPODS   MAXPODS   REPLICAS   AGE
datateam        hpa1                    Deployment/hpa1                    15%/75%                        2         10        2          3h
default         hpa2                     Deployment/hpa2                     1%/75%                         2         10        2          21d
default         hpa3              Deployment/hpa3              596%/75%                       2         10        4          20d
nginx-ingress   nginx-ingress-controller   Deployment/nginx-ingress-controller   <unknown>/50%, <unknown>/50%   3         11        3          50m

The one that is not working is another helm chart stable/nginx-ingress.

I have tried with --kubelet-insecure-tls and --kubelet-preferred-address-types=InternalIP without any luck.

top pods is working fine

kubectl top pods -n nginx-ingress                                                                                                                                                      [19:17:34]
NAME                                             CPU(cores)   MEMORY(bytes)
nginx-ingress-controller-6c54d8d8fd-hbnmf        3m           77Mi
nginx-ingress-controller-6c54d8d8fd-m8jb8        3m           76Mi
nginx-ingress-controller-6c54d8d8fd-xvm5d        4m           76Mi
nginx-ingress-default-backend-544cfb69fc-7zvnw   1m           2Mi

Let me know if you need more info, thanks.

Update:

I got nginx-ingress-controller hpa to work by defining resources in my values.yaml file 😅

  resources:
    limits:
      cpu: 100m
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 128Mi

olereidar · 2019-01-14T09:19:37Z

I had the same issue. This solved my problem: https://stackoverflow.com/q/54106725/2291510

See: kubernetes-sigs/metrics-server#129 kubernetes-sigs/metrics-server#131 Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com>

piyushkumar13 · 2019-07-27T12:44:10Z

@kiahmed and @DirectXMan12
Referring to your comment #129 (comment) and #129 (comment)
Adding --kubelet-insecure-tls has worked for me. But is it fine to use this flag for the production cluster ? If not, then what needs to be done to make metrics-server to work ?

LucasSales · 2019-12-19T18:57:44Z

Is necessary add the resources
example:
resources:
limits:
cpu: 500m
memory: 254Mi
requests:
cpu: 1000m
memory: 1G

lauer · 2020-09-02T14:29:57Z

Had same problem. Solved it with this command:

helm upgrade --install metrics stable/metrics-server --namespace kube-system --set hostNetwork.enabled=true --set args={kubelet-insecure-tls}

edrimon · 2024-05-30T13:16:40Z

Figured out my issue -- my worker node security group was misconfigured. I had to add an inbound rule to allow HTTPS (port 443) traffic from the control plane security group.

Thank you so much, that was it, networking/firewall issue

sc-rz closed this as completed Sep 1, 2018

nlamirault added a commit to zeiot-old/jarvis that referenced this issue Mar 24, 2019

Update: use internal IP for MetricServer

90d5353

See: kubernetes-sigs/metrics-server#129 kubernetes-sigs/metrics-server#131 Signed-off-by: Nicolas Lamirault <nicolas.lamirault@gmail.com>

ysde mentioned this issue Sep 8, 2020

Horizontal Pod Autoscaler (HPA): Current CPU: <unknown> kubernetes-retired/kube-aws#549

Closed

kbbqiu mentioned this issue Oct 16, 2020

node hostname resolution failure in metrics-server pod #617

Closed

wang-xiaowu mentioned this issue Jun 10, 2022

loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503 k3s-io/k3s#5344

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EKS] unable to fetch metrics from Kubelet #129

[EKS] unable to fetch metrics from Kubelet #129

sc-rz commented Sep 1, 2018 •

edited

Loading

sc-rz commented Sep 1, 2018

dijeesh commented Sep 1, 2018 •

edited

Loading

sc-rz commented Sep 1, 2018

sc-rz commented Sep 2, 2018

dijeesh commented Sep 4, 2018

LucasSales commented Sep 11, 2018

zhangzhaorui commented Oct 26, 2018

GeekyTex commented Oct 26, 2018 •

edited

Loading

kiahmed commented Nov 9, 2018

DirectXMan12 commented Nov 13, 2018

kiahmed commented Nov 13, 2018

DirectXMan12 commented Nov 26, 2018

cdmurph32 commented Dec 6, 2018

jitesh-prajapati123 commented Dec 14, 2018 •

edited

Loading

jitesh-prajapati123 commented Dec 14, 2018

jairovm commented Jan 11, 2019 •

edited

Loading

olereidar commented Jan 14, 2019

piyushkumar13 commented Jul 27, 2019

LucasSales commented Dec 19, 2019

lauer commented Sep 2, 2020

edrimon commented May 30, 2024

[EKS] unable to fetch metrics from Kubelet #129

[EKS] unable to fetch metrics from Kubelet #129

Comments

sc-rz commented Sep 1, 2018 • edited Loading

sc-rz commented Sep 1, 2018

dijeesh commented Sep 1, 2018 • edited Loading

sc-rz commented Sep 1, 2018

sc-rz commented Sep 2, 2018

dijeesh commented Sep 4, 2018

LucasSales commented Sep 11, 2018

zhangzhaorui commented Oct 26, 2018

GeekyTex commented Oct 26, 2018 • edited Loading

kiahmed commented Nov 9, 2018

DirectXMan12 commented Nov 13, 2018

kiahmed commented Nov 13, 2018

DirectXMan12 commented Nov 26, 2018

cdmurph32 commented Dec 6, 2018

jitesh-prajapati123 commented Dec 14, 2018 • edited Loading

I am getting following error.

When I did curl to https://ip-10-0-3-12.ec2.internal:10250/stats/summary/ it gives me following.

jitesh-prajapati123 commented Dec 14, 2018

jairovm commented Jan 11, 2019 • edited Loading

Update:

olereidar commented Jan 14, 2019

piyushkumar13 commented Jul 27, 2019

LucasSales commented Dec 19, 2019

lauer commented Sep 2, 2020

edrimon commented May 30, 2024

sc-rz commented Sep 1, 2018 •

edited

Loading

dijeesh commented Sep 1, 2018 •

edited

Loading

GeekyTex commented Oct 26, 2018 •

edited

Loading

jitesh-prajapati123 commented Dec 14, 2018 •

edited

Loading

jairovm commented Jan 11, 2019 •

edited

Loading