Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS] unable to fetch metrics from Kubelet #129

Closed
sc-rz opened this issue Sep 1, 2018 · 21 comments
Closed

[EKS] unable to fetch metrics from Kubelet #129

sc-rz opened this issue Sep 1, 2018 · 21 comments

Comments

@sc-rz
Copy link

sc-rz commented Sep 1, 2018

Hi,

I am testing the recently released HPA on Amazon's EKS but running into an issue where it's failing to ping the node.

(actual IP redacted)

$ kubectl logs -l app=metrics-server -n kube-system
...
E0901 04:09:10.815694       1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-aa-bb-cc-dd.ec2.internal: unable to fetch metrics from Kubelet ip-aa-bb-cc-dd.ec2.internal (ip-aa-bb-cc-dd.ec2.internal): Get https://ip-aa-bb-cc-dd.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-aa-bb-cc-dd.ec2.internal on 10.100.0.10:53: no such host, unable to fully scrape metrics from source 
$ kubectl get nodes
NAME                             STATUS    ROLES     AGE       VERSION
ip-aa-bb-cc-dd.ec2.internal   Ready     <none>    1h        v1.10.3
$ kubectl describe node 
...
Addresses:
  InternalIP:  aa.bb.cc.dd
  Hostname:    ip-aa-bb-cc-dd.ec2.internal

I am using v0.3 after running kubectl apply -f metrics-server/deploy/1.8+/ on commit 931ef84

Do i need to configure something?

Thanks

@sc-rz
Copy link
Author

sc-rz commented Sep 1, 2018

Nevermind, this was an issue with my VPC DNS resolution

@sc-rz sc-rz closed this as completed Sep 1, 2018
@dijeesh
Copy link

dijeesh commented Sep 1, 2018

Same here,

I have manually set Image to metrics-server-amd64:v0.3.0 in metrics-server-deployment.yaml and deployed.

But,

`` kubectl logs metrics-server-754478c688-j5ckq -n kube-system
I0901 03:49:30.403514       1 serving.go:273] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
W0901 03:49:30.723508       1 authentication.go:166] cluster doesn't provide client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication to extension api-server won't work.
W0901 03:49:30.732733       1 authentication.go:210] cluster doesn't provide client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication to extension api-server won't work.
[restful] 2018/09/01 03:49:30 log.go:33: [restful/swagger] listing is available at https://:443/swaggerapi
[restful] 2018/09/01 03:49:30 log.go:33: [restful/swagger] https://:443/swaggerui/ is mapped to folder /swagger-ui/
I0901 03:49:30.778391       1 serve.go:96] Serving securely on [::]:443

And HPA is still showing

Warning FailedGetResourceMetric 4m (x191 over 1h) horizontal-pod-autoscaler unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

@sc-rz
Copy link
Author

sc-rz commented Sep 1, 2018

I am also still unable to get HPA working. I ran kubectl describe apiservice v1beta1.metrics.k8s.io and am having the same errors as in #45

@sc-rz
Copy link
Author

sc-rz commented Sep 2, 2018

Figured out my issue -- my worker node security group was misconfigured. I had to add an inbound rule to allow HTTPS (port 443) traffic from the control plane security group.

@dijeesh
Copy link

dijeesh commented Sep 4, 2018

I just added incoming 443 from CONTROLE PLANE SECURITY GROUP and looks like it's working now. Thanks @sc-rz

@LucasSales
Copy link

The solution proposed by @MIBc works. Change the metrics-server-deployment.yaml file and add:

command:
- /metrics-server
- --kubelet-preferred-address-types=InternalIP

@zhangzhaorui
Copy link

没关系,这是我的VPC DNS解析的问题

Nevermind, this was an issue with my VPC DNS resolution

hi boss! my metrics-server pod hava the same as error information:

E1026 07:37:04.007899 1 reststorage.go:144] unable to fetch pod metrics for pod dev-java/csg-application-68584c6b66-c65k9: no metrics known for pod
E1026 07:37:34.022311 1 reststorage.go:144] unable to fetch pod metrics for pod dev-java/csg-application-68584c6b66-c65k9: no metrics known for pod
E1026 07:37:38.242410 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-001: unable to fetch metrics from Kubelet idc-k8snode-javaphp-001 (idc-k8snode-javaphp-001): Get https://idc-k8snode-javaphp-001:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-001 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8smaster-javaphp-001: unable to fetch metrics from Kubelet idc-k8smaster-javaphp-001 (idc-k8smaster-javaphp-001): Get https://idc-k8smaster-javaphp-001:10250/stats/summary/: dial tcp: lookup idc-k8smaster-javaphp-001 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-002: unable to fetch metrics from Kubelet idc-k8snode-javaphp-002 (idc-k8snode-javaphp-002): Get https://idc-k8snode-javaphp-002:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-002 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-003: unable to fetch metrics from Kubelet idc-k8snode-javaphp-003 (idc-k8snode-javaphp-003): Get https://idc-k8snode-javaphp-003:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-003 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8smaster-javaphp-002: unable to fetch metrics from Kubelet idc-k8smaster-javaphp-002 (idc-k8smaster-javaphp-002): Get https://idc-k8smaster-javaphp-002:10250/stats/summary/: dial tcp: lookup idc-k8smaster-javaphp-002 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8snode-javaphp-004: unable to fetch metrics from Kubelet idc-k8snode-javaphp-004 (idc-k8snode-javaphp-004): Get https://idc-k8snode-javaphp-004:10250/stats/summary/: dial tcp: lookup idc-k8snode-javaphp-004 on 10.96.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:idc-k8smaster-javaphp-003: unable to fetch metrics from Kubelet idc-k8smaster-javaphp-003 (idc-k8smaster-javaphp-003): Get https://idc-k8smaster-javaphp-003:10250/stats/summary/: dial tcp: lookup idc-k8smaster-javaphp-003 on 10.96.0.10:53: no such host]

How did you solve it?!

@GeekyTex
Copy link

GeekyTex commented Oct 26, 2018

Thanks @LucasSales, this ended up fixing the issue for me as well. It looks like port 443 has since been added to the needed SGs, but I was still getting the following error in my metrics-server:

E1026 14:41:58.325491 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-0-166-28.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-166-28.ec2.internal (ip-10-0-166-28.ec2.internal): Get https://ip-10-0-166-28.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-166-28.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-135-135.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-135-135.ec2.internal (ip-10-0-135-135.ec2.internal): Get https://ip-10-0-135-135.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-135-135.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-146-30.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-146-30.ec2.internal (ip-10-0-146-30.ec2.internal): Get https://ip-10-0-146-30.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-146-30.ec2.internal on 172.20.0.10:53: no such host]

Adding the command above works. Not sure if the root issue is related to CNI or something else. Would be curious to know if anyone else hits this.

FWIW, my cluster was manually set up (still in early POC phase) and was built per the current AWS Getting Started docs.

@kiahmed
Copy link

kiahmed commented Nov 9, 2018

stuck with this issue over a week..tried all the above ..tried @LucasSales approach but that gives certificate error saying not created for that host ip, and my host would be changing in my cluster . port 443 is opened though ..not sure why everybody is talking about that

@DirectXMan12
Copy link
Contributor

@kiahmed basically, you need to tell metrics-server to connect to your pods using a name or address that it can actually look up. So, by saying InternalIP, you're telling metrics-server to not use hostnames, but instead use the internal IP address of the node. However, if your serving certificates on the Kubelet aren't valid for that IP, you'll get a certificate error.

@kiahmed
Copy link

kiahmed commented Nov 13, 2018

--kubelet-insecure-tls did the job which is okay for now for dev cluster, but even in prod api would be getting access under k8 main apiserver anyway and it has its own CA and validation, so does it really matter?

@DirectXMan12
Copy link
Contributor

metrics-server doesn't talk to the nodes via the main API server -- it talks to them directly. Using --kubelet-insecure-tls means that someone could MITM the metrics-server <-> kubelet connection, unless you're using some sort of service mesh or what-have-you that provides its own auth.

@cdmurph32
Copy link

Nevermind, this was an issue with my VPC DNS resolution

I think I hit this issue as well, and it wasn't clear to me how VPC settings could break metrics server, besides NACLs.
So just in case other people are broken because of their VPC configuration (not because of NACLs):

  1. The value of http://169.254.169.254/latest/meta-data/local-hostname is set from the VPC DHCP settings. https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html
  2. Kubernetes pods get their hostname from this ec2 instance metadata. This sets the node label kubernetes.io/hostname
    https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1244
  3. Metrics server by default uses this label as the hostname for the node (makes sense).
    https://github.com/kubernetes-incubator/metrics-server/blob/master/pkg/sources/summary/addrs.go#L23-L40
  4. If your DHCP settings are wrong, (ex you override the defaults unintentionally through copy paste errors in Cloudformation templates, or your custom domain isn't resolvable from within Kubernetes), metrics server won't be able to get anything.
    unable to fully scrape metrics from source kubelet_summary:ip-10-68-234-200.us-west-2.compute.internal: unable to fetch metrics from Kubelet ip-10-68-234-200.us-west-2.compute.internal (ip-10-68-234-200.ec2.internal): Get https://ip-10-68-234-200.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-68-234-200.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-68-234-239.us-west-2.compute.internal: unable to fetch metrics from Kubelet ip-10-68-234-239.us-west-2.compute.internal (ip-10-68-234-239.ec2.internal): Get https://ip-10-68-234-239.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-68-234-239.ec2.internal on 172.20.0.10:53: no such host

@jitesh-prajapati123
Copy link

jitesh-prajapati123 commented Dec 14, 2018

I am getting following error.

E1214 06:23:17.408800 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-0-3-12.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-3-12.ec2.internal (ip-10-0-3-12.ec2.internal): Get https://ip-10-0-3-12.ec2.internal:10250/stats/summary/: dial tcp: i/o timeout, unable to fully scrape metrics from source kubelet_summary:ip-10-0-1-54.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-1-54.ec2.internal (ip-10-0-1-54.ec2.internal): Get https://ip-10-0-1-54.ec2.internal:10250/stats/summary/: dial tcp: i/o timeout]

When I did curl to https://ip-10-0-3-12.ec2.internal:10250/stats/summary/ it gives me following.

SSL certificate problem: unable to get local issuer certificate
curl: (60) SSL certificate problem: unable to get local issuer certificate

@jitesh-prajapati123
Copy link

Thanks @LucasSales, this ended up fixing the issue for me as well. It looks like port 443 has since been added to the needed SGs, but I was still getting the following error in my metrics-server:

E1026 14:41:58.325491 1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:ip-10-0-166-28.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-166-28.ec2.internal (ip-10-0-166-28.ec2.internal): Get https://ip-10-0-166-28.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-166-28.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-135-135.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-135-135.ec2.internal (ip-10-0-135-135.ec2.internal): Get https://ip-10-0-135-135.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-135-135.ec2.internal on 172.20.0.10:53: no such host, unable to fully scrape metrics from source kubelet_summary:ip-10-0-146-30.ec2.internal: unable to fetch metrics from Kubelet ip-10-0-146-30.ec2.internal (ip-10-0-146-30.ec2.internal): Get https://ip-10-0-146-30.ec2.internal:10250/stats/summary/: dial tcp: lookup ip-10-0-146-30.ec2.internal on 172.20.0.10:53: no such host]

Adding the command above works. Not sure if the root issue is related to CNI or something else. Would be curious to know if anyone else hits this.

FWIW, my cluster was manually set up (still in early POC phase) and was built per the current AWS Getting Started docs.

I have same issue.

@jairovm
Copy link

jairovm commented Jan 11, 2019

Hi guys, I'm running metrics-server through a helm chart on EKS and got all my HPA working but one, see:

NAMESPACE       NAME                       REFERENCE                             TARGETS                        MINPODS   MAXPODS   REPLICAS   AGE
datateam        hpa1                    Deployment/hpa1                    15%/75%                        2         10        2          3h
default         hpa2                     Deployment/hpa2                     1%/75%                         2         10        2          21d
default         hpa3              Deployment/hpa3              596%/75%                       2         10        4          20d
nginx-ingress   nginx-ingress-controller   Deployment/nginx-ingress-controller   <unknown>/50%, <unknown>/50%   3         11        3          50m

The one that is not working is another helm chart stable/nginx-ingress.

I have tried with --kubelet-insecure-tls and --kubelet-preferred-address-types=InternalIP without any luck.

top pods is working fine

kubectl top pods -n nginx-ingress                                                                                                                                                      [19:17:34]
NAME                                             CPU(cores)   MEMORY(bytes)
nginx-ingress-controller-6c54d8d8fd-hbnmf        3m           77Mi
nginx-ingress-controller-6c54d8d8fd-m8jb8        3m           76Mi
nginx-ingress-controller-6c54d8d8fd-xvm5d        4m           76Mi
nginx-ingress-default-backend-544cfb69fc-7zvnw   1m           2Mi

Let me know if you need more info, thanks.

Update:

I got nginx-ingress-controller hpa to work by defining resources in my values.yaml file 😅

  resources:
    limits:
      cpu: 100m
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 128Mi

@olereidar
Copy link

I had the same issue. This solved my problem: https://stackoverflow.com/q/54106725/2291510

nlamirault added a commit to zeiot-old/jarvis that referenced this issue Mar 24, 2019
@piyushkumar13
Copy link

@kiahmed and @DirectXMan12
Referring to your comment #129 (comment) and #129 (comment)
Adding --kubelet-insecure-tls has worked for me. But is it fine to use this flag for the production cluster ? If not, then what needs to be done to make metrics-server to work ?

@LucasSales
Copy link

Is necessary add the resources
example:
resources:
limits:
cpu: 500m
memory: 254Mi
requests:
cpu: 1000m
memory: 1G

@lauer
Copy link

lauer commented Sep 2, 2020

Had same problem. Solved it with this command:

helm upgrade --install metrics stable/metrics-server --namespace kube-system --set hostNetwork.enabled=true --set args={kubelet-insecure-tls}

@edrimon
Copy link

edrimon commented May 30, 2024

Figured out my issue -- my worker node security group was misconfigured. I had to add an inbound rule to allow HTTPS (port 443) traffic from the control plane security group.

Thank you so much, that was it, networking/firewall issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests