Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After GKE 1.6 upgrade kubernetes nodes metrics endpoint returns 401 #2606

Closed
JorritSalverda opened this Issue Apr 11, 2017 · 31 comments

Comments

Projects
None yet
9 participants
@JorritSalverda
Copy link
Contributor

JorritSalverda commented Apr 11, 2017

What did you do?

After upgrading a GKE cluster - both master and nodes - to 1.6.0 the job_name: 'kubernetes-nodes' as specified in the k8s configuration example results in all the node /metrics endpoints returning

server returned HTTP status 401 Unauthorized

What did you expect to see?

The node /metrics endpoints to be scraped as before upgrading to 1.6.0 (previous version was 1.5.6).

What did you see instead? Under which circumstances?

All the endpoints for kubernetes-nodes as down with the server returned HTTP status 401 Unauthorized error.

Environment

Google Container Engine version 1.6.0
  • System information:
Linux 4.4.21+ x86_64
  • Prometheus version:
prometheus, version 1.5.2 (branch: master, revision: bd1182d29f462c39544f94cc822830e1c64cf55b)
  build user:       root@1a01c5f68840
  build date:       20170210-16:23:28
  go version:       go1.7.5
  • Prometheus configuration file:
- job_name: 'kubernetes-nodes'

  scheme: https

  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

  kubernetes_sd_configs:
  - role: node

  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
@brancz

This comment has been minimized.

Copy link
Member

brancz commented Apr 11, 2017

Your kubelet needs client certs for authorization. Not sure where you can get those on GKE, but you need to insert them in your tls_config. The two things you are going to need to set are cert_file and key_file.

Also note that since RBAC is on by default in 1.6, you need to make sure that the ServiceAccount your Prometheus instance runs under has permission to access the /metrics route.

Come to think about it, it would probably be a good idea to include canonical ClusterRole suitable to run Prometheus with RBAC in this repository.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Apr 11, 2017

The token and ca.crt are present in the prometheus container, so apparently the default token no longer has access to the node metrics. This person tried various things with ClusterRole without any effect: http://serverfault.com/questions/843751/kubernetes-node-metrics-endpoint-returns-401/

What would it need to look like?

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Apr 11, 2017

There are two separate problems: scraping the kubelet requires the client certificates the apiserver uses to communicate with it, and an RBAC role so Prometheus is authorized to access the /metrics endpoint of the apiserver.

In CoreOS Tectonic we have a separate secret that contains the certificates the apiserver uses to communicate with the kubelet. We mount that secret into the Prometheus container for Prometheus to be able to authenticate with the kubelet and scrape it. No RBAC is required for this part.

The ClusterRole that we use for Prometheus to be authorized to access the /metrics endpoint is this, but the part about configmaps is unique to the Prometheus sidecar we have built, so the important part is this for service discovery and this for allowing access to the /metrics endpoint of the apiserver.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Apr 11, 2017

Thanks, I'll work on setting up the ServiceAccount, ClusterRole and ClusterRoleBinding.

For the untrusted kubelet certificate I still resort to 'insecure_skip_verify: true', but will have a look at a solution similar to what you use.

I'll post the extra manifests I needed to this ticket once it's up and running and will close after.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Apr 11, 2017

For the untrusted kubelet certificate I still resort to 'insecure_skip_verify: true', but will have a look at a solution similar to what you use.

This is still an unsolved problem for us as well, as the kubelets still create their TLS certs on the fly (therefore self-signed), only in addition requiring client certs.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Apr 11, 2017

I think we can close here as this is mainly an operational issue, if anything else comes up, please take it to the prometheus users mailing list as it has better visibility than here and other users can better from it benefit. Thanks! 🙂

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Apr 11, 2017

It turned out that for some reason the kubelet is no longer accessible over https on port 10250. Changing the scrape address to use http and port 10255 provides an acceptable workaround for now:

      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__address__]
        action: replace
        target_label: __address__
        regex: ([^:;]+):(\d+)
        replacement: ${1}:10255
      - source_labels: [__scheme__]
        action: replace
        target_label: __scheme__
        regex: https
        replacement: http
@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Apr 14, 2017

I'd like to have this reopened, as according to comment kubernetes/kubernetes#44330 (comment) GKE is not going to support RBAC based authentication for the kubelet while closing of anonymous access. It either needs to be proxied via apiserver or use a client side certificate created by certificates api.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Apr 14, 2017

How does "hitting the apiserver proxy" work? What URLs need to be hit?

@matthiasr matthiasr reopened this Apr 14, 2017

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Apr 14, 2017

I don't think there is anything we can fundamentally so if the kubelet doesn't expose the metrics in a usable form, but if there is a way we should document it.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Apr 14, 2017

@JorritSalverda maybe you can use relabeling to bend the discovered nodes into being scraped through the apiserver? Similar to how the blackbox and SNMP exporters are addressed.

@mikedanese

This comment has been minimized.

Copy link

mikedanese commented Apr 14, 2017

You can access node metrics by hitting the kubernetes master, e.g.:

 https://<master-ip?/api/v1/nodes/gke-cluster-1-default-pool-b1eaf580-79km/proxy/metrics

Or you can use TLS client auth

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Apr 18, 2017

Thanks @mikedanese that actually works. Now I have to figure out how to get master ip during node relabeling and than I'm set for now and the future.

@matthiasr do you have an idea for this? The documentation only shows the following meta labels to be available for the node role:

  • __meta_kubernetes_node_name: The name of the node object.
  • __meta_kubernetes_node_label_<labelname>: Each label from the node object.
  • __meta_kubernetes_node_annotation_<annotationname>: Each annotation from the node object.
  • __meta_kubernetes_node_address_<address_type>
@mikedanese

This comment has been minimized.

Copy link

mikedanese commented Apr 18, 2017

Do you need the actual IP or can you use the 'kubernetes' service in the default namespace?

@yuvaldrori

This comment has been minimized.

Copy link

yuvaldrori commented Apr 18, 2017

This worked for me:

relabel_configs:
    - target_label: __address__
      replacement: [my master ip]
    - target_label: __metrics_path__
      regex: (.+)
      source_labels: [__meta_kubernetes_node_name]
      action: replace
      replacement: /api/v1/nodes/$1/proxy/metrics
    - action: labelmap
      regex: __meta_kubernetes_node_label_(.+)
@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Apr 18, 2017

For my Prometheus server running inside GKE I now have it running with the following relabeling:

relabel_configs:
- action: labelmap
  regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
  replacement: kubernetes.default.svc.cluster.local:443
- target_label: __scheme__
  replacement: https
- source_labels: [__meta_kubernetes_node_name]
  regex: (.+)
  target_label: __metrics_path__
  replacement: /api/v1/nodes/${1}/proxy/metrics

And the following ClusterRole bound to the service account used by Prometheus:

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]

Because the GKE cluster still has an ABAC fallback in case RBAC fails I'm not 100% sure yet this covers all required permissions.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Apr 19, 2017

I don't think we have a general way to get the master IP, but configuring the Kubernetes discovery requires it anyway. The example config assumes Prometheus running in the cluster, so in that case the server address can just be kubernetes.default.svc:443, this is independent of cluster specifics. For Prometheus outside of Kubernetes, plug in whatever you put in for the api_server. We could expose that as another meta label but I don't think it's worth it just to avoid copying the value (which you can use config management for).

@JorritSalverda @yuvaldrori would you mind opening a PR against the example configuration to add this for the node job, and a comment mentioning the necessary role? Add the role as a separate file in the same directory. (At some point maybe we should reorganise the example directory to group kubernetes stuff, but that will break inbound links, to let's not tie it in with this.)

@yuvaldrori

This comment has been minimized.

Copy link

yuvaldrori commented Apr 19, 2017

@matthiasr some points to consider before I open a PR:

  1. I did not have to add a clusterRole but I do use my admin user name and password for the GKE master in order to talk to the master.
  2. I am not sure that using the master proxy is good for the "up" test - it might fail because the master is down and not because the node is down.
    Should I still add my PR with these caveats?
@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Apr 19, 2017

  1. The cluster role is preferable to encoding the admin user name and password – it's limited in what it is allowed to do and works in any cluster, not just GKE.

  2. Using the apiserver proxy seems to be the most generic solution given what Prometheus already knows (how to talk to the apiserver). True, if the apiserver is down all nodes will not be scrapeable – please add a comment detailing that. Maybe also mention alternatives (client cert, insecure port)?

@yuvaldrori

This comment has been minimized.

Copy link

yuvaldrori commented Apr 19, 2017

  1. I would be happy to know if someone got the client cert to work and how.
@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Apr 19, 2017

Sure, I'll work on the PR and leave this ticket open until that's fixed and merged.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Apr 19, 2017

PR is at #2641

@bviolier

This comment has been minimized.

Copy link

bviolier commented Apr 22, 2017

I ran into this issue and tried to resolve it by applying the rbac-setup.yaml as IAM user on kubectl (which has all rights, by default, right?), but I got the error:

serviceaccount "prometheus" configured
clusterrolebinding "prometheus" created
Error from server (Forbidden): error when creating "rbac-setup.yaml": clusterroles.rbac.authorization.k8s.io "prometheus" is forbidden: attempt to grant extra privileges: [{[get] [] [nodes] [] []} {[list] [] [nodes] [] []} {[watch] [] [nodes] [] []} {[get] [] [nodes/proxy] [] []} {[list] [] [nodes/proxy] [] []} {[watch] [] [nodes/proxy] [] []} {[get] [] [services] [] []} {[list] [] [services] [] []} {[watch] [] [services] [] []} {[get] [] [endpoints] [] []} {[list] [] [endpoints] [] []} {[watch] [] [endpoints] [] []} {[get] [] [pods] [] []} {[list] [] [pods] [] []} {[watch] [] [pods] [] []} {[get] [] [] [] [/metrics]}] user=&{<my_email> [system:authenticated] map[]} ownerrules=[{[create] [authorization.k8s.io] [selfsubjectaccessreviews] [] []} {[get] [] [] [] [/api /api/* /apis /apis/* /healthz /swaggerapi /swaggerapi/* /version]}] ruleResolutionErrors=[]

Am I doing anything wrong?

@JorritSalverda

This comment has been minimized.

Copy link
Contributor Author

JorritSalverda commented Apr 22, 2017

@bviolier as a safety measure you're only allowed to hand out permissions your own account alread has. In Google Cloud IAM you should assign yourself the Container Engine Cluster Admin role, which should allow you to apply this manifest. It seems to take a while before this is effective though.

A faster way that worked for me is to add a ClusterRoleBinding for your own user to assign yourself the cluster-admin role using the following manifest:

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: additional-cluster-admins
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: User
  name: <your email address as used in GCE>
@discostur

This comment has been minimized.

Copy link

discostur commented Apr 27, 2017

I just deployed a new k8s 1.6 cluster (on premise) with prometheus and the following RBAC rules:

apiVersion: rbac.authorization.k8s.io/v1alpha1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus 
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus 
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus 
subjects:
- kind: ServiceAccount
  name: prometheus 
  namespace: default

My API-server and custom endpoints get scraped, but prometheus is not able to scrape the node endpoints. It still gets the following error:

https://IP_address:10250/metrics

server returned HTTP status 403 Forbidden

Can you tell me what rules i have to add to my RBAC? In the Log-Output i can't see any errors ...

@mikedanese

This comment has been minimized.

Copy link

mikedanese commented Apr 27, 2017

YOu can't hit the nodes directly. You need to use the config here:

#2606 (comment)

@mikedanese

This comment has been minimized.

Copy link

mikedanese commented Apr 27, 2017

Also see the PR that updates the documentation:

#2641

@discostur

This comment has been minimized.

Copy link

discostur commented Apr 27, 2017

@mikedanese after changing the relabel config as you said i still get the error:

https://kubernetes.default.svc.cluster.local:443/api/v1/nodes/k8s-dev01.eu/proxy/metrics 

server returned HTTP status 403 Forbidden

Did i miss something?

@discostur

This comment has been minimized.

Copy link

discostur commented Apr 27, 2017

Forgot the

- nodes/proxy

rule in the cluster role, now it is working! Thanks @mikedanese for helping me out!

kujenga added a commit to kujenga/charts that referenced this issue May 15, 2017

charts/stable/prometheus: modify config to support 1.6 by default
This commit adds support for Kubernetes 1.6 RBAC restrictions within
the prometheus configuration for scraping node metrics.

Fixes helm#955

Initial discussion of the issue:
prometheus/prometheus#2606

viglesiasce added a commit to helm/charts that referenced this issue Jul 6, 2017

Prometheus: modify config to support k8s 1.6 by default (#1080)
* charts/stable/prometheus: modify config to support 1.6 by default

This commit adds support for Kubernetes 1.6 RBAC restrictions within
the prometheus configuration for scraping node metrics.

Fixes #955

Initial discussion of the issue:
prometheus/prometheus#2606

* stable/prometheus: bump chart minor version
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 14, 2017

I believe this is all resolved now.

yanns pushed a commit to yanns/charts that referenced this issue Jul 28, 2017

Prometheus: modify config to support k8s 1.6 by default (helm#1080)
* charts/stable/prometheus: modify config to support 1.6 by default

This commit adds support for Kubernetes 1.6 RBAC restrictions within
the prometheus configuration for scraping node metrics.

Fixes helm#955

Initial discussion of the issue:
prometheus/prometheus#2606

* stable/prometheus: bump chart minor version

TinySong added a commit to TinySong/kubernete-mainifest that referenced this issue Sep 9, 2017

update prometheus's rbac-auth and prometheus's configmap
if not modify prometheus-configmap, the endpoint metrics will cannot be scraped
by prometheus
track:
1 PR. prometheus/prometheus#2641
2 issue. prometheus/prometheus#2606 (comment)
@deepti-cloudibility

This comment has been minimized.

Copy link

deepti-cloudibility commented Feb 11, 2019

It turned out that for some reason the kubelet is no longer accessible over https on port 10250. Changing the scrape address to use http and port 10255 provides an acceptable workaround for now:

      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__address__]
        action: replace
        target_label: __address__
        regex: ([^:;]+):(\d+)
        replacement: ${1}:10255
      - source_labels: [__scheme__]
        action: replace
        target_label: __scheme__
        regex: https
        replacement: http

Hi how did you changed the port from 10250 to 10255, since for me its not working on 10255 but when I'm curlig on ip:10250 it gives me output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.