Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide metrics for kubernetes sd errors #3876

Closed
JosephSalisbury opened this Issue Feb 21, 2018 · 13 comments

Comments

Projects
None yet
5 participants
@JosephSalisbury
Copy link

JosephSalisbury commented Feb 21, 2018

We have a setup where we can dynamically add Kubernetes clusters to Prometheus. We have some (unrelated) issues where Kubernetes clusters (very occasionally :D) don't come up correctly.

When this happens, I see the following in logs:

level=error ts=2018-02-21T17:09:43.053597895Z caller=main.go:221 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:270: Failed to list *v1.Pod: Get https://master.3p8rz/api/v1/pods?resourceVersion=0: dial tcp 172.31.182.193:443: i/o timeout"
level=error ts=2018-02-21T17:09:43.053600054Z caller=main.go:221 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:268: Failed to list *v1.Endpoints: Get https://master.3p8rz/api/v1/endpoints?resourceVersion=0: dial tcp 172.31.182.193:443: i/o timeout"
level=error ts=2018-02-21T17:09:43.053882797Z caller=main.go:221 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:269: Failed to list *v1.Service: Get https://master.3p8rz/api/v1/services?resourceVersion=0: dial tcp 172.31.182.193:443: i/o timeout"

but in terms of metrics for the kubernetes sd, i'm only getting

Every 2.0s: curl -Ss prometheus.monitoring:9090/metrics | grep 'kubernetes' | less                                                                                                      Wed Feb 21 17:10:27 2018

# HELP prometheus_sd_kubernetes_events_total The number of Kubernetes events handled.
# TYPE prometheus_sd_kubernetes_events_total counter
prometheus_sd_kubernetes_events_total{event="add",role="endpoints"} 21600
prometheus_sd_kubernetes_events_total{event="add",role="node"} 2655
prometheus_sd_kubernetes_events_total{event="add",role="pod"} 0
prometheus_sd_kubernetes_events_total{event="add",role="service"} 19449
prometheus_sd_kubernetes_events_total{event="delete",role="endpoints"} 33
prometheus_sd_kubernetes_events_total{event="delete",role="node"} 0
prometheus_sd_kubernetes_events_total{event="delete",role="pod"} 0
prometheus_sd_kubernetes_events_total{event="delete",role="service"} 6
prometheus_sd_kubernetes_events_total{event="update",role="endpoints"} 64009
prometheus_sd_kubernetes_events_total{event="update",role="node"} 32245
prometheus_sd_kubernetes_events_total{event="update",role="pod"} 0
prometheus_sd_kubernetes_events_total{event="update",role="service"} 0

It would be very useful for prometheus to expose metrics about kubernetes sd errors, so we can alert for situations where prometheus can't reach new clusters.

Thoughts? Useful? Happy to provide a PR if there's some consensus here.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Feb 22, 2018

I also bumped into this today and would appreciate error metric for k8s SD

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Mar 13, 2018

I understand this has no priority compared to more important problems being solved right now so I decided I'd give it a try. Unfortunately I'm complete newbie in Go and this is pretty bit step for me as I look at to code.

If I read it right Prometheus uses client-go's ListWatch for asynchronous querying of k8s API and only periodically check for updates.

The problem is that the info about requests has the client-go itself. I figured out two possible solutions from my lame point of view:

  • try to use client-go's abstract interface for metrics I found here in tools and here in workqueue but for both modules I'm not sure they are used at all..

  • using the ResourceEventHandler which triggers event but only if the event succeeds (which eliminates failures metric) and I doubt the returned synced object would contain data about request duration (request_duration metric)

I'll try to investigate further but if any Prometheus developer or anyone else gave me some clue or advice I'd be really glad. Maybe I'm missing something out.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Mar 13, 2018

Ok I hope I found the right direction in using the MetricsProvider interface of cache package.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Mar 14, 2018

So the problem is Prometheus is using client-go kubernetes library in version v3.0.0 and above mentioned MetricsProvider was introduced in version v5.0.0 (see release notes in this PR)

The MetricsReflector is compatible with Prometheus native metrics so I have the code that should be working but when I upgraded client-go to version v5.0.0 everything breaks down. There are another dependencies to be solved and that would be probably major change I suppose.

I think the client-go version should be updated anyway looking at the compatibility matrix.

I'll be glad to work on this further but I'd like any word from maintainers on this since it looks like bigger change.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 14, 2018

@FUSAKLA see #3895 which is about upgrading the k8s client to v6.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 14, 2018

The compatibility matrix really doesn't matter for the Prometheus project, as we only use resources that went v1 in Kubernetes 1.0 (maybe 1.2, I would have to look it up again to be honest, but it's a much older version than the client-go version we're using). That's not to say we shouldn't upgrade, but I wanted to make sure everyone knows there are no compatibility issues we'll run into.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Mar 14, 2018

I'm sorry, I didn't want it to sound somehow alarming, my bad.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 14, 2018

No worries, I didn't take it like that, I just wanted to make sure the information is here if people start reading this. 🙂

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 3, 2018

@FUSAKLA The k8s client-go has been updated in #4336. Would you have some time to work on this?

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Aug 3, 2018

@simonpasquier Hi, definitely. I had working version based on the Krasi's branch with updated client-go, so hopefully it will be easy to port after those major updates in k8s SD.

I'll get to it during this weekend hopefully.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Aug 4, 2018

so there it is
#4458

@FUSAKLA

This comment has been minimized.

Copy link
Contributor

FUSAKLA commented Sep 21, 2018

Ok, it took bit longer but it should be in master now.

@JosephSalisbury could you look at it if it satisfies your needs for alerting?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 9, 2018

Closed by #4458

@lock lock bot locked and limited conversation to collaborators Apr 7, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.