Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes discovery misses some changes #1316

Closed
matthiasr opened this Issue Jan 14, 2016 · 14 comments

Comments

Projects
None yet
3 participants
@matthiasr
Copy link
Contributor

matthiasr commented Jan 14, 2016

When bringing up a local cluster for integration tests, sometimes Prometheus (using the example Kubernetes configuration) does not pick up a service that requests to be scraped using the prometheus.io/scrape annotation.

In this situation, sending a SIGHUP to the Prometheus process seems to "fix" this:

mr@ip-10-33-34-50 ~/k8s (git)-[master] % kubectl-local --namespace=kube-system exec -ti deploymentrc-2964570028-e8nbh -- wget -q -O - "http://localhost:9090/api/v1/query?query=haproxy_backend_up&time=$(date -u +%s)"
{"status":"success","data":{"resultType":"vector","result":[]}}
mr@ip-10-33-34-50 ~/k8s (git)-[master] % kubectl-local --namespace=kube-system exec -ti deploymentrc-2964570028-e8nbh -- pkill -HUP prometheus
mr@ip-10-33-34-50 ~/k8s (git)-[master] % kubectl-local --namespace=kube-system exec -ti deploymentrc-2964570028-e8nbh -- wget -q -O - "http://localhost:9090/api/v1/query?query=haproxy_backend_up&time=$(date -u +%s)"
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"haproxy_backend_up","backend":"default-web-80","instance":"172.17.128.7:9101","job":"kubernetes-services","kubernetes_name":"ingress-controller","kubernetes_namespace":"ingress-internal","kubernetes_role":"service"},"value":[1452788773,"1"]},{"metric":{"__name__":"haproxy_backend_up","backend":"deny","instance":"172.17.128.7:9101","job":"kubernetes-services","kubernetes_name":"ingress-controller","kubernetes_namespace":"ingress-internal","kubernetes_role":"service"},"value":[1452788773,"1"]},{"metric":{"__name__":"haproxy_backend_up","backend":"kube-system-kibana-logging-5601","instance":"172.17.128.7:9101","job":"kubernetes-services","kubernetes_name":"ingress-controller","kubernetes_namespace":"ingress-internal","kubernetes_role":"service"},"value":[1452788773,"1"]},{"metric":{"__name__":"haproxy_backend_up","backend":"kube-system-prometheus-9090","instance":"172.17.128.7:9101","job":"kubernetes-services","kubernetes_name":"ingress-controller","kubernetes_namespace":"ingress-internal","kubernetes_role":"service"},"value":[1452788773,"1"]},{"metric":{"__name__":"haproxy_backend_up","backend":"stats","instance":"172.17.128.7:9101","job":"kubernetes-services","kubernetes_name":"ingress-controller","kubernetes_namespace":"ingress-internal","kubernetes_role":"service"},"value":[1452788773,"1"]}]}}
mr@ip-10-33-34-50 ~/k8s (git)-[master] % kubectl-local --namespace=kube-system logs deploymentrc-2964570028-e8nbh
prometheus, version 0.16.1 (branch: HEAD, revision: 968ee35)
  build user:       @84057be02ab6
  build date:       20160111-15:47:17
  go version:       1.5.1
time="2016-01-14T16:08:50Z" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:196"
time="2016-01-14T16:08:50Z" level=info msg="Loading series map and head chunks..." source="storage.go:262"
time="2016-01-14T16:08:50Z" level=info msg="0 series loaded." source="storage.go:267"
time="2016-01-14T16:08:50Z" level=info msg="Starting target manager..." source="targetmanager.go:114"
time="2016-01-14T16:08:50Z" level=info msg="Listening on :9090" source="web.go:223"
time="2016-01-14T16:09:00Z" level=error msg="Unable to list Kubernetes nodes: Unable to query any masters" source="discovery.go:113"
time="2016-01-14T16:09:10Z" level=error msg="Unable to list Kubernetes nodes: Unable to query any masters" source="discovery.go:113"
time="2016-01-14T16:09:20Z" level=error msg="Failed to watch service endpoints: Unable to query any masters" source="discovery.go:551"
time="2016-01-14T16:09:20Z" level=error msg="Failed to watch service endpoints: Unable to query any masters" source="discovery.go:551"
time="2016-01-14T16:09:30Z" level=error msg="Failed to watch nodes: Unable to query any masters" source="discovery.go:365"
time="2016-01-14T16:09:30Z" level=error msg="Failed to watch nodes: Unable to query any masters" source="discovery.go:365"
time="2016-01-14T16:09:40Z" level=error msg="Failed to watch services: Unable to query any masters" source="discovery.go:407"
time="2016-01-14T16:09:40Z" level=error msg="Failed to watch services: Unable to query any masters" source="discovery.go:407"
time="2016-01-14T16:09:50Z" level=error msg="Failed to watch service endpoints: Unable to query any masters" source="discovery.go:551"
time="2016-01-14T16:09:50Z" level=error msg="Failed to watch service endpoints: Unable to query any masters" source="discovery.go:551"
time="2016-01-14T16:10:00Z" level=error msg="Failed to watch nodes: Unable to query any masters" source="discovery.go:365"
time="2016-01-14T16:10:00Z" level=error msg="Failed to watch nodes: Unable to query any masters" source="discovery.go:365"
time="2016-01-14T16:10:10Z" level=error msg="Failed to watch services: Unable to query any masters" source="discovery.go:407"
time="2016-01-14T16:10:10Z" level=error msg="Failed to watch services: Unable to query any masters" source="discovery.go:407"
time="2016-01-14T16:10:20Z" level=error msg="Failed to watch service endpoints: Unable to query any masters" source="discovery.go:551"
time="2016-01-14T16:10:20Z" level=error msg="Failed to watch service endpoints: Unable to query any masters" source="discovery.go:551"
time="2016-01-14T16:10:30Z" level=error msg="Failed to watch nodes: Unable to query any masters" source="discovery.go:365"
time="2016-01-14T16:10:30Z" level=error msg="Failed to watch nodes: Unable to query any masters" source="discovery.go:365"
time="2016-01-14T16:10:40Z" level=error msg="Failed to watch services: Unable to query any masters" source="discovery.go:407"
time="2016-01-14T16:10:40Z" level=error msg="Failed to watch services: Unable to query any masters" source="discovery.go:407"
time="2016-01-14T16:13:50Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:539"
time="2016-01-14T16:13:50Z" level=info msg="Done checkpointing in-memory metrics and chunks in 178.226463ms." source="persistence.go:563"
time="2016-01-14T16:18:50Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:539"
time="2016-01-14T16:18:50Z" level=info msg="Done checkpointing in-memory metrics and chunks in 167.210103ms." source="persistence.go:563"
time="2016-01-14T16:23:50Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:539"
time="2016-01-14T16:23:50Z" level=info msg="Done checkpointing in-memory metrics and chunks in 144.101098ms." source="persistence.go:563"
time="2016-01-14T16:26:10Z" level=info msg="Loading configuration file /etc/prometheus/prometheus.yml" source="main.go:196"
time="2016-01-14T16:26:10Z" level=info msg="Stopping target manager..." source="targetmanager.go:203"
time="2016-01-14T16:26:10Z" level=info msg="Target manager stopped." source="targetmanager.go:216"
time="2016-01-14T16:26:10Z" level=info msg="Starting target manager..." source="targetmanager.go:114"

Of note may be that during the bring-up of this cluster, several components are dumped in pretty much at once, in parallel, and in no particular order. Prometheus and the to-be-scraped HAProxy exporter may come up in either order or at the same time. Once in about 5 times it gets into the situation above, which may or may not be a race condition …

There's also a bunch of errors about being unable to query any masters, but they also happen when everything works out just fine.

@jimmidyson any idea what's up with that?

@brian-brazil brian-brazil added the bug label Jan 14, 2016

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jan 14, 2016

PS: I'm now serializing pushing components into the cluster, I'll report back if that makes things better.

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Jan 14, 2016

The Unable to query any masters is bad. Should probably log the reason for that failure to help diagnose. But that error should mean no targets get discovered - from what you say it looks like stuff does get discovered? There shouldn't be any errors like that when things work. Normally this failure only happens for secured API servers where either the API server certificate will not validate (CA certificate not distributed to pods - config error in API server IIRC) or if using ABAC the service account cannot read from the API server. But there are a few other scenarios... Let me add some more logging to help you diagnose what's going on.

It is possible there are races (however careful I was to try to avoid them) & I'll look again at that (other people's eyes also welcome to look if anyone has time!).

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Jan 14, 2016

BTW order should of course not matter in an environment like this. If you find serializing deploys works then will have to look for races even more closely.

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Jan 14, 2016

Ah I see you're using the 0.16.1 tag - there's a number of fixes made to master since then (including the error logging I mentioned above!). Can you try with master please? If you're using Docker (which I assume you are, deploying on to Kubernetes cluster) then you can use prom/prometheus:master which is stable enough for me. Just be aware there are a few differences in config between the two versions.

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jan 14, 2016

Because of the order of the tests, I know that it did successfully scrape the apiserver and node despite the failure to find this service.

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jan 14, 2016

Hmm okay, let me try master.

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jan 14, 2016

Update: unparallelizing the component deploy did not fix this. (However, as all components are deployed as Deployments or Daemonsets in our setup, the unparallelization just meant that the Kubernetes API objects were created serially, not the pod startup).

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jan 14, 2016

Here's an error message from :master:

time="2016-01-14T18:48:26Z" level=error msg="Unable to list Kubernetes nodes: Unable to query any API servers: Get https://kubernetes.default/api/v1/nodes: dial tcp: i/o timeout" source="discovery.go:113"

At the time of this message the cluster was in a state where some components were up, but specifically the DNS pods were still Pending, so I presume what times out is the DNS lookup.

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jan 14, 2016

… and I give up for today because now it can't resolve the cluster relative names. I'll update here when I have more information.

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Jan 14, 2016

DNS is only required to resolve the kubernetes api server if you've specified it in the config using the cluster relative name (kubernetes.default), which is best to use as underlying cluster IP range could be different depending on deployment. DNS is also required IIRC if you're scraping directly from the service (e.g. blackbox probing). Otherwise the pod IP is used which doesn't require DNS.

Not much we can do about DNS resolution not working as this is reliant on glibc for resolution. Failures shouldn't be cached AFAIK.

We have somehow separate out what is cluster set up problems & what is Prometheus problems I think.

@matthiasr

This comment has been minimized.

Copy link
Contributor Author

matthiasr commented Jan 14, 2016

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Jan 14, 2016

Thanks!

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Jan 15, 2016

tl;dr: I've found a time when this could happen & will look into a fix for it.

Long story: In Sources discovery sends GET requests for nodes & services, storing the returned resource version. In Run we send the WATCH request to watch services, starting at the resource version we got from the GET in Sources. Kubernetes doesn't keep a full history of resource versions (housekeeping to limit resource usage) so there is a chance that there are some dropped events between GET & WATCH, depending on how loaded the API server is. As you're starting up from scratch this could be pretty active & hence more likely to miss events.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.