Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to investigate why prometheus service backed by prometheus pod times out #2192

Closed
sudhirpandey opened this Issue Nov 16, 2016 · 4 comments

Comments

Projects
None yet
2 participants
@sudhirpandey
Copy link

sudhirpandey commented Nov 16, 2016

What did you do?
We have a prometheus successfully deployed in open shift origin cluster. All in all it works well and we are successfully graphing things on grafana which queries every 10s.

we have this query to plot the cpu usage of various pods in cluster

sum (rate (container_cpu_usage_seconds_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^$Node$"}[1m])) by (container_label_io_kubernetes_pod_name)

This query returns value successfully when a particular node is selected. ie has value.. when we have value of .* i.e for the whole cluster. It never get value and instead times out.

We have given large amount of cpu cores to the pod ie 4 cores. and still it seems to be not able to return data, while occasionally in graph prometheus pod seems to use all the cores. Is there any other way to investigate what is going on here. There is also nothing interesting on the prometheus pod stdout

Any more info i could provide so we can better understand the problem.
What did you expect to see?
See the query get completed and values returned and graph plotted

What did you see instead? Under which circumstances?
The query times out

Environment

openshift v1.3.0
kubernetes v1.3.0+52492b4
  • System information:
Linux 3.10.0-327.36.3.el7.x86_64 x86_64
  • Prometheus version:
prometheus --version
prometheus, version 1.3.1 (branch: master, revision: be476954e80349cb7ec3ba6a3247cd712189dfcb)
  build user:       root@37f0aa346b26
  build date:       20161104-20:24:03
  go version:       go1.7.3
@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Nov 17, 2016

How many nodes do you have? It might not be a CPU problem, but an IO bottleneck. Also, if you formulate a query without the regex filter at all (equivalent, since that also matches everything), does it complete?

There is a command-line flag for controlling the query timeout, by the way: -query.timeout. By default, it's 2 minutes.

@sudhirpandey

This comment has been minimized.

Copy link
Author

sudhirpandey commented Nov 17, 2016

We have around 12 nodes,with 121 pods running. we are trying to uses glusterfs as data volume in the prometheus.

The problem is with and without regex, we get results back after some time only when one/two (max) concurrent request are querying making. Once a lot of users start to to load the dashboard (graphana) then the query about the cpu usage get timeout for all the users.

The queries of other metrics does not take that long to respond and is almost instant and can be graphed for 1hour interval by a lot of users .

sum (rate (container_cpu_usage_seconds_total{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~".*"}[1m])) by (container_label_io_kubernetes_pod_name) from promethes web console is reported to be complete in 279ms with resolution 14s

Here is the response time of promtheus when only one user was querying it.
for 15m it used 24s
and for 30m it used 22s
and for 1hr it used 20s

Is there some instrumentation technique to find out where the bottle neck is happening. From the graph below prometheus is the most cpu intensive pod we have it seems

screen shot 2016-11-17 at 10 22 07 am

Those peaks occur in graphs occur are we have one grafana Ui that reload the dashboad after 5min interval.

@sudhirpandey

This comment has been minimized.

Copy link
Author

sudhirpandey commented Nov 17, 2016

We switched out from gluster vols to local disk for prometheus so it seems to perform decently. So it was indeed IO bottleneck spiking up the cpu usage.

Thanks for the help.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.