Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upNeed to investigate why prometheus service backed by prometheus pod times out #2192
Comments
This comment has been minimized.
This comment has been minimized.
|
How many nodes do you have? It might not be a CPU problem, but an IO bottleneck. Also, if you formulate a query without the regex filter at all (equivalent, since that also matches everything), does it complete? There is a command-line flag for controlling the query timeout, by the way: |
This comment has been minimized.
This comment has been minimized.
|
We have around 12 nodes,with 121 pods running. we are trying to uses glusterfs as data volume in the prometheus. The problem is with and without regex, we get results back after some time only when one/two (max) concurrent request are querying making. Once a lot of users start to to load the dashboard (graphana) then the query about the cpu usage get timeout for all the users. The queries of other metrics does not take that long to respond and is almost instant and can be graphed for 1hour interval by a lot of users .
Here is the response time of promtheus when only one user was querying it. Is there some instrumentation technique to find out where the bottle neck is happening. From the graph below prometheus is the most cpu intensive pod we have it seems Those peaks occur in graphs occur are we have one grafana Ui that reload the dashboad after 5min interval. |
This comment has been minimized.
This comment has been minimized.
|
We switched out from gluster vols to local disk for prometheus so it seems to perform decently. So it was indeed IO bottleneck spiking up the cpu usage. Thanks for the help. |
sudhirpandey
closed this
Nov 17, 2016
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |

sudhirpandey commentedNov 16, 2016
What did you do?
We have a prometheus successfully deployed in open shift origin cluster. All in all it works well and we are successfully graphing things on grafana which queries every 10s.
we have this query to plot the cpu usage of various pods in cluster
This query returns value successfully when a particular node is selected. ie has value.. when we have value of
.*i.e for the whole cluster. It never get value and instead times out.We have given large amount of cpu cores to the pod ie 4 cores. and still it seems to be not able to return data, while occasionally in graph prometheus pod seems to use all the cores. Is there any other way to investigate what is going on here. There is also nothing interesting on the prometheus pod stdout
Any more info i could provide so we can better understand the problem.
What did you expect to see?
See the query get completed and values returned and graph plotted
What did you see instead? Under which circumstances?
The query times out
Environment