Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upBig increase of scrape duration and other durations after update to 1.6 #2782
Comments
This comment has been minimized.
This comment has been minimized.
|
It would be interesting to look at plots of |
This comment has been minimized.
This comment has been minimized.
|
Also What happens if you let a Prometheus server run with the same setup as your main Prometheus but you don't point a federating Prometheus to it? (To find out if it has something to do with federation.) |
This comment has been minimized.
This comment has been minimized.
|
Here are your requested graphs from our main prometheus: We would like to avoid trying a dual running setup for now as we're already running in timeouts when scraping metrics and doubling the load on scraping endpoints might make things actually worse. To clarify things: Our setup looks like this (simplified) The metrics we shared in this issue have been taken from the main prometheus. If you would like, we can provide you the same metrics for our federation prometheus, but they are looking the same (despite different resources and therefore memory configuration). |
This comment has been minimized.
This comment has been minimized.
|
OK, so a misunderstanding was that I thought Federation Prometheus is the one that scrapes Am I right to assume that the Main Prometheus federates everything that is in the Federation Prometheus? |
This comment has been minimized.
This comment has been minimized.
|
No, we don't federate everything. Basically, the Federation Prometheus automatically creates 5 recording rules per metric in order to get rid of the
(We know that not every combination makes sense, but it was easier to automate this way). The Main Prometheus federates all those pre-aggregated metrics. We do this in several scrape jobs because otherwise the scrape duration would take too long. |
This comment has been minimized.
This comment has been minimized.
|
From your graphs, I see ~350k series right after the restart. That's indeed a lot of samples to federate and might cause trouble, no matter what version of Prometheus you are running. This might all be triggered by a change unrelated to the version upgrade, which pushed your server over the cliff. For one, I see a steep increase in disk i/o on 5/20, which correlates with a steep increase in load average. Something changed there. And it was not a restart of the server, and neither a version upgrade. The first suspect would be retention, i.e. on that day, your Main Prometheus had for the first time enough history to start deleting metrics data. This is, sadly, very expensive by design. (Good news here is that retention cutoff will be almost free with Prometheus 2.) Perhaps that gave your Main Prometheus just enough heat to fail federation scrapes more often, or let them run for so long that it explains the increased number of open fd's. (More fd's should either be triggered by more open network connections or larger LevelDB indices. The raw sample storage of Prometheus never opens more than a handful of files at the same time. Of course, you can debug this with There are several things you can do now to investigate: On the one hand, you can try to exclude possible causes. You could verify my theory above by downgrading to Prometheus 1.5 and check the behavior. Or you could run a Prometheus 1.6 starting with empty data. On the other hand, you can take a more analytic approach and analyze goroutine dumps and What could help if it is really the retention cutoff that is killing you: Set the In totally different news: Your aggregation above is problematic. If |
This comment has been minimized.
This comment has been minimized.
PirminTapken
commented
Jun 2, 2017
|
Thanks for your detailed answer @beorn7 ! I would rule out retention as a source for this per se, as this particular setup is running quite a while already with retention set to 32 days. The amount of data ingested didn't change either (as can be seen with the disk space usage). We will go forward with downgrading to 1.5 to rule out (or find out) that it's a change in 1.6.x that caused this behaviour. We will then update the issue with the results. Thanks for the hint about our aggregation. We will have another look into this in the future. This aggregation is done for all kinds of metrics, and in case it doesn't make sense our internal customers have ways to apply their own rules to the raw metrics. |
This comment has been minimized.
This comment has been minimized.
BugRoger
commented
Jun 11, 2017
|
"Coincidentally" we're running a very similar setup of two Prometheus federating data. In our case we are aggregating the For what it's worth we're running into scrape timeouts as well after upgrading to 1.6. I don't have such a detailed investigation, since we just kept on increasing the timeout... Consider this a 'me too' and maybe a +1 on allowing to drop a metric before ingestion. I would also love to get rid of this dual setup just to filter/aggregate a label. |
This comment has been minimized.
This comment has been minimized.
|
You can drop metrics before ingestion, though it's primarily intended as a tactical option while you're fixing the metrics. https://www.robustperception.io/dropping-metrics-at-scrape-time-with-prometheus/ |
This comment has been minimized.
This comment has been minimized.
BugRoger
commented
Jun 11, 2017
|
Ok, yes. I should have been more specific. We want to aggregate on Fixing this in cAdvisor is not easy/possible due to philosophical differences. |
This comment has been minimized.
This comment has been minimized.
|
If your issue is the bandwidth, there's a planned Go client feature that'll let you select the metrics to return (Java and Python already have it). Whether that'll work in an exporters like cadvisor or work in your case isn't certain. |
This comment has been minimized.
This comment has been minimized.
|
I think what @BugRoger wants is to run a recording rule on a metric and then drop it as he is only interested in keeping the result of the rule but not in keeping the input metrics for the rule. |
This comment has been minimized.
This comment has been minimized.
|
About the wider issue: The performance degradations reported here are happening in a setup where changes between 1.5 and 1.6 have happened at various places (memory management, federation, probably more). None of the changes have an obvious catch that could cause a performance degradation but of course any of the changes could have a surprising side effect. So this is difficult to further debug without sitting in front of an affected system. All the ~70 Prometheus servers I work with have improved in performance by migrating to 1.6, and I wasn't able to reproduce anything of what was reported here. We need you to do further investigation as suggested above to narrow down the possible reasons. |
This comment has been minimized.
This comment has been minimized.
auhlig
commented
Jun 12, 2017
|
@beorn7 Regarding your 1st comment: Any way to achieve this? |
This comment has been minimized.
This comment has been minimized.
|
@auhlig No, not with Prometheus as we know it. I think the described set-up is the closest you can have, i.e. have a short-retention Prometheus server from which you federate the aggregated results into a long-term Prometheus server. |
This comment has been minimized.
This comment has been minimized.
|
Just a small update: We are still investigating and try different setups (downgrade both, only run one server at 1.7.1, different memory settings). What we can clearly see is that everything works as before when downgrading both prometheus servers to 1.5.2 |
brian-brazil
added
the
component/local storage
label
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
|
We updated the Here is a summary what we tried:
Afterwards we updated the main prometheus to 1.7.1 without changing any setting despite To sum things up: Setting This issue may be closed, thanks for your help. |
grobie
closed this
Nov 12, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |


Bonko commentedMay 30, 2017
Setup
We have two prometheis. The main prometheus scrapes(federates), among other targets, the federation prometheus, which collects metrics from our microservices.
What did you do?
target-heap-sizeto ~2/3 of total memoryWhat did you expect to see?
What did you see instead? Under which circumstances?
This improved after the update to 1.6.3 but still is much higher than with 1.5.
All graphs are from the main prometheus.
Scrape Duration (excerpt):

Storage:

Rule eval duration:

Load and memory:
We observe this behaviour on both systems
Environment
System information:
Prometheus version:
Logs:
On the federation prometheus we can see the following appearing in the logs since the upgrade:
Questions
We would like to know if the higher scrape duration is an expected behaviour or what we can do to mitigate it as the increase leads to timeouts and missing metrics in some cases.