Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big increase of scrape duration and other durations after update to 1.6 #2782

Closed
Bonko opened this Issue May 30, 2017 · 18 comments

Comments

Projects
None yet
7 participants
@Bonko
Copy link

Bonko commented May 30, 2017

Setup

We have two prometheis. The main prometheus scrapes(federates), among other targets, the federation prometheus, which collects metrics from our microservices.

What did you do?

  • 17.5.: Main prometheus: update prometheus from 1.5.2 to 1.6.2
  • 18.5.: Federation prometheus: update prometheus from 1.5.2 to 1.6.2
  • 24.5.: Update both prometheis from 1.6.2 to 1.6.3
  • set target-heap-size to ~2/3 of total memory

What did you expect to see?

  • Scrape durations and rule evaluations stay unchanged after the migration.
  • CPU utilization should increase as mentioned in the changelog.

What did you see instead? Under which circumstances?

  • Rule evaluation and all scrape durations are rising very high.

This improved after the update to 1.6.3 but still is much higher than with 1.5.

All graphs are from the main prometheus.

Scrape Duration (excerpt):
image

Storage:
image

Rule eval duration:
image

Load and memory:

image

We observe this behaviour on both systems

Environment

  • System information:

    • main prometheus
    Linux 3.16.0-4-amd64 x86_64
    
    • federation prometheus
    Linux 4.9.24-coreos x86_64
    
  • Prometheus version:

    • main prometheus
    prometheus, version 1.6.3 (branch: HEAD, revision: c580b60c67f2c5f6b638c3322161bcdf6d68d7fc)
      build user:       root@0d7d8db544c3
      build date:       20170524-12:21:10
      go version:       go1.8.1
    
    • federation prometheus
    prometheus, version 1.6.3 (branch: master, revision: c580b60c67f2c5f6b638c3322161bcdf6d68d7fc)
      build user:       root@a6410e65f5c7
      build date:       20170522-09:15:06
      go version:       go1.8.1
    
  • Logs:

On the federation prometheus we can see the following appearing in the logs since the upgrade:

level=error msg="federation failed" err="write tcp 127.0.0.1:9090->127.0.0.1:40082: write: broken pipe"

Questions

We would like to know if the higher scrape duration is an expected behaviour or what we can do to mitigate it as the increase leads to timeouts and missing metrics in some cases.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 31, 2017

It would be interesting to look at plots of prometheus_local_storage_memory_chunks and prometheus_local_storage_memory_series over the same time range as your other graphs.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 31, 2017

Also process_open_fds. Perhaps there is some FD leaking happening with Prometheus.

What happens if you let a Prometheus server run with the same setup as your main Prometheus but you don't point a federating Prometheus to it? (To find out if it has something to do with federation.)

@Bonko

This comment has been minimized.

Copy link
Author

Bonko commented Jun 1, 2017

Here are your requested graphs from our main prometheus:

image

We would like to avoid trying a dual running setup for now as we're already running in timeouts when scraping metrics and doubling the load on scraping endpoints might make things actually worse.

To clarify things:

Our setup looks like this (simplified)

image

The metrics we shared in this issue have been taken from the main prometheus.

If you would like, we can provide you the same metrics for our federation prometheus, but they are looking the same (despite different resources and therefore memory configuration).
The federation prometheus has only a retention of 30 minutes configured and is handling everything in memory (ramdisk).

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jun 1, 2017

OK, so a misunderstanding was that I thought Federation Prometheus is the one that scrapes /federation. I'll use your terminology from now on.

Am I right to assume that the Main Prometheus federates everything that is in the Federation Prometheus?

@Bonko

This comment has been minimized.

Copy link
Author

Bonko commented Jun 1, 2017

No, we don't federate everything.

Basically, the Federation Prometheus automatically creates 5 recording rules per metric in order to get rid of the instance label because it would result in a very high cardinality otherwise.
e.g.

job:api_requests_total:sum{} = sum(api_requests_total) WITHOUT (instance)
job:api_requests_total:min{} = min(api_requests_total) WITHOUT (instance)
job:api_requests_total:max{} = max(api_requests_total) WITHOUT (instance)
job:api_requests_total:avg{} = avg(api_requests_total) WITHOUT (instance)
job:api_requests_total:count{} = count(api_requests_total) WITHOUT (instance)

(We know that not every combination makes sense, but it was easier to automate this way).

The Main Prometheus federates all those pre-aggregated metrics. We do this in several scrape jobs because otherwise the scrape duration would take too long.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jun 1, 2017

From your graphs, I see ~350k series right after the restart. That's indeed a lot of samples to federate and might cause trouble, no matter what version of Prometheus you are running. This might all be triggered by a change unrelated to the version upgrade, which pushed your server over the cliff.

For one, I see a steep increase in disk i/o on 5/20, which correlates with a steep increase in load average. Something changed there. And it was not a restart of the server, and neither a version upgrade. The first suspect would be retention, i.e. on that day, your Main Prometheus had for the first time enough history to start deleting metrics data. This is, sadly, very expensive by design. (Good news here is that retention cutoff will be almost free with Prometheus 2.) Perhaps that gave your Main Prometheus just enough heat to fail federation scrapes more often, or let them run for so long that it explains the increased number of open fd's. (More fd's should either be triggered by more open network connections or larger LevelDB indices. The raw sample storage of Prometheus never opens more than a handful of files at the same time. Of course, you can debug this with lsof etc. The increase in fd's is a smoking gun in any case.) Since the fd count is increased but not constantly rising, I'd assume this isn't a fd leak. (Initially I thought a failed federation scrape might leak a fd.)

There are several things you can do now to investigate: On the one hand, you can try to exclude possible causes. You could verify my theory above by downgrading to Prometheus 1.5 and check the behavior. Or you could run a Prometheus 1.6 starting with empty data. On the other hand, you can take a more analytic approach and analyze goroutine dumps and lsof output etc. to see where fd's are used and where the server spends all its CPU cycles.

What could help if it is really the retention cutoff that is killing you: Set the -storage.local.series-file-shrink-ratio flag to a much higher value than the default 0.1, like 0.3 or 0.5. This will increase the disk space you need but reduce the disk throughput consumed.

In totally different news: Your aggregation above is problematic. If api_requests_total is really a counter, the aggregations you are performing will disrupt counter reset detection. In general, you should first take a rate of a counter before applying aggregations.

@PirminTapken

This comment has been minimized.

Copy link

PirminTapken commented Jun 2, 2017

Thanks for your detailed answer @beorn7 !

I would rule out retention as a source for this per se, as this particular setup is running quite a while already with retention set to 32 days. The amount of data ingested didn't change either (as can be seen with the disk space usage).

We will go forward with downgrading to 1.5 to rule out (or find out) that it's a change in 1.6.x that caused this behaviour. We will then update the issue with the results.

Thanks for the hint about our aggregation. We will have another look into this in the future. This aggregation is done for all kinds of metrics, and in case it doesn't make sense our internal customers have ways to apply their own rules to the raw metrics.

@BugRoger

This comment has been minimized.

Copy link

BugRoger commented Jun 11, 2017

"Coincidentally" we're running a very similar setup of two Prometheus federating data. In our case we are aggregating the cpu label in cadvisor metrics. 😑

For what it's worth we're running into scrape timeouts as well after upgrading to 1.6. I don't have such a detailed investigation, since we just kept on increasing the timeout... Consider this a 'me too' and maybe a +1 on allowing to drop a metric before ingestion. I would also love to get rid of this dual setup just to filter/aggregate a label.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 11, 2017

You can drop metrics before ingestion, though it's primarily intended as a tactical option while you're fixing the metrics. https://www.robustperception.io/dropping-metrics-at-scrape-time-with-prometheus/

@BugRoger

This comment has been minimized.

Copy link

BugRoger commented Jun 11, 2017

Ok, yes. I should have been more specific. We want to aggregate on cpu and then drop the original metric. In order to aggregate it, it needs to be ingested. So, we ingest it. Aggregate it. Then drop it during federation. See also: google/cadvisor#1513

Fixing this in cAdvisor is not easy/possible due to philosophical differences.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 11, 2017

If your issue is the bandwidth, there's a planned Go client feature that'll let you select the metrics to return (Java and Python already have it). Whether that'll work in an exporters like cadvisor or work in your case isn't certain.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jun 12, 2017

I think what @BugRoger wants is to run a recording rule on a metric and then drop it as he is only interested in keeping the result of the rule but not in keeping the input metrics for the rule.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jun 12, 2017

About the wider issue: The performance degradations reported here are happening in a setup where changes between 1.5 and 1.6 have happened at various places (memory management, federation, probably more). None of the changes have an obvious catch that could cause a performance degradation but of course any of the changes could have a surprising side effect. So this is difficult to further debug without sitting in front of an affected system. All the ~70 Prometheus servers I work with have improved in performance by migrating to 1.6, and I wasn't able to reproduce anything of what was reported here. We need you to do further investigation as suggested above to narrow down the possible reasons.

@auhlig

This comment has been minimized.

Copy link

auhlig commented Jun 12, 2017

@beorn7 Regarding your 1st comment: Any way to achieve this?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jun 12, 2017

@auhlig No, not with Prometheus as we know it. I think the described set-up is the closest you can have, i.e. have a short-retention Prometheus server from which you federate the aggregated results into a long-term Prometheus server.

@Bonko

This comment has been minimized.

Copy link
Author

Bonko commented Jun 29, 2017

Just a small update:

We are still investigating and try different setups (downgrade both, only run one server at 1.7.1, different memory settings).

What we can clearly see is that everything works as before when downgrading both prometheus servers to 1.5.2

@Bonko

This comment has been minimized.

Copy link
Author

Bonko commented Aug 9, 2017

We updated the federation prometheus back to 1.7.1 and tried to optimize it.

Here is a summary what we tried:

  • We changed from tmpfs for Prometheus data volume to an EBS Volume (with enough IOPS provisioned) -> performance improved
  • Raised storage.local.num-fingerprint-mutexes to 40960 -> performance improved even more
  • Raised storage.local.num-fingerprint-mutexes to 61440 -> performance got worse again -> reverted
  • Set target-heap-size to 2/3 of memory prometheus may use -> nothing changed (was nearly the same as automatically calculated by promtheus)
  • Set GOGC to 100, which was the default Value in Prometheus 1.5 -> performance increased a lot, but was still worse than in 1.5 (keep in mind: retention is only 30 minutes)
  • Replace EBS with tmpfs again while GOGC is still set to 100 and storage.local.num-fingerprint-mutexes is still set to 40960 -> performance nearly as in Prometheus 1.5 (despite some spikes while scraping)
  • Set storage.local.num-fingerprint-mutexes back to default of 4096 -> performance increased a bit

Afterwards we updated the main prometheus to 1.7.1 without changing any setting despite target-heap-size and everything was ok again.

To sum things up: Setting GOGC=100 on federation prometheus fixed this for us.

This issue may be closed, thanks for your help.

@grobie grobie closed this Nov 12, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.