Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upStorage slow down within 2h range #3604
Comments
This comment has been minimized.
This comment has been minimized.
|
Is it possible that there is a lot of churn involved in the Prometheus being scraped? If yes, then this is expected behavior, else this is not expected. |
gouthamve
added
the
kind/more-info-needed
label
Jan 18, 2018
This comment has been minimized.
This comment has been minimized.
|
@gouthamve No. Both Prometheus have static setups (they use consul for SD, but a set of services is mostly static and they rarely restart). We have some metrics where new label values may appear over time (different error values), but they stabilise within several hours/a day after a service restarts . Yet such slowdown continue to be visible even with completely static set of time series (after all service instances have been running for a long time w/o restart). |
This comment has been minimized.
This comment has been minimized.
|
Can you share a snapshot of the benchmark dashboard for this Prometheus? https://grafana.com/dashboards/3834 |
This comment has been minimized.
This comment has been minimized.
|
Sure. Here is the snapshot: https://snapshot.raintank.io/dashboard/snapshot/SSf8lrDSf9olUTlJWncACKnso8Ja17nO?orgId=2 JFYI, since the bug was filed we've started to federate a lot more data (more Prometheus instances and more metrics from each one of them) and started to hit scrape timeouts on some targets. So I've split federation from each Prometheus into several jobs (one per job='' label) and the problem is not that clearly visible anymore when looking at individual scrape_duration_seconds metrics. But the sum of them clearly shows the pattern: |
This comment has been minimized.
This comment has been minimized.
|
This is an abuse of federation, however given you've ~no churn the time federation takes should be constant as you're looking at the latest chunk of every series. |
brian-brazil
added
kind/enhancement
component/local storage
priority/P3
and removed
kind/more-info-needed
labels
Jun 13, 2018
This comment has been minimized.
This comment has been minimized.
|
this looks stale and we haven't had any other reports for such degradation so looks like some configuration issue.. |
krasi-georgiev
closed this
Dec 5, 2018
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev It still reproduces, but it doesn't affect our Prometheus setup in any way. I just thought it might be something worth investigating. |
This comment has been minimized.
This comment has been minimized.
|
@pborzenkov thanks for the update. Can you confirm that you have no churn as Brian suggested? |
krasi-georgiev
reopened this
Dec 5, 2018
This comment has been minimized.
This comment has been minimized.
|
also might be worth pinging me on IRC to speed up the troubleshooting. |
This comment has been minimized.
This comment has been minimized.
We have no churn. We have some metrics where new label values might appear over time, but usually the process stabilises after several hours and the set of time series stays constant until a service restarts. The restarts are quite rare. Here is the "Prometheus benchmark" dashboard snapshot for the same time period (Grafana is GMT+3). https://snapshot.raintank.io/dashboard/snapshot/XaSMZMpeIoBSVCV2gzlHO7DX8IzanwrW?orgId=2 |
This comment has been minimized.
This comment has been minimized.
Sure, will do. |
This comment has been minimized.
This comment has been minimized.
|
There were quite a few fixes in tsdb and Prometheus the recent weeks and although I can't think of anything related to what you are seeing here it might be worth trying master to see if it experiences the same behaviour. |
krasi-georgiev
added
kind/bug
and removed
kind/enhancement
labels
Dec 5, 2018
This comment has been minimized.
This comment has been minimized.
|
Yes, I'm wondering if prometheus/tsdb@d2aa6ff fixes this. |
This comment has been minimized.
This comment has been minimized.
|
We are currently running 2.4.2. I'll update our main setup to 2.5 (it doesn't include BTW, can this issue be affected by the source Prometheus instances? I believe they might be quite old (2.2 or 2.3, will check later). |
This comment has been minimized.
This comment has been minimized.
|
I think only the last node(the federated one in your case) can delay the scrapes. |
This comment has been minimized.
This comment has been minimized.
|
I've been running 'prometheus-2.6.0-rc.0' on a fresh DB with the same config as our main setup for the past day and it shows the very same scrape slowdown within 2h window: @krasi-georgiev I'll ping you on IRC a bit later if you are still interested. |
This comment has been minimized.
This comment has been minimized.
|
ok , I am working on few other PRs right now, but ping me on IRC in few days and will continue the troubleshooting. |
This comment has been minimized.
This comment has been minimized.
|
@pborzenkov can you test this with the 2.7 release and if still an issue please ping me on IRC so I can try to replicate and than search for the culprit. |
This comment has been minimized.
This comment has been minimized.
|
@pborzenkov have you had a chance to test it with the latest release? |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev We haven't updated all of our DCs to the latest Prometheus release yet, but the ones that have been updated to 2.7.1 don't experience any noticeable slowdown within 2h window anymore. So, I assume, the issue can be closed now. Thanks for you help! |
This comment has been minimized.
This comment has been minimized.
|
np, thanks for the update! |



pborzenkov commentedDec 20, 2017
What did you do?
I have a Prometheus instance that's configured to federate a lot of metrics from 2 other instances (~190000 samples each). I know that this isn't a way federation is supposed to be used, but still I want to share the behaviour I observe.
What did you expect to see?
scrape_duration_seconds doesn't change considerably with time
What did you see instead? Under which circumstances?
I see scrape_duration_seconds increases gradually within 2h window and then drops to the original value (obviously, after new TSDB block is started). See attached screenshot.
Environment
System information:
Linux 4.13.0-17-generic x86_64
Prometheus version:
prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98)
build user: root@615b82cb36b6
build date: 20171108-07:11:59
go version: go1.9.2
Is this an expected behaviour? A set of metrics doesn't change with time, so I expected TSDB to perform equally well regardless of the actual file size. If this is an expected behaviour, please just close the issue and sorry for the noise.