Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage slow down within 2h range #3604

Closed
pborzenkov opened this Issue Dec 20, 2017 · 21 comments

Comments

Projects
None yet
4 participants
@pborzenkov
Copy link

pborzenkov commented Dec 20, 2017

What did you do?
I have a Prometheus instance that's configured to federate a lot of metrics from 2 other instances (~190000 samples each). I know that this isn't a way federation is supposed to be used, but still I want to share the behaviour I observe.

What did you expect to see?
scrape_duration_seconds doesn't change considerably with time

What did you see instead? Under which circumstances?
I see scrape_duration_seconds increases gradually within 2h window and then drops to the original value (obviously, after new TSDB block is started). See attached screenshot.

screen shot 2017-12-20 at 14 43 41 1

Environment

  • System information:
    Linux 4.13.0-17-generic x86_64

  • Prometheus version:
    prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98)
    build user: root@615b82cb36b6
    build date: 20171108-07:11:59
    go version: go1.9.2

# vim: set shiftwidth=2 tabstop=2 expandtab:

global:
  scrape_interval: 30s
  scrape_timeout: 20s
  evaluation_interval: 30s

rule_files:
  - '/etc/prometheus/config/*.rules'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['127.0.0.1:9090']

  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'

    params:
      'match[]':
        - '{__name__=~"[a-zA-Z].*"}'

    static_configs:
      - targets:
        - '<redacted>'
        - '<redacted>'

alerting:
  alertmanagers:
    - scheme: http
      static_configs:
        - targets:
          - "<redacted>"

Is this an expected behaviour? A set of metrics doesn't change with time, so I expected TSDB to perform equally well regardless of the actual file size. If this is an expected behaviour, please just close the issue and sorry for the noise.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jan 18, 2018

Is it possible that there is a lot of churn involved in the Prometheus being scraped? If yes, then this is expected behavior, else this is not expected.

@pborzenkov

This comment has been minimized.

Copy link
Author

pborzenkov commented Jan 18, 2018

@gouthamve No. Both Prometheus have static setups (they use consul for SD, but a set of services is mostly static and they rarely restart).

We have some metrics where new label values may appear over time (different error values), but they stabilise within several hours/a day after a service restarts . Yet such slowdown continue to be visible even with completely static set of time series (after all service instances have been running for a long time w/o restart).

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

Can you share a snapshot of the benchmark dashboard for this Prometheus? https://grafana.com/dashboards/3834

@pborzenkov

This comment has been minimized.

Copy link
Author

pborzenkov commented Jun 13, 2018

Sure. Here is the snapshot:

https://snapshot.raintank.io/dashboard/snapshot/SSf8lrDSf9olUTlJWncACKnso8Ja17nO?orgId=2

JFYI, since the bug was filed we've started to federate a lot more data (more Prometheus instances and more metrics from each one of them) and started to hit scrape timeouts on some targets. So I've split federation from each Prometheus into several jobs (one per job='' label) and the problem is not that clearly visible anymore when looking at individual scrape_duration_seconds metrics. But the sum of them clearly shows the pattern:

screen shot 2018-06-13 at 17 50 28

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

This is an abuse of federation, however given you've ~no churn the time federation takes should be constant as you're looking at the latest chunk of every series.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 5, 2018

this looks stale and we haven't had any other reports for such degradation so looks like some configuration issue..
@pborzenkov feel free to reopen if you want to continue troubleshooting this.

@pborzenkov

This comment has been minimized.

Copy link
Author

pborzenkov commented Dec 5, 2018

@krasi-georgiev It still reproduces, but it doesn't affect our Prometheus setup in any way. I just thought it might be something worth investigating.

screenshot 2018-12-05 at 12 34 46

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 5, 2018

@pborzenkov thanks for the update. Can you confirm that you have no churn as Brian suggested?

@krasi-georgiev krasi-georgiev reopened this Dec 5, 2018

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 5, 2018

also might be worth pinging me on IRC to speed up the troubleshooting.
#prometheus-dev

@pborzenkov

This comment has been minimized.

Copy link
Author

pborzenkov commented Dec 5, 2018

@pborzenkov thanks for the update. Can you confirm that you have no churn as Brian suggested?

We have no churn. We have some metrics where new label values might appear over time, but usually the process stabilises after several hours and the set of time series stays constant until a service restarts. The restarts are quite rare.

Here is the "Prometheus benchmark" dashboard snapshot for the same time period (Grafana is GMT+3).

https://snapshot.raintank.io/dashboard/snapshot/XaSMZMpeIoBSVCV2gzlHO7DX8IzanwrW?orgId=2

@pborzenkov

This comment has been minimized.

Copy link
Author

pborzenkov commented Dec 5, 2018

also might be worth pinging me on IRC to speed up the troubleshooting.
#prometheus-dev

Sure, will do.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 5, 2018

There were quite a few fixes in tsdb and Prometheus the recent weeks and although I can't think of anything related to what you are seeing here it might be worth trying master to see if it experiences the same behaviour.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 5, 2018

Yes, I'm wondering if prometheus/tsdb@d2aa6ff fixes this.

@pborzenkov

This comment has been minimized.

Copy link
Author

pborzenkov commented Dec 5, 2018

@krasi-georgiev @brian-brazil

We are currently running 2.4.2. I'll update our main setup to 2.5 (it doesn't include
prometheus/tsdb@d2aa6ff, but it's time to update anyway) and if nothing changes will bring up a new Prometheus instance from master branch on a separate machine.

BTW, can this issue be affected by the source Prometheus instances? I believe they might be quite old (2.2 or 2.3, will check later).

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 5, 2018

I think only the last node(the federated one in your case) can delay the scrapes.

@pborzenkov

This comment has been minimized.

Copy link
Author

pborzenkov commented Dec 11, 2018

I've been running 'prometheus-2.6.0-rc.0' on a fresh DB with the same config as our main setup for the past day and it shows the very same scrape slowdown within 2h window:

screenshot 2018-12-11 at 11 28 46

@krasi-georgiev I'll ping you on IRC a bit later if you are still interested.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Dec 11, 2018

ok , I am working on few other PRs right now, but ping me on IRC in few days and will continue the troubleshooting.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jan 25, 2019

@pborzenkov can you test this with the 2.7 release and if still an issue please ping me on IRC so I can try to replicate and than search for the culprit.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 27, 2019

@pborzenkov have you had a chance to test it with the latest release?

@pborzenkov

This comment has been minimized.

Copy link
Author

pborzenkov commented Feb 27, 2019

@krasi-georgiev We haven't updated all of our DCs to the latest Prometheus release yet, but the ones that have been updated to 2.7.1 don't experience any noticeable slowdown within 2h window anymore.

So, I assume, the issue can be closed now. Thanks for you help!

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 27, 2019

np, thanks for the update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.