Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heavy memory usage #3774

Closed
tonobo opened this issue Jan 31, 2018 · 13 comments
Closed

heavy memory usage #3774

tonobo opened this issue Jan 31, 2018 · 13 comments

Comments

@tonobo
Copy link

tonobo commented Jan 31, 2018

The prometheus server is currently allocating the whole series of 30GB in memory. By just putting hardware on it, its back again and faster than before ;D. But in general its not an option 😄

What did you do?

Queries, often!

Environment

  • System information:
level=info ts=2018-01-31T19:11:46.652846354Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
level=info ts=2018-01-31T19:11:46.65287715Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
level=info ts=2018-01-31T19:11:46.6528934Z caller=main.go:227 host_details="(Linux 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 x86_64 node3 (none))"
level=info ts=2018-01-31T19:11:46.652904858Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
  • Prometheus configuration file:
alerting:
    alertmanagers:
    -   static_configs:
        -   targets:
            -  localhost:9095
global:
    evaluation_interval: 15s
    scrape_interval: 15s
rule_files:
- /etc/prometheus/git_rules.d/*.yml
scrape_configs:
-   file_sd_configs:
    -   files:
        - /etc/prometheus/*.json
    honor_labels: true
    job_name: metrics
    scrape_interval: 2s
    scrape_timeout: 1s
  • Logs:

image

image

image

image

@krasi-georgiev
Copy link
Contributor

yes the promql defenetely needs optimising and there is already an idea to use streaming.
#3690
Might be best to close this one and subscribe to the discussions there.

I am also trying few things, but this will take time.

@brian-brazil
Copy link
Contributor

There's in sufficient information here to know if anything is wrong. What is your ingestion rate and what sort of queries are you using? In particular are they pulling in hours or days of data?

@tonobo
Copy link
Author

tonobo commented Jan 31, 2018

99,5% of the queries are just pulling the last 5 minutes of data. Also no complex aggregations are performed. There is no mentionable cpu load

@brian-brazil
Copy link
Contributor

What about the other .5%? What is your ingestion rate? There's still insufficient information to know if this is unexpected.

Can you run this dashboard against your Prometheus: https://grafana.com/dashboards/3834

@tonobo
Copy link
Author

tonobo commented Jan 31, 2018

image

@brian-brazil
Copy link
Contributor

It looks like you have some expensive rules that are taking too long to evaluate, 2.1 will show you how long each rule is taking so you can pinpoint which rule it is in particular. That may be what's pulling in all the data.

@tonobo
Copy link
Author

tonobo commented Jan 31, 2018

Thanks a lot. Great support guys.

image

@tonobo tonobo closed this as completed Jan 31, 2018
@tonobo
Copy link
Author

tonobo commented Feb 1, 2018

Ive done futher investigations. The rule evalutation time is around 20s, but the memory seems to be allocated for multiple minutes.

record: statistic_cpu_intensive
expr: avg(avg_over_time(cpu_percent[12h]))
  BY (target) > 95

I know the query needs to load all memory series, but it wouldn't scale well if this would be done fully in memory. E.g PostgreSQL allows to configure a memory limit, larger memory allocations will trigger disk based aggregations.

@brian-brazil
Copy link
Contributor

brian-brazil commented Feb 1, 2018

Prometheus doesn't spill to disk, and if you have a query so expensive that it doesn't fit in RAM you should probably be doing it outside of Prometheus as that's getting into heavy reporting use cases.

That rule is also filtering, that is not a good idea. You should only filter in alerting rules. You want the bool modifier there.

@tonobo
Copy link
Author

tonobo commented Feb 1, 2018

Ah ok, good to know.

@dobesv
Copy link

dobesv commented Apr 6, 2018

I'm running Prometheus 2.2 now, I wonder how I would see how long each rule is taking so I can pinpoint which rule is taking too long (if any) ?

@genericgithubuser
Copy link

For those running 2.2 and trying to find how long each rule is taking, the run time per rule should be showing if you go to http:///rules
which gives a quick way to identify the more expensive rules.

@lock
Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants