Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expensive recording rules cause high memory/malloc use in 2.0.0 #3450

Closed
SuperQ opened this Issue Nov 9, 2017 · 7 comments

Comments

Projects
None yet
6 participants
@SuperQ
Copy link
Member

SuperQ commented Nov 9, 2017

What did you do?

Have some recording rules that require a large amount of data to evaluate.

What did you expect to see?

Moderate heap use/growth.

What did you see instead? Under which circumstances?

Large heap use/growth.

Environment

This recording rule requires about 12k metrics, which is admittedly expensive, but the heap grows to 10-15GB.

groups:
- name: recordings/recording.rules
  rules:
  - record: gitaly:grpc_server_handled_total:error_avg_rate12h
    expr: avg(rate(grpc_server_handled_total{grpc_code!="OK"}[12h])) BY (job, grpc_method,
      environment)
  • System information:

    insert output of uname -srm here

  • Prometheus version:

prometheus, version 2.0.0 (branch: HEAD, revision: 0a74f98628a0463dddc90528220c94de5032d1a0)
  build user:       root@615b82cb36b6
  build date:       20171108-07:11:59
  go version:       go1.9.2

See attached pprof heap.svg.gz.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 9, 2017

Was the memory usage any different in Prometheus 1.x?

I'd assume the allocations happen in the query layer, so 1.x and 2.x should be affected in the same way.

And BTW, we have found predict_linear and deriv being especially prone to allocating loads of memory when run over long time frames.

@SuperQ

This comment has been minimized.

Copy link
Member Author

SuperQ commented Nov 9, 2017

Attaching the heap output for 1.8.1.
heap-1.8.2.svg.gz.

I don't think it's much different, but it was being masked by the target-heap-size use.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Nov 9, 2017

I would expect it's the time frame rather than the function. Both of those should be constant memory, as there's just a handful of floats to be tracked.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 12, 2017

So is this 2.0 specific or not? If not we should remove the 2.0 tag as it's not a regression but a general issue then.

@discordianfish

This comment has been minimized.

Copy link
Member

discordianfish commented Nov 15, 2017

I also have issues with memory usage on 2.0, so not sure it's unrelated.
Here is my prometheus 2.0 memory usage (was runnig beta2 until today but still runnig OOM):

It's possible that something in my infra caused this increase but I can't explain it. I assume the only thing that could cause such growth would be an growing number of timeseries but not sure if that's what open head series shows?

I have a few recording rules which are more or less expensive:

- name: default
  rules:
  - record: probe_uptime:avg6h
    expr: avg_over_time(probe_success[6h]) * 100
  - record: probe_uptime:avg24h
    expr: avg_over_time(probe_success[1d]) * 100
  - record: probe_uptime:avg7d
    expr: avg_over_time(probe_success[1w]) * 100
  - record: probe_uptime:avg14d
    expr: avg_over_time(probe_success[2w]) * 100
  - record: probe_duration_seconds:success
    expr: probe_duration_seconds * (probe_success == 1)
  - record: probe_duration_seconds:failure
    expr: probe_duration_seconds + (probe_success == 0)
  - record: probe_duration_seconds:15m:99th
    expr: quantile_over_time(0.99, probe_duration_seconds:success[15m])
  - record: probe_duration_seconds:15m:95th
    expr: quantile_over_time(0.95, probe_duration_seconds:success[15m])
  - record: probe_duration_seconds:15m:90th
    expr: quantile_over_time(0.9, probe_duration_seconds:success[15m])
  - record: probe_duration_seconds:15m:75th
    expr: quantile_over_time(0.75, probe_duration_seconds:success[15m])
  - record: probe_duration_seconds:15m:median
    expr: quantile_over_time(0.5, probe_duration_seconds:success[15m])

Either way, this was running just fine until yesterday. Now it runs out of memory within minutes.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 31, 2018

We've made quite a few performance improvements to PromQL since this was filed, so this should be a lot better now.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.