Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement strategies to limit memory usage. #455

Closed
beorn7 opened this Issue Jan 21, 2015 · 90 comments

Comments

Projects
None yet
@beorn7
Copy link
Member

beorn7 commented Jan 21, 2015

Currently, Prometheus simply limits the chunks in memory to a fixed number.

However, this number doesn't directly imply the total memory usage as many other things take memory as well.

Prometheus could measure its own memory consumption and (optionally) evict chunks early if it needs too much memory.

It's non-trivial to measure "actual" memory consumption in a platform independent way.

@sammcj

This comment has been minimized.

Copy link
Contributor

sammcj commented Mar 5, 2015

I too have had issues with this, I've found Prometheus very quickly eats up a lot of RAM and it can't easily be managed.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 6, 2015

@sammcj The problem here is that there is no standard way to get a Go program's actual memory consumption. The heap sizes reported in http://golang.org/pkg/runtime/#MemStats are usually off by a factor of 3 or so from the actual resident memory. This can be due to a number of things: memory fragmentation, Go metadata overhead, or the Go runtime not returning pages to the OS very eagerly. A proper solution needs yet to be found.

One thing you can tune right now is how many sample chunks to keep in RAM. See this flag:

  -storage.local.memory-chunks=1048576: How many chunks to keep in memory. While the size of a chunk is 1kiB, the total memory usage will be significantly higher than this value * 1kiB. Furthermore, for various reasons, more chunks might have to be kept in memory temporarily.

Keep in mind this is only one (albeit major factor) in RAM usage. Other factors are:

  • number of queries
  • number of time series
  • frequency of samples per time series
  • shape / length of label sets
  • etc.
@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 6, 2015

Also the various queues (Prometheus's own ones like sample ingestion queue but also Go and OS internal like queued up network queries or whatever). I have that idea of implementing a kind of memory chaperon that will not only evict evictable chunks but also throttle/reject queries and sample ingestion to keep total memory usage (or the amount of free memory on the machine) within limits. But that's all highly non-trivial stuff...

@beorn7 beorn7 changed the title Optional chunk eviction based on memory pressure. Implement strategies to limit memory usage. Mar 19, 2015

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 19, 2015

There are by now many things that may take memory, and there are many knobs to turn to tweak it. I changed then name of the issue into something more generic.

A good news already: The ingestion queue is gone, so there will not be wild ram usage jumps anymore if ingestion piles up scrapes.

@camerondavison

This comment has been minimized.

Copy link

camerondavison commented Apr 28, 2015

I am running with a retention of 4 hours, and the default "storage.local.memory-chunks", on version 0.13.1-fb3b464. While @juliusv said that I should expect that this means that the memory used up may be more than 1GB I am seeing it run out with the docker container setting at 2.5GB. Basically it looks like a memory lea, because on restart all of the memory goes back down and then slowly over time creeps back up. Is there any formula that could give me a good idea what to set the memory limit to, is there any way I can figure out if there is a leak somewhere or if it just because more and more data is coming in.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 29, 2015

@a86c6f7964 One thing to start out with: if you configure Prometheus to monitor itself (I'd always recommend it), does the metric prometheus_local_storage_memory_chunks go up at the same rate as the memory usage you're seeing? Or does it plateau at the configured maximum while the memory usage continues to go up? Checking prometheus_local_storage_memory_series would also be interesting to see how many series are current (not archived) in memory. If those are plateauing, and the memory usage is still going up, we'll have to dig deeper.

@camerondavison

This comment has been minimized.

Copy link

camerondavison commented Apr 29, 2015

ya it was going up. it got almost to 1million, so maybe it just needs a little more memory

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Apr 29, 2015

@a86c6f7964 Retention is fundamentally a bad way to get memory usage under control. It will only affect memory usage if all your chunks fit into memory. Retention is meant to limit disk usage.

Please refer to http://prometheus.io/docs/operating/storage/#memory-usage for a starter. Applying the rule of thumb given there, you should set -storage.local.memory-chunks to 800,000 at most if you have only 2.5GiB available. The default is 1M, which will almost definitely make your Prometheus use more than 2.5GiB in steady state.

I recommend to start with -storage.local.memory-chunks=500000 and a retention tailored to your disk size (possibly many days or weeks).

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Jul 15, 2015

Problem here is that "what's my memory usage?" or "how much memory is free on the system?" are highly non trivial questions. See http://www.redhat.com/advice/tips/meminfo.html/ as a starter...

@pousa

This comment has been minimized.

Copy link

pousa commented Aug 20, 2015

I'm currently running prometheus (0.15.1) on a bare metal server with 64GB memory, default settings (except retention, one week) and around 750 compute servers to be scraped every 30s. The server is dedicated to prometheus.

We have observed that memory consumption is going up until the machine is not responding anymore. It takes around two days to reach this point and killing prometheus process does not free all memory immediately. As suggested by @juliusv , I monitored prometheus_local_storage_memory_chunks. It started with 1.242443e+06 and end up in a plateau around 1.86631e+06, please see bellow. My question is, what should I look to get more information about this growing and from where is coming from?

Mem used 34549836 KiB
prometheus_local_storage_memory_chunks 1.964022e+06
prometheus_local_storage_memory_series 1.822449e+06

Mem used 38098228 KiB
prometheus_local_storage_memory_chunks 2.013611e+06
prometheus_local_storage_memory_series 1.648374e+06

Mem used 41139708 KiB
prometheus_local_storage_memory_chunks 2.062455e+06
prometheus_local_storage_memory_series 1.472947e+06

Mem used 53843712 KiB
top: 1431 prometh+  20   0 21.967g 0.015t   7968 S  99.4 23.9   2189:20 prometheus  
prometheus_local_storage_memory_chunks 1.899084e+06
prometheus_local_storage_memory_series 1.653677e+06

Mem used 56187240 KiB
top: 1431 prometh+  20   0 22.384g 0.015t   7968 S  86.8 25.0   2441:44 prometheus
prometheus_local_storage_memory_chunks 1.969578e+06
prometheus_local_storage_memory_series 1.518563e+06

Mem used 63289448 KiB
top: 1431 prometh+  20   0 23.886g 0.017t   7972 S  88.5 27.4   3461:40 prometheus   
prometheus_local_storage_memory_chunks 1.86631e+06
prometheus_local_storage_memory_series 1.518586e+06
@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 20, 2015

@pousa Yeah, sounds like that kind of server should normally not use that much RAM.

Some things to dig into:

  • How much query traffic does the machine get? sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m]))
  • What's the sample ingestion rate? rate(prometheus_local_storage_ingested_samples_total[5m]). with 1.5 million series and a 30s scrape rate, I'd expect roughly 50k samples per second.
  • What's the type of monitored jobs? Are they node exporters or something else?
  • Doing a heap profile via go tool pprof http://prometheus-host:9090/debug/pprof/heap could be interesting to see what section of memory is growing over time (web in the resulting pprof shell will open an SVG graph in the browser).
@pousa

This comment has been minimized.

Copy link

pousa commented Aug 20, 2015

I would say that the machine does get that much traffic. I had to reboot it in the afternoon (again memory problems) and right now it has:

sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m])) = 0.029629190678656617
rate(prometheus_local_storage_ingested_samples_total[5m]) = 12478.366666666667

They are node exporter and a collectd plugin developed by us, that basically aggregate data from /proc/PID/stat for each job. On our data center, a job is basically a parallel application from HPC domain composed of processes|threads.

Great! Thanks for the pprof tip, I will do it and post it here later.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 20, 2015

@pousa That sounds like a low number of queries and very reasonable ingestion rate for that kind of server. Though that makes me wonder, if you have 1.5 million series active in memory, I would have expected ~50k samples per second ingestion rate (at 30s scrape interval) instead of 12k/s, unless your series are frequently changing, so that only a small subset of active memory series gets an update on every scrape. If you are monitoring jobs via /proc/PID/stat, and these jobs are labeled by their pid, I wonder if that is what's leading to frequent churn in series (by PIDs changing all the time?). Still not sure how exactly that would lead to your memory woes though.

Your memory usage is shooting up, while memory chunks and series are staying pretty constant. Weird!

@pousa

This comment has been minimized.

Copy link

pousa commented Aug 21, 2015

@juliusv the ingestion rate keeps around this and the number of monitored jobs is around 7k. I do not monitor PID explicitly, but JOBIDs (set of PIDs). However, they also change quiet a lot. Jobs have a maximum duration of 4h, 24h or 120h. And it is common to have jobs that run for few minutes.

Yep, that is why I posted here. I could not understand this either. I still have to run pprof, will do this today.

@pousa

This comment has been minimized.

Copy link

pousa commented Aug 24, 2015

@juliusv run pprof and saw only 3GB being used. Need to investigate more..

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 24, 2015

@pousa The Go-reported heap size is always smaller than what the OS sees in terms of resident memory usage (due to internal metadata overhead and memory fragmentation), but that's usually factor 2 or so. I don't see how it would report 3GB, but then fill up 64GB in reality. Odd!

@pousa

This comment has been minimized.

Copy link

pousa commented Aug 24, 2015

@juliusv Thanks for the information. Ideed odd. I replicated the service on a different server today and I'm monitoring both servers/prometheus. I want to see if this could some how be related to the server and not prometheus itself, since top reports only half of memory being used by prometheus.

@killercentury

This comment has been minimized.

Copy link

killercentury commented Sep 8, 2015

I have similar issue as @pousa , the EC2 instance repeatly run out of memory after about 2 - 4 days. I do have much less memory than @pousa . But I am wondering what is the minimum/recommended memory capacity is required for running prometheus for long-term. Is it possible for prometheus to control its memory usage automatically instead of exhausting all the memory on the instance until it dies?

@pousa

This comment has been minimized.

Copy link

pousa commented Sep 8, 2015

@killercentury I still have the problem. I tried to build it with the newer version of GO but, no luck. I also looked into GO runtime environment variables, but could not find anything.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Sep 8, 2015

Hi everybody, please read http://prometheus.io/docs/operating/storage/ . @killercentury if you have very little memory (less than ~4GiB), you want to reduce -storage.local.memory-chunks from its default value. @pousa with 1.5M time series, you want to increase -storage.local.memory-chunks to something like 5M to make Prometheus work properly. -storage.local.max-chunks-to-persist should be increased then, too. Each time series should be able to keep at least 3 chunks in memory, ideally more... Also, if -storage.local.max-chunks-to-persist is not high enough, Prometheus will attempt a lot of little disk ops, which will slow everything else down and might increase RAM usage a lot because queues fill up and everything. That's especially true with 7k targets. If everything slows down, this might easily result in a spiral of death... Once you have tweaked the two flags mentioned above (perhaps to even higher settings), I would next increase the scraping interval to something very high (like 3min or so) to check if things improve. Then you can incrementally reduce the interval until you see problems arising. (In different news: 7k is a very high number of targets. Sharding of some kind might be required. But that's a different story.)

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Sep 8, 2015

And yes, ideally all of these value would auto-tweak themselves. However, that's highly non-trivial and not a priority to implement right now.

@pousa

This comment has been minimized.

Copy link

pousa commented Sep 8, 2015

@beorn7 I will try tweaking those flags and if needed the scrape interval. Thanks!

@pousa

This comment has been minimized.

Copy link

pousa commented Sep 16, 2015

@beorn7 Changing those flags and increasing a bit the scraping interval allow our prometheus instances to run without running out of memory. Thanks! However, I still have long timings to get back results from expressions... sometimes even get timeouts. Concerning sharding, we were already doing it.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Sep 16, 2015

Expensive queries can be pre-computed with recording rules: http://prometheus.io/docs/querying/rules/

To try out very expensive queries, you can increase the query timeout via the -query.timeout flag.

(Obviously, that's all now off-topic and has nothing to do with memory usage anymore. ;)

@pousa

This comment has been minimized.

Copy link

pousa commented Sep 16, 2015

@beorn7 Thanks, I will try the flag. We already have rules in place, and I was talking about very simple queries (e.g single metrics). But, as you said, this is a different topic ;)

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Sep 16, 2015

If a single time series takes a long time to query, then we are kind of on-topic again. Because the time is most likely needed to load chunks from disk. Tweaking the flags discussed here, you can maximize the number of chunks Prometheus keeps in memory, and thereby avoid loading in chunks from disk.
But in different news, loading a single series from disk should be very fast (because all the data is in a single file, one seek only). So I guess your server is very busy and overloaded anyway, so that everything is slow.

@ntquyen

This comment has been minimized.

Copy link

ntquyen commented Nov 9, 2016

@juliusv If you read my comments above and @brian-brazil's answers, you wouldn't be sure that the high memory would only come from storage.local.memory-chunks setting. It actually comes from the number of active series. Put it another way to make it clear: the number of active series decides the memory usage, not the storage.local.memory-chunks.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 14, 2016

I think we have do some deeper analysis here to identify bottlenecks in the case of many active time series. The more dynamic the environment, the higher the time series churn, the more we encounter this issue.
Maybe we properly did – back-of-the-envelope or by runtime investigation? @beorn7 @juliusv?

We all know I'm not a friend of storage tuning flags dictating memory usage (with limited accuracy), but at least it's static to a degree.
But memory usage highly fluctuating depending on number of time series in various states is inherently a runtime property. Unlike the chunk flags this is not just an undesirable step from an operational perspective but quite literally unmanageable. At least if we agree that restart cycles, between which we have a secondary process inspecting and reconfiguring from the outside, are not scalable with checkpoint/restore taking up to 15 minutes for sufficiently large servers.

Our in-memory held data should be the largest part of memory usage – even considering a moderate querying load.
There's certainly a per series management overhead and buffers used for ingestion. However, even knowing the code quite well, I believe we must be missing something if our actual sample data only accounts for about 25% of memory usage. With that number shrinking as the number of series grows.

If there are any ideas where this memory goes to, I'm sure we can optimize in the right places.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Nov 14, 2016

I did some research as reported in #455 (comment) above. I essentially tried to model the memory usage with two linear variables: chunks in memory and timeseries in memory. Without any luck at all. There is definitely more creating variable memory usage. Candidates include: number of targets, queries, service discovery, …

It would obviously be great to understand the memory consumption pattern in more detail. But we can pretty safely assume that they are complicated enough so that you cannot create a simple rule of how to set the flags (or how to auto-set them). My plans are therefore more into the direction to make chunk eviction depend on memory pressure so that you ultimately tell Prometheus how much RAM it may take, and then the server tries to dynamically balance memory chunks (and even persistence efforts) accordingly. Since these are the only levers we have, we don't really have to know where the memory is used (outside of finding memory leaks or optimizing code) as long as we know how much is used.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 14, 2016

Sorry, I missed that comment.

My general point: If there's a place where we can fix algorithmic complexity or a significant constant factor of our memory usage, it is well worth knowing about it.

For example, a chunk is 1KB with in-mem chunks fully pre-allocated. Then there's obviously management around it – but why does that add up to 3KB?
Why does management of a single series need 10KB of memory, which is 10x as much as max data of its active chunk and in the K8S case, more than most series ever accumulate in total?

That's not to say that there aren't very good reasons for that. But it's a complex system by now and chances are we missed an opportunity for baseline improvement so far.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Nov 14, 2016

For example, a chunk is 1KB with in-mem chunks fully pre-allocated. Then there's obviously management around it – but why does that add up to 3KB?
Why does management of a single series need 10KB of memory, which is 10x as much as max data of its active chunk and in the K8S case, more than most series ever accumulate in total?

My research resulted in the conclusion that there is no linear relationship like the above. The memory certainly goes somewhere, but it doesn't make sense to say that every memory chunk we add adds 3kiB of RAM usage or each series adds 10kiB of RAM usage.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Dec 13, 2016

More results from ongoing investigation:
I run two servers with exactly the same config and load, with the only exception that the one is the active one serving dashboard queries, while the other one is a hot spare. Both have the same number of memory chunks and time series (i.e. the dashboard queries are not super-long-term, so they don't keep otherwise archived series in RAM or similar). The active server has around 55GiB RAM usage, the hot spare 35GiB. Query load can thus create a significant memory footprint, even if it is not resulting in more memory chunks or memory series.

@uschtwill uschtwill referenced this issue Dec 16, 2016

Merged

Prometheus fix #13

@coding-horror

This comment has been minimized.

Copy link

coding-horror commented Jan 19, 2017

We are also seeing extreme memory usage by Prometheus at Discourse, 55GB of RAM. This actually caused our https termination machines to have small blip-like interruptions in service due to extreme memory pressure. :(

@randomInteger

This comment has been minimized.

Copy link

randomInteger commented Jan 19, 2017

What a kick in the pants, seeing Jeff Atwood's reply pop up in my inbox because of my involvement in this thread. I am a huge coding-horror fan.

Back on topic: I have been playing with trying to tune Prometheus but nothing I do seems to stop the ever creeping memory consumption. I am still restarting prometheus' container once every 24 hours and that is keeping memory use in check while still continuing to harvest/store/serve the data I'm collecting. Its not a great solution, but its an ok workaround in this instance.

If there is anything I can do from my end to provide you with more data on this issue, please let me know.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 19, 2017

I've implemented a number of improvements to Prometheus ingestion that'll be in the 1.5.0, and in the best case will cut memory usage by 33%. That slow growth is likely chunkDescs which depending on your setup could take weeks to stabilise, with my changes things should now stabilise about 12 hours after your chunks fill.

https://www.robustperception.io/how-much-ram-does-my-prometheus-need-for-ingestion/ has more information.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor

JorritSalverda commented Jan 19, 2017

We finally managed to run Prometheus stable in our production GKE cluster with following settings:

10 core cpu request
14 core cpu limit
64gb memory request
80gb memory limit
500gb pd-ssd

-storage.local.retention=168h0m0s
-storage.local.memory-chunks=13107200 (approx 64gb * 1024 * 1024 / 5)

Earlier tests with local memory chunks calculated by dividing by 3 or 4 constantly ran us in to out of memory kills, since using 5 this no longer seems to happen. Memory usage slowly grows after start but flattens off somewhere around 60gb.

We do experience an occasional hang, for that we just added -storage.local.max-chunks-to-persist=8738132 but still have to see whether this has a positive effect on stability.

In the GKE cluster itself we have about 40 4-core nodes running 750 pods, getting scraped each 15 seconds. Each node runs the node-exporter, so there's a wealth of information coming from Kubernetes itself and those node exporters.

Ingestion isn't a problem at all for Prometheus, the querying side is more tricky. We can see clearly in cpu pressure - and to a lesser extent memory usage - whether people have their Grafana dashboards opened during the day. It drops to a really low level when the office goes empty.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 19, 2017

The quick math indicates that you needed ~3.9KB per chunk with 1.4 just to handle ingestion, so 5 with queries isn't surprising.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor

JorritSalverda commented Jan 19, 2017

With our current settings it seems more like 4.6kb per chunk (ending up at 60GB used memory). But part of that memory usage could be for querying or disk cache, or any other resident memory usage by the application I guess?

Is it possible to see the separate types of memory usage in Prometheus' own metrics?

@brian-brazil

This comment has been minimized.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 5, 2017

Branch https://github.com/prometheus/prometheus/tree/beorn7/storage currently contains the implementation of an experimental flag -storage.local.target-heap-size and obsoletes -storage.local.max-chunks-to-persist and -storage.local.memory-chunks. It's running with great success on a couple of SoundCloud servers right now, but I have to polish it a bit before submitting it for review.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 5, 2017

That looks promising. One thing I noted during benchmarking was a ~10% overhead for memory usage above what purely the heap used, so reducing the number the user provides accordingly might be an idea.

@coding-horror

This comment has been minimized.

Copy link

coding-horror commented Mar 6, 2017

We did see solid improvements (reduction in memory usage) after we deployed 1.5 on our infra thanks @brian-brazil -- keep the improvements coming! Let us know how we can help.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 7, 2017

@beorn7 golang/go#16843 and linked discussions may be of interest.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 7, 2017

Thanks for the pointer. The linked proposal document aligns very well with my research (and it even mentions Prometheus explicitly as a use case :).

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 7, 2017

I noticed that :)

The design also looks fairly similar to what your proposal is. Having two very similar control systems running on top of each other may cause undesirable interactions.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 27, 2017

It's merged! Will be released in 1.6.0.

@beorn7 beorn7 closed this Mar 27, 2017

@Eeemil

This comment has been minimized.

Copy link

Eeemil commented Mar 13, 2019

--storage.local.target-heap-size seems to have been removed in 2.0, is there any strategy to limit memory usage nowadays?

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 14, 2019

See https://groups.google.com/forum/#!topic/prometheus-users/jU0Ghd_SyrQ

It makes more sense to ask questions like this on the prometheus-users mailing list rather than in a GitHub issue (in particular if it is a closed GitHub issue). On the mailing list, more people are available to potentially respond to your question, and the whole community can benefit from the answers provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.