Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus on a docker container gets OOM Killed #1358

Closed
vjsamuel opened this Issue Jan 30, 2016 · 9 comments

Comments

Projects
None yet
3 participants
@vjsamuel
Copy link

vjsamuel commented Jan 30, 2016

I currently face the issue that, Im running a docker container with prometheus on kubernetes. I've allocated a maximum of 3G of memory. The process is started with the following command:
prometheus -config.file=/etc/prometheus/prometheus.yml -storage.local.path=/prometheus -web.console.libraries=/etc/prometheus/console_libraries -web.console.templates=/etc/prometheus/consoles -storage.local.memory-chunks=40960 -storage.local.max-chunks-to-persist=10240 -storage.local.retention=720h

I see that the container gets OOMKilled because the container consumes all 3G and tries to consume more. Is there some setting Im going wrong with? Reducing the memory.chunks to 10240 has given me a longer life but still the container runs with 99% memory utilization.

Can someone please guide me on this issue at hand?

Thanks in advance.

@beorn7 beorn7 added the question label Feb 1, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Feb 1, 2016

40960 memory chunks should lead to at most 200MiB of RAM usage. With 3GiB, you should be fine with many more, like 500,000 or even 1,000,000 chunks (especially if using recent versions like 0.16.2).

Could you check how much memory chunks your server is actually using (metric prometheus_local_storage_memory_chunks exported by the Prometheus server)? How many time series do you have in memory? (prometheus_local_storage_memory_chunks)

If you have many time series, the Prometheus server has to keep a couple of chunks for each in memory anyway, so let's say you have 1M time series, than your server needs to keep 1 to 2M chunks in memory to be operational, no matter what you have set as flags. (From version 0.17 on, the server will stop ingestion in that case, so you will definitely notice. ;) In that case, it will exceed the 3GiB it is allowed to use and get OOM-killed.

@vjsamuel

This comment has been minimized.

Copy link
Author

vjsamuel commented Feb 1, 2016

prometheus_local_storage_memory_chunks 729095
process_resident_memory_bytes 2.766176256e+09
version="0.16.1"

This is how many I have right now.

The process is running with:
prometheus -alertmanager.url=http://192.168.52.164:9093 -config.file=/etc/prometheus/prometheus.yml -storage.local.path=/prometheus -web.console.libraries=/etc/prometheus/console_libraries -web.console.templates=/etc/prometheus/consoles -storage.local.memory-chunks=10240 -storage.local.max-chunks-to-persist=10240 -storage.local.retention=720h

I had done a typo earlier. Im actually running with 10240 local.memory.chunks yet still im running on 99% memory consistently.
The prometheus is monitoring a 10 node kubernetes cluster.

Docker stats:
CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O
50ca7365007d 408.45% 3 GiB/3 GiB 99.99% 0 B/0 B

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Feb 1, 2016

If you have 700k chunks in memory, that indicates that you likely have a
way larger count of series than the maximum number of chunks you have
configured. Each active series will need at least one chunk, even if your
storage.local.memory-chunks flag value is lower. What does the metric
"prometheus_local_storage_memory_series" say about your number of active
series? If that's not it, a very high number of queries could be another
(less likely) reason.
On Feb 1, 2016 19:30, "Vijay Samuel" notifications@github.com wrote:

prometheus_local_storage_memory_chunks 729095
process_resident_memory_bytes 2.766176256e+09
version="0.16.1"

This is how many I have right now.

The process is running with:
prometheus -alertmanager.url=http://192.168.52.164:9093
-config.file=/etc/prometheus/prometheus.yml -storage.local.path=/prometheus
-web.console.libraries=/etc/prometheus/console_libraries
-web.console.templates=/etc/prometheus/consoles
-storage.local.memory-chunks=10240
-storage.local.max-chunks-to-persist=10240 -storage.local.retention=720h

I had done a typo earlier. Im actually running with 10240
local.memory.chunks yet still im running on 99% memory consistently.
The prometheus is monitoring a 10 node kubernetes cluster.

Docker stats:
CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O
50ca7365007d 408.45% 3 GiB/3 GiB 99.99% 0 B/0 B


Reply to this email directly or view it on GitHub
#1358 (comment)
.

@vjsamuel

This comment has been minimized.

Copy link
Author

vjsamuel commented Feb 2, 2016

prometheus_local_storage_memory_series 729095

I would doubt the query count as Im hardly executing any queries on the prometheus server.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Feb 2, 2016

Yup, so your Prometheus has 729095 active time series. Prometheus keeps at
least one chunk in memory for each, and it is even advisable to configure
-storage.local.memory-chunks to at 2x or 3x that count so you can keep more
than one chunk per series in memory ideally. That of course means that you
will need accordingly much more memory.

First of all, I would double-check your actual service metrics (label
cardinalities and so on) to make sure that you really need all that
dimensionality that's in there, or whether maybe there's some kind of
mistake (like tracking something like a user ID in a label, which quickly
blows up your number of time series). If that all seems ok, you will simply
need more memory and update your flags.

On Tue, Feb 2, 2016 at 2:23 AM, Vijay Samuel notifications@github.com
wrote:

prometheus_local_storage_memory_series 729095

I would doubt the query count as Im hardly executing any queries on the
prometheus server.


Reply to this email directly or view it on GitHub
#1358 (comment)
.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Feb 2, 2016

So yes, with 729095 you want at least ~2M memory chunks and thus ~6GiB of RAM.

Either get more RAM or cut down the number of metrics your monitored targets are exporting.

@beorn7 beorn7 closed this Feb 2, 2016

@vjsamuel

This comment has been minimized.

Copy link
Author

vjsamuel commented Feb 2, 2016

Thank you so much for the help. This helped me in understanding how to debug memory issues on prometheus. Is this documented anywhere? I could submit a PR on documenting this as it would be useful for the next person.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Feb 2, 2016

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.