Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus server runs out of memory #1396

Closed
meenupandey opened this Issue Feb 15, 2016 · 12 comments

Comments

Projects
None yet
4 participants
@meenupandey
Copy link

meenupandey commented Feb 15, 2016

I am running prometheus server on t2.large (2 vCPU, 8 GiB Mem). The prometheus server runs out of memory in few days. Longest it ran was for 8 days.
The prometheus server, promdash, alertmanager and cloud-exporter are running as docker container.

Top command shows outside container shows:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4159 root 20 0 6316428 4.552g 6752 S 23.3 58.3 1980:58 prometheus

Config flags: -storage.local.memory-chunks=1048576 -storage.local.max-chunks-to-persist=1048576 -storage.local.checkpoint-interval=3m0s

here are some statistics:
prometheus_local_storage_memory_series 324159

prometheus_local_storage_memory_chunks 1048601

prometheus_local_storage_ingested_samples_total 7237728330

rate(prometheus_local_storage_ingested_samples_total[5m]) 20954

sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m])) 0.1762

We are monitoring 25 nodes using node exporter and also getting data from the application services running on those nodes.
Mem info http://pastebin.com/L0ZLb2CD

Why is memory usage so high 4.552g ?

What should i do to make prometheus more reliable?
Also is this machine config enough to support all the data.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Feb 16, 2016

Are those numbers from just before it runs out of memory?

What version of the Prometheus server are you running?

In general, I'd expect your Prometheus server to use just below 4GiB
with those settings. If it's really using 4.5GiB, that might be due to
expensive queries or large, long-lasting scrapes.

Perhaps running all those other binaries (promdash, alertmanager and
cloud-exporter) in docker containers on the same host consumes the
other half of your memory. I'd check how much RAM each of those
containers uses.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

@beorn7 beorn7 added the question label Feb 16, 2016

@meenupandey

This comment has been minimized.

Copy link
Author

meenupandey commented Feb 16, 2016

That number was when server was still running. I updated the -storage.local.memory-chunks=1572864 as was seeing this error very often prom prometheus[17446]: time="2016-02-16T14:24:52Z" level=warning msg="1572864 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" Feb 16 14:24:52 prom prometheus[17446]: time="2016-02-16T14:24:52Z" level=warning msg="1572864 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544"

But i still see this error coming in few hours.

I am running prometheus version 0.16.
Current prometheus stats:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17491 root 20 0 7885184 6.116g 5388 D 5.0 78.3 64:26.87 prometheus

@meenupandey

This comment has been minimized.

Copy link
Author

meenupandey commented Feb 16, 2016

Here is the output of ps aux

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

root     19053  0.0  0.1 117056 13080 ?        Ssl  17:23   0:00 /usr/bin/docker run --name prometheus --rm -v /data/prometheus/prometheus:/prometheus -v /etc/prometheus/prometheus.yml:/etc
root     19130 39.0 63.8 5632568 5234580 ?     Ssl  17:23   1:23 /bin/prometheus -web.listen-address=:19090 -config.file=/etc/prometheus/prometheus.yml -storage.local.memory-chunks=1572864
root     19225  0.0  0.1 174396 13044 ?        Ssl  17:23   0:00 /usr/bin/docker run --name alertmanager --rm -v /etc/prometheus/alertmanager.conf:/alertmanager.conf -p 9093:9093/tcp prom/a
systemd+ 19327  0.0  0.1  74004 12848 ?        Ssl  17:23   0:00 /bin/alertmanager -config.file=/alertmanager.conf

root     19422  0.0  0.1 125252 13004 ?        Ssl  17:23   0:00 /usr/bin/docker run --name promdash --rm -v /data/prometheus/promdash:/promdash -e DATABASE_URL -p 127.0.0.1:3000:3000/tcp p
\oot     19656  0.0  0.1 117056 13000 ?        Ssl  17:23   0:00 /usr/bin/docker run --name cloudwatch-exporter --rm -p 9106:9106 -v /etc/prometheus/cloudwatch-exporter-config.json:/config.
root     19724  1.6  1.9 3522232 157180 ?      Ssl  17:23   0:03 java -jar /cloudwatch_exporter.jar 9106 /config.json ```

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Feb 16, 2016

The message above ("1572864 chunks waiting for persistence") means that too many chunks are waiting to be written to disk. So many, in fact, that it explains the memory consumption of your server. Sample ingestion has been suspended, which is really bad. You don't want to be in that state.

The root cause is that your disk is too slow. It might be that you don't have enough I/O ops on your machine. Your ingestion rate is moderate, so a normal disk should be able to handle it, but I/O ops on AWS instances are limited.

More memory helps to batch I/O ops but then you need to tweak certain flags, see http://prometheus.io/docs/operating/storage/ .
But I'd first go for getting more I/O ops quota (I'm not an expert on how to do that on AWS).

@meenupandey

This comment has been minimized.

Copy link
Author

meenupandey commented Feb 16, 2016

the volume attached in gp2 type on aws and has 300/3000 IOPS allowed.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Feb 16, 2016

So looking at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html, that means you have a 100GiB large disk (100*3 = 300)? Maybe try io1?

I'm not sure how well seek-heavy workloads perform on EBS in general (lots of back-and-forth?), since at SC we always used instance store storage (as in, no need to go over the network, but gets lost when the instance disappears). That's something to try as well, if that's an option.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Feb 16, 2016

Can you see somewhere how many IOPS your server is consuming?

Prometheus is definitely IOPS heavy on the disk. I strongly assumed you have maxed out your credit and are thus falling behind in persistence until ingestions stalls and/or you run out of memory.

@meenupandey

This comment has been minimized.

Copy link
Author

meenupandey commented Feb 17, 2016

from cloud watch metrics on aws i could see read/write ops average at 150..and bursting upto 500 periodically. I removed few node exporter stats that we were not using and suddenly the memory usage seems under control 4.8g and sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m])) 0.1355 and rate(prometheus_local_storage_ingested_samples_total[5m]) 905

I will see if we can increase IOS on the ebs volume.

@Puneeth-n

This comment has been minimized.

Copy link

Puneeth-n commented Feb 17, 2016

@meenupandey Is there a reason for choosing t2.large and 100GiB disk? I think it is a 100GiB EBS because of 300/3000 IOPS. Anyways, because of the burstable nature of T2 I would go for a C3/C4 instance type as it has better performance (no CPU credits). I would also allocate a much larger EBS volume as they come with much higher baseline performance and longer burstable period.

I currently don't use prometheus but I have seen similar issues in production in our carbon stack.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Feb 18, 2016

If you keep the ingestion load slightly below what your storage device can handle, you will be good. If you are slightly above, your server will blow up at some point. (Suspending ingestion is better handled in master (not yet released) but in general you really never want to get close to that kind of overload because throttled sample ingestion throws off your metrics quite a bit.)

Looks like the root cause is pretty clear now. I'm closing this. Feel free to re-open if there is new evidence.

@beorn7 beorn7 closed this Feb 18, 2016

@meenupandey

This comment has been minimized.

Copy link
Author

meenupandey commented Feb 18, 2016

Thank you all, we upgraded our ebs volume and IOPs provisioning it increased to 500 now. The sever seems to be working well now. Thanks again.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.