Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus server runs out of memory #1396
Comments
This comment has been minimized.
This comment has been minimized.
|
Are those numbers from just before it runs out of memory? What version of the Prometheus server are you running? In general, I'd expect your Prometheus server to use just below 4GiB Perhaps running all those other binaries (promdash, alertmanager and Björn Rabenstein, Engineer SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany |
beorn7
added
the
question
label
Feb 16, 2016
This comment has been minimized.
This comment has been minimized.
|
That number was when server was still running. I updated the But i still see this error coming in few hours. I am running prometheus version 0.16. |
This comment has been minimized.
This comment has been minimized.
|
Here is the output of
|
This comment has been minimized.
This comment has been minimized.
|
The message above ("1572864 chunks waiting for persistence") means that too many chunks are waiting to be written to disk. So many, in fact, that it explains the memory consumption of your server. Sample ingestion has been suspended, which is really bad. You don't want to be in that state. The root cause is that your disk is too slow. It might be that you don't have enough I/O ops on your machine. Your ingestion rate is moderate, so a normal disk should be able to handle it, but I/O ops on AWS instances are limited. More memory helps to batch I/O ops but then you need to tweak certain flags, see http://prometheus.io/docs/operating/storage/ . |
This comment has been minimized.
This comment has been minimized.
|
the volume attached in gp2 type on aws and has 300/3000 IOPS allowed. |
This comment has been minimized.
This comment has been minimized.
|
So looking at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html, that means you have a 100GiB large disk (100*3 = 300)? Maybe try I'm not sure how well seek-heavy workloads perform on EBS in general (lots of back-and-forth?), since at SC we always used instance store storage (as in, no need to go over the network, but gets lost when the instance disappears). That's something to try as well, if that's an option. |
This comment has been minimized.
This comment has been minimized.
|
Can you see somewhere how many IOPS your server is consuming? Prometheus is definitely IOPS heavy on the disk. I strongly assumed you have maxed out your credit and are thus falling behind in persistence until ingestions stalls and/or you run out of memory. |
This comment has been minimized.
This comment has been minimized.
|
from cloud watch metrics on aws i could see read/write ops average at 150..and bursting upto 500 periodically. I removed few node exporter stats that we were not using and suddenly the memory usage seems under control I will see if we can increase IOS on the ebs volume. |
This comment has been minimized.
This comment has been minimized.
Puneeth-n
commented
Feb 17, 2016
|
@meenupandey Is there a reason for choosing I currently don't use |
This comment has been minimized.
This comment has been minimized.
|
If you keep the ingestion load slightly below what your storage device can handle, you will be good. If you are slightly above, your server will blow up at some point. (Suspending ingestion is better handled in master (not yet released) but in general you really never want to get close to that kind of overload because throttled sample ingestion throws off your metrics quite a bit.) Looks like the root cause is pretty clear now. I'm closing this. Feel free to re-open if there is new evidence. |
beorn7
closed this
Feb 18, 2016
This comment has been minimized.
This comment has been minimized.
|
Thank you all, we upgraded our ebs volume and IOPs provisioning it increased to 500 now. The sever seems to be working well now. Thanks again. |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
meenupandey commentedFeb 15, 2016
I am running prometheus server on t2.large (2 vCPU, 8 GiB Mem). The prometheus server runs out of memory in few days. Longest it ran was for 8 days.
The prometheus server, promdash, alertmanager and cloud-exporter are running as docker container.
Top command shows outside container shows:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND4159 root 20 0 6316428 4.552g 6752 S 23.3 58.3 1980:58 prometheusConfig flags: -storage.local.memory-chunks=1048576 -storage.local.max-chunks-to-persist=1048576 -storage.local.checkpoint-interval=3m0s
here are some statistics:
prometheus_local_storage_memory_series 324159prometheus_local_storage_memory_chunks 1048601prometheus_local_storage_ingested_samples_total 7237728330rate(prometheus_local_storage_ingested_samples_total[5m]) 20954sum(rate(http_request_duration_microseconds_count{handler=~"query|metrics"}[5m])) 0.1762We are monitoring 25 nodes using node exporter and also getting data from the application services running on those nodes.
Mem info http://pastebin.com/L0ZLb2CD
Why is memory usage so high
4.552g?What should i do to make prometheus more reliable?
Also is this machine config enough to support all the data.