Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upmmap: cannot allocate memory #4392
Comments
This comment has been minimized.
This comment has been minimized.
|
Thanks for your report. This message comes form the OS so can't be bug in Prometheus. Another option would be to move the question to our user mailing list. If you haven't looked already you might find your answer in the FAQ If you think this is not purely a support question, feel free to comment in here or in the dev mailing list. Once your questions have been answered, please add a link to the solution to help other Prometheans in trouble reaching this from a search |
krasi-georgiev
closed this
Jul 17, 2018
This comment has been minimized.
This comment has been minimized.
|
Thanks @krasi-georgiev. I couldn't solve this one, my solution in the end was just delete the data directory and start again which worked straight away. Luckily I've only been running for a few months and it's not live yet, but I'll probably now look at federation. |
This comment has been minimized.
This comment has been minimized.
marcelmay
commented
Jul 20, 2018
|
Just happened to me, too - on a large 256G memory machine, where Prometheus takes <12G memory. There was > 200G free memory (buffers, cache) available, but Prometheus (v2.2.1 or latest v2.3.2) refused to start. As noted above, the workaround was to create a new, empty data directory to get Prometheus up running. It seems to occur to more users - eg #4168 : mmap fails and free memory :
@krasi-georgiev : When starting Prometheus, we noticed RES memory < 10G but VIRT memory sky-rocking. |
This comment has been minimized.
This comment has been minimized.
|
I'd also be interested in any performance tweaks that could help, but this instance is only monitoring a few hundred hosts. I've also noticed I can tell when it's going to crash, as the directories in
The host definitely isn't running out of RAM, it's a VM on an ESX cluster that isn't showing any health issues across the other VMs. I'll keep trying things, I may have to build a server directly on tin to try and narrow down the issue... |
This comment has been minimized.
This comment has been minimized.
|
Reopening this because it seems it was closed prematurely. Hitting an mmap limit like this is probably still a Prometheus problem, especially if the machines have plenty of available memory otherwise. Maybe the TSDB tries to mmap too many blocks at once at startup or something like that? |
juliusv
reopened this
Jul 21, 2018
This comment has been minimized.
This comment has been minimized.
|
@marcelmay Have you had a look at http://mroonga.org/docs/faq/mmap_cannot_allocate_memory.html and increasing the |
This comment has been minimized.
This comment has been minimized.
marcelmay
commented
Jul 21, 2018
|
@juliusv : Thx for the hint. We tried that, starting at the default 65k and doubling it till 512k. As Prometheus worked nicely for at least 2 months, we stopped doubling it (thought it would require only small increase from the default). Interesting thing we noted was that Prometheus startup bailed out at different mmaped files whenever we doubled, so it looks like the param has an impact. I will try to increase |
This comment has been minimized.
This comment has been minimized.
Previous research into this indicated it was a kernel problem, and we're not the only application to hit it. |
This comment has been minimized.
This comment has been minimized.
If this has indeed been determined to be a kernel problem already, it's a good idea to link to the previously gained insights about the problem. Is it just about too low There's been an equivalent issue that was also closed by just saying that the machine doesn't have enough memory, although reportedly there was enough memory available in that case too: #4168 (comment). |
This comment has been minimized.
This comment has been minimized.
marcelmay
commented
Jul 24, 2018
|
Increasing the ulimit (-v) for virtual memory to a value greater than the data size did fix the issue for me. Running The previous virtual memory limit when Prometheus bailed out with From setrlimit :
Unfortunately, Prometheus ran later into #4388 - but that's a different issue. Anyway, maybe Prometheus could print out a hint about the virtual memory limit (if someone else can confirm the fix) when mmap fails/when Prometheus notices that it reaches that limit? Final note: I have not reverted my increase of |
This comment has been minimized.
This comment has been minimized.
|
Thanks a lot @marcelmay for the very detailed explanation. I agree that it would be good to print out the vm limit on startup like we do for fd limits. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the explanations and advice, I will try the suggestions and get back @simonpasquier |
This comment has been minimized.
This comment has been minimized.
|
Even with Again, removing the data directory allows it to start back up. I've been moving to Prometheus to get away from Icinga2, but I'm struggling now as a fresh VM with enough memory seems to corrupt my data every few days. I can see some mentions of TSDB work in 2.3.2, I could run an earlier version in one DC and 2.3.2 in the other to see if one outlives the other? Running out of things to try and I need this to go in to production soon. Would be interested to hear others TSDB settings. My max block duration was 6h, but I've left that to default now and it goes to 9d, here's my settings:
|
This comment has been minimized.
This comment has been minimized.
veox
commented
Jul 30, 2018
•
|
@iDemonix I have found that it may not be necessary to delete the entire data directory; removing "just" the ones that are "too frequent" (as in your earlier comment) would work. I've done this on one "collector" machine with find ./ -maxdepth 1 -type d -ctime 2 -exec sudo rm -rf {} \;This left a couple of those "too frequent" block directories, but (Below not likely to be the same case as @iDemonix.) There's nothing that monitors the machine other than itself (oops!..), and it seems that it was writing unusual amounts of data to disk ( (I might've accidentally deleted a few blocks manually - oops, again.) |
This comment has been minimized.
This comment has been minimized.
|
Happened again overnight, I'm not sure whether to roll back a CentOS minor release, or roll back a couple of Prometheus versions, as both have been upgraded in the last month, and I can't seem to find any way to run Prometheus for more than 48h now.
After running:
I can start Prometheus again, albeit with missing data. Why does Prometheus start segregating data at 1-minute intervals, when the min size configured is 2 hours? Would it not be possible to throw an error or restart (some part of, or all of) the service? If I catch Prometheus doing this with the /data directory, I can stop the process, remove the broken files and restart without an issue, during all of this I don't run out of RAM, and I don't see a HDD being filled up rapidly (there's 480GB free). Update: I usually just erase data/ and start again, this time I ran the above command, started back up and I'm getting:
|
This comment has been minimized.
This comment has been minimized.
In general we recommend not to modify the min/max block duration settings. They're mostly used for benchmarking and testing. Having block directories that are created every minute or so seems to indicate a bug with the compaction. If the faulty server is still on v2.3.1, can you upgrade to v2.3.2? It includes several tsdb fixes that may be related to your issue. |
This comment has been minimized.
This comment has been minimized.
|
Hi @simonpasquier, I'm running 2.3.2 and 2.3.1 on two separate servers now, with the same specs. I'm going to see which one lasts longer and will report back. The last few crashes have actually been from the 2.3.2 version, I believe (although starting to lose track!) |
This comment has been minimized.
This comment has been minimized.
|
@marcelmay @veox would you mind sharing your Linux flavour + kernel version? I'm trying to figure out what's changed. I ran Prometheus for 2-3 months without issue, but unfortunately a bad hardware failure lost that VM entirely. Since repeating the exact same build steps with the same config, I hit this mmap issue constantly. As I lost my original VM I can't check exact versions, but I believe there's been some CentOS 7 and Linux kernel updates since, so I'm starting to think about trying a kernel downgrade on a test box. My VMs are built by Puppet so they're identical in build, all I can think that has changed is the Linux Kernel version, and I may have been running Prometheus 2.2.1. |
This comment has been minimized.
This comment has been minimized.
|
Some more logs below from my 2.3.2 install which was creating repeated directories in /var/prometheus data:
At the point of |
This comment has been minimized.
This comment has been minimized.
|
@iDemonix as a double-check can you verify the actual limits of the process by inspecting IIUC while persisting the head to an immutable block, tsdb fails to reload the blocks (because of the mmap error). A few seconds later, the same process happens: tsdb persists the head to a block, the reload fails and now we have another almost identical block on disk. This doesn't explain why the mmap reading fails but at least why you have so many items in your data directory. prometheus/vendor/github.com/prometheus/tsdb/db.go Lines 371 to 380 in 71af5e2 |
This comment has been minimized.
This comment has been minimized.
|
Hi Simon, Here's the output:
|
This comment has been minimized.
This comment has been minimized.
|
Crashed again overnight and won't start backup, 10G of memory free, and barely any metrics, but it won't start:
What I'm also a bit confused by, the dupe directories it creates seem to be the same size from the faulty one onwards. Could it be this repeated writing of large files/dirs that makes it hit some sort of limit? I've now ran out of things to try (error on both 2.3.2 and 2.3.1) apart from starting to drop Kernel versions until it stops dying.
I also tried the above commands on both boxes, but no joy, even after persisting them and rebooting:
Does anyone have any other settings I could change? It looks like the latest Prometheus no longer runs on the latest CentOS7 out of the box, even with plenty of RAM? |
This comment has been minimized.
This comment has been minimized.
|
@iDemonix as I wrote in my earlier comment, the truncation process is creating the same block directory over and over because the reloading fails. This is a bug that needs to be fixed but in the mean time and to get back to a normal state, you can move the duplicate directories as well as the wal directory somewhere else since they just amplify the mmap failure. It doesn't mean that the problem won't occur again but it might give you some room. As you said, maybe a kernel update triggers the problem. I'm trying to reproduce it on my end. |
This comment has been minimized.
This comment has been minimized.
|
Thanks @simonpasquier, I've bumped the kernel up to a 4.x release to try that and got the same thing after a couple of hours. I've since gone back to my original Kernel version, and I'm running it in Docker as I need to start collecting some usable data - will see if the Docker variant runs ok. |
This comment has been minimized.
This comment has been minimized.
|
On my two equal boxes, I ran 2.3.0 overnight on one, and 2.3.2 in a Docker container on the other. The 2.3.0 one is throwing mmap errors, but the Docker instance with a data volume doesn't appear to have any problems - it's not really a fix but this is likely the route I'll now go down as it seems to be more stable. |
This comment has been minimized.
This comment has been minimized.
|
I fail to reproduce the issue: when setting a low virtual memory limit ( Can you share the following kernel parameters: |
This comment has been minimized.
This comment has been minimized.
|
Hi @simonpasquier, requested output below.
My Prom1 box is just sat with a failed Prometheus that won't start anymore due to mmap issues, my Prom2 box I installed Docker on instead and so far, so good (using an external data volume). I'll rebuild the Prom1 box soon in to a Docker host, if you'd like any more diagnostics over the weekend, let me know! I tried a few more vm.X and similar sysctl settings but I couldn't get it to stop crashing, and as it can take anything from 1 - 12 hours for problems to start appearing I've simply stuck with Docker for now. Thanks |
This comment has been minimized.
This comment has been minimized.
veox
commented
Aug 31, 2018
|
Doesn't seem to show anything useful... I'm taking a different approach: inspecting the "broken" database on my workstation, which has much more memory. Currently, it's failing with "too many open files", so I'll "truncate" the database... |
This comment has been minimized.
This comment has been minimized.
See https://www.robustperception.io/dealing-with-too-many-open-files for that. |
This comment has been minimized.
This comment has been minimized.
veox
commented
Aug 31, 2018
•
|
Yup, I know how to deal with that. :) Just wanted to give an interim result while I'm inspecting this... Output of Looks like just a |
This comment has been minimized.
This comment has been minimized.
veox
commented
Aug 31, 2018
|
BTW, here's the node-exporter view (of the occurence today, not the earlier one in the gist above) - this time with memory. I had a thought that this might be happening due to the memory being already filled by disk cache, but that seems false. (Tried taking a screenshot of all the graphs, but Grafana has them in an |
simonpasquier
added
kind/bug
component/local storage
labels
Sep 6, 2018
This comment has been minimized.
This comment has been minimized.
omegarus
commented
Oct 2, 2018
|
I don't know if it's relevant but I had similar problem. I change binary to amd64 and problem disappeared. |
This comment has been minimized.
This comment has been minimized.
|
@omegarus what do you mean exactly by "change binary to amd64"? What was the architecture before? |
This comment has been minimized.
This comment has been minimized.
fusionswap
commented
Oct 11, 2018
•
|
I have been running to this issue now. ulimit -v unlimited and increasing vm.max_map_count doesnt help. is there any workaround for this other than deleting data directory. We are running on RHEL 7.5 (kernel 3.10.0-862.11.6.el7.x86_64). prometheus version is 2.1.0 |
This comment has been minimized.
This comment has been minimized.
|
@fusionswap what is the error message you are getting? |
This comment has been minimized.
This comment has been minimized.
|
@fusionswap can you upgrade to 2.4.3 and share the configuration + log files? Also |
This comment has been minimized.
This comment has been minimized.
fusionswap
commented
Oct 12, 2018
|
@krasi-georgiev its the same error. |
This comment has been minimized.
This comment has been minimized.
fusionswap
commented
Oct 12, 2018
•
|
I took backup of data directory and ran the prometheus 2.4.3 pointing it to this backup data directory. config gist: https://gist.github.com/fusionswap/427b45be82de6a05d76fee8e8dcdc18a I also ran it with strace. Also, note that I am also seeing lots of directories created at 1 min interval similar to @iDemonix /proc details are below. sh-4.2# cat /proc/sys/kernel/shmmni sh-4.2# cat /proc/sys/kernel/shmmax sh-4.2# cat /proc/sys/kernel/shmall sh-4.2# cat /proc/meminfo |
This comment has been minimized.
This comment has been minimized.
fusionswap
commented
Oct 18, 2018
|
I ran with 2.4.3 with empty data directory. I am again seeing this issue. This happens after 4 days.
After this compaction failed message, the block directories keep getting created at one minute interval |
This comment has been minimized.
This comment has been minimized.
fusionswap
commented
Oct 18, 2018
•
|
Well, this was stupid mistake on our side. We were running linux-386 (32 bit) binaries on our X86-64 servers. So, it was limited to 4G possible memory utilization. However, as stated by @simonpasquier it is still a bug where after certain failures in compacting block directories due to memory allocation issues, it starts duplicating directories every minute until it stops running. |
This comment has been minimized.
This comment has been minimized.
|
@fusionswap , thanks for the update , I will remember that one :) @iDemonix, @veox is it possible that your issue is the same? 32 bit binary on a 64 bit machine? |
This comment has been minimized.
This comment has been minimized.
|
just added the error in the wiki : https://github.com/prometheus/prometheus/wiki/FAQ#error-mmap-cannot-allocate-memory |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev It is 100% possible but I can't check now as the box was trashed and I went with a Dockerised install instead! I'm not sure why I would have got the wrong arch, but again, could have accidentally pulled the wrong one! I'll try a rebuild at some point on the same CentOS release on a vanilla install, to see if it happens again. |
This comment has been minimized.
This comment has been minimized.
|
since 2 people confirmed, shall we close it and reopen if we get other reports? |
This comment has been minimized.
This comment has been minimized.
|
I'm happy to close until I get chance to re-test! If anyone confirms that using the correct binary fixes this, they can reopen. Thanks for the help. |
iDemonix
closed this
Oct 18, 2018
This comment has been minimized.
This comment has been minimized.
veox
commented
Oct 19, 2018
•
|
@krasi-georgiev No, I'm running a It's OK for the issue to be closed as far as I'm concerned, because (apparently) it's expected that Prometheus should be allowed by the system to allocate as much memory as it wants. This is at the core of the issue, and a design choice that I'm not willing to argue. Especially since I'm not contributing any code that will help. :) JIC someone else's running a limited-hardware set-up - my workaround:
So far (for ~2 months now) this set-up has "worked": i.e., no more loss of measurements because of disk getting full, because no more thrashing the disk with identical uncompacted blocks, because no more failure to map memory while doing compaction. |
This comment has been minimized.
This comment has been minimized.
@veox does it mean you upgraded the OS? One of you comments mentioned 32bit ? In summary:
So if you observe anything different than let me know and will reopen to continue to troubleshooting. |
This comment has been minimized.
This comment has been minimized.
veox
commented
Oct 20, 2018
@krasi-georgiev Sorry; that means I'm talking nonsense, or otherwise out of my mind. X_X Edited the comment: running a 32-bit local build, on a 32-bit machine (as compared to 32-bit on 64-bit). An embarrassing brain-fart, no less.
Thank you for your patience. I'll open a PR if I have something concrete. |
This comment has been minimized.
This comment has been minimized.
klaper
commented
Oct 28, 2018
|
I'm experiencing same issue here.
Provided screens are from standard Grafana dashboard offered to me upon Prometheus DS setup. I got this issue couple times but thats first time I caught it with Prometheus still working (it shuts down after a while but I have no idea why). From start Prometheus process uses up bit more memory every 2 hours (its shuttered due system shutdown - still virtual memory usage got back to same level):
And since then every minute or so:
Also then compaction started appearing:
Some revenant data:
And strace of what happens when Prometheus does write to logs those errors. |
This comment has been minimized.
This comment has been minimized.
|
I am pretty sure that the problem is when the compaction expand the block and probably can be fixed with some workaround but i personally don't think it is worth the trouble supporting 32 bit systems. Would be interested to see what @fabxc and @gouthamve have to say about this. |
This comment has been minimized.
This comment has been minimized.
klaper
commented
Oct 28, 2018
|
Stacktrace of dying prometheus. I guess it might be helpfull, in case you decided it's worth fixing for 32bit. |





iDemonix commentedJul 17, 2018
I've been running Prometheus for a while now, I've returned after the weekend to find it's died and won't start back up, here's the logs:
The server has plenty of free memory both in terms of disk space and RAM/SWAP. If I temporarily remove the block it complains about to /tmp, it just moves on to the next block and has the same complaint, any advice?