Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs seems to use more memory than it should #5035

Closed
haasn opened this issue Aug 28, 2016 · 28 comments
Closed

zfs seems to use more memory than it should #5035

haasn opened this issue Aug 28, 2016 · 28 comments

Comments

@haasn
Copy link
Contributor

haasn commented Aug 28, 2016

I'm running into many memory-related issues since switching to ZFS, including instances where the oom killer triggered despite plenty free memory being available, and instances where programs fail due to out of memory conditions.

For an example of a failure, when trying to recreate my initramfs:

* Gentoo Linux Genkernel; Version 64
* Running with options: --install initramfs

* Using genkernel.conf from /etc/genkernel.conf
* Sourcing arch-specific config.sh from /usr/share/genkernel/arch/x86_64/config.sh ..
* Sourcing arch-specific modules_load from /usr/share/genkernel/arch/x86_64/modules_load ..

* Linux Kernel 4.7.2-hardened-gnu for x86_64...
* .. with config file /usr/share/genkernel/arch/x86_64/kernel-config
* busybox: >> Using cache
* initramfs: >> Initializing...
*         >> Appending base_layout cpio data...
*         >> Appending udev cpio data...
cp: cannot stat '/etc/modprobe.d/blacklist.conf': No such file or directory
* cannot copy /etc/modprobe.d/blacklist.conf from udev
cp: cannot stat '/lib/systemd/network/99-default.link': No such file or directory
* cannot copy /lib/systemd/network/99-default.link from udev
*         >> Appending auxilary cpio data...
*         >> Copying keymaps
*         >> Appending busybox cpio data...
*         >> Appending modules cpio data...
*         >> Appending zfs cpio data...
*         >> Including zpool.cache
*         >> Appending blkid cpio data...
*         >> Appending ld_so_conf cpio data...
* ldconfig: adding /sbin/ldconfig...
* ld.so.conf: adding /etc/ld.so.conf{.d/*,}...
cpio: lib64 not created: newer or same age version exists
cpio: lib64 not created: newer or same age version exists
cpio: lib64/ld-linux-x86-64.so.2 not created: newer or same age version exists
cpio: lib64/librt.so.1 not created: newer or same age version exists
cpio: lib64/libpthread.so.0 not created: newer or same age version exists
cpio: lib64/libuuid.so.1 not created: newer or same age version exists
cpio: lib64/libz.so.1 not created: newer or same age version exists
cpio: lib64/libblkid.so.1 not created: newer or same age version exists
cpio: lib64/libc.so.6 not created: newer or same age version exists
cpio: usr/lib64 not created: newer or same age version exists
cpio: lib64 not created: newer or same age version exists
cpio: lib64/ld-linux-x86-64.so.2 not created: newer or same age version exists
cpio: lib64/libuuid.so.1 not created: newer or same age version exists
cpio: lib64/libc.so.6 not created: newer or same age version exists
cpio: lib64/libblkid.so.1 not created: newer or same age version exists
*         >> Finalizing cpio...
*         >> Compressing cpio data (.xz)...
/usr/bin/xz: /var/tmp/genkernel/initramfs-4.7.2-hardened-gnu: Cannot allocate memory
* ERROR: Compression (/usr/bin/xz -e --check=none -z -f -9) failed

I've experienced this failure multiple times already. In every case, reducing the ARC size (i.e. temporarily reducing zfs_arc_max) solves it, as does e.g. echo 2 > /proc/sys/vm/drop_caches.

At the time of this failure, my system reported about 80% memory being in use, and after the echo 2 command described it went down to about 60%.

The weird thing is that I can't explain this high memory usage. This is my current output of arc_summary.py: https://0x0.st/MFr.txt

As you can see, ARC claims to be using about 11 GiB. Tallying together the top memory-hungry processes in top gives me about 3 GiB at most. There are no significant amounts of data in tmpfs either.

Together, this means that my system should be consuming 11+3 = 14 GiB of memory, meaning my usage should be 14/32 ≈ 43%, rather than 60%. Why does free -m report almost 19 GiB used? Where are the missing 5 GiB being accounted for? I've never had this weird issue before ZFS, nor have I ever “run out” of memory before ZFS.

I'm considering drastically reducing the ARC size as a temporary measure at least until this issue can be tracked down and fixed, wherever it comes from.

@kernelOfTruth
Copy link
Contributor

referencing:

#2298 Document or distribute memory fragmentation workarounds in 0.6.3
#4953 External Fragmentation leading to Out of memory Condition
#466 Memory usage keeps going up

#3441 ABD: linear/scatter dual typed buffer for ARC (ver 2)

@ironMann
Copy link
Contributor

@haasn xz -9 allocates a lot of memory, even by today's standards. Memory fragmentation can cause it to fail, even if there are free pages arround. You can have a sense of just how much your memory is fragmented by looking into: /proc/buddyinfo (you want large numbers on the right)

@haasn
Copy link
Contributor Author

haasn commented Aug 28, 2016

A small update: After setting the ARC size limit to 8 GiB max, I've done a reboot. I'm now running a clean system with virtually no memory consumption from programs. (about 700 MiB from the browser I'm typing this in and basically nothing else)

I've read a bunch of data from disk (tar cf /dev/null) to fill up the ARC with stuff

After doing this, it reports the current usage as 6 GiB (75% of the 8 GiB max), yet free -m considers my total used memory to be 10 GiB (32% used). Again I have about a 4 GiB deficit between what zfs claims to consume and what it actually consumes.

Note: As an experiment, I tried removing my L2ARC devices, because I read that zfs needs to store tables of some sort to support them. However, this did not affect memory usage at all. (Unless I need to reboot for the change to take effect?)

You can find my current arc_summary.py output here: https://0x0.st/MFJ.txt

This is my current: /proc/buddyinfo:

Node 0, zone      DMA      1      1      0      1      1      1      1      0      1      1      3 
Node 0, zone    DMA32  25784   4161    216    389    146     78     32     15      1      1    257 
Node 0, zone   Normal 175248  31387   1710   2261    780    395    236    158     28      8   1959 
Node 1, zone   Normal  97675  18760    574    323    304    539    404    266     29     20   2618 

Also worth noting is that I have two pools imported, although the second pool has primarycache=metadata. Exporting the second pool did not affect memory usage.

After running echo 2 > /proc/sys/vm/drop_caches, my memory usage went down by about 2 GiB, down to 8 GiB (26%) - I now have about 8 GiB of memory used, and arc_summary.py reports 4.5 GiB for the arc size. My total usage is still consistently about 4 GiB higher than what zfs claims.

(To work around this temporarily, and since the number seems to be fairly constant, I'm going to subtract 4 GiB from my normal zfs_arc_max setting, giving it 16-4 = 12 GiB total.)

P.s. I forgot to mention, I am on kernel 4.7.2 and spl/zfs git master.

@haasn
Copy link
Contributor Author

haasn commented Aug 28, 2016

I decided to re-investigate this after fixing #5036 to eliminate that as the cause. Additionally, I am now testing on a stock kernel (not hardened) to eliminate more potential issues.

Long story short: Problem persists, the difference between the actual and observed RAM is again almost exactly 4 GiB.

(1.7 GiB is the total sum of all resident+shared memory currently in use, 6.91 GiB is what ARC reports → totals to 8.61 GiB, but free reports 12.6 GiB in use)

@haasn
Copy link
Contributor Author

haasn commented Aug 30, 2016

It seems like this memory usage is slowly growing over time, while the node 0 memory fragmentation also grew (according to /proc/buddyinfo). I would give you more details, but when I tried reducing the ARC size followed by echo 2 > /proc/sys/vm/drop_caches and echo 1 > /proc/sys/vm/compact_memory, my system hard-froze shortly thereafter. (Completely unresponsive to input and networking, didn't even respond to magic sysrq)

I'm slightly suspecting that there may be some sort of fragmentation-inducing memory leak in some SPL/ZFS component on my machine, since I haven't had these problems while running btrfs on the same hardware. same kernel version and doing the same things.

@haasn
Copy link
Contributor Author

haasn commented Aug 30, 2016

Further update: I had a look through /proc/spl/kmem/slab and noticed the allocation of many zio_buf/zio_data_buf slabs, about 3GB in total (and currently shrinking). (Full output here: https://0x0.st/uX8.txt)

With this extra data accounted for, I'm only “missing” about 2G currently, which probably has some other similar explanation. It seems I was under the misguided assumption that ZFS would only use about as much RAM as I had ARC configured. Is it normal for ZFS to have several extra GB of slabs allocated for other purposes?

Perhaps worth noting is that I am using SLAB instead of SLUB as my SLAB allocator, because I have previously observed it performing better for me under certain workloads, but it might be worth re-evaluating that assumption for ZFS.

Edit: Unfortunately, it seems like the /proc/spl/kmem/slab output also includes objects associated with the ARC, so I was doing a bit of double-counting. (And I'm also not entirely sure which size fields to go by, the output of this file seems a bit cryptic), so my above reasoning is probably invalid.

@kernelOfTruth
Copy link
Contributor

referencing:

https://utcc.utoronto.ca/~cks/space/blog/linux/ZFSonLinuxMemoryWhere
Where your memory can be going with ZFS on Linux

https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/tXHQPBE6uHg Where ZFS on Linux's memory usage goes and gets reported (I think)

@haasn
Copy link
Contributor Author

haasn commented Aug 30, 2016

Interesting, that first article in particular pretty much completely answers all of the doubts and questions I had, and also helps me understand why ZFS is causing me so many out-of-memory style conditions when I had other programs running at the same time (i.e. I can't entirely dedicated my RAM to ARC and spl SLAB objects like the defaults seem to be tuned for).

I'm considering this issue resolved unless I run into more troubles. Thanks for the pointers.

@haasn haasn closed this as completed Aug 30, 2016
@behlendorf
Copy link
Contributor

For reference, PR #5009 which is actively being worked in a big step towards addressing these issues.

@kernelOfTruth
Copy link
Contributor

@haasn take a look at: http://list.zfsonlinux.org/pipermail/zfs-discuss/2013-November/012336.html

disabling transparent hugepages also might lessen memory fragmentation and consumption

@haasn
Copy link
Contributor Author

haasn commented Sep 1, 2016

@kernelOfTruth I gave it a try. (I also got around to setting up monitoring/graphs so I could observe this over time)

memory fragmentation

I tried out your suggestion by using echo never > /sys/kernel/mm/transparent_hugepage/enabled when I read it, which was shortly after 20:40 local time. As you can see, there is a dramatic drop-off in the number of free pages (i.e. increase in fragmentation) corresponding almost exactly with the change.

Just now (at around 4:00 local time) I saw this graph and decided to re-enable transparent_hugepage/enabled=always, as well as transparent_hugepage/defrag=always for good measure. (Since I wasn't sure whether just enabling transparent_hugepage would have changed anything)

As you can see, memory fragmentation immediately went down rather dramatically, at least for smaller fragments, despite practically no change in the amount of consumed memory (nor the distribution of memory). The number of free pages for large chunks still seems to be rather low compared to a fresh boot, but it's still higher than it was before

@haasn
Copy link
Contributor Author

haasn commented Sep 1, 2016

Update: Seems to have been a fluke caused by switching the setting more than anything. Not an hour after enabling hugepages again, memory fragmentation has gotten even worse than before (free page count dropped to basically nothing).

If I had to guess, I think what I'm seeing would be explained by changes to this variable only taking being taken into effect for new allocations, rather than existing ones - and enabling defragmentation caused a spike in the available free pages due to defragmenting all of the existing ones.

Edit: That being said, I disabled it again and not much longer my free page count has skyrocketed again, so now I'm not really sure. These values are probably pretty unreliable at the moment either way since I'm rsyncing some data off old disks. I'll comment again when I can provide more concrete data.

Edit 2: Confirmed that it was the rsync causing heavy memory pressure which increased my fragmentation, which I've cross-confirmed by looking at the overall memory usage and noticing that nearly all available memory was being used for internal caches. Seems that “memory fragmentation” graph really only considers free memory, rather than available memory. (Which is somewhat odd IMO, but oh well). Everything's fine again now.

@haasn
Copy link
Contributor Author

haasn commented Sep 14, 2016

It seems this problem won't leave me alone. On my machine right now, 75% of my RAM is being used. (It was at 90% before I reduced my ARC size)

Current ARC size is 8 GiB (25%). About 3 GiB of that is applications, and another 3 GiB was in tmpfs. This memory (which I can directly account for) adds up to 14 GiB (43%), leaving behind 10 GiB of memory in use by zfs's various slabs.

Any tips on how I can track down why exactly they are being used, and ideally, limit them? I wonder what would have happened if I had less available RAM to begin with. Would zfs have exploded, or would it have self-limited? If the latter, can I do this manually?

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Sep 14, 2016

@haasn is it possible for you to test out current master ?

I've pre-set 10GB for ARC but the following currently is used while transferring 2.5 TB of data from one ZFS pool to another (via rsync):

p                               4    5368709120
c                               4    10737418240
c_min                           4    1073741824
c_max                           4    10737418240
size                            4    3173034432

meaning, it hovered between 2.5 to 3.7 GB,

when adding the other memory consumption it is probably still less than 10 GB,

so it means either the compressed ARC makes usage really efficient and/or it is now also able to stick way more exactly to the preset limits

Since you mentioned rsync:

linking

https://github.com/Feh/nocache

and

http://insights.oetiker.ch/linux/fadvise/

here

@haasn
Copy link
Contributor Author

haasn commented Sep 14, 2016

@kernelOfTruth I can upgrade. Right now I'm on commit 9907cc1 (and openzfs/spl@aeb9baa), is there a commit in particular that you think will help?

One thing I noticed while inspecting /proc/spl/kmem/slab is that the zio_data_buf_131072 usage is extremely high: https://0x0.st/SOb.txt (6 GiB in total). Also, everything about this is 0. At first I thought that was because of spl_kmem_cache_slab_limit, but it seems like that is set to 16384 - not 163840 (off by a factor of 10).

Edit: I just realized, spl_kmem_cache_slab_limit is a lower limit, not an upper limit.

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Sep 14, 2016

@haasn the changes since September 7th, specifically:
#5078 Compressed ARC, Compressed Send/Recv, ARC refactoring (changes from yesterday, September 13th)

spl basically tag 0.7.0-rc1 (September 7th).

Make sure to have recent backups of your data (just in case, which is actually always a good idea & practice)

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Sep 14, 2016

As an example:

current arc stats (after several hundreds of GB of data transferred)

p                               4    5368709120
c                               4    10737418240
c_min                           4    1073741824
c_max                           4    10737418240
size                            4    3332885640
compressed_size                 4    1000249856
uncompressed_size               4    4317090816
overhead_size                   4    794254848
hdr_size                        4    120223344
data_size                       4    113368576
metadata_size                   4    1681136128
dbuf_size                       4    304738112
dnode_size                      4    807605400
bonus_size                      4    305814080
anon_size                       4    2877952

plus

1147872 1033816  90%    0.50K  35871       32    573936K zio_buf_512
1125514 994428  88%    0.30K  43289   26    346312K dmu_buf_impl_t
1082484 960373  88%    0.82K  27756   39    888192K dnode_t
939872 937707  99%    0.99K  29371   32    939872K zfs_znode_cache
939455 937174  99%    0.27K  32395   29    259160K sa_cache
774976 246379  31%    0.06K  12109   64     48436K kmalloc-64
707994 689363  97%    0.19K  33714   21    134856K dentry
370944 345673  93%    0.32K  15456   24    123648K arc_buf_hdr_t_full
219648  87379  39%    0.01K    429  512  1716K kmalloc-8
188118 142108  75%    0.09K   4479   42     17916K kmalloc-96
164240 156315  95%    4.00K  20530        8    656960K zio_buf_4096
146304 127594  87%    0.03K   1143  128  4572K kmalloc-32
 78387  42943  54%    0.08K   1537   51  6148K arc_buf_t
 49344  41727  84%    0.06K    771   64  3084K range_seg_cache
 40256  40256 100%    0.12K   1184   34  4736K kernfs_node_cache
 39936  39139  98%    1.00K   1248   32     39936K zio_buf_1024
 38690  38538  99%   16.00K  19345        2    619040K zio_buf_16384
 35272  35241  99%    8.00K   8818        4    282176K kmalloc-8192
 21632  20889  96%    0.03K    169  128   676K nvidia_pte_cache
 20820  20317  97%    2.50K   1735   12     55520K zio_buf_2560

which adds around 2-3 GB to the existing accounted 3 GB in arcstats, which is close to 6 GB still significantly lower than 10 GB (the set limit)

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Sep 18, 2016

@haasn it seems that the conservative mem usage after the compressed ARC patches was rather a regression than fundamental change in behavior:

#5128 Poor cache performance
#5129 Fix arc_adjust_meta_balanced()

So the only solution right now is to e.g. set ARC size at approx. 40% when you want it to occupy for example 50% of your RAM

@haasn
Copy link
Contributor Author

haasn commented Oct 14, 2016

The netdata graphs paint a new light on the situation:

netdata

According to this graph, which does not seem to be displaying my ARC size (16 GB at the time of writing), I have 9-10 GB of memory spent on unreclaimable slab objects. What are these 10 GB currently doing? Is there any way I can introspect this figure further?

@meteozond
Copy link

Hi @haasn we encounterd the same problem with several servers, did you managed to find solution?

@petermaloney
Copy link

@meteozond here is my solution to the problem: http://brockmann-consult.de/peter2/bc-zfsonlinux-memorymanagement2.tgz

it will loop forever and keep tuning and dropping caches if used RAM gets too high.

Both of my large 36 disk box ZoL machines hang if I don't run this. I wrote this in bash years ago and recently redid it in python3 to fix the float handling and exceptions.

@lechup
Copy link

lechup commented Jul 17, 2017

@meteozond @haasn any update on this problem? I use nmon to get info about current slab usage.

When I run: echo 3 > /proc/sys/vm/drop_caches on system where slab is around 6 GB it reduce it to 1,6 GB almost instationous, after a 15 minutes it even dropped further to ~400 MB.

@petermaloney is Your script doing anything fancier than that?

PS: I'm on Ubuntu 16.04 + zfsutils-linux 0.6.5.6-0ubuntu17

@petermaloney
Copy link

petermaloney commented Jul 17, 2017

@lechup The source is there for you to read what it does (not sure if you need to know python to understand it). The script is fancier than that, yes. What it does is constantly manages the zfs_arc_meta_limit and zfs_arc_max module parameters to keep the total system used RAM within a specified range, like 89-93% used which is the default I set (works well for me; but is configurable).

I found that setting those module parameters one time doesn't work well... a low value might still end up using all your RAM still, or a not so high value might still not use enough RAM for best performance. Or a value that works well sometimes might not work well other times. So this script keeps your free RAM around 10%. The machine I originally wrote it for was very slow if I didn't use enough RAM, so it was very important to use lots when available, so this strategy was very effective.

The script will also use drop_caches (and zfs set primarycache=metadata) if it panics because setting the lowest value wasn't enough to keep RAM low. And then it sets it back to primarycache=all later. Even if it runs fine for weeks, there's still a chance that zfsonlinux eats way more RAM than usual for a short time, causing this to happen.

@lechup
Copy link

lechup commented Jul 17, 2017

@petermaloney thanks for sharing Your code and explainng how it works - I'll give it a shot!

@gdevenyi
Copy link
Contributor

gdevenyi commented Dec 1, 2018

@petermaloney your link is dead? do you have it on github somewhere?

@petermaloney
Copy link

@gdevenyi the path changed slightly https://www.brockmann-consult.de/peter2/zfs/bc-zfsonlinux-memorymanagement2.tgz

and BTW there's a hang bug in the zfs version which is the one used on Ubuntu 16.04 where I think it might not get triggered if you lower the meta limit (my script set it very generously)... (see 25458cb ) so to maybe prevent that (still testing...) patch the script like:

-meta_limit = int((limit_gb-2)102410241024)
+meta_limit = int((limit_gb
2/3)10241024*1024)

@gdevenyi
Copy link
Contributor

gdevenyi commented Dec 1, 2018

Thanks @petermaloney I'm still on 0.6.5.11. I just started having these exploding memory usages, OOM on a fileserver that had been running fine for years. Right now I'm experiencing huge zfs slabs for unknown reasons. The only thing thats saved me is greatly relaxing arc_min so that it can shrink. Interestingly your script does absolutely nothing for me. I guess I'll wait another month to see of the 0.7.x series is finally stable since new features are only being added to 0.8.x now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants