New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swap deadlock in 0.7.9 #7734
Comments
@jgallag88 and I have been observing this as well. Here's our bug report with a simple reproducer: System information
Describe the problem you're observingWith We also tested this situation with a This is the memory usage on one our of typical VMs:
Reproducing the problemConfigure a
|
Can reproduce this as well on: System information
swap zvol was created as above with " -o compression=zle-o secondarycache=none" as additional parameters. I could reproduce the issue with plain debian stretch: kernel 4.9.110-3+deb9u4 and zfs 0.7.9 (from stretch-backports) Please let me know if/how I can help in hunting this down/fixing this. console output w/ hung tasks on 4.15.18-4 (ZFS 0.7.9):
|
If possible could you dump the stack from the |
Managed to get the following by running
|
I decided to give this a try and, as advertised, the deadlock is quite easy to reproduce. First, in a 4.14 kernel, it's interesting to note that a bunch of the "page allocation stalls for ..." messages are generated. That particular message, however, was removed in later kernels because it was determined that merely trying to print the message could, itself, cause deadlocks. In a 4.16 kernel, which does not have the message, the processes are in quite a few different states. Among other things, an interesting observation is that the "zvol" tasks are generally in a running (R) state. Most of the regular user processes are in various states of dormancy (D). This sshd process is typical ("?" lines elided for brevity):
Other user dormant user processes such as agetty, etc. all have pretty much identical stacks. As far as ZFS is concerned, in the current deadlocked system I've got running, the only blocked process is "txg_quiesce" with a rather uninteresting stack of:
which is in the The most interesting zfs-related thing is that all the
I'm going to try a couple of things: First, set |
naively i don't see how swap can be expected to work as things are, i think we should document/label this as such @behlendorf given the amount of code complexity and potentially allocations required is supporting swap even realistic? |
@cwedgwood It's definitely been a challenging area to get working and is not heavily stress tested on a wide range of kernels. It was working reliably with older zfs releases and older kernels, but as you say given the code complexity and surface area we'd need to test it's not something I'd recommend. I tend to agree we should make this clearer in the documentation. |
I am using this (swap on zfs) on several production systems (home and work) with recent kernels. Thankfully it has not been an issue for me yet. |
Same issue on Gentoo with ZFS 0.8.0-rc1 and native encryption. |
@inpos Out of curiosity, how did you get 0.8.0-rc1 on Gentoo? I have not finished my review of it, so I haven't pushed it to the main tree yet. I might just push it unkeyworded if people would otherwise be resort to building outside the package manager. |
@behlendorf This was never 100% sane, although the return of the zvol threads certainly did not help. I started (but did not finish) writing a patch that could help with this if finished: https://paste.pound-python.org/show/KWlvrHdBU2mA9ev2odXL/ At this point, it is fairly clear to me that my offline obligations prevent me from speculating on if/when I will finish that, but other developers should be able to see the idea there. It ought to help if/when finished. The current version will NOT compile, so I would prefer it if users did not attempt it. Trying to compile it would be a waste of their time. Alternatively, there is a nuclear option for resolving this. If we treat swap like Illumos treats dump devices by disabling CoW and checksums, things will definitely work. That would be at the expense of protection against bitrot on the swap. This requires modifications to the code because the current codebase is not able to support a zvol in that configuration. Quite honestly, if we were going for the nuclear option, I'd prefer to create a new "dataset" type and implement extent based allocation. @ahrens Not that I am seriously considering this, but it would be nice to hear your thoughts on the nuclear option for dealing with swap. |
On second thought, given that swap devices are simultaneously dump devices on Linux, it might make sense to implement support for creating/writing to Illumos dump devices. That would probably be a good middle ground here. |
I'm not an expert in this area of the code, but I think that swap on ZVOL is inherently unreliable due to writes to the swap ZVOL having to go through the normal TXG sync and ZIO write paths, which can require lots of memory allocations by design (and these memory allocations can stall due to a low memory situation). I believe this to be true for swap on ZVOL for illumos, as well as Linux, and presumably FreeBSD too (although I have no experience using it on FreeBSD, so I could be wrong). I think the "proper" way to address this is to mimic the write path for a ZVOL dump device on illumos, for the ZVOL swap device. i.e. preallocate the ZVOL's blocks on disk, then do a "direct write" to the preallocated LBA such that it doesn't go through the normal TXG sync and ZIO write code paths. This would significantly reduce the amount of work (e.g. memory allocations) that's required to write a swap page to disk, thereby increasing the reliability of that write. The drawback to this approach would be that we wont get the data consistency guarantees that we normally get with ZFS, e.g. data checksums. I think this is a reasonable tradeoff (i.e. swap zvol + no hangs + no checksums vs. swap zvol + hangs + checksums), given that other Linux filesystems are no better (right?). I spent a couple hours reading the ZVOL dump device write codepaths during the OpenZFS dev summit last week, and I think this approach is viable. It'll require porting the code over to work on Linux, where we'll need to rework the code to work with the Linux block device layer, since (IIRC) that's what's used to issue the writes to disk (see With all that said, I've only spent an hour or two looking into this and I don't have any prior experience with swap, so I may be over looking something or flat out wrong in my analysis so far. Also, I won't have any time in the near future to look into this in detail, but I could potentially help answer questions if somebody else wants to try to prototype the changes I'm proposing. |
I took a look at mkswap.c and aside from writing a unique signature to the device it doesn't do anything which would let ZFS easily identify it as a swap device. One solution would be to add a udev helper which detects ZFS volumes with the swap signature and calls |
@behlendorf that's unfortunate; given there's no good entry point to do the preallocation, your two suggestions seem reasonable at first glance. Using a new |
Maybe this wiki page https://github.com/zfsonlinux/pkg-zfs/wiki/HOWTO-use-a-zvol-as-a-swap-device should be edited to warn about this issue? |
I took the liberty of amending the wiki (I've been affected by this as well). It's quite reproducible in v0.6.5.9-5~bpo8+1 running on 3.16.0-7-amd64 (Debian 3.16.59-1), so I look forward to an eventual fix. I've found no sustainable workaround (other than to stop using zvols for swap). |
Thanks @MobyGamer, I'm running ZFS 0.7.12-1~bpo9+1 on Debian Stretch (kernel 4.19.12-1~bpo9+1) on a system with 2GB of RAM and I experienced this issue. I had followed the Debian Stretch Root on ZFS and was lead to believe I should use swap on ZFS. Searching for the cause, there was really nothing in the wiki's about potential down-sides to doing it, and the only "against" I managed to find was this issue. It's good that there is at least some documentation regarding it now |
@mafredri i updated the wiki. Indeed, we didn't find problems with Swap on ZFS 0.7 branch for a while. |
FYI, this issue is what made us switch from a swap zvol to a dedicated partition at the last minute after user's feedback and our own experience in our installer on ubuntu 19.10. (https://bugs.launchpad.net/bugs/1847628) This is quite an annoying issue as for ext4, we are using for quite a long time a swapfile to prevent too much partionning on user's machine and would like our default ZFS experience to be similar. |
With multi_vdev_crash_dump it should be possible to swap into a preallocated zvol or file, the writes bypassing zfs completely, right? But it is not implemented yet. |
@didrocks I gave some information on a possible approach to address this long term, in my comment above. At this point, I think it's accepted that using a ZVOL for swap will result in instabilities, but we haven't had anybody with the time or motivation to step up and attempt to fix this. I think at least partially, the lack of motivation stems from the fact that (I assume) most folks using ZFS on Linux are not using it for the root filesystem (I use it personally, and so does my employer, so I don't mean this as nobody uses root on ZFS). Now that Ubuntu is officially supporting a root filesystem on ZFS configuration, perhaps that will change. I mean that, both in terms of more users of a root filesystem on ZFS, but also in terms of more developers with sufficient motivation to fix this issue. |
and:
Not that my 2¢ means much as a non-contributor, but I'd personally really prefer to see a design where at least checksums can be maintained functional so the possibility of end-to-end integrity can remain. Can anyone speak to the amount of complexity/effort that that single feature would add to this work? I know above Illumos dump devices were mentioned. I am father unfamiliar with Illumos, but I thought dump and swap devices were not the same thing, and that Illumos actually did use zvols for swap. Am I incorrect, or were the above comparisons not quite valid? |
I think we'd all prefer this, but I don't think it's currently possible if we were to write directly to pre-allocated blocks; due to the checksums being stored in the indirect blocks, and the indirect blocks being written during the pre-allocation rather than when we write out the swap data.
Yes, we used ZVOLs for swap on our illumos based appliance, and we often saw (likely the same) deadlocks for workloads that tried to actually use the swap space. |
Yes, excellent. Compare linux swap on bcachefs vs OpenZFS. Hopefully the zfs devs read this. Thanks. |
Note that there are already some ideas for how to implement this, and they are mentioned in this thread. For instance: #7734 (comment) and #7734 (comment) (looks like the doc is set to private now: @ahrens should we put this info somewhere else?). The problems is to find volunteers who'd be willing to implement it and test it :). |
@pzakha I've updated the URL in the previous comment to link to a publicly-readable version: https://docs.google.com/document/d/14KM0A8tlmvGyqF2FOqsaWCS3ACaNBBRQMJcgCkZ9cjM/edit?usp=sharing While I'm here I'll also point to my previous thoughts on the subject: #7734 (comment) |
I use redundant zvol swap on FreeBSD 13.0 and haven't seen issues yet, but based on reading threads such as this one, I do try to carefully manage the systems to ensure they don't exhaust memory. (Before using ZFS swap, I had multiple crashes/lockups when init, ssh, virtual consoles, database servers, etc. were all swapped out to standard FreeBSD swap partitions on spinning disks which failed - so far ZFS swap has been more stable thanks to the redundancy.) |
Semi related from testing zvol storage I've noticed zvol performing less
than half vs dataset + qcow2 file (121MB/s vs 254MB/s), I don't know if
this would be an issue with zvol based swap
…On Mon., Apr. 25, 2022, 12:12 p.m. mark burdett, ***@***.***> wrote:
I use redundant zvol swap on FreeBSD 13.0 and haven't seen issues yet, but
based on reading threads such as this one, I do try to carefully manage the
systems to ensure they don't exhaust memory. (Before using ZFS swap, I had
multiple crashes/lockups when init, ssh, virtual consoles, database
servers, etc. were all swapped out to standard FreeBSD swap partitions on
spinning disks which failed - so far ZFS swap has been more stable thanks
to the redundancy.)
—
Reply to this email directly, view it on GitHub
<#7734 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEYC375HSW44A45JSHHWATVG3VATANCNFSM4FLGXRVA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Was curious about this given file does identify a swap device.
https://github.com/file/file/blob/d17d8e9ff8ad8e95fdf66239ccdcc2133d1ce5ce/magic/Magdir/linux#L84 ? |
I'd be interested to see how Oracle Solaris handles this. They appear to support swap on ZVOLs on both SPARC and x86. I'm sure the Oracle Solaris ZFS code and OpenZFS code have massively diverged since Solaris was closed back up, but they would've had to deal with the exact same problems. Also, has anyone noticed whether size seems to affect whether swap on a ZVOL locks up? I haven't had any issues running swap on a ZVOL on my laptop, but I only have 16G RAM, 8G swap, relatively light workloads (memory-wise, anyway), and on the occasion I do swap its not a huge amount. |
The only solution I can think of is to not require memory allocations during writing. |
That would be doable. Oracle Solaris does not differentiate a ZVOL used for swap from a ZVOL used for any other purpose. That could have performance implications. Perhaps having a separate ZVOL type just for swaps, which pre-allocates memory as you suggest, would be the best solution? |
It should be possible to differentiate between the two based on how they are used, without requiring any manual work on the sysadmin’s part. |
Thanks to everyone for the good progress in thinking about solutions. Hope the zfs devs see your ideas. |
As was mentioned above, "a separate ZVOL type just for swaps, which pre-allocates memory". Sounds good. |
Perhaps. I'd rather just have an option toggle, like "-pamem" which tells ZFS to pre-allocate memory on the ZVOL. Itr wouldnt require any extra work on the part of the SA, other than setting the option when creating the ZVOL. |
Yes. Can have option toggles. Good. Thanks. |
Would be nice to have just a single thread preallocating some static memory and don't have it cleaned up by overcommitting optimizations in the Linux kernel and then introduce a flag which does flag the ZVOL as swap device of the system. This way the overall performance isn't hurt by this "workaround" but it's guaranteed to be able to accept writes without having to ask for more memory, as it's bound to this thread. Hey @behlendorf, maybe you can have a look at this if you got time? Seems rather important! :) Doesn't even have to "properly" write it out, but just dump it into a reserved section of the disk would be fine, I guess. Then Linux can reclaim the memory and other ZFS threads can read it again and write it out "properly". |
Another option may be to have an emergency mode for the arc. So if there's a high write demand on this specific device the ARC cache in memory gets wiped and used for this purpose to buffer it, while the thread also starts writing it out properly to the disk. As there's a reserved minimum for the arc anyway, and in a low memory situation it's better to get rid of some pressure right away and it doesn't depend so heavily on the disk-io beeing good. Not sure what's easier to implement. |
I just installed an Ubuntu using zfs as it seemed the most recent option. Started experimenting lockups during heavy usage that seem linked to swap usage (in my case a swap volume of my zfs filesystem). After 4 years this was opened it seems unlikely will ever be fixed. Considering I cannot resize my zfs (it takes full disk) to make space for a normal swap partition, and I need swap for some heavy processes, what are my alternatives/workarounds apart from reinstalling everything to ditch zfs and add another voice telling Ubuntu to take it off from the setup? |
You should look into zram-generator. This is what I am using with excellent results. |
Could you provide some more detail? I'm not sure I get how to use this to have a swap that bypasses the zvol/zfs swap problem...? |
zram-generator reserves part of your RAM for swap. This sounds counter intuitive but in fact it works very well. On my machine for example which has 64 GB of RAM and 12 GB swap with zram-generator. I do frequent stress tests where 76 GB of RAM are consumed and it works very well. And the swap is very fast. |
I'm afraid this only applies to a configuration woth lots of ram, in my case with a netbook with 8gb of RAM I cannot do that, I need a swap that is a real swap on disk and that is bigger than the ram, and so it cannot stay in RAM. |
If that is the case, then your only choice is reinstallation with a new partition scheme which has one dedicated swap partition |
I also have the exact same issue. Running ubuntu with zfs on an old laptop with 8gb ram...can't use zram.. |
Offtopic, but in fact ZRAM can help with 8GB RAM, at the expense of CPU usage, zstd works really great on laptop scenarios, it compresses 4GB of browser/etc ram into 1GB, you just need to give it needed space (note, that it's maximum swapped RAM space, in worst case it will be used as is, in usual case - see next). For 16GB I gave it 75% of RAM as threshold, and usually it looks like (output from my laptop, live):
, so, it compressed 4G into 1G, profit, usually 4x with zstd and 3x with lz4. I saw 10GB compressed into 2-3GB too. I use turn-in (many Android firmwares use zram too, so it works even on 1G RAM cases for consumer usage) |
~zram: On Ubuntu I only install ~Ubuntu zfs: They should make a separate swap partition (not ditch zfs altogether). |
Doesn't it mean there is a flaw in the default configuration of Linux if zram-generator is beating swap and/or no-swap configs? I mean, why would zram-generator beat no swap except in the rare case that you're needing to swap really often (completely running out of RAM often)? I've not used swap on most modern systems because there doesn't seem to be a need. The reason I'm in this thread is I've one of those modern |
For focus: maybe take workarounds, and other things that don't directly progress this issue, to discussions – https://github.com/openzfs/zfs/discussions Thanks |
Can this be fixed? Ubuntu on ZFS should not be an option if this isn't fixed. |
The Ubuntu on ZFS installation should just have a separate (LUKS encrypted!) swap partition, as swap on ZFS apparently won't be fixed rapidly. |
Could the relevant allocations be wrapped in |
runderwo commentedJul 21, 2018
•
edited
System information
Describe the problem you're observing
System deadlocked while forcing a page out to a swap zvol. Unfortunately I do not have the rest of the backtraces.
Describe how to reproduce the problem
rsync a filesystem with O(1M) files.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: