-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable direct reclaim on zvols #669
Conversation
As for my Opensuse 12.1, this patch don´t work. Increased the min_free_kbytes too and patched zvol.c, but with memtester the system freezes as soon it hits the swap space. I tried it twice: once it freezes at 270 MB,next time at 670 MB. |
This is a default Kernel, CONFIG_PREEMPT_NONE=y (is set) CONFIG_PREEMPT_VOLUNTARY= is not set and CONFIG_PREEMT is not set. So i need to recompile the kernel to test yr patch. |
CONFIG_PREEMPT_NONE=y is what should be set. How much RAM do you have? What is your pool configuration? How quickly does it freeze without this patch? If you are booting off ZFS, did you remember to rebuild your initramfs? |
@pyavdr I just tried running memtester on my system and it works. Note that I also have patches from pull requests #618, #651 and #660 installed, so they might be the reason my system is stable while yours is not. In particular, I suspect that pull request #618 is responsible. Pull request #660 might also play a role. |
Ok, i use 12 GB in a vm session, can go up to 25 GB when needed. ( pool is latest version, 4 drives, no raidz/mirror, 10GB zfsswap volume) I would like to try that patches too, but can´t figure out how to apply all these various patches in different states (I edited yr patch manually). So i need some time to apply them. The "how to" to apply all these patches is widespreaded allover. If you find the time, maybe you can put them together in a single one, it would be easier to figure out how to apply them. But besides that, if you really found it, congratulations! |
@pyavdr You can get the code with all of these patches from the gentoo branch of my fork: https://github.com/gentoofan/zfs/tree/gentoo I intend to snapshot that branch when I feel that ZFS is ready to enter Gentoo's testing tree. By the way, pull request #660 resolves a stability issue that predominantly affects systems with more than 8GB of RAM. It seems like it might address your issue. Alternatively, you could reduce the amount of RAM that you give to your virtual machine. |
How much RAM are you asking memtester to use? It will do mlock, which will prevent the kernel from having access to the RAM. It is possible that is what is killing your system. On my system with 8GB of RAM, I only let memtester take half of it. There was a significant lag when it did this because of mlock, although it went away after the kernel finished reorganizing system memory. Also note that 131072 is the correct value for my system, but the correct value for yours could be higher given that you have more RAM. |
Ok, applied #651, set vm.min_free_kbytes=512000. Started memtester with 2000/4000/6000/8000/10000 no problem, runs through, no swap needed. Started memtester with 12000 : mhhhhh : freeze. some traffic on the zfsswap devices. After some minutes: My windows 7 host gives a BSOD: memory management. Need to reboot the whole system , brrrrrr. |
I cannot say that this is surprising. Memtester does allocations that are incapable of being swapped. When you run memtester, memory is effectively removed from your system. Giving it nearly all of your RAM like you did would likely kill Linux regardless of the filesystem used. You need to find something else to do allocations that can be swapped. A few instances of |
Ok, starting to start instances of python -c "print 21010" & ... one ... 20 sec waiting... two .... three and so on ... after 7 instances the system hits the swap space which leads to freeze. |
What is the default setting for vm.min_free_kbytes? Did you try increasing it? |
I started some hours ago with the default of something 68000 ... increased it to 131072 ... and finally to 512000. |
Are you certain that you properly patched the kernel modules and that you are running tests with the updated setting? If you are using code from my GIT, did you do |
Yes, im pretty sure. What i did to merge the patches: edit the specific files ( .../module/zfs/arc.c ...) with the changes in the patches. These are only few lines, no problem. After that: make; make install. Make finds the changes, compiles and links. "Make install" copies the news to destination, then reboot. In the last weeks i changed several source files in zpool.c or zfs.c, so that procedure works as expected. Just to be sure : For now i tried it from start: make clean; ./configure; make; make install; reboot; after starting the pythons: it freezes again. Don´t know these "git" interface, need to check this, there is already a book lying here on my desk :-). |
Thanks for digging in to this, I suspected something like this was going on. There have been a number of similar deadlocks to this which were resolved by disabling direct reclaim where appropriate. In general, I've tried to do this in as targeted a manor as possible. As I'm sure you saw I have been forced to resort to setting PF_MEMALOC on occasion. Considerable care needs to be taken when doing this to ensure no other bits gets mistakenly cleared or set in the current->flags. For this reason, it's usually better to target the specific memory allocations which are causing the issue. Direct reclaim can be disabled for those kmem_alloc()'s by using KM_PUSHPAGE instead of KM_SLEEP. In this case I'm not 100% sure it will be possible to pass KM_PUSHPAGE to all the offending allocations under zvol_write() but it's worth investigating since it may be cleaner. Getting a full stack of the exact deadlock would be helpful. However, if we stick with your proposed patch you're going to need to move setting PF_MEMALLOC to the very top of the function, or even better in to a wrapper function. Both the zil_commit() and zfs_range_lock() have the potential to enter direct relcaim and they are outside the flag. |
@behlendorf I am not certain if a wrapper function would be appropriate given that there is only a single call to zvol_write(), but it might be useful to write a helper macro that takes a variable, its type, some flags to set temporarily in that variable, a function pointer and the arguments to that function. It could be used to wrap the zvol_write() call, but I am not sure where the appropriate place would be to put this macro. I have modified the patch to address the issues that you highlighted. I have also updated the pull request message at the top. In theory, we could use thread local storage to control the value used in all allocations that are currently use KM_SLEEP so that we can flip them to KM_PUSHPAGE on demand. That would work well with the wrapper macro idea. In addition, I suspect that the reason that I need to set vm.min_free_kbytes is because indirect reclaim can fail. I believe that results in the additional deadlock that I have observed with this patch where the system is not immediately crippled. I think that we can solve that by modifying the SPL to maintain a pool of pages as per /proc/sys/kernel/spl/vm/swapfs_reserve and to provide a thread local storage flag that ZFS can set to permit indirect reclaim to draw from those pages. That probably should be a separate patch. |
On second thought, maintaining a pool of pages in the SPL that are released on demand would suffer from a race condition where another thread could steal the pages meant for ZFS on SMP systems. This could also happen on uniprocessor systems where preemption is possible if we are not careful. Addressing that could require implementing a memory allocator for ZFS, which should be able to guarantee that pages reserved for ZFS would only be used by ZFS. |
I have done some additional testing. This patch permits a single disk pool to use swap on a zvol if vm.min_free_kbytes is sufficiently high, but it does not appear to have the same effect on a pool with a single 6-disk raidz vdev on my server, which has 16GB of RAM. Increasing the value of vm.min_free_kbytes to 1048576 permits some amount of writing to swap, but then all writes will stop as what appears to be a soft deadlock occurs. Running the |
@gentoofan Your updated patch doesn't do what you think it does. The zvol_dispatch() function will dispatch the zvol_write() function to be executed in the context of one of the zvol taskq threads. So your setting the PF_MEMALLOC flag in the dispatching thread, but that will have no effect since zvol_write() will be done by one of the taskq worker thread. If we're going to take the PF_MEMALLOC approach this bit must be set in zvol_write(). I'd suggest a wrapped function like this. __zvol_write() { /* Existing zvol_write() implementation */ } zvol_write() if (current->flags & PF_MEMALLOC) { error = __zvol_write() } else { current->flags |= PFMEMALLOC; error = __zvol_write() current->flags &= ~PFMEMALLOC; } return (error); } This should also resolve your indirect reclaim case for kswapd which will already have set PF_MEMALLOC. See commit 6a95d0b for a better explanation of this race, we fixed a similar subtle issue in the mmap code this same way. Longer term I think the best way to address this is still use KM_PUSHPAGE in all the offending allocations. This is what all the other filesystems in the Linux kernel do, they must be very careful about not allocating any memory in the write path. If they absolutely have too them this flag can be used... check really maps to GFP_NOFS. |
Related to this, I still really want to see a stack to ensure we're addressing the real deadlock here. Do you happen to have a trivial reproducer for a VM, I'm setup to get a stack. |
@behlendorf Thanks for catching that. I will revise the patch shortly. As for reproducing this, give the VM 2G of RAM and do this:
|
@behlendorf I have pushed a revised version of my patch, but I think that this still needs more work. swap appears to work properly on both my desktop and my server and there is no longer any need to edit vm.min_free_kbytes. Unfortunately, I was able to observe a hard deadlock on my desktop when running an instance of I believe that a deadlock can occur where other threads consume pages as indirect reclaim frees them, starving the kernel thread that needs the pages to be able to swap. I think that can be fixed by implementing a TLS flag in the SPL that would enable us to flip allocations to KM_NOSLEEP. This will guarantee that allocations will fallback on emergency memory pools, which should prevent the deadlock I observed under load. |
It looks like using PF_MEMALLOC is inappropriate: http://lkml.indiana.edu/hypermail/linux/kernel/0911.2/00576.html The main issue is that PF_MEMALLOC permits ZFS to take pages out of ZONE_DMA. That could cause a crash by exhausting pages available for DMA. We should be able to address this issue by flipping allocations to use KM_NOSLEEP instead of setting PF_MEMALLOC. |
Exactly right, PF_MEMALLOC is a bit of a last resort and we should avoid using it if at all possible. The two places in the existing code where I was forced to use it are because I was unable to modify the exact point of allocation because it was in the kernel proper. Seeing PF_MEMALLOC allowed me to work around the issue without forcing people to patch their kernels. Anyway, back to this particular issue. I agree the best solution is to pass the proper flags at all the offending allocation points. KM_PUSHPAGE should be enough for this, I don't think we need to resort to KM_NOSLEEP which comes with it's own issues. We just need to avoid a deadlock due to reentering reclaim while we're writing out pages. I'll try and get some stacks tomorrow which I expect will make the issue a bit more concrete. |
@behlendorf If you have a kernel patch to eliminate the need for PF_MEMALLOC in ZFS code that is ready for upstream, I could try talking to Greg Kroah-Hartman about sending it to Linus Torvalds for inclusion. Greg is a Gentoo developer and he might be willing to assist my ZFS efforts in Gentoo. |
@gentoofan Yes and no. Ricardo and I started a nice thread on linux-mm with Andrew Morton and got everyone to agree that in fact this is a real bug which should be fixed. However, the right fix (in the thread) in pretty invasive and ends up touching all the various arch's which makes it a testing nightmare. Anyway, since I was able to work around it (which I need to do for older kernels anyway) I stopped pushing to issue. Also since it relates to vmalloc() which is something we need to stop using heavily in the long run I didn't feel it was worth the fight. Still, I encourage you to read the thread. |
@behlendorf I tried modifying the code to use KM_PUSHPAGE, but the system will not write to swap: https://github.com/gentoofan/spl/commits/gentoo I am still examining this, but any thoughts that you might have would be appreciated. |
Increasing vm.min_free_kbytes to 524288 enables my new patchset to swap. Without that, the system refuses to write to swap, but it does not hard deadlock immediately. Lower values might also work, although I have not tested them yet. Also, the deadlock involving 4 simultaneous python processes does not appear to occur with my new patchset. |
To be honest, I'm not a big fan of the tsd approach. Having the lower layers modify the passed flags is asking for problems in my view. Plus the tsd code is already rarely used in zfs and I've been tempted a few times to remove it. I'd prefer to either: A) Just set PF_MEMALLOC B) Explicitly pass KM_PUSHPAGE for all impacted allocations. This might be a little broad but it hardly any worse than disabling reclaim for all zvol writes. I will try to spend some time on this myself over the next week or two. |
The following appears to work:
I guess the question should be what the loopback device does that zvols fail to do. |
@behlendorf It looks like the loopback device works because of the following lines in taskq_thread() in the SPL, which set PF_MEMALLOC:
|
I noticed the the loopback device sets nice to -20. I patched the SPL to use that as well and ran my stress tests. My desktop no longer deadlocks when running 4 simultaneous python processes, so I have opened a pull request with zfsonlinux/spl: Addressing the issue with PF_MEMALLOC taking pages from ZONE_DMA is important, but that probably would be best addressed as part of a more comprehensive fix. |
I have revised my patch to set the flag that is being used in the codepath taken by the loopback device. I have also revised the commit message to reflect that. I am now at the point where I feel that this should close issue #342. |
Previously, it was possible for the direct reclaim path to be invoked when a write to a zvol was made. When a zvol is used as a swap device, this often causes swap requests to depend on additional swap requests, which deadlocks. We address this by disabling the direct reclaim path on zvols. This closes issue openzfs#342.
@ryao So what's the latest on this change? I lost track of the latest testing. Is setting TASKQ_NORECLAIM enough to resolve most issues? If so I'm not adverse to merging it since it clear does help, although I suspect this will need more work. |
@behlendorf Setting TASKQ_NORECLAIM eliminated all issues that I have encountered with swap on zvols. The only known issue is the theoretical issue of DMA pages being consumed by ARC. The dma-kmalloc* entries in my desktop's /proc/slabinfo only show a single slab consuming DMA pages after several days of uptime and several instances of heavy swap usage. This suggests to me that crashes caused by DMA page consumption would be incredibly rare in practice: slabinfo - version: 2.1 name <active_objs> <num_objs> : tunables : slabdata <active_slabs> <num_slabs>delayed_node 0 0 328 24 2 : tunables 0 0 0 : slabdata 0 0 0 |
Awesome, then I'll merge this patch in to master since it's clearly safe and improves stability. It may be a little broad but we can always revisit this latter if that leads to issues such as larger latencies. Thank you again for all your testing of this change and iterating with me on a reasonable fix. |
Merged as commit ce90208 |
Previously, it was possible for the direct reclaim path to be invoked
when a write to a zvol was made. When a zvol is used as a swap device,
this often causes swap requests to depend on additional swap requests,
which deadlocks. We address this by disabling the direct reclaim path
on zvols.
This closes issue #342.