Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avl_find avl_add z_unmount panic on zfs with snapshots #3243

Closed
toppk opened this issue Apr 1, 2015 · 21 comments
Closed

avl_find avl_add z_unmount panic on zfs with snapshots #3243

toppk opened this issue Apr 1, 2015 · 21 comments
Milestone

Comments

@toppk
Copy link

toppk commented Apr 1, 2015

[toppk@static ~]$ uname -a;rpm -q zfs
Linux static.bllue.org 3.19.3-200.fc21.x86_64 #1 SMP Thu Mar 26 21:39:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
zfs-0.6.3-1.3.fc21.x86_64

img_20150331_195137

here is how I created the filesystem
zpool create t3 mirror /dev/mapper/hp35.cfs /dev/mapper/hp34.cfs
zfs create t3/home
zfs set compression=lz4 t3/home
zfs snapshot t3/home@$( date '+%Y%m%d_%H:%M:%S_%z' )
zfs set snapdir=visible t3/home

I was running some zfs list cmds and running "df /t3/home/.zfs/snaphots/" around the time it crashed

@toppk
Copy link
Author

toppk commented Apr 1, 2015

cat /sys/module/zfs/parameters/zfs_expire_snapshot

300

after reading zfs_ctldir, I figure I can avoid this entire codepath by going hidden:
zfs set snapdir=hidden some/path

https://github.com/zfsonlinux/zfs/blob/9b67f605601c77c814037613d8129562db642a29/module/zfs/zfs_ctldir.c

It's seems like there's plenty of opportunity for this race condition to get triggered. this is the third panic since I've setup zfs, but I didn't take pictures to be sure it was the same.

@toppk
Copy link
Author

toppk commented Apr 1, 2015

still panic'd. this time, there was nothing I did that should have mounted the snapshot that it would need to be unmounted.

I'm setting expire to zero, let see if that helps

cat /sys/module/zfs/parameters/zfs_expire_snapshot

0

@cooper75
Copy link

cooper75 commented Apr 7, 2015

same problem for me on Fedora21 3.19.3-200.fc21.x86_64 posted here #3257

@toppk
Copy link
Author

toppk commented Apr 7, 2015

Hi, I'm trying this debugging patch. Which, I hope, does two things:

  • avoid panic (printing when it would occur)
  • print when adding to avl, so as to find the culprit.
# diff -u zfs_ctldir.c.old zfs_ctldir.c
--- zfs_ctldir.c.old    2015-04-07 18:30:24.600190585 -0400
+++ zfs_ctldir.c        2015-04-07 18:30:24.602190547 -0400
@@ -733,6 +733,7 @@
 {
        zfs_snapentry_t search;
        zfs_snapentry_t *sep;
+       zfs_snapentry_t *sep2;
        int error = 0;

        mutex_enter(&zsb->z_ctldir_lock);
@@ -746,9 +747,15 @@
                error = __zfsctl_unmount_snapshot(sep, flags);

                mutex_enter(&zsb->z_ctldir_lock);
-               if (error == EBUSY)
+               if (error == EBUSY) {
+                    sep2 = avl_find(&zsb->z_ctldir_snaps, &search, NULL);
+                    if (sep2) {
+                          printk("ZFS: find avl entry during zfsctl_unmount_snapshot\n");
+                    } else {
+                          printk("ZFS: adding avl entry during zfsctl_unmount_snapshot\n");
                        avl_add(&zsb->z_ctldir_snaps, sep);
-               else
+                    }
+               } else
                        zfsctl_sep_free(sep);
        } else {
                error = SET_ERROR(ENOENT);
@@ -786,6 +793,8 @@

                mutex_enter(&zsb->z_ctldir_lock);
                if (error == EBUSY) {
+                           printk("ZFS: adding avl entry during zfsctl_unmount_snapshots\n");
+
                        avl_add(&zsb->z_ctldir_snaps, sep);
                        (*count)++;
                } else {
@@ -877,6 +886,8 @@
        sep->se_path = full_path;
        sep->se_inode = ip;
        avl_add(&zsb->z_ctldir_snaps, sep);
+        printk("ZFS: adding avl entry during zfsctl_mount_snapshot\n");
+

        sep->se_taskqid = taskq_dispatch_delay(zfs_expire_taskq,
            zfsctl_expire_snapshot, sep, TQ_SLEEP,

@toppk
Copy link
Author

toppk commented Apr 7, 2015

okay. in playing around with quick snapshot creation after applying my patch, I hit a would-have-been-panic, as well as the other issues people were reporting in linked tickets (i.e.: ?????? for directory permissions).

I could not create a simple reproducer for finding avl in unmount_snapshot, but I did see that happen a few times (with my logging, which shows surprising to me things like a large series of unmount_snapshot calls firing in the same second.

Ultimately, I'm giving up again, because the kernel still hung, with the last kernel message being: "BUG: unable to handle kernel paging request at 0000000400000029"

no issues in the rest of zfs (when not using snapshots and leaving snapdir=hidden)

I'm not sure about how automount is suppose to work, but it was mounting, even when I was cd'ing into subdirectories of the snapshot mount point. I wasn't printing out path & name, and I don't know what the shell was doing, so not sure if that is anything.

the shell was printing out too many links and was getting confused about cwd.

@toppk
Copy link
Author

toppk commented Apr 8, 2015

btw - I didn't find an entry in /proc/mounts for the snapshot, even though it was mounted..

https://github.com/zfsonlinux/zfs/blob/9b67f605601c77c814037613d8129562db642a29/module/zfs/zpl_ctldir.c

/*
 * Rather than returning the new vfsmount for the snapshot we must
 * return NULL to indicate a mount collision.  This is done because
 * the user space mount calls do_add_mount() which adds the vfsmount
 * to the name space.  If we returned the new mount here it would be
 * added again to the vfsmount list resulting in list corruption.
 */
return (NULL);

}

@grigio
Copy link

grigio commented May 6, 2015

I confirm this bug with zfs-dkms 0.6.4.1-1~vivid and linux 4.0.0-040000-generic.
I've zfs on a single disk on LUKS and I use zfs-systemd.

I think it's something related to standby of the disks, because I didn't notice this problem with manual import/export

@chrisrd
Copy link
Contributor

chrisrd commented May 6, 2015

Anyone seeing this might like to try applying #3381 / @f5882e2 to see if it helps.

@GalenOfTheShadows
Copy link

@chrisrd , are there other changes in your branch? On Fedora, I'm seeing a build error when using your patch that avl_walk is a different type than expected.

Unfortunately, my DKMS seems to be causing cc to be a little more strict than usual, but hey, we are talking about kernel modules here. ;-)

@chrisrd
Copy link
Contributor

chrisrd commented May 18, 2015

@GalenOfTheShadows no, no other changes.

@fvanniere
Copy link
Contributor

Go the same panic :

Linux logger 4.0.4-xenU #1 SMP Tue May 19 19:11:15 CEST 2015 x86_64 GNU/Linux
Jun  1 14:48:51 logger kernel: SPL: Loaded module v0.6.4.1-1
Jun  1 14:48:51 logger kernel: SPLAT: Loaded module v0.6.4.1-1
Jun  1 14:48:51 logger kernel: ZFS: Loaded module v0.6.4.1-1, ZFS pool version 5000, ZFS filesystem version 5
Jun  1 14:48:51 logger kernel: SPL: using hostid 0x030ab800

I've created a snapshot (zfs snapshot /tank/root@install), went into /.zfs/snapshot/install" and, several minutes after :

Kernel panic - not syncing: avl_find() succeeded inside avl_add()
CPU: 1 PID: 811 Comm: z_unmount/0 Not tainted 4.0.4-xenU #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 ffff8800b856c380 ffff88007f11bcb8 ffffffff81bece73 ffff88011fc8f4f0
 ffffffff81eab620 ffff88007f11bd38 ffffffff81beabc2 0000000000000002
 0000000000000008 ffff88007f11bd48 ffff88007f11bce8 ffff88007f11bd38
Call Trace:
 [<ffffffff81bece73>] dump_stack+0x45/0x57
 [<ffffffff81beabc2>] panic+0xc1/0x1d4
 [<ffffffff811e0cda>] avl_add+0x4a/0x50
 [<ffffffff8127e04c>] zfsctl_unmount_snapshot+0x18c/0x1a0
 [<ffffffff81bef358>] ? __schedule+0x268/0x940
 [<ffffffff8127e21d>] zfsctl_expire_snapshot+0x2d/0x80
 [<ffffffff81179de0>] taskq_thread+0x1d0/0x380
 [<ffffffff81081340>] ? wake_up_state+0x20/0x20
 [<ffffffff81179c10>] ? task_done+0xb0/0xb0
 [<ffffffff8107525b>] kthread+0xdb/0x100
 [<ffffffff81010000>] ? perf_trace_xen_mc_entry+0xf0/0x130
 [<ffffffff81075180>] ? __kthread_parkme+0x90/0x90
 [<ffffffff81bf3618>] ret_from_fork+0x58/0x90
 [<ffffffff81075180>] ? __kthread_parkme+0x90/0x90
Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
---[ end Kernel panic - not syncing: avl_find() succeeded inside avl_add()

@GalenOfTheShadows
Copy link

I can also report that this is still an issue on Fedora 21 running 4.0.4-201 w/0.6.4.1.

The funny part is, for me, it only triggers when more than 3 snapshot functions are scheduled. No issues when its just cycling out an hourly snapshot for a new one, but when daily/weekly/monthly hit along with the hourly, it craps out. Also, with kernel debugging turned on, a running scrub seems to hold it off until about 25-35 minutes after the scrub finishes, no matter what the status of the snapshot jobs are.

@chrisrd
Copy link
Contributor

chrisrd commented Jun 1, 2015

The change referenced above:

Anyone seeing this might like to try applying #3381 / @f5882e2 to see if it helps.

...has been merged into master @7224c67. Can anyone seeing this please let us know if this change does or doesn't fix the problem?

@dylanpiergies
Copy link

I'm also seeing this issue. I'm running a build from master @2a34db1b. If anyone needs me to extract any more information from the core dump, please let me know. kdump essentials follow:

crash 7.1.0
Copyright (C) 2002-2014  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

    KERNEL: /lib/modules/4.0.4-2-debug/build/vmlinux
    DUMPFILE: /mnt/vmcore
        CPUS: 12
        DATE: Thu Jan  1 01:00:00 1970
    UPTIME: 04:25:14
LOAD AVERAGE: 0.09, 0.27, 0.18
    TASKS: 709
    NODENAME: libertine
    RELEASE: 4.0.4-2-debug
    VERSION: #1 SMP PREEMPT Tue Jun 2 19:14:31 BST 2015
    MACHINE: x86_64  (3201 Mhz)
    MEMORY: 32 GB
    PANIC: "Kernel panic - not syncing: avl_find() succeeded inside avl_add()"
        PID: 245
    COMMAND: "z_unmount/0"
        TASK: ffff8807ef2ddb20  [THREAD_INFO: ffff8807ef314000]
        CPU: 0
    STATE: TASK_RUNNING (PANIC)

crash> bt -a
PID: 245    TASK: ffff8807ef2ddb20  CPU: 0   COMMAND: "z_unmount/0"
#0 [ffff8807ef317b70] machine_kexec at ffffffff810563db
#1 [ffff8807ef317be0] crash_kexec at ffffffff810fec22
#2 [ffff8807ef317cb0] panic at ffffffff8156dfad
#3 [ffff8807ef317d30] avl_add at ffffffffa000f7c8 [zavl]
#4 [ffff8807ef317d60] zfsctl_unmount_snapshot at ffffffffa02f153b [zfs]
#5 [ffff8807ef317df0] zfsctl_expire_snapshot at ffffffffa02f16fd [zfs]
#6 [ffff8807ef317e10] taskq_thread at ffffffffa0070501 [spl]
#7 [ffff8807ef317ec0] kthread at ffffffff81092ae8
#8 [ffff8807ef317f50] ret_from_fork at ffffffff81574598

PID: 0      TASK: ffff8807fbfdef60  CPU: 1   COMMAND: "swapper/1"
#0 [ffff88081fc25e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fc25e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fc25ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fc25ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fc25ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb837e18  RFLAGS: 00000046
    RAX: 0000000000000010  RBX: 0000000000000004  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb837fd8  RDI: 0000000000000001
    RBP: ffff8807fb837e48   R8: 00000000004ff398   R9: 0000000000000018
    R10: 000000000000049a  R11: 0000000000006f21  R12: 0000000000000010
    R13: 0000000000000002  R14: 0000000000000003  R15: 00000e7959a74f3c
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb837e18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb837e50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb837ea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb837eb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb837f30] start_secondary at ffffffff8104bc1a

PID: 0      TASK: ffff8807fb838000  CPU: 2   COMMAND: "swapper/2"
#0 [ffff88081fc45e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fc45e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fc45ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fc45ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fc45ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb843e18  RFLAGS: 00000046
    RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb843fd8  RDI: 000000000280b000
    RBP: ffff8807fb843e48   R8: 00000000004ff398   R9: 0000000000000018
    R10: 0000000000029c9f  R11: 00000000000011eb  R12: 0000000000000030
    R13: 0000000000000004  R14: 0000000000000005  R15: 00000e7959854d08
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb843e18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb843e50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb843ea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb843eb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb843f30] start_secondary at ffffffff8104bc1a

PID: 0      TASK: ffff8807fb838a20  CPU: 3   COMMAND: "swapper/3"
#0 [ffff88081fc65e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fc65e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fc65ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fc65ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fc65ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb847e18  RFLAGS: 00000046
    RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb847fd8  RDI: 0000000000000003
    RBP: ffff8807fb847e48   R8: 00000000004ff398   R9: 0000000000000018
    R10: 0000000082cd8698  R11: 0000000000004668  R12: 0000000000000030
    R13: 0000000000000004  R14: 0000000000000005  R15: 00000e7959a22602
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb847e18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb847e50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb847ea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb847eb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb847f30] start_secondary at ffffffff8104bc1a

PID: 0      TASK: ffff8807fb839440  CPU: 4   COMMAND: "swapper/4"
#0 [ffff88081fc85e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fc85e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fc85ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fc85ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fc85ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb84be18  RFLAGS: 00000046
    RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb84bfd8  RDI: 0000000000000004
    RBP: ffff8807fb84be48   R8: 00000000004ff398   R9: 0000000000000018
    R10: 000000000000f810  R11: 0000000000001f8e  R12: 0000000000000030
    R13: 0000000000000004  R14: 0000000000000005  R15: 00000e7959a603d2
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb84be18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb84be50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb84bea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb84beb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb84bf30] start_secondary at ffffffff8104bc1a

PID: 0      TASK: ffff8807fb839e60  CPU: 5   COMMAND: "swapper/5"
#0 [ffff88081fca5e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fca5e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fca5ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fca5ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fca5ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb84fe18  RFLAGS: 00000046
    RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb84ffd8  RDI: 000000000280b000
    RBP: ffff8807fb84fe48   R8: 00000000004ff398   R9: 0000000000000018
    R10: 0000000000010402  R11: 0000000000002381  R12: 0000000000000030
    R13: 0000000000000004  R14: 0000000000000005  R15: 00000e7959863100
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb84fe18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb84fe50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb84fea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb84feb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb84ff30] start_secondary at ffffffff8104bc1a

PID: 0      TASK: ffff8807fb83a880  CPU: 6   COMMAND: "swapper/6"
#0 [ffff88081fcc5e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fcc5e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fcc5ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fcc5ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fcc5ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb85be18  RFLAGS: 00000046
    RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb85bfd8  RDI: 0000000000000006
    RBP: ffff8807fb85be48   R8: 00000000004ff398   R9: 0000000000000018
    R10: 0000000000000000  R11: 000000000000268a  R12: 0000000000000030
    R13: 0000000000000004  R14: 0000000000000005  R15: 00000e79597fe92f
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb85be18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb85be50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb85bea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb85beb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb85bf30] start_secondary at ffffffff8104bc1a

PID: 0      TASK: ffff8807fb83b2a0  CPU: 7   COMMAND: "swapper/7"
#0 [ffff88081fce5e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fce5e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fce5ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fce5ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fce5ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb85fe18  RFLAGS: 00000046
    RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb85ffd8  RDI: 0000000000000007
    RBP: ffff8807fb85fe48   R8: 00000000004ff397   R9: 0000000000000018
    R10: 0000000000000003  R11: 0000000000002cc5  R12: 0000000000000030
    R13: 0000000000000004  R14: 0000000000000005  R15: 00000e7958e3f21c
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb85fe18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb85fe50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb85fea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb85feb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb85ff30] start_secondary at ffffffff8104bc1a

PID: 0      TASK: ffff8807fb83bcc0  CPU: 8   COMMAND: "swapper/8"
#0 [ffff88081fd05e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fd05e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fd05ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fd05ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fd05ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb863e18  RFLAGS: 00000046
    RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb863fd8  RDI: 0000000000000008
    RBP: ffff8807fb863e48   R8: 00000000004ff398   R9: 0000000000000018
    R10: 0000000017320095  R11: 000000000000482d  R12: 0000000000000030
    R13: 0000000000000004  R14: 0000000000000005  R15: 00000e79597be18c
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb863e18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb863e50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb863ea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb863eb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb863f30] start_secondary at ffffffff8104bc1a

PID: 0      TASK: ffff8807fb83c6e0  CPU: 9   COMMAND: "swapper/9"
#0 [ffff88081fd25e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fd25e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fd25ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fd25ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fd25ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb867e18  RFLAGS: 00000046
    RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb867fd8  RDI: 0000000000000009
    RBP: ffff8807fb867e48   R8: 00000000004ff398   R9: 0000000000000018
    R10: 0000000000000049  R11: 00000000000021e7  R12: 0000000000000030
    R13: 0000000000000004  R14: 0000000000000005  R15: 00000e79597ed862
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb867e18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb867e50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb867ea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb867eb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb867f30] start_secondary at ffffffff8104bc1a

PID: 0      TASK: ffff8807fb83d100  CPU: 10  COMMAND: "swapper/10"
#0 [ffff88081fd45e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fd45e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fd45ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fd45ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fd45ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb86be18  RFLAGS: 00000046
    RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb86bfd8  RDI: 000000000000000a
    RBP: ffff8807fb86be48   R8: 00000000004ff397   R9: 0000000000000018
    R10: ffff88081fd57e10  R11: 0000000000016687  R12: 0000000000000030
    R13: 0000000000000004  R14: 0000000000000005  R15: 00000e7958e51b76
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb86be18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb86be50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb86bea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb86beb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb86bf30] start_secondary at ffffffff8104bc1a

PID: 0      TASK: ffff8807fb83db20  CPU: 11  COMMAND: "swapper/11"
#0 [ffff88081fd65e20] crash_nmi_callback at ffffffff8104a721
#1 [ffff88081fd65e30] nmi_handle at ffffffff81019f6c
#2 [ffff88081fd65ea0] default_do_nmi at ffffffff8101a58a
#3 [ffff88081fd65ed0] do_nmi at ffffffff8101a708
#4 [ffff88081fd65ef0] end_repeat_nmi at ffffffff81576921
    [exception RIP: intel_idle+212]
    RIP: ffffffff8133e9b4  RSP: ffff8807fb86fe18  RFLAGS: 00000046
    RAX: 0000000000000030  RBX: 0000000000000010  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff8807fb86ffd8  RDI: 000000000000000b
    RBP: ffff8807fb86fe48   R8: 00000000004ff397   R9: 0000000000000018
    R10: 000000000000049a  R11: 000000000000e961  R12: 0000000000000030
    R13: 0000000000000004  R14: 0000000000000005  R15: 00000e7958e4ec73
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#5 [ffff8807fb86fe18] intel_idle at ffffffff8133e9b4
#6 [ffff8807fb86fe50] cpuidle_enter_state at ffffffff8142e139
#7 [ffff8807fb86fea0] cpuidle_enter at ffffffff8142e2b7
#8 [ffff8807fb86feb0] cpu_startup_entry at ffffffff810b6480
#9 [ffff8807fb86ff30] start_secondary at ffffffff8104bc1a

@Bronek
Copy link

Bronek commented Jun 3, 2015

@fvanniere your stacktrace looks identical to one I posted on #3257 on 13th April. I was able to fix kernel panic by upgrading to newer kernel (3.18.14 but I also tested 4.0.4), this seems to have been fixed in torvalds/linux@8f502d5 (I also applied #3344 but it probably was not necessary).

The current situation is still not perfect, there are kernel "Oopses" when trying to access non-existing snapshots (which depending on your configuration may lead to panic, for example see #3030 ). I have no other problems if I avoid accessing non-existing snapshots,

@dylanpiergies
Copy link

Is anyone still actively investigating this?

@behlendorf
Copy link
Contributor

@dylanpiergies yes although I haven't had time to focus on it.

@neurotensin
Copy link

Hi there,

This patch is to address the avl_find avl_add z_unmount panic on zfs with snapshots.

The cause of the problem is that avl_add() panics if it finds a node that already exists.

This patch addresses this by removing the unnecessary calls to avl_add() and potential panic.

The root of the problem is the avl_remove() is called before an "unmount" and if EBUSY tries to re-add a node.

This is logically incorrect, because the EBUSY does not accurately reflect the filesystem state.

The solution in the following patch is to:

a) properly guard using ZFS_ENTER/ZFS_EXIT operations around the filesystem. Linux (and Solaris) kernel mutex's are not-reentrant, I have some evidence this might cause other problems with other processes accessing the filesystem during these operations. Other developers please take note.

b) a simple userspace program that returns a side-effect "free" state of mount (by reading /proc/mounts) and therefore can determine the state of the filesystem before unmount. If a mount is not present it is ignored. An EBUSY is returned ONLY if a mounted was present before the attempted umount.
The unmount is non-blocking as this is NOT safe to be blocking. umount can fail in many interesting ways and this will torpedo the module. Hence we use the userspace "sidefree" program to test for a mount before and after the unmount command. It may be necessary to add a delay for sanity, but it is it doesn't necessarily matter, as we actively test for mounted snapshots.

c) the addition of ENOENT state for __zfsctl_unmount_snapshot() to reflect the snapshot was not mounted.

The machine this was tested on is running linux 4.1.1, zfs/spl 0.6.4.2-1, 32 threads, 256G memory, and correctly creates/removes and mounts and unmounts snapshots.

There remains a problem for someone else who knows this code better than I. If you cd $snapshot && cat /proc/mounts the mount for $snapshot is shown. Immediately after it disappears. I have instrumented the unmount code in zfs_ctrdir.c so I know it is not there. There also remains the problems of too many links, which I think is connected to the ".." structure.

My simple userspace code to test a mounted snapshot, has no license so add whatever you feel is necessary for inclusion. I could of course make it a module function, but I wanted a simple drop in for "stat" and "mountpoint" programs both of which force an automount in testing for the mountpoint.

Cheers

P.

--- zfs_ctldir.c.orig   2015-07-06 16:48:25.901104564 -0400
+++ zfs_ctldir.c        2015-07-08 13:17:40.311619328 -0400
@@ -695,33 +695,83 @@
        "exec 0</dev/null " \
        "     1>/dev/null " \
        "     2>/dev/null; " \
-       "umount -t zfs -n %s'%s'"
+       "umount -t zfs -n -l %s'%s'"
+
+#define STAT_CMD \
+       "exec 0</dev/null "\
+       "     1>/dev/null " \
+       "     2>/dev/null; "\
+       "sidefree '%s'"

 static int
 __zfsctl_unmount_snapshot(zfs_snapentry_t *sep, int flags)
 {
        char *argv[] = { "/bin/sh", "-c", NULL, NULL };
        char *envp[] = { NULL };
-       int error;
+       int error,stat1,stat2;
+       int ismounted=0;

-       argv[2] = kmem_asprintf(SET_UNMOUNT_CMD,
-           flags & MNT_FORCE ? "-f " : "", sep->se_path);
-       error = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
-       strfree(argv[2]);
+       /* 
+        Since we don't get back status information as we are not waiting for umount , 
+        we use a stat to see if the mount still exists. if it *does* this would be the error condition
+        in the original scheme, since the mount might be in use. Note, we assume /proc/mounts in sync... */

+       argv[2] = kmem_asprintf(STAT_CMD, sep->se_path); /* note, only one arg */
+       stat1 = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
+       strfree(argv[2]);
+       if(!stat1){
+               ismounted=1; /* before we try and unmount, is it mounted? */
+       }else{
+               ismounted=0;
+       }
+
        /*
-        * The umount system utility will return 256 on error.  We must
-        * assume this error is because the file system is busy so it is
-        * converted to the more sensible EBUSY.
+        * The "stat"(our side-effect free program)  call will return 1 on error, in otherwords the mount does not exist.
+        * This is the opposite behaviour of before.
+        * Hence, if  no error, mount is present, EBUSY is probably true.
+        *  
         */
-       if (error)
-               error = SET_ERROR(EBUSY);
+
+       if (ismounted){
+               argv[2] = kmem_asprintf(SET_UNMOUNT_CMD,
+                       flags & MNT_FORCE ? "-f " : "", sep->se_path);
+               error = call_usermodehelper(argv[0], argv, envp, UMH_NO_WAIT);
+               strfree(argv[2]);
+
+               /* OK we tried to unmount. Change in state? */
+               argv[2] = kmem_asprintf(STAT_CMD, sep->se_path); /* note, only one arg */
+               stat2 = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
+               strfree(argv[2]);
+
+               /* Since we only try to unmount if mounted. if still mounted then EBUSY is correct.
+               If stat2=0 then success unmounting, so change state.
+               */
+               if(!stat2){
+                       ismounted=0;
+                       printk("ZFS: snapshot %s was unmounted from mounted state.\n",  sep->se_path );
+                       error=0; /* signifies no error */
+               }else{
+                       ismounted=1;
+                       error = SET_ERROR(EBUSY);
+               }
+
+       }else{ /* if not mounted initially, we notice this. Unmounted cannot change state */
+                       printk("ZFS: snapshot %s not mounted. Ignoring\n",  sep->se_path );
+                       error = SET_ERROR(ENOENT);
+       }
+
+
+
+
+
+       /* make sure we return EBUSY if still mounted*/
+       /* make sure we return ENOENT if was never  mounted*/

        /*
         * This was the result of a manual unmount, cancel the delayed work
         * to prevent zfsctl_expire_snapshot() from attempting a unmount.
         */
-       if ((error == 0) && !(flags & MNT_EXPIRE))
+       if ((ismounted==0) && !(flags & MNT_EXPIRE))
                taskq_cancel_id(zfs_expire_taskq, sep->se_taskqid);


@@ -734,29 +784,33 @@
        zfs_snapentry_t search;
        zfs_snapentry_t *sep;
        int error = 0;
-
        mutex_enter(&zsb->z_ctldir_lock);

        search.se_name = name;
        sep = avl_find(&zsb->z_ctldir_snaps, &search, NULL);
        if (sep) {
-               avl_remove(&zsb->z_ctldir_snaps, sep);
                mutex_exit(&zsb->z_ctldir_lock);

                error = __zfsctl_unmount_snapshot(sep, flags);

                mutex_enter(&zsb->z_ctldir_lock);
-               if (error == EBUSY)
-                       avl_add(&zsb->z_ctldir_snaps, sep);
-               else
+               if (error == EBUSY){
+                   printk("ZFS: Could not unmount Busy snapshot %s\n",  sep->se_path );
+               }else if ( error == ENOENT){
+                   printk("ZFS: Ignore unmounted snapshot %s\n",  sep->se_path );
+                   error=0; /* do nothing */
+               }else{
+               /* We only remove when the unmount is successful not if never mounted */
+                   printk("ZFS: success unmounting snapshot %s\n",  sep->se_path );
+                       avl_remove(&zsb->z_ctldir_snaps, sep);
                        zfsctl_sep_free(sep);
-       } else {
+               }
+       } else { /* This indicates the snapshot was not found internally */
                error = SET_ERROR(ENOENT);
        }

        mutex_exit(&zsb->z_ctldir_lock);
        ASSERT3S(error, >=, 0);
-
        return (error);
 }

@@ -773,22 +827,25 @@

        *count = 0;

+ZFS_ENTER(zsb);
        ASSERT(zsb->z_ctldir != NULL);
        mutex_enter(&zsb->z_ctldir_lock);

        sep = avl_first(&zsb->z_ctldir_snaps);
        while (sep != NULL) {
                next = AVL_NEXT(&zsb->z_ctldir_snaps, sep);
-               avl_remove(&zsb->z_ctldir_snaps, sep);
                mutex_exit(&zsb->z_ctldir_lock);

                error = __zfsctl_unmount_snapshot(sep, flags);

                mutex_enter(&zsb->z_ctldir_lock);
                if (error == EBUSY) {
-                       avl_add(&zsb->z_ctldir_snaps, sep);
+               /* don't need this now  avl_add(&zsb->z_ctldir_snaps, sep); */
                        (*count)++;
-               } else {
+               } if (error == ENOENT) { /* was never mounted. Ignore */
+
+               } else { /* we get here, it was mounted, and unmounted successfully. */
+                       avl_remove(&zsb->z_ctldir_snaps, sep);
                        zfsctl_sep_free(sep);
                }

@@ -796,7 +853,7 @@
        }

        mutex_exit(&zsb->z_ctldir_lock);
-
+ZFS_EXIT(zsb);
        return ((*count > 0) ? EEXIST : 0);
 }

simple program for side-effect free mount check. install as "/sbin/sidefree"

#include <stdio.h>
#include <string.h>
#include <stdlib.h>


/* A side-effect free way of reading /proc/mounts and reporting if directory argument is mounted.
For use with zfs snapshots, since the commands mountpoint and sync, force automount behaviour.
*/
int main (int argc, char** argv){
char* mounts="/proc/mounts";
FILE* in;
char* targ;
char* line=NULL;
size_t read,len;
char dev[2048],path[2048],vfs[2048];


/* We take one argument, if no arg, report failed (0). and fail silently. If mounted report 1. */
if (argc>1){
targ=(char*)malloc(strlen(argv[1])+1);
strncpy(targ,argv[1],strlen(argv[1]));

in=fopen(mounts,"r");
if(in==NULL){exit(0);};
 while ((read = getline(&line, &len, in)) != -1) {
          /* fprintf(stderr,"Retrieved line of length %zu :\n", read); */
           sscanf(line,"%s%s%s",dev,path,vfs);
         /*  fprintf(stderr,"dev=%s path=%s vfs=%s \n",dev,path,vfs); */
           if(strcmp(vfs,"zfs")==0)
                if(strncmp(path,targ,strlen(path))==0){
                /*      fprintf(stderr,"found match to %s with %s \n",targ,path);*/
                        exit(1);
                }
       }
fclose(in);
exit(0);
}else{
exit(0);
}

}/*end main */

@behlendorf
Copy link
Contributor

@neurotensin it would be great if you could open a new pull request with your proposed fix. That will make it much easier to review and allow it to run through the automated testing. https://help.github.com/articles/creating-a-pull-request/

@behlendorf behlendorf added this to the 0.6.5 milestone Jul 10, 2015
@neurotensin
Copy link

Hi,

I'm working on a pull request now. I just wanted to get it out there as
it greatly improves stability and makes other things easier to test.

One question, where's the best place to add the "sidefree" help
code...another binary in the package (we could call it zfsmountcheck )
or as an internal module call?

In the pull request I will add it as a binary to be install in /sbin/
unless others wish to move the code.

Cheers

P.

On 07/10/2015 04:05 PM, Brian Behlendorf wrote:

@neurotensin https://github.com/neurotensin it would be great if you
could open a new pull request with your proposed fix. That will make it
much easier to review and allow it to run through the automated testing.
https://help.github.com/articles/creating-a-pull-request/


Reply to this email directly or view it on GitHub
#3243 (comment).

@neurotensin
Copy link

Hi,

#3589

This is the pull request. Sorry for rerouting, but I had to clean it up and make sure it applied to the master not just the release I am running.

P.

behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 30, 2015
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels.  And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients.  This patch makes
the following core improvements.

* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.

* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process.  This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.

* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.

* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds.  This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.

* Snapshots in active use will not be automatically unmounted.  As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.

* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated.  This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots.  This patch modifies the
behavior such that only negative dentries are invalidated.  This solves
the issue and may result in a performance improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3589
Issue openzfs#3344
Issue openzfs#3295
Issue openzfs#3257
Issue openzfs#3243
Issue openzfs#3030
Issue openzfs#2841
behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 31, 2015
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels.  And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients.  This patch makes
the following core improvements.

* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.

* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process.  This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.

* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.

* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds.  This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.

* Snapshots in active use will not be automatically unmounted.  As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.

* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated.  This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots.  This patch modifies the
behavior such that only negative dentries are invalidated.  This solves
the issue and may result in a performance improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#3589
Closes openzfs#3344
Closes openzfs#3295
Closes openzfs#3257
Closes openzfs#3243
Closes openzfs#3030
Closes openzfs#2841
behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 31, 2015
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels.  And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients.  This patch makes
the following core improvements.

* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.

* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process.  This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.

* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.

* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds.  This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.

* Snapshots in active use will not be automatically unmounted.  As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.

* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated.  This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots.  This patch modifies the
behavior such that only negative dentries are invalidated.  This solves
the issue and may result in a performance improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#3589
Closes openzfs#3344
Closes openzfs#3295
Closes openzfs#3257
Closes openzfs#3243
Closes openzfs#3030
Closes openzfs#2841
tomgarcia pushed a commit to tomgarcia/zfs that referenced this issue Sep 11, 2015
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels.  And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients.  This patch makes
the following core improvements.

* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.

* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process.  This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.

* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.

* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds.  This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.

* Snapshots in active use will not be automatically unmounted.  As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.

* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated.  This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots.  This patch modifies the
behavior such that only negative dentries are invalidated.  This solves
the issue and may result in a performance improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#3589
Closes openzfs#3344
Closes openzfs#3295
Closes openzfs#3257
Closes openzfs#3243
Closes openzfs#3030
Closes openzfs#2841
JKDingwall pushed a commit to JKDingwall/zfs that referenced this issue Aug 11, 2016
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels.  And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients.  This patch makes
the following core improvements.

* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.

* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process.  This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.

* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.

* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds.  This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.

* Snapshots in active use will not be automatically unmounted.  As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.

* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated.  This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots.  This patch modifies the
behavior such that only negative dentries are invalidated.  This solves
the issue and may result in a performance improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#3589
Closes openzfs#3344
Closes openzfs#3295
Closes openzfs#3257
Closes openzfs#3243
Closes openzfs#3030
Closes openzfs#2841

Conflicts:
	config/kernel.m4
	module/zfs/zfs_ctldir.c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants