Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mount namespaces can make ZFS snapshots not automountable by normal processes #7849

Closed
siebenmann opened this issue Aug 30, 2018 · 15 comments
Closed
Labels
Status: Stale No recent activity for issue

Comments

@siebenmann
Copy link
Contributor

System information

Type Version/Name
Distribution Name Fedora, also Ubuntu
Distribution Version 28 (Fedora), 18.04 LTS (Ubuntu)
Linux Kernel 4.17.18-200.fc28.x86_64 (Fedora), 4.15.0-32-generic (Ubuntu)
Architecture x86_64
ZFS Version 0.7.0-1530_g5097b4e42 (Fedora; almost git tip), 0.7.5-1ubuntu15 (Ubuntu)
SPL Version 0.7.0-1530_g5097b4e42 (Fedora), 0.7.5-1ubuntu1 (Ubuntu)

Describe the problem you're observing

Cloned mount namespaces can 'trap' ZFS snapshot automounts, making them mostly inaccessible to regular processes in the regular host namespace after ZFS expires the initial automount because ZFS will refuse to re-automount them. Under some circumstances this can lead to a panic in the NFS server code (see issue #7764), although it's not the only way to cause or reproduce that particular panic.

Describe how to reproduce the problem

  1. Cause a snapshot to be automounted in the regular (PID 1) mount namespace. ls /tank/tst/.zfs/snapshot/snap.
  2. Create a new cloned mount namespace, for example with PS1="newns# " unshare -m sh. This cloned mount namespace inherits the automounted snapshot.
  3. Wait for the automounted snapshot's unmount timer to come up (five minutes by default). ZFS will attempt to unmount it by ultimately running umount in the regular mount namespace; if the snapshot is unused there, this will succeed. However, this doesn't unmount the snapshot in the cloned mount namespace.
  4. Attempt to access the snapshot again to re-automount it, again with ls /tank/tst/.zfs/snapshot/snap. However, this time around it will fail with the error of 'Too many symbolic links'.

This happens because zfsctl_snapshot_mount() refuses to automount a snapshot if it is already mounted, but the actual check is implicitly merely 'is the snapshot mounted anywhere in any namespace' (implemented as 'is this still on our internal list of snapshots that are mounted' in zfsctl_snapshot_ismounted()). The specific error is ELOOP for reasons explained in the comments further down in zfsctl_snapshot_mount().

Unfortunately it's not clear to me what (if anything) can be done about this. It might be possible to make it so that ZFS snapshot automounts can only be present in the main (PID 1) mount namespace, but that feels like overkill because there are some situations where this works right in additional mount namespaces. I don't know if there's a way to make a re-mount work right if the snapshot is mounted but not in the main mount namespace, or if that would cause explosions.

(The good news is that systemd's handling of mount namespaces for regular units with things like DynamicUser or PrivateTmp doesn't appear to have this problem; unmounts propagate into the modified mount namespace used by the service.)

@loli10K
Copy link
Contributor

loli10K commented Aug 30, 2018

However, this doesn't unmount the snapshot in the cloned mount namespace.

I don't see this happening on 0.7.9:

root@linux:~# echo 5 > /sys/module/zfs/parameters/zfs_expire_snapshot 
root@linux:~# 
root@linux:~# zfs create $POOLNAME/fs
root@linux:~# zfs snap $POOLNAME/fs@snap
root@linux:~# ls -l /$POOLNAME/fs/.zfs/snapshot/snap
total 0
root@linux:~# df -t zfs
Filesystem       1K-blocks  Used Available Use% Mounted on
testpool             57216     0     57216   0% /testpool
testpool/fs          57216     0     57216   0% /testpool/fs
testpool/fs@snap     57216     0     57216   0% /testpool/fs/.zfs/snapshot/snap
root@linux:~# PS1='newns# ' unshare -m sh
newns# sleep 10
newns# df -t zfs
Filesystem     1K-blocks  Used Available Use% Mounted on
testpool           57216     0     57216   0% /testpool
testpool/fs        57216     0     57216   0% /testpool/fs
newns# 
root@linux:~# df -t zfs
Filesystem     1K-blocks  Used Available Use% Mounted on
testpool           57216     0     57216   0% /testpool
testpool/fs        57216     0     57216   0% /testpool/fs
root@linux:~# ls -l /$POOLNAME/fs/.zfs/snapshot/snap
total 0
root@linux:~# df -t zfs
Filesystem       1K-blocks  Used Available Use% Mounted on
testpool             57216     0     57216   0% /testpool
testpool/fs          57216     0     57216   0% /testpool/fs
testpool/fs@snap     57216     0     57216   0% /testpool/fs/.zfs/snapshot/snap
root@linux:~# 

@Ukko-Ylijumala
Copy link

Maybe something keeps the snapshot mounted in the cloned namespace?

@siebenmann
Copy link
Contributor Author

I was just able to reproduce my issue with 0.7.9 on Fedora 28 (with the latest, just-released kernel). I used a 15 second zfs_expire_snapshot, though, since I wasn't sure I could type all the necessary commands in within five seconds, and I confirmed that the snapshot was mounted in newns# before the sleep (with df -t zfs). @loli10K, is it possible that in your case the snapshot expired in the main namespace before your unshare? What kernel and Linux distribution were you using here?

@loli10K
Copy link
Contributor

loli10K commented Aug 30, 2018

@siebenmann Debian8 with stock kernel:

root@linux:~# cat /tmp/issue-7849.sh 
#!/bin/bash

echo 50 > /sys/module/zfs/parameters/zfs_expire_snapshot
# misc functions
function is_linux() {
   if [[ "$(uname)" == "Linux" ]]; then
      return 0
   else
      return 1
   fi
}
# setup
POOLNAME='testpool'
if is_linux; then
   TMPDIR='/var/tmp'
   mountpoint -q $TMPDIR || mount -t tmpfs tmpfs $TMPDIR
   zpool destroy $POOLNAME
   rm -f $TMPDIR/disk*
   truncate -s 128m $TMPDIR/disk{0,1}
   zpool create $POOLNAME $TMPDIR/disk0
else
   TMPDIR='/tmp'
   zpool destroy $POOLNAME
   rm -f $TMPDIR/zpool_$POOLNAME.dat
   mkfile 128m $TMPDIR/zpool_$POOLNAME.dat
   zpool create $POOLNAME $TMPDIR/zpool_$POOLNAME.dat
fi
zfs create $POOLNAME/fs
zfs snap $POOLNAME/fs@snap
ls -l /$POOLNAME/fs/.zfs/snapshot/snap
df -t zfs
echo 'unshare ->'
PS1='newns# ' unshare -m sh -c "df -t zfs && echo sleep && sleep 60 && echo slept && df -t zfs"
echo 'unshare <-'
df -t zfs

root@linux:~# /tmp/issue-7849.sh 
total 0
Filesystem       1K-blocks  Used Available Use% Mounted on
testpool             57216     0     57216   0% /testpool
testpool/fs          57216     0     57216   0% /testpool/fs
testpool/fs@snap     57216     0     57216   0% /testpool/fs/.zfs/snapshot/snap
unshare ->
Filesystem       1K-blocks  Used Available Use% Mounted on
testpool             57216     0     57216   0% /testpool
testpool/fs          57216     0     57216   0% /testpool/fs
testpool/fs@snap     57216     0     57216   0% /testpool/fs/.zfs/snapshot/snap
sleep
slept
Filesystem     1K-blocks  Used Available Use% Mounted on
testpool           57216     0     57216   0% /testpool
testpool/fs        57216     0     57216   0% /testpool/fs
unshare <-
Filesystem     1K-blocks  Used Available Use% Mounted on
testpool           57216     0     57216   0% /testpool
testpool/fs        57216     0     57216   0% /testpool/fs
root@linux:~# uname -a
Linux linux 3.16.0-4-amd64 #1 SMP Debian 3.16.51-3 (2017-12-13) x86_64 GNU/Linux
root@linux:~# 

@siebenmann
Copy link
Contributor Author

Thank you for the script! I ran it on my test Fedora 28 machine and it reproduced the problem:

[...]
unshare ->
Filesystem       1K-blocks   Used Available Use% Mounted on
tank                558592      0    558592   0% /tank
tank/fs             900864 342272    558592  38% /tank/fs
testpool             57216      0     57216   0% /testpool
testpool/fs          57216      0     57216   0% /testpool/fs
testpool/fs@snap     57216      0     57216   0% /testpool/fs/.zfs/snapshot/snap
sleep
slept
Filesystem       1K-blocks   Used Available Use% Mounted on
tank                558592      0    558592   0% /tank
tank/fs             900864 342272    558592  38% /tank/fs
testpool             57216      0     57216   0% /testpool
testpool/fs          57216      0     57216   0% /testpool/fs
testpool/fs@snap     57216      0     57216   0% /testpool/fs/.zfs/snapshot/snap
unshare <-
[...]

This is with 0.7.9 straight from the official ZoL Fedora repo, so all I can think of is some difference in either the kernel's behavior or the unshare command on Debian 8 vs Fedora.

@loli10K
Copy link
Contributor

loli10K commented Aug 31, 2018

my test Fedora 28 machine and it reproduced the problem

@siebenmann does this happen only with ZFS automounted snapshots? Please try the following script on that Fedora28 machine:

#!/bin/bash

mountpoint -q /mnt && umount /mnt
truncate -s 64m /var/tmp/file
yes | mkfs.ext4 /var/tmp/file
mount /var/tmp/file /mnt
df /mnt
(sleep 10 && echo 'umounting' && umount /mnt && echo 'umounted') &
echo 'unshare ->'
PS1='newns# ' unshare -m sh -c "df /mnt && echo sleep && sleep 20 && echo slept && df /mnt"
echo 'unshare <-'
df /mnt

@siebenmann
Copy link
Contributor Author

The held-in-namespace mount also happens with the /dev/loop0 mount:

unshare ->
Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/loop0         59365  1294     53485   3% /mnt
sleep
umounting
umounted
slept
Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/loop0         59365  1294     53485   3% /mnt
unshare <-

Just to be sure, I tested a variant of your script that did a df /mnt in the regular namespace after unmounting /mnt (so between 'unmounted' and 'slept') and it showed that the mount was indeed gone there.

@loli10K
Copy link
Contributor

loli10K commented Sep 2, 2018

It is the unshare binary:

  • tested Kernel 4.18.5 on the same Debian8 with ZFS 0.7.9+0.7.10 patches without issues.
  • copied unshare from Fedora24 to the same Debian8 and it reproduced the issue
  • copied unshare from Ubuntu16 to the same Debian8 and it reproduced the issue

"util-linux" versions:

Distribution Version
Debian8 2.25.2-6
Ubuntu16 2.27.1-6ubuntu3.3
Fedora24 2.28.2-1.fc24

By default newer versions execute an additional

mount("none", "/", NULL, MS_REC|MS_PRIVATE, NULL)

when unsharing the mount namespace : specifying --propagation unchanged (default is private) seems to avoid this specific issue. The man page confirm this:

   mount namespace
         Mounting and unmounting filesystems will not affect the rest of the system (CLONE_NEWNS flag), except for filesystems which are explicitly marked as  shared  (with  mount  --make-shared;  see
         /proc/self/mountinfo or findmnt -o+PROPAGATION for the shared flags).

         unshare since util-linux version 2.27 automatically sets propagation to private in a new mount namespace to make sure that the new namespace is really unshared.  It's possible to disable this
         feature with option --propagation unchanged.  Note that private is the kernel default.

The issue here is we don't get to properly cleanup all the references (zfs_snapshots_by_name and zfs_snapshots_by_objsetid avl trees) to the mounted snapshot when we (auto)umount the one in the "parent" namespace.

The cleanup code path is deactivate_super -> deactivate_locked_super -> zpl_kill_sb -> zfs_preumount -> zfsctl_destroy -> zfsctl_snapshot_remove. With a cloned, private namespace zpl_kill_sb is never called because we hold an additional reference (sb->s_active) to the mounted snapshot.

Tracing relevant function calls via systemtap, without private mount:

 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 cleanup_mnt <- deactivate_super
 kretprobe_trampoline -> deactivate_super {.counter=2}
 task_work_run <- deactivate_super
 cleanup_mnt <- deactivate_super
 kretprobe_trampoline -> deactivate_super {.counter=1}
 kretprobe_trampoline <- deactivate_locked_super
 deactivate_locked_super <- zpl_kill_sb
 zfsctl_destroy -> zfsctl_snapshot_remove se={.se_name="testpool/fs@snap", .se_path="/testpool/fs/.zfs/snapshot/snap", ....
 zfs_preumount <- zfsctl_snapshot_remove 
 kretprobe_trampoline <- zpl_kill_sb
 task_work_run <- deactivate_locked_super
 task_work_run <- deactivate_super

and with private mount:

 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 cleanup_mnt <- deactivate_super
 kretprobe_trampoline -> deactivate_super {.counter=2}
 task_work_run <- deactivate_super

this is confirmed by the fact that we keep trying to umount the snapshot on the parent namespace (because after a "failed" umount we get "rescheduled" in snapentry_expire):

 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 cleanup_mnt <- deactivate_super
 kretprobe_trampoline -> deactivate_super {.counter=2}
 task_work_run <- deactivate_super
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)
 -> call_usermodehelper "/usr/bin/env" "umount" "-t" "zfs" "-n" "/testpool/fs/.zfs/snapshot/snap"           (null)

@brandonhaberfeld
Copy link

brandonhaberfeld commented Sep 5, 2018

Would this be the root cause of the ELOOP error which ultimately cause the issues in #7764 fixed in #7864 ?

@lundman
Copy link
Contributor

lundman commented Sep 7, 2018

We have this problem with snapshot where we get ELOOP error, which could be related to this bug report. Is there an easy way to test if we are having this problem? How do we pass --propagation unchanged to unshare to see if it stops happening?

@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@coolhaircut
Copy link

The disk device type now specifies the propagation key which seems like it would resolve this and several related issues. Probably can be proactively closed as resolved before autoclosed as stale.

Can you confirm this fixes @siebenmann ?

@stale stale bot removed the Status: Stale No recent activity for issue label Sep 25, 2020
@siebenmann
Copy link
Contributor Author

I will need to build a test environment and rebuild my familiarity with this bug, so it will be a few days before I can give you an answer here.

@siebenmann
Copy link
Contributor Author

The current state of git tip on Fedora 32 is confusing me but the issue is still present; the snapshot remains mounted in the second namespace even if it has been unmounted in the PID 1 namespace. Under some circumstances the snapshot seems to also remain mounted in the PID 1 namespace even though the timer has theoretically expired, but if I set a fast enough unmount time in @loli10K 's ZFS test script, I can see the snapshot mount disappear and then attempting to access it gets the same error. When the snapshot remains mounted in the PID 1 namespace, it appears to become slow to automatically unmount or perhaps doesn't really automatically unmount at all any more (I've waited several multiples of a 50 second unmount timeout and it is still mounted).

(With the script's timing settings of a 50 second unmount timer and a 60 second sleep, the snapshot remains mounted in the PID 1 namespace. With a 15 second timeount it doesn't, although it stays mounted in the unshared namespace.)

@stale
Copy link

stale bot commented Nov 18, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Nov 18, 2021
@stale stale bot closed this as completed Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue
Projects
None yet
Development

No branches or pull requests

6 participants