"Too many levels of symbolic links" when accessing snapshots #4514

odoucet · 2016-04-12T14:50:31Z

This is the same error as #816 but this old ticket was closed as fixed because it happens on old ZFS versions with an old kernel. I'm opening a new issue with additional informations to get it fixed.

This happens on two different systems (with same data, replicated with zfs send/recv).
System 1 is Kernel 4.4.4 + SPL/ZFS 0.6.5.5
System 2 is Kernel 4.5.0 + SPL/ZFS 0.6.5.6

This worked on ZFS 0.6.3 with kernel 3.10 ...

$ ls -lah /backup/xxxxobfuscatedxxx/.zfs/snapshot/2016-04-11.000000Z/
ls: cannot access /backup/xxxxobfuscatedxxx/.zfs/snapshot/2016-04-11.000000Z/: Too many levels of symbolic links

But this almost work :

$ cd /backup/xxxxobfuscatedxxx/.zfs/snapshot/2016-04-11.000000Z/
$ ls
home/
# Great, I have it ! 
$ pwd
/backup/xxxxobfuscatedxxx/.zfs/snapshot/2016-04-11.000000Z/
$ cd home && pwd
(unreachable)/home

But as the path seems completely f**ed up, tools like rsync do not work to restore data...

Fortunately, I can still have access to my data with mount :

$ mount -t zfs backupxx/xxxxobfuscatedxxx@2016-04-11.000000Z /mnt/test
# works \o/

stracing ls does not result to segfault like old ticket :)

stat("/backup/xxxxobfuscatedxxx/.zfs/snapshot/2016-04-11.000000Z/", 0x1ac50d0) = -1 ELOOP

This volume currently have 57 snapshots.

This can be reproduced easily if needed. Just tell me how can I help ...

The text was updated successfully, but these errors were encountered:

tuxoko · 2016-04-12T21:15:40Z

Is the snapshot already mounted somewhere else?

odoucet · 2016-04-12T21:20:56Z

None of the snapshot in this filesystem is mounted elsewhere.
Initial filesystem is mounted though.

tuxoko · 2016-04-12T21:41:33Z

When you saw the ELOOP error, can you manual mount the snapshot at the .zfs/snapshot/xxx location and would every thing work afterward?

odoucet · 2016-04-12T22:22:55Z

I'm not sure what you mean ... Yes, I can mount the same snapshot I have the ELOOP error with.

Or do you mean

mount -t zfs xxxx@2016-04-10.000000Z /backup/xxxx/.zfs/snapshot/2016-04-10.000000Z/ ?

This does not have any sense nope ? mount does no error, but ls still failed with ELOOP.

And 3rd try :

$ mount -t zfs xxx@2016-04-10.000000Z /mnt/test
$ ls /backup/xxx/.zfs/snapshot/2016-04-10.000000Z/
Too many levels of symbolic links

tuxoko · 2016-04-12T22:35:22Z

Do this mount -t zfs xxxx@2016-04-10.000000Z /backup/xxxx/.zfs/snapshot/2016-04-10.000000Z/
It might not make sense to you, but I want to have an idea of which part in the code went wrong.

odoucet · 2016-04-13T06:41:59Z

Yes, that's what I tested in my previous post ;) mount command does no error, but ELOOP is still there, and /proc/mounts does not show the mount.

odoucet · 2016-04-15T15:01:35Z

This bug also prevents "zfs diff" to work (failed with message "Unable to obtain diffs: No such file or directory"), if it can help.
Folder .zfs/shares is empty

tuxoko · 2016-04-15T22:06:38Z

@odoucet
Please try strace mount.zfs xxxx@2016-04-10.000000Z /backup/xxxx/.zfs/snapshot/2016-04-10.000000Z/

odoucet · 2016-04-16T09:16:28Z

[...]
stat("xxxxxxx/xxxxxx@2016-04-10.000000Z", 0x7ffe33b45500) = -1 ENOENT (No such file or directory)
getcwd("/root", 4096)                   = 6
lstat("/xx", {st_mode=S_IFDIR|0755, st_size=10, ...}) = 0
lstat("/xx/xx", {st_mode=S_IFDIR|0755, st_size=50, ...}) = 0
lstat("/xx/xx/xx", {st_mode=S_IFDIR|0755, st_size=25, ...}) = 0
lstat("/xx/xx/xx/xx", {st_mode=S_IFDIR|0755, st_size=3, ...}) = 0
lstat("/xx/xx/xx/xx/.zfs", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
lstat("/xx/xx/xx/xx/.zfs/snapshot", {st_mode=S_IFDIR|0555, st_size=2, ...}) = 0
lstat("/xx/xx/xx/xx/.zfs/snapshot/2016-04-10.000000Z", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
access("/sys/module/zfs", F_OK)         = 0
access("/sys/module/zfs", F_OK)         = 0
open("/dev/zfs", O_RDWR)                = 3
close(3)                                = 0
open("/dev/zfs", O_RDWR)                = 3
open("/etc/mtab", O_RDONLY)             = 4
open("/etc/dfs/sharetab", O_RDONLY)     = 5
open("/dev/zfs", O_RDWR)                = 6
open("/usr/share/locale/locale.alias", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=2512, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2c28673000
read(7, "# Locale name alias data base.\n#"..., 4096) = 2512
read(7, "", 4096)                       = 0
close(7)                                = 0
munmap(0x7f2c28673000, 4096)            = 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.UTF-8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
ioctl(3, 0x5a12, 0x7ffe33b41a70)        = 0
ioctl(3, 0x5a05, 0x7ffe33b3e420)        = 0
ioctl(3, 0x5a13, 0x7ffe33b41e80)        = 0
close(3)                                = 0
close(4)                                = 0
close(5)                                = 0
close(6)                                = 0
mount("xxxxxxx/xxxxxx@2016-04-10.000000Z", "/xxxxxxx/.zfs/snapshot/2016-04-10.000000Z", "zfs", 0, ",mntpoint=/xxxxxxxxx"...) = 0
lstat("/etc/mtab", {st_mode=S_IFREG|0644, st_size=17740, ...}) = 0
open("/etc/mtab", O_RDWR|O_CREAT, 0644) = 3
close(3)                                = 0
open("/etc/mtab", O_RDWR|O_CREAT|O_APPEND, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=17740, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2c28673000
fstat(3, {st_mode=S_IFREG|0644, st_size=17740, ...}) = 0
lseek(3, 16384, SEEK_SET)               = 16384
read(3, "rw,noatime 0 0\nxxxxxxx"..., 1356) = 1356
write(3, "xxxxxx"..., 147) = 147
close(3)                                = 0
munmap(0x7f2c28673000, 4096)            = 0
exit_group(0)                           = ?
+++ exited with 0 +++

odoucet · 2016-04-25T16:54:27Z

any update ? What can I do to help ?

tuxoko · 2016-04-25T18:10:11Z

The mount command returns success, I don't know why it wouldn't work for you.

odoucet · 2016-04-26T13:09:19Z

Just tested different kernel with same SPL/ZFS version (v0.6.5.6-1).
On kernel 4.4.4 : not working
Rebooted same system on kernel 3.10.101 : working

on kernel 4.4.4, behaviour is really really strange :

$ ls /backup/xxxobfuscatedxxx/.zfs/snapshot/2016-04-26.000000Z/
ls: cannot access /backup/xxxobfuscatedxxx/.zfs/snapshot/2016-04-26.000000Z/: Too many levels of symbolic links
$ cd xxobfuscatedxxx && ls 2016-04-26.000000Z/
home/

As stated above, mount on snapshot is working ...

m-r-r · 2016-05-01T20:56:22Z

Hello @odoucet,

I have the same problem with a ZFS filesystem which is bind-mounted in a LXC container.
Is your ZFS filesystem also accessible from a LXC container ? If so, that could be related…

odoucet · 2016-05-02T07:55:34Z

Hi @m-r-r , sorry, no LXC involved in my setup ...

JuliaVixen · 2016-05-04T22:58:23Z

I'm having, was appears to be, the same issue. Short summary: 'ls' on a snapshot when the CWD is under .zfs works as expected. When the CWD is elsewhere: "Too many symbolic links".

localhost ~ # uname -a
Linux localhost 4.4.6-gentoo #1 SMP Mon Apr 25 02:58:59 Local time zone must be set--see zic  x86_64 Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz GenuineIntel GNU/Linux

localhost ~ # modinfo zfs
filename:       /lib/modules/4.4.6-gentoo/extra/zfs/zfs.ko
version:        0.6.5.4-r1-gentoo
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     4251E810337436FD7B850DA
depends:        spl,znvpair,zunicode,zcommon,zavl
vermagic:       4.4.6-gentoo SMP mod_unload modversions 
[... And so on.]

localhost ~ # ls /l/photos/.zfs/snapshot/send/
ls: cannot open directory /l/photos/.zfs/snapshot/send/: Too many levels of symbolic links
[Wait a minute]
localhost ~ # ls /l/photos/.zfs/snapshot/send/
ls: cannot open directory /l/photos/.zfs/snapshot/send/: Too many levels of symbolic links

localhost ~ # pwd
/root

localhost ~ # ls /l/photos/.zfs/snapshot/send/
ls: cannot open directory /l/photos/.zfs/snapshot/send/: Too many levels of symbolic links
localhost ~ # ls /l/photos/.zfs/snapshot/send
ls: cannot access /l/photos/.zfs/snapshot/send/stuff: Too many levels of symbolic links
ls: cannot access /l/photos/.zfs/snapshot/send/thing: Too many levels of symbolic links
ls: cannot access /l/photos/.zfs/snapshot/send/whatever: Too many levels of symbolic links
[...]
localhost ~ # ls /l/photos/.zfs/snapshot/
[Expected output, no error]
localhost ~ # ls /l/photos/.zfs/
[Expected output, no error]
localhost ~ # ls /l/photos/
[Expected output, no error]
localhost ~ # ls /l/
[Expected output, no error]

localhost photos # cd /l/photos
localhost photos # pwd
/l/photos

localhost photos # ls .zfs/snapshot/send/
ls: cannot open directory .zfs/snapshot/send/: Too many levels of symbolic links

localhost photos # cd .zfs
localhost .zfs # pwd
/l/photos/.zfs
localhost .zfs # ls snapshot/send/
thing
stuff
whatever
[...And the correctly expected results with no errors.]

Using an absolute path, even with CWD being .zfs

localhost .zfs # ls /l/photos/.zfs/snapshot/send/
ls: cannot open directory /l/photos/.zfs/snapshot/send/: Too many levels of symbolic links

Only relative paths to .zfs and below work without error.

There's nothing in dmesg, and I've been scrubbing the pool, but there are no errors detected so far.

Oh, also...

localhost .zfs # zpool get all l
NAME  PROPERTY                    VALUE                       SOURCE
l     size                        36.2T                       -
l     capacity                    67%                         -
l     altroot                     -                           default
l     health                      ONLINE                      -
l     guid                        4946876290228094116         default
l     version                     -                           default
l     bootfs                      -                           default
l     delegation                  on                          default
l     autoreplace                 off                         default
l     cachefile                   -                           default
l     failmode                    wait                        default
l     listsnapshots               off                         default
l     autoexpand                  off                         default
l     dedupditto                  0                           default
l     dedupratio                  1.00x                       -
l     free                        11.9T                       -
l     allocated                   24.3T                       -
l     readonly                    off                         -
l     ashift                      12                          local
l     comment                     -                           default
l     expandsize                  -                           -
l     freeing                     0                           default
l     fragmentation               37%                         -
l     leaked                      0                           default
l     feature@async_destroy       enabled                     local
l     feature@empty_bpobj         active                      local
l     feature@lz4_compress        active                      local
l     feature@spacemap_histogram  active                      local
l     feature@enabled_txg         active                      local
l     feature@hole_birth          active                      local
l     feature@extensible_dataset  active                      local
l     feature@embedded_data       active                      local
l     feature@bookmarks           enabled                     local
l     feature@filesystem_limits   enabled                     local
l     feature@large_blocks        active                      local

localhost .zfs # zfs get all l/photos
NAME      PROPERTY              VALUE                                                       SOURCE
l/photos  type                  filesystem                                                  -
l/photos  creation              Tue Apr 26 19:43 2016                                       -
l/photos  used                  13.8T                                                       -
l/photos  available             8.64T                                                       -
l/photos  referenced            13.2T                                                       -
l/photos  compressratio         1.01x                                                       -
l/photos  mounted               yes                                                         -
l/photos  quota                 none                                                        default
l/photos  reservation           none                                                        default
l/photos  recordsize            128K                                                        default
l/photos  mountpoint            /l/photos                                                   default
l/photos  sharenfs              fsid=25,rw=172.16.111.0/24,sec=sys,insecure,insecure_locks  received
l/photos  checksum              on                                                          default
l/photos  compression           on                                                          inherited from l
l/photos  atime                 off                                                         inherited from l
l/photos  devices               off                                                         inherited from l
l/photos  exec                  on                                                          default
l/photos  setuid                off                                                         inherited from l
l/photos  readonly              off                                                         default
l/photos  zoned                 off                                                         default
l/photos  snapdir               hidden                                                      default
l/photos  aclinherit            restricted                                                  default
l/photos  canmount              on                                                          default
l/photos  xattr                 on                                                          default
l/photos  copies                1                                                           default
l/photos  version               1                                                           -
l/photos  utf8only              off                                                         default
l/photos  normalization         none                                                        default
l/photos  casesensitivity       sensitive                                                   default
l/photos  vscan                 off                                                         default
l/photos  nbmand                off                                                         default
l/photos  sharesmb              off                                                         default
l/photos  refquota              none                                                        default
l/photos  refreservation        none                                                        default
l/photos  primarycache          all                                                         default
l/photos  secondarycache        all                                                         default
l/photos  usedbysnapshots       680G                                                        -
l/photos  usedbydataset         13.2T                                                       -
l/photos  usedbychildren        0                                                           -
l/photos  usedbyrefreservation  0                                                           -
l/photos  logbias               latency                                                     default
l/photos  dedup                 off                                                         default
l/photos  mlslabel              none                                                        default
l/photos  sync                  standard                                                    default
l/photos  refcompressratio      1.01x                                                       -
l/photos  written               0                                                           -
l/photos  logicalused           14.0T                                                       -
l/photos  logicalreferenced     13.3T                                                       -
l/photos  filesystem_limit      none                                                        default
l/photos  snapshot_limit        none                                                        default
l/photos  filesystem_count      none                                                        default
l/photos  snapshot_count        none                                                        default
l/photos  snapdev               hidden                                                      default
l/photos  acltype               off                                                         default
l/photos  context               none                                                        default
l/photos  fscontext             none                                                        default
l/photos  defcontext            none                                                        default
l/photos  rootcontext           none                                                        default
l/photos  relatime              off                                                         default
l/photos  redundant_metadata    all                                                         default
l/photos  overlay               off                                                         default

There are 43 snapshots under /l/photos/.zfs/snapshots. The filesystem has been "zfs send" and "zfs received" a few times over the years, from older versions of ZFS on Solaris, FreeBSD, and Linux. It's been running under ZFSonLinux for two years now. I just did a...

zfs send -eLRv -I May_20_2015 h/photos@send | zfs recv -evF l

...a few days ago. (The h pool was an older version without many features turned on. And I don't remember this snapshot error on that pool, but I'd have to plug the drives back in to check...)

JuliaVixen · 2016-05-04T23:48:13Z

I plugged the drives back in; there is no error when using the old pool.

localhost ~ # ls /l/photos/.zfs/snapshot/send/
ls: cannot open directory /l/photos/.zfs/snapshot/send/: Too many levels of symbolic links
localhost ~ # ls /h/photos/.zfs/snapshot/send/
things
stuff
[...expected results, no errors]

localhost ~ # zpool get all h

[Ignore the "DEGRADED" state, I only plugged in just enough drives to import this to test.]

h     size                        21.8T                       -
h     capacity                    96%                         -
h     altroot                     -                           default
h     health                      DEGRADED                    -
h     guid                        5105680881284105628         default
h     version                     -                           default
h     bootfs                      -                           default
h     delegation                  on                          default
h     autoreplace                 off                         default
h     cachefile                   -                           default
h     failmode                    wait                        default
h     listsnapshots               off                         default
h     autoexpand                  off                         default
h     dedupditto                  0                           default
h     dedupratio                  1.00x                       -
h     free                        825G                        -
h     allocated                   20.9T                       -
h     readonly                    on                          -
h     ashift                      12                          local
h     comment                     -                           default
h     expandsize                  -                           -
h     freeing                     0                           default
h     fragmentation               0%                          -
h     leaked                      0                           default
h     feature@async_destroy       enabled                     local
h     feature@empty_bpobj         active                      local
h     feature@lz4_compress        active                      local
h     feature@spacemap_histogram  disabled                    local
h     feature@enabled_txg         disabled                    local
h     feature@hole_birth          disabled                    local
h     feature@extensible_dataset  disabled                    local
h     feature@embedded_data       active                      local
h     feature@bookmarks           disabled                    local
h     feature@filesystem_limits   disabled                    local
h     feature@large_blocks        disabled                    local

localhost ~ # zfs get all h/photos
NAME      PROPERTY              VALUE                                                       SOURCE
h/photos  type                  filesystem                                                  -
h/photos  creation              Sat Mar 21 20:06 2015                                       -
h/photos  used                  13.9T                                                       -
h/photos  available             85.7G                                                       -
h/photos  referenced            13.0T                                                       -
h/photos  compressratio         1.00x                                                       -
h/photos  mounted               yes                                                         -
h/photos  quota                 none                                                        default
h/photos  reservation           none                                                        default
h/photos  recordsize            128K                                                        default
h/photos  mountpoint            /h/photos                                                   default
h/photos  sharenfs              fsid=25,rw=172.16.111.0/24,sec=sys,insecure,insecure_locks  received
h/photos  checksum              on                                                          default
h/photos  compression           lz4                                                         local
h/photos  atime                 off                                                         local
h/photos  devices               off                                                         local
h/photos  exec                  off                                                         local
h/photos  setuid                off                                                         local
h/photos  readonly              on                                                          temporary
h/photos  zoned                 off                                                         default
h/photos  snapdir               hidden                                                      default
h/photos  aclinherit            restricted                                                  default
h/photos  canmount              on                                                          default
h/photos  xattr                 on                                                          default
h/photos  copies                1                                                           default
h/photos  version               1                                                           -
h/photos  utf8only              off                                                         default
h/photos  normalization         none                                                        default
h/photos  casesensitivity       sensitive                                                   default
h/photos  vscan                 off                                                         default
h/photos  nbmand                off                                                         default
h/photos  sharesmb              off                                                         default
h/photos  refquota              none                                                        default
h/photos  refreservation        none                                                        default
h/photos  primarycache          all                                                         default
h/photos  secondarycache        all                                                         default
h/photos  usedbysnapshots       988G                                                        -
h/photos  usedbydataset         13.0T                                                       -
h/photos  usedbychildren        0                                                           -
h/photos  usedbyrefreservation  0                                                           -
h/photos  logbias               latency                                                     default
h/photos  dedup                 off                                                         default
h/photos  mlslabel              none                                                        default
h/photos  sync                  standard                                                    default
h/photos  refcompressratio      1.00x                                                       -
h/photos  written               234K                                                        -
h/photos  logicalused           13.9T                                                       -
h/photos  logicalreferenced     13.0T                                                       -
h/photos  filesystem_limit      none                                                        default
h/photos  snapshot_limit        none                                                        default
h/photos  filesystem_count      none                                                        default
h/photos  snapshot_count        none                                                        default
h/photos  snapdev               hidden                                                      default
h/photos  acltype               off                                                         default
h/photos  context               none                                                        default
h/photos  fscontext             none                                                        default
h/photos  defcontext            none                                                        default
h/photos  rootcontext           none                                                        default
h/photos  relatime              off                                                         default
h/photos  redundant_metadata    all                                                         default
h/photos  overlay               off                                                         default

I have this in my zpool history:
2016-04-26.06:53:29 zpool set feature@embedded_data=enabled h
But I haven't modified any data in the pool at all, so it has probably had no tangible effect.
I also set these on h/photos, just before I make the h/photos@send snapshot:
h/photos compression lz4 local
h/photos atime off local
h/photos devices off local
h/photos exec off local
h/photos setuid off local
These were all the default values (off), for all snapshots prior to that last snapshot.

All 43 snapshots on the pool "h" work without error, all 43 snapshots on the pool "l" don't return the "too many symbolic links" error message.

`localhost ~ # zdb -C l

MOS Configuration:
version: 5000
name: 'l'
state: 0
txg: 623065
pool_guid: 4946876290228094116
errata: 0
hostname: 'localhost'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 4946876290228094116
create_txg: 4
children[0]:
type: 'raidz'
id: 0
guid: 3883405302801716307
nparity: 1
metaslab_array: 35
metaslab_shift: 38
ashift: 12
asize: 40007743569920
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 560335158339124174
path: '/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_VKGU6V2X-part1'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 17515656895715273874
path: '/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_VKH408MX-part1'
whole_disk: 1
DTL: 425
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 8921389355845190819
path: '/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_VKH52T6X-part1'
whole_disk: 1
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 13116042119914339975
path: '/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_VKHLNHZX-part1'
whole_disk: 1
create_txg: 4
children[4]:
type: 'disk'
id: 4
guid: 6003650805274103645
path: '/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_VKHNJ93X-part1'
whole_disk: 1
create_txg: 4
features_for_read:
com.delphix:embedded_data
com.delphix:hole_birth
space map refcount mismatch: expected 150 != actual 146

localhost ~ # zdb -C h
zdb: can't open 'h': No such file or directory
[That's unexpected...]

localhost ~ # zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
h 21.8T 20.9T 825G - 0% 96% 1.00x DEGRADED -
l 36.2T 24.3T 11.9T - 37% 67% 1.00x ONLINE -

localhost ~ # zdb -d l
Dataset mos [META], ID 0, cr_txg 4, 356M, 672 objects
Dataset l/photos@May_20_2015 [ZPL], ID 632, cr_txg 550939, 8.96T, 1946909 objects
[etc...]
Dataset l/photos@Oct_19_2015 [ZPL], ID 861, cr_txg 601790, 12.0T, 2214649 objects
Dataset l/photos [ZPL], ID 218, cr_txg 69828, 13.2T, 2290873 objects
Dataset l/test [ZPL], ID 356, cr_txg 131705, 153K, 6 objects
Dataset l/new [ZPL], ID 42, cr_txg 75, 4.40T, 35274 objects
Dataset l/CDs [ZPL], ID 49, cr_txg 1607, 78.8G, 144 objects
Dataset l [ZPL], ID 21, cr_txg 1, 1.11T, 2351 objects
Verified large_blocks feature refcount is correct (1)
space map refcount mismatch: expected 150 != actual 146

localhost ~ # zdb -d h
zdb: can't open 'h': No such file or directory

`

That's strange, zdb isn't seeing the other pool. Is this another bug? "zdb -C" only lists the "l" pool and not the "h" pool. Also the "h" pool doesn't appear in "/etc/zfs/zpool.cache". Could this be because I imported "h" with:

zpool import -o readonly=on h

JuliaVixen · 2016-05-05T00:07:48Z

So, I just checked, and yes, when a pool is imported "readonly=on", it doesn't appear in zpool.cache and "zdb" can't see it. I'm going to create an issue for this if there isn't one already...

localhost ~ # zpool import -o readonly=on h
localhost ~ # zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
h     21.8T  20.9T   825G         -     0%    96%  1.00x  ONLINE  -
l     36.2T  24.3T  11.9T         -    37%    67%  1.00x  ONLINE  -

localhost ~ # strings /etc/zfs/zpool.cache 
[There's just one pool there]

localhost ~ # zdb -C  
l:
    version: 5000
    name: 'l'
    state: 0
[etc etc, just this one pool]

[But, re-import read-write...]

localhost ~ # zpool export h
localhost ~ # zpool import h
localhost ~ # zdb -C
h:
    version: 5000
    name: 'h'
    state: 0
    txg: 5195339
    pool_guid: 5105680881284105628
    errata: 0
    hostname: 'localhost'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 5105680881284105628
        children[0]:
            type: 'raidz'
            id: 0
            guid: 16056632966898540210
            nparity: 1
            metaslab_array: 34
            metaslab_shift: 37
            ashift: 12
            asize: 24004646141952
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 2091825546480782105
                path: '/dev/sdn1'
                whole_disk: 1
                DTL: 381
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 10983704478065269639
                path: '/dev/sdo1'
                whole_disk: 1
                DTL: 261
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 16892970404048100612
                path: '/dev/sdm1'
                whole_disk: 1
                DTL: 380
                create_txg: 4
    features_for_read:
        com.delphix:embedded_data

l:
    version: 5000
    name: 'l'
    state: 0
    txg: 623065
    pool_guid: 4946876290228094116
    errata: 0
[etc. etc.]
36.2T  24.3T  11.9T         -    37%    67%  1.00x  ONLINE  -
localhost ~ # zdb -d h
Dataset mos [META], ID 0, cr_txg 4, 354M, 680 objects
Dataset h/photos@2015_Jul_8 [ZPL], ID 330, cr_txg 1261118, 10.1T, 1910968 objects
[etc. etc].

JuliaVixen · 2016-05-18T23:51:33Z

I have updated to the Gentoo zfs-9999 package (as of May 17, 2016) and with this version I can no longer reproduce the "Too many symbolic links" error when I attempt to "ls" the ".zfs/snapshots" directory with an absolute path. As far as I know, nothing has changed about the configuration of my filesystem, except I made a new snapshot yesterday, but I haven't even written any data to this filesystem since I wrote this bug. The only things which have changed about the configuration of the zpool which this filesystem sits upon, is the creation of one or two new zfs filesystems, and I think I destroyed a filesystem, and I wrote a few TiB of data into the other zfs filesystems, and did some "zfs send"'s of all the filesystems.

The exact version of zfs-9999 I'm using is this one:
#4582 (comment)

Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This will cause funny behaviour for the mounted snapdirs. Especially for Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone automount it again as long as someone is still using the detached mount. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#4514 Closes openzfs#4661 Closes openzfs#4672

* Consistently use parsable instead of parseable This is a purely cosmetical change, to consistently prefer one of two (both acceptable) choises for the word parsable in documentation and code. I don't really care which to use, but acording to wiktionary https://en.wiktionary.org/wiki/parsable#English parsable is preferred. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4682 * Add missing RPM BuildRequires Both libudev and libattr are recommended build requirements. As such their development headers should lists in the rpm spec file so those dependencies are pulled in when building rpm packages. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4676 * Skip ctldir znode in zfs_rezget to fix snapdir issues Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This will cause funny behaviour for the mounted snapdirs. Especially for Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone automount it again as long as someone is still using the detached mount. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4514 Closes #4661 Closes #4672 * Improve zfs-module-parameters(5) Various rewrites to the descriptions of module parameters. Corrects spelling mistakes, makes descriptions them more user-friendly and describes some ZFS quirks which should be understood before changing parameter values. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4671 * Fix arc_prune_task use-after-free arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent the underlying zsb from disappearing if there's a concurrent umount. We fix this by force the caller of arc_remove_prune_callback to wait for arc_prune_taskq to finish. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4687 Closes #4690 * Add request size histograms (-r) to zpool iostat, minor man page fix Add -r option to "zpool iostat" to print request size histograms for the leaf ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs ("agg"). These stats can be useful for seeing how well the ZFS IO aggregator is working. $ zpool iostat -r mypool sync_read sync_write async_read async_write scrub req_size ind agg ind agg ind agg ind agg ind agg ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 512 0 0 0 0 0 0 530 0 0 0 1K 0 0 260 0 0 0 116 246 0 0 2K 0 0 0 0 0 0 0 431 0 0 4K 0 0 0 0 0 0 3 107 0 0 8K 15 0 35 0 0 0 0 6 0 0 16K 0 0 0 0 0 0 0 39 0 0 32K 0 0 0 0 0 0 0 0 0 0 64K 20 0 40 0 0 0 0 0 0 0 128K 0 0 20 0 0 0 0 0 0 0 256K 0 0 0 0 0 0 0 0 0 0 512K 0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0 0 0 0 0 0 0 2M 0 0 0 0 0 0 0 0 0 0 4M 0 0 0 0 0 0 155 19 0 0 8M 0 0 0 0 0 0 0 811 0 0 16M 0 0 0 0 0 0 0 68 0 0 -------------------------------------------------------------------------------- Also rename the stray "-G" in the man page to be "-w" for latency histograms. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #4659 * OpenZFS 6531 - Provide mechanism to artificially limit disk performance Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130 Porting notes: - Added new IO delay tracepoints, and moved common ZIO tracepoint macros to a new trace_common.h file. - Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function. - Updated zinject man page - Updated zpool_scrub test files * Systemd configuration fixes * Disable zfs-import-scan.service by default. This ensures that pools will not be automatically imported unless they appear in the cache file. When this service is explicitly enabled pools will be imported with the "cachefile=none" property set. This prevents the creation of, or update to, an existing cache file. $ systemctl list-unit-files | grep zfs zfs-import-cache.service enabled zfs-import-scan.service disabled zfs-mount.service enabled zfs-share.service enabled zfs-zed.service enabled zfs.target enabled * Change services to dynamic from static by adding an [Install] section and adding 'WantedBy' tags in favor of 'Requires' tags. This allows for easier customization of the boot behavior. * Start the zfs-import-cache.service after the root pivot so the cache file is available in the standard location. * Start the zfs-mount.service after the systemd-remount-fs.service to ensure the root fs is writeable and the ZFS filesystems can create their mount points. * Change the default behavior to only load the ZFS kernel modules in zfs-import-*.service or when blkid(8) detects a pool. Users who wish to unconditionally load the kernel modules must uncomment the list of modules in /lib/modules-load.d/zfs.conf. Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4325 Closes #4496 Closes #4658 Closes #4699 * Fix self-healing IO prior to dsl_pool_init() completion Async writes triggered by a self-healing IO may be issued before the pool finishes the process of initialization. This results in a NULL dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes(). George Wilson recommended addressing this issue by initializing the passed `dsl_pool_t **` prior to dmu_objset_open_impl(). Since the caller is passing the `spa->spa_dsl_pool` this has the effect of ensuring it's initialized. However, since this depends on the caller knowing they must pass the `spa->spa_dsl_pool` an additional NULL check was added to vdev_queue_max_async_writes(). This guards against any future restructuring of the code which might result in dsl_pool_init() being called differently. Signed-off-by: GeLiXin <47034221@qq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4652 * Add isa_defs for MIPS GCC for MIPS only defines _LP64 when 64bit, while no _ILP32 defined when 32bit. Signed-off-by: YunQiang Su <syq@debian.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4712 * Fix out-of-bound access in zfs_fillpage The original code will do an out-of-bound access on pl[] during last iteration. ================================================================== BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs] Read of size 8 by task tmpfile/7850 page:ffffea00017c6dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0xffff8000000000() page dumped because: kasan: bad access detected CPU: 3 PID: 7850 Comm: tmpfile Tainted: G OE 4.6.0+ #3 ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618 ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8 ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3 Call Trace: [<ffffffff81635618>] dump_stack+0x63/0x8b [<ffffffff81313ee8>] kasan_report_error+0x528/0x560 [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0 [<ffffffff813144b8>] kasan_report+0x58/0x60 [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffff81312e4e>] __asan_load8+0x5e/0x70 [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs] [<ffffffff81353c3a>] SyS_execve+0x3a/0x50 [<ffffffff810058ef>] do_syscall_64+0xef/0x180 [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25 Memory state around the buggy address: ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 ^ ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4705 Issue #4708 * Fix memleak in zpl_parse_options strsep() will advance tmp_mntopts, and will change it to NULL on last iteration. This will cause strfree(tmp_mntopts) to not free anything. unreferenced object 0xffff8800883976c0 (size 64): comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s) hex dump (first 32 bytes): 72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a rw.strictatime.z 66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d fsutil.mntpoint= backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811f9cac>] __kmalloc+0x16c/0x250 [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl] [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs] [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs] [<ffffffff81222dc8>] mount_fs+0x38/0x160 [<ffffffff81240097>] vfs_kern_mount+0x67/0x110 [<ffffffff812428e0>] do_mount+0x250/0xe20 [<ffffffff812437d5>] SyS_mount+0x95/0xe0 [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4706 Issue #4708 * Fix memleak in vdev_config_generate_stats fnvlist_add_nvlist will copy the contents of nvx, so we need to free it here. unreferenced object 0xffff8800a6934e80 (size 64): comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s) hex dump (first 32 bytes): 60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff `..s.....|.s.... 00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff ........@.p..... backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310 [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl] [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl] [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair] [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair] [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair] [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair] [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs] [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs] [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs] [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs] [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs] [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs] [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0 [<ffffffff812333b9>] SyS_ioctl+0x79/0x90 Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4707 Issue #4708 * Linux 4.7 compat: handler->set() takes both dentry and inode Counterpart to fd4c7b7, the same approach was taken to resolve the compatibility issue. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4717 Issue #4665 * Implementation of AVX2 optimized Fletcher-4 New functionality: - Preserves existing scalar implementation. - Adds AVX2 optimized Fletcher-4 computation. - Fastest routines selected on module load (benchmark). - Test case for Fletcher-4 added to ztest. New zcommon module parameters: - zfs_fletcher_4_impl (str): selects the implementation to use. "fastest" - use the fastest version available "cycle" - cycle trough all available impl for ztest "scalar" - use the original version "avx2" - new AVX2 implementation if available Performance comparison (Intel i7 CPU, 1MB data buffers): - Scalar: 4216 MB/s - AVX2: 14499 MB/s See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl` to get list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Andreas Dilger <andreas.dilger@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4330 * Fix cstyle.pl warnings As of perl v5.22.1 the following warnings are generated: * Redundant argument in printf at scripts/cstyle.pl line 194 * Unescaped left brace in regex is deprecated, passed through in regex; marked by <-- HERE in m/\S{ <-- HERE / at scripts/cstyle.pl line 608. They have been addressed by escaping the left braces and by providing the correct number of arguments to printf based on the fmt specifier set by the verbose option. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4723 * Fix minor spelling mistakes Trivial spelling mistake fix in error message text. * Fix spelling mistake "adminstrator" -> "administrator" * Fix spelling mistake "specificed" -> "specified" * Fix spelling mistake "interperted" -> "interpreted" Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4728 * Add `zfs allow` and `zfs unallow` support ZFS allows for specific permissions to be delegated to normal users with the `zfs allow` and `zfs unallow` commands. In addition, non- privileged users should be able to run all of the following commands: * zpool [list | iostat | status | get] * zfs [list | get] Historically this functionality was not available on Linux. In order to add it the secpolicy_* functions needed to be implemented and mapped to the equivalent Linux capability. Only then could the permissions on the `/dev/zfs` be relaxed and the internal ZFS permission checks used. Even with this change some limitations remain. Under Linux only the root user is allowed to modify the namespace (unless it's a private namespace). This means the mount, mountpoint, canmount, unmount, and remount delegations cannot be supported with the existing code. It may be possible to add this functionality in the future. This functionality was validated with the cli_user and delegation test cases from the ZFS Test Suite. These tests exhaustively verify each of the supported permissions which can be delegated and ensures only an authorized user can perform it. Two minor bug fixes were required for test-running.py. First, the Timer() object cannot be safely created in a `try:` block when there is an unconditional `finally` block which references it. Second, when running as a normal user also check for scripts using the both the .ksh and .sh suffixes. Finally, existing users who are simulating delegations by setting group permissions on the /dev/zfs device should revert that customization when updating to a version with this change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #362 Closes #434 Closes #4100 Closes #4394 Closes #4410 Closes #4487 * Remove libzfs_graph.c The libzfs_graph.c source file should have been removed in 330d06f, it is entirely unused. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4766 * Linux 4.6 compat: Fall back to d_prune_aliases() if necessary As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes: #4726 * SIMD implementation of vdev_raidz generate and reconstruct routines This is a new implementation of RAIDZ1/2/3 routines using x86_64 scalar, SSE, and AVX2 instruction sets. Included are 3 parity generation routines (P, PQ, and PQR) and 7 reconstruction routines, for all RAIDZ level. On module load, a quick benchmark of supported routines will select the fastest for each operation and they will be used at runtime. Original implementation is still present and can be selected via module parameter. Patch contains: - specialized gen/rec routines for all RAIDZ levels, - new scalar raidz implementation (unrolled), - two x86_64 SIMD implementations (SSE and AVX2 instructions sets), - fastest routines selected on module load (benchmark). - cmd/raidz_test - verify and benchmark all implementations - added raidz_test to the ZFS Test Suite New zfs module parameters: - zfs_vdev_raidz_impl (str): selects the implementation to use. On module load, the parameter will only accept first 3 options, and the other implementations can be set once module is finished loading. Possible values for this option are: "fastest" - use the fastest math available "original" - use the original raidz code "scalar" - new scalar impl "sse" - new SSE impl if available "avx2" - new AVX2 impl if available See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to get the list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4328 * Fix NFS credential The commit f74b821 caused a regression where creating file through NFS will always create a file owned by root. This is because the patch enables the KSID code in zfs_acl_ids_create, which it would use euid and egid of the current process. However, on Linux, we should use fsuid and fsgid for file operations, which is the original behaviour. So we revert this part of code. The patch also enables secpolicy_vnode_*, since they are also used in file operations, we change them to use fsuid and fsgid. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4772 Closes #4758 * OpenZFS 6513 - partially filled holes lose birth time Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Richard Lowe <richlowe@richlowe.net>a Ported by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6513 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data. * Add a test case for dmu_free_long_range() to ztest Signed-off-by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4754 * Revert "Add a test case for dmu_free_long_range() to ztest" This reverts commit d0de2e82df579f4e4edf5643b674a1464fae485f which introduced a new test case to ztest which is failing occasionally during automated testing. The change is being reverted until the issue can be fully investigated. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4754 * OpenZFS 6878 - Add scrub completion info to "zpool history" Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Authored by: Nav Ravindranath <nav@delphix.com> Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6878 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5 Closes #4787 * FreeBSD rS271776 - Persist vdev_resilver_txg changes Persist vdev_resilver_txg changes to avoid panic caused by validation vs a vdev_resilver_txg value from a previous resilver. Authored-by: smh <smh@FreeBSD.org> Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5154 FreeBSD-issue: https://reviews.freebsd.org/rS271776 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf Closes #4790 * xattrtest: allow verify with -R and other improvements - Use a fixed buffer of random bytes when random xattr values are in effect. This eliminates the potential performance bottleneck of reading from /dev/urandom for each file. This also allows us to verify xattrs in random value mode. - Show the rate of operations per second in addition to elapsed time for each phase of the test. This may be useful for benchmarking. - Set default xattr size to 6 so that verify doesn't fail if user doesn't specify a size. We need at least six bytes to store the leading "size=X" string that is used for verification. - Allow user to execute just one phase of the test. Acceptable values for -o and their meanings are: 1 - run the create phase 2 - run the setxattr phase 3 - run the getxattr phase 4 - run the unlink phase Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * Backfill metadnode more intelligently Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don’t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can’t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3542 * Sync DMU_BACKUP_FEATURE_* flags Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING. The DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and then reserved in the upstream OpenZFS implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #4795 * OpenZFS 2605, 6980, 6902 2605 want to resume interrupted zfs send Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Xin Li <delphij@freebsd.org> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/2605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6980 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f Porting notes: - All rsend and snapshop tests enabled and updated for Linux. - Fix misuse of input argument in traverse_visitbp(). - Fix ISO C90 warnings and errors. - Fix gcc 'missing braces around initializer' in 'struct send_thread_arg to_arg =' warning. - Replace 4 argument fletcher_4_native() with 3 argument version, this change was made in OpenZFS 4185 which has not been ported. - Part of the sections for 'zfs receive' and 'zfs send' was rewritten and reordered to approximate upstream. - Fix mktree xattr creation, 'user.' prefix required. - Minor fixes to newly enabled test cases - Long holds for volumes allowed during receive for minor registration. * OpenZFS 6051 - lzc_receive: allow the caller to read the begin record Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6051 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322 * OpenZFS 6393 - zfs receive a full send as a clone Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6394 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e * OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS Authored by: Andrew Stormont <astormont@racktopsystems.com> Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com> Reviewed by: Kim Shrier <kshrier@racktopsystems.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6536 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b * OpenZFS 6738 - zfs send stream padding needs documentation Authored by: Eli Rosenthal <eli.rosenthal@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6738 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff * OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota Authored by: Dan McDonald <danmcd@omniti.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/4986 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad * OpenZFS 6562 - Refquota on receive doesn't account for overage Authored by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6562 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6 * Implement zfs_ioc_recv_new() for OpenZFS 2605 Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy ZFS_IOC_RECV user/kernel interface. The new interface supports all stream options but is currently only used for resumable streams. This way updated user space utilities will interoperate with older kernel modules. ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW handler. Non-Linux OpenZFS platforms have opted to change the legacy interface in an incompatible fashion instead of adding a new ioctl. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * OpenZFS 6314 - buffer overflow in dsl_dataset_name Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6314 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee * OpenZFS 6876 - Stack corruption after importing a pool with a too-long name Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. OpenZFS-issue: https://www.illumos.org/issues/6876 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e * Vectorized fletcher_4 must be 128-bit aligned The fletcher_4_native() and fletcher_4_byteswap() functions may only safely use the vectorized implementations when the buffer is 128-bit aligned. This is because both the AVX2 and SSE implementations process four 32-bit words per iterations. Fallback to the scalar implementation which only processes a single 32-bit word for unaligned buffers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Issue #4330 * Allow building with `CFLAGS="-O0"` If compiled with -O0, gcc doesn't do any stack frame coalescing and -Wframe-larger-than=1024 is triggered in debug mode. Starting with gcc 4.8, new opt level -Og is introduced for debugging, which does not trigger this warning. Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4799 * Don't allow accessing XATTR via export handle Allow accessing XATTR through export handle is a very bad idea. It would allow user to write whatever they want in fields where they otherwise could not. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828 * Fix get_zfs_sb race with concurrent umount Certain ioctl operations will call get_zfs_sb, which will holds an active count on sb without checking whether it's active or not. This will result in use-after-free. We fix this by using atomic_inc_not_zero to make sure we got an active sb. P1 P2 --- --- deactivate_locked_super(): s_active = 0 zfs_sb_hold() ->get_zfs_sb(): s_active = 1 ->zpl_kill_sb() -->zpl_put_super() --->zfs_umount() ---->zfs_sb_free(zsb) zfs_sb_rele(zsb) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * Fix Large kmem_alloc in vdev_metaslab_init This allocation can go way over 1MB, so we should use vmem_alloc instead of kmem_alloc. Large kmem_alloc(1430784, 0x1000), please file an issue... Call Trace: [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl] [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs] [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs] [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs] [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs] [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs] [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs] [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs] [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0 [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0 Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4752 * Add configure result for xattr_handler Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828 * fh_to_dentry should return ESTALE when generation mismatch When generation mismatch, it usually means the file pointed by the file handle was deleted. We should return ESTALE to indicate this. We return ENOENT in zfs_vget since zpl_fh_to_dentry will convert it to ESTALE. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828 * xattr dir doesn't get purged during iput We need to set inode->i_nlink to zero so iput will purge it. Without this, it will get purged during shrink cache or umount, which would likely result in deadlock due to zfs_zget waiting forever on its children which are in the dispose_list of the same thread. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827 * Kill zp->z_xattr_parent to prevent pinning zp->z_xattr_parent will pin the parent. This will cause huge issue when unlink a file with xattr. Because the unlinked file is pinned, it will never get purged immediately. And because of that, the xattr stuff will never be marked as unlinked. So the whole unlinked stuff will stay there until shrink cache or umount. This change partially reverts e89260a. This is safe because only the zp->z_xattr_parent optimization is removed, zpl_xattr_security_init() is still called from the zpl outside the inode lock. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827 * Fix RAIDZ_TEST tests Remove stray trailing } which prevented the raidz stress tests from running in-tree. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <peng.hse@xtaotech.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3937 * Fix handling of errors nvlist in zfs_ioc_recv_new() zfs_ioc_recv_impl() is changed to always allocate the 'errors' nvlist, its callers are responsible for freeing it. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4829 * Add RAID-Z routines for SSE2 instruction set, in x86_64 mode. The patch covers low-end and older x86 CPUs. Parity generation is equivalent to SSSE3 implementation, but reconstruction is somewhat slower. Previous 'sse' implementation is renamed to 'ssse3' to indicate highest instruction set used. Benchmark results: scalar_rec_p 4 720476442 scalar_rec_q 4 187462804 scalar_rec_r 4 138996096 scalar_rec_pq 4 140834951 scalar_rec_pr 4 129332035 scalar_rec_qr 4 81619194 scalar_rec_pqr 4 53376668 sse2_rec_p 4 2427757064 sse2_rec_q 4 747120861 sse2_rec_r 4 499871637 sse2_rec_pq 4 522403710 sse2_rec_pr 4 464632780 sse2_rec_qr 4 319124434 sse2_rec_pqr 4 205794190 ssse3_rec_p 4 2519939444 ssse3_rec_q 4 1003019289 ssse3_rec_r 4 616428767 ssse3_rec_pq 4 706326396 ssse3_rec_pr 4 570493618 ssse3_rec_qr 4 400185250 ssse3_rec_pqr 4 377541245 original_rec_p 4 691658568 original_rec_q 4 195510948 original_rec_r 4 26075538 original_rec_pq 4 103087368 original_rec_pr 4 15767058 original_rec_qr 4 15513175 original_rec_pqr 4 10746357 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4783 * Enable zpool_upgrade test cases Creating the pool in a striped rather than mirrored configuration provides enough space for all upgrade tests to run. Test case zpool_upgrade_007_pos still fails and must be investigated so it has been left disabled. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4852 * Prevent null dereferences when accessing dbuf kstat In arc_buf_info(), the arc_buf_t may have no header. If not, don't try to fetch the arc buffer stats and instead just zero them. The null dereferences were observed while accessing the dbuf kstat with awk on a system in which millions of small files were being created in order to overflow the system's metadata limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4837 * Fix dbuf_stats_hash_table_data race Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf can be freed at any time. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4846 * Use native inode->i_nlink instead of znode->z_links A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4838 Issue #227 * Implementation of SSE optimized Fletcher-4 Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4) This commit adds another implementation of the Fletcher-4 algorithm. It is automatically selected at module load if it benchmarks higher than all other available implementations. The module benchmark was also amended to analyze the performance of the byteswap-ed version of Fletcher-4, as well as the non-byteswaped version. The average performance of the two is used to select the the fastest implementation available on the host system. Adds a pair of fields to an existing zcommon module parameter: - zfs_fletcher_4_impl (str) "sse2" - new SSE2 implementation if available "ssse3" - new SSSE3 implementation if available Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4789 * Fix filesystem destroy with receive_resume_token It is possible that the given DS may have hidden child (%recv) datasets - "leftovers" resulting from the previously interrupted 'zfs receieve'. Try to remove the hidden child (%recv) and after that try to remove the target dataset. If the hidden child (%recv) does not exist the original error (EEXIST) will be returned. Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4818 * Prevent segfaults in SSE optimized Fletcher-4 In some cases, the compiler was not respecting the GNU aligned attribute for stack variables in 35a76a0. This was resulting in a segfault on CentOS 6.7 hosts using gcc 4.4.7-17. This issue was fixed in gcc 4.6. To prevent this from occurring, use unaligned loads and stores for all stack and global memory references in the SSE optimized Fletcher-4 code. Disable zimport testing against master where this flaw exists: TEST_ZIMPORT_VERSIONS="installed" Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4862 * Update arc_summary.py for prefetch changes Commit 7f60329 removed several kstats which arc_summary.py read. Remove these kstats from arc_summary.py in the same way this was handled in FreeNAS. FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4695 * Wait iput_async before evict_inodes to prevent race Wait for iput_async before entering evict_inodes in generic_shutdown_super. The reason we must finish before evict_inodes is when lazytime is on, or when zfs_purgedir calls zfs_zget, iput would bump i_count from 0 to 1. This would race with the i_count check in evict_inodes. This means it could destroy the inode while we are still using it. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4854 * Fixes and enhancements of SIMD raidz parity - Implementation lock replaced with atomic variable - Trailing whitespace is removed from user specified parameter, to enhance experience when using commands that add newline, e.g. `echo` - raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813 - silence `cppcheck` in vdev_raidz, partial solution of Issue #1392 - Minor fixes and cleanups - Enable use of original parity methods in [fastest] configuration. New opaque original ops structure, representing native methods, is added to supported raidz methods. Original parity methods are executed if selected implementation has NULL fn pointer. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4813 Issue #1392 * RAIDZ parity kstat rework Print table with speed of methods for each implementation. Last line describes contents of [fastest] selection. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4860 * Fix NULL pointer in zfs_preumount from 1d9b3bd When zfs_domount fails zsb will be freed, and its caller mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into zfs_preumount. In order to make sure we don't touch any nonexistent stuff, we must make sure s_fs_info is NULL in the fail path so zfs_preumount can easily check that. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4867 Issue #4854 * Illumos Crypto Port module added to enable native encryption in zfs A port of the Illumos Crypto Framework to a Linux kernel module (found in module/icp). This is needed to do the actual encryption work. We cannot use the Linux kernel's built in crypto api because it is only exported to GPL-licensed modules. Having the ICP also means the crypto code can run on any of the other kernels under OpenZFS. I ended up porting over most of the internals of the framework, which means that porting over other API calls (if we need them) should be fairly easy. Specifically, I have ported over the API functions related to encryption, digests, macs, and crypto templates. The ICP is able to use assembly-accelerated encryption on amd64 machines and AES-NI instructions on Intel chips that support it. There are place-holder directories for similar assembly optimizations for other architectures (although they have not been written). Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4329 * Fix for compilation error when using the kernel's CONFIG_LOCKDEP Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4329 * zloop: print backtrace from core files Find the core file by using `/proc/sys/kernel/core_pattern` Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4874 * Fix for metaslab_fastwrite_unmark() assert failure Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost. Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4267 * Remove znode's z_uid/z_gid member Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid|gid)_(read|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4685 Issue #227 * Check whether the kernel supports i_uid/gid_read/write helpers Since the concept of a kuid and the need to translate from it to ordinary integer type was added in kernel version 3.5 implement necessary plumbing to be able to detect this condition during compile time. If the kernel doesn't support the kuid then just fall back to directly accessing the respective struct inode's members Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4685 Issue #227 * Fix uninitialized variable in avl_add() Silence the following warning when compiling with gcc 5.4.0. Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609. module/avl/avl.c: In function ‘avl_add’: module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized in this function [-Wmaybe-uninitialized] avl_insert(tree, new_node, where); Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> * Fix sync behavior for disk vdevs Prior to b39c22b, which was first generally available in the 0.6.5 release as b39c22b, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In b39c22b, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and aa159af fixed several problems introduced by b39c22b. In particular, 5592404 introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af. The original rationale for introducing synchronous operations in b39c22b was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4858 * Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on th…

Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This will cause funny behaviour for the mounted snapdirs. Especially for Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone automount it again as long as someone is still using the detached mount. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#4514 Closes openzfs#4661 Closes openzfs#4672

pivot69 · 2016-12-08T17:50:04Z

Im using zfs on ubuntu server 16.04.1 and I had the same issue with the symlink error when accessing the snapshots. I got the error after sending incremental snapshots from another ubuntu server (running ubuntu server 14.04).

After updating the affected server and trying everything in my mind (atime off, compression off, mountpoints etc) it still did not work. I did a reboot and suddenly everything worked again - until I transferred new incremental snapshots.

This led me to try unmounting and remounting the filesystem after each time I transferred snapshots, and that seemed to do the trick! Now I just put the remount-commands into my script, and I am no longer bothered by this bug.

This is not a fix, it is a only workaround. But in case someone cannot get it working, even with the newest versions of everything, then try this! :)

Just some additional info:
The ubuntu server sending snapshots (14.04) has the ubuntu-zfs package installed.
[ 1.570547] ZFS: Loaded module v0.6.5.7-1~trusty, ZFS pool version 5000, ZFS filesystem version 5
The ubuntu server receiving snapshots (16.04.1) has zfs native
[ 17.440504] ZFS: Loaded module v0.6.5.6-0ubuntu10, ZFS pool version 5000, ZFS filesystem version 5

zffocussss · 2019-05-17T08:51:28Z

I have same error messages when using autofs + docker volume mount

jgoerzen · 2020-10-10T16:15:47Z

I encountered this when attempting to access a snapshot under /usr/.zfs/snapshots on my system. On this sytem, I have filesystems like this:

tank/hephaestus-1                    20.2G   135G       96K  /tank/hephaestus-1
tank/hephaestus-1/ROOT               15.9G  8.78G     1.22G  /
tank/hephaestus-1/ROOT/opt           1.51G   135G      823M  /opt
tank/hephaestus-1/ROOT/usr           12.4G  9.47G     10.5G  /usr
tank/hephaestus-1/var                4.35G  6.96G     3.04G  legacy

There are no bind mounts involved.

The relevant snapshots would have been created with zfs snapshot -r tank@foo. I observed df showing directories under /.zfs/snapshot -- showing snapshots of tank/hephaestus-1/ROOT. It did not make the /usr snapshot available in any way.

helamonster · 2021-11-23T19:18:05Z

I have just encountered this for the first time myself.
My environment:

root@myserver ~ # lsb_release  -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

root@myserver ~ # uname -a
Linux myserver.mydomain.com 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

root@myserver ~ # zfs version
zfs-2.0.3-0york0~20.04
zfs-kmod-2.0.3-0york0~20.04

Observe:

root@myserver ~ # pwd
/root

root@myserver ~ # ls -l  /var/.zfs/snapshot/zrepl_20211116_231549_000/
ls: cannot access '/var/.zfs/snapshot/zrepl_20211116_231549_000/': Too many levels of symbolic links

root@myserver ~ # cd   /var/.zfs/snapshot/zrepl_20211116_231549_000/
-bash: cd: /var/.zfs/snapshot/zrepl_20211116_231549_000/: Too many levels of symbolic links

root@myserver ~ # cd /var/.zfs

root@myserver .zfs # cd snapshot/

root@myserver snapshot # cd zrepl_20211116_231549_000
-bash: cd: zrepl_20211116_231549_000: Too many levels of symbolic links

root@myserver snapshot # mount.zfs rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000 /var/.zfs/snapshot/zrepl_20211116_231549_000/  
filesystem 'rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000' is already mounted

Interestingly, I found that the snapshot is indeed mounted but at a different level in the filesystem (/.zfs/snapshot instead of /var/.zfs/snapshot). Strange...

root@myserver snapshot # mount.zfs rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000 /mnt/var
filesystem 'rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000' is already mounted

root@myserver snapshot # mount | grep zrepl
rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000 on /.zfs/snapshot/zrepl_20211116_231549_000 type zfs (ro,relatime,xattr,posixacl)

root@myserver snapshot # ls /.zfs/snapshot/zrepl_20211116_231549_000
backups  cache  empty  lib  local  lock  log  mail  opt  run  snap  spool  tmp

root@myserver snapshot # zfs list -r rpool -o name,used,avail,refer,canmount,mounted,mountpoint
NAME                    USED  AVAIL     REFER  CANMOUNT  MOUNTED  MOUNTPOINT
rpool                  30.9G   182G       96K  off       no       none
rpool/ROOT             30.8G   182G       96K  off       no       none
rpool/ROOT/ubuntu      30.8G   182G     2.88G  on        yes      /
rpool/ROOT/ubuntu/tmp  48.3M   182G      440K  on        yes      /tmp
rpool/ROOT/ubuntu/var  27.9G   182G     11.7G  on        yes      /var
rpool/temp               96K   182G       96K  on        yes      /temp

I can confirm that trying to access the correct .zfs/snapshot directory causes the snapshot to be mounted at the parent directory's .zfs/snapshot directory instead. Odd.

root@myserver ~ # mount | grep 'rpool.*zfs'
rpool/ROOT/ubuntu on / type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/tmp on /tmp type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/var on /var type zfs (rw,relatime,xattr,posixacl)
rpool/temp on /temp type zfs (rw,noatime,xattr,posixacl)
rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000 on /.zfs/snapshot/zrepl_20211116_231549_000 type zfs (ro,relatime,xattr,posixacl)

root@myserver ~ # umount /.zfs/snapshot/zrepl_20211116_231549_000

root@myserver ~ # mount | grep 'rpool.*zfs'
rpool/ROOT/ubuntu on / type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/tmp on /tmp type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/var on /var type zfs (rw,relatime,xattr,posixacl)
rpool/temp on /temp type zfs (rw,noatime,xattr,posixacl)

root@myserver ~ # ls /var/.zfs/snapshot/zrepl_20211116_231549_000
ls: cannot open directory '/var/.zfs/snapshot/zrepl_20211116_231549_000': Too many levels of symbolic links

root@myserver ~ # mount | grep 'rpool.*zfs'
rpool/ROOT/ubuntu on / type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/tmp on /tmp type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/var on /var type zfs (rw,relatime,xattr,posixacl)
rpool/temp on /temp type zfs (rw,noatime,xattr,posixacl)
rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000 on /.zfs/snapshot/zrepl_20211116_231549_000 type zfs (ro,relatime,xattr,posixacl)

Nothing is written to dmesg, syslog, or /proc/spl/kstat/zfs/dbgmsg as a result of executing these failing commands.

I can access files under the snapshot where it was mounted (/.zfs/snapshot/zrepl_20211116_231549_000) just fine.

parke · 2022-05-23T00:29:11Z

Fyi, it appears this issue may be related to #9958.

rptb1 · 2023-01-04T12:07:26Z

Just another "me too", since this issue is marked closed but is still happening.

$ uname -a
Linux plover 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.1 LTS
Release:	22.04
Codename:	jammy
$ dpkg-query -l 'zfs*' | grep ii
ii  zfs-initramfs  2.1.4-0ubuntu0.1 amd64        OpenZFS root filesystem capabilities for Linux - initramfs
ii  zfs-zed        2.1.4-0ubuntu0.1 amd64        OpenZFS Event Daemon
ii  zfsutils-linux 2.1.4-0ubuntu0.1 amd64        command-line tools to manage OpenZFS filesystems
$ 
$ ls -ld /var/lib/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19
ls: cannot access '/var/lib/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links

The system is set up according to https://openzfs.github.io/openzfs-docs/Getting%20Started/Ubuntu/Ubuntu%2022.04%20Root%20on%20ZFS.html

It's not consistent for all filesystems:

$ zfs list -H -o mountpoint | grep '^/' | sort -u | while read m; do sudo ls -ld "$m/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19"; done
drwx------ 2 root root 30 Nov 13 09:48 //.zfs/snapshot/backup-pelican-2022-11-13-09-53-19
ls: cannot access '/boot/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
drwxr-xr-x 28 rb rb 46 Nov 12 19:55 /home/rb/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19
ls: cannot access '/home/rb/.cache/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/home/rb/private/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/home/rb/snap/firefox/common/.cache/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/home/rb/tmp/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
drwx------ 9 root root 16 Nov  8 14:43 /root/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19
ls: cannot access '/tmp/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/usr/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/var/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/var/cache/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/var/lib/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/lib/AccountsService/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/lib/docker/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/lib/NetworkManager/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/log/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/snap/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/spool/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/tmp/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory

I noticed a large number of mounts of snapshots:

$ mount | grep 'type zfs' | grep '@' | wc -l
76

Perhaps there some resource limit that's being exceeded?

I frequently browse lists of snapshots in Emacs dired, which will be doing at least ls -l on .zfs/snapshot. I wonder if that triggers a lot of mounts at once?

EDIT: Later on, I'm getting messages from zfs-auto-snapshot like this:

cannot destroy snapshot kiwi-rpool/ROOT/ubuntu_8lgjho/var/lib@zfs-auto-snap_frequent-2023-01-04-1100: dataset is busy
cannot destroy snapshot kiwi-rpool/ROOT/ubuntu_8lgjho/var/lib@zfs-auto-snap_frequent-2023-01-04-1045: dataset is busy

Notice that these are in the same filesystem as the original problem. Does this also suggest a problem with automatic mount/unmount?

Can someone point me at whatever code/daemon/agent is responsible for this mounting and unmounting of snapshots? Perhaps I can figure it out.

rptb1 · 2023-01-04T16:24:30Z

Can someone point me at whatever code/daemon/agent is responsible for this mounting and unmounting of snapshots? Perhaps I can figure it out.

Digging deeper, I have bogus entries like this in /etc/mtab:

kiwi-rpool/ROOT/ubuntu_8lgjho/var/lib@backup-pelican-2022-11-13-09-53-19 /.zfs/snapshot/backup-pelican-2022-11-13-09-53-19 zfs ro,relatime,xattr,posixacl 0 0

Notice that this is a snapshot for filesystem kiwi-rpool/ROOT/ubuntu_8lgjho/var/lib but look at where it's mounted -- it's mounted at /.zfs in the wrong place. (Something like this was mentioned by #4514 (comment) .)

And if I unmount it from there, then the problem is fixed:

rb@plover:/var/lib/.zfs/snapshot$ sudo umount /.zfs/snapshot/backup-pelican-2022-12-15-07-36-11
rb@plover:/var/lib/.zfs/snapshot$ ls -ld /.zfs/snapshot/backup-pelican-2023-01-03-08-08-09
drwxr-xr-x 79 root root 81 Dec 11 08:28 /.zfs/snapshot/backup-pelican-2023-01-03-08-08-09
rb@plover:/var/lib/.zfs/snapshot$

But even more weirdness, I can get the wrong directory contents to appear at /.zfs/snapshot:

$ ls -l /.zfs/snapshot/backup-pelican-2022-12-15-07-36-11
[shows correctly the things I expect to see in / ]
$ mount | grep backup-pelican-2023-01-03-08-08-09
kiwi-rpool/ROOT/ubuntu_8lgjho/var/lib@backup-pelican-2023-01-03-08-08-09 on /.zfs/snapshot/backup-pelican-2023-01-03-08-08-09 type zfs (ro,relatime,xattr,posixacl)
$ ls -l /.zfs/snapshot/backup-pelican-2022-12-15-07-36-11
[shows incorrectly the contents of /var/lib]

So it seems to me that there is something very broken about the automounting of snapshots. They're mounting in /.zfs and ignoring their filesystem mountpoint, giving incorrect contents, and probably causing the symbolic link problem.

EDIT: I can get correct and incorrect contents in the same location like this:

rb@plover:/.zfs/snapshot$ ls backup-pelican-2023-01-03-08-08-09
AccountsService      flatpak         [and more contents of /var/lib]
rb@plover:/.zfs/snapshot$ sudo umount backup-pelican-2023-01-03-08-08-09
rb@plover:/.zfs/snapshot$ ls backup-pelican-2023-01-03-08-08-09
bin   etc  [and more contents of /]

Does this warrant another issue? Showing the wrong contents at the /.zfs/snapshot path seems like it might be.

rptb1 · 2023-01-04T16:38:34Z

Possibly relevant

zfs/module/os/linux/zfs/zfs_ctldir.c

Lines 60 to 62 in c935fe2

    
            * All mounts are handled automatically by an user mode helper which invokes 
        
            * the mount procedure.  Unmounts are handled by allowing the mount 
        
            * point to expire so the kernel may automatically unmount it.

and I note that this is different to the FreeBSD module.

The call to the user mode helper (mount.zfs) is constructed here:

zfs/module/os/linux/zfs/zfs_ctldir.c

Lines 1095 to 1102 in c935fe2

    
           	/* 
        
           	 * Construct a mount point path from sb of the ctldir inode and dirent 
        
           	 * name, instead of from d_path(), so that chroot'd process doesn't fail 
        
           	 * on mount.zfs(8). 
        
           	 */ 
        
           	snprintf(full_path, MAXPATHLEN, "%s/.zfs/snapshot/%s", 
        
           	    zfsvfs->z_vfs->vfs_mntpoint ? zfsvfs->z_vfs->vfs_mntpoint : "", 
        
           	    dname(dentry));

Note that this gives up and uses "" in the conditional, which might cause everything to be mounted at /.zfs. The gun is smoking. I am suspicious!

rptb1 · 2023-01-04T18:20:30Z

Note that this gives up and uses "" in the conditional, which might cause everything to be mounted at /.zfs. The gun is smoking. I am suspicious!

I have tried to confirm this is happening by enabling debug logging in the hope that I would see a message from

zfs/module/os/linux/zfs/zfs_ctldir.c

Line 1125 in c935fe2

dprintf("mount; name=%s path=%s\n", full_name, full_path);

but I suspect my zfs module is not built with ZFS_DEBUG and so this is disabled.

In case someone with a debug build can do it, here is what I tried:

cd /var/lib/.zfs/snapshot
umount * /.zfs/snapshot/*
# clear debug log
echo 0 > /proc/spl/kstat/zfs/dbgmsg
# enable debug messages
echo 1 >> /sys/module/zfs/parameters/zfs_flags
# provoke "too many symbolic links"
ls backup-pelican-2023-01-03-08-08-09
# stop debug messages
echo 0 >> /sys/module/zfs/parameters/zfs_flags
# examine debug log
cat /proc/spl/kstat/zfs/dbgmsg

I see messages from zfs_dbgmsg calls but not dprintf.

nachtgeist · 2023-01-13T02:02:12Z

So it seems to me that there is something very broken about the automounting of snapshots. They're mounting in /.zfs and ignoring their filesystem mountpoint, giving incorrect contents, and probably causing the symbolic link problem.

Yup. Got another case:

Debian Bullseye with ZFS 2.1.7, systemd
rootfs on ZFS
no bind mounts
zfs get mountpoint -r tank -t filesystem gives:

NAME                        PROPERTY    VALUE               SOURCE
tank                        mountpoint  /                   local
tank/home                   mountpoint  /home               inherited from tank
tank/opt                    mountpoint  /opt                inherited from tank
tank/rootfs                 mountpoint  /                   local
tank/root                   mountpoint  /root               inherited from tank
tank/srv                    mountpoint  /srv                inherited from tank

Listing and auto-mounting of snapshots via $MNTPOINT/.zfs/snapshot/* works everywhere EXCEPT in /.
When I realized the connection explained below, there were snapshots with identical names present for different datasets - date-time-strings, actually - which had
been created by sanoid via cron. So let's say there are identically named snapshots of tank/rootfs and tank/root like so:

tank/rootfs@a
tank/rootfs@b

tank/root@a
tank/root@b

This leads to:

Acessing /root/.zfs/snapshot/* works, i.e. ls, cd, cat, cp...
let's do an umount /root/.zfs/snapshot/* (It'll become clear why in a second)
ls /.zfs/snapshot/ gives the dreaded "too many levels of symbolic links"...but oh look!
cd /root/.zfs/snapshot/a; ls now displays the contents of snapshot "a" of tank/rootfs@a, NOT tank/root@a!

This got me thinking and I tinkered with the init script from initramfs-tools and just applied this patch:

diff -Naur /usr/share/initramfs-tools/init.orig /usr/share/initramfs-tools/init
--- /usr/share/initramfs-tools/init.orig        2023-01-13 02:30:14.717469685 +0100
+++ /usr/share/initramfs-tools/init     2023-01-13 02:30:09.565173985 +0100
@@ -57,7 +57,7 @@
 export break=
 export init=/sbin/init
 export readonly=y
-export rootmnt=/root
+export rootmnt=/rfs
 mkdir "$rootmnt"
 export debug=
 export panic=

The only thing this changes is the name of the mount point where the rootfs-on-zfs gets initially mounted during boot by the initramfs. Re-build the initramfs and reboot...

Now, ls /.zfs/snapshot nicely lists a and b, ls /.zfs/snapshot/* yields empty directories, however. And nothing got mounted.

Let's mkdir -p /rfs/.zfs/snapshot and ls /.zfs/snapshot/a.

No error printed, again empty directories and nothing got mounted.

Now let's mkdir /rfs/.zfs/snapshot/a and ls /.zfs/snapshot/a again yields "ls: cannot access '/.zfs/snapshot/b': Too many levels of symbolic links",

BUT

ls /rfs/.zfs/snapshot/a shows the expected snapshot content.

Under /rfs. The mountpoint where the initramfs's init mounted tank/rfs to before pivot-rooting it to /.

Note: This did NOT happen on a rootfs-on-zfs setup on Debian Buster with ZFS 2.0.3 from buster-backports. However, that buster machine still ran sysvinit instead of systemd.

~~I'll try the steps described here on bullseye-sysvinit and buster-systemd machines in the coming days...~~
#9461 (comment) seems to describe the root cause

rptb1 · 2023-02-15T19:20:02Z

I'm not sure why this issue is still closed. It's clearly not fixed.

cjthompson · 2023-05-18T03:22:19Z

I'm not sure why this issue is still closed. It's clearly not fixed.

I just got this error today as well (zfs v2.1.9-2ubuntu1)

ceastus · 2023-06-23T10:42:22Z

I just upgraded from Ubuntu 22.04 to 23.04 (reusing my pool) and I see the same issue, just as nachtgeist described.
ZFS 2.1.9-2ubuntu1.1
6.2.0-23-generic #23-Ubuntu

Listing and auto-mounting of snapshots via $MNTPOINT/.zfs/snapshot/* works everywhere EXCEPT in /.

The snaps for / are mounted under /root/.zfs/snapshot

root@xander:/.zfs# ls -l snapshot/pre-rebuild-2023-06-22/
ls: cannot access 'snapshot/pre-rebuild-2023-06-22/': Too many levels of symbolic links
2 root@xander:/.zfs# mount |grep xander.ubuntu
xander/ubuntu on / type zfs (rw,nodev,noatime,xattr,posixacl)
xander/ubuntu@pre-rebuild-2023-06-22 on /root/.zfs/snapshot/pre-rebuild-2023-06-22 type zfs (ro,relatime,xattr,posixacl)

I don't see this issue when booting from the ISO, so that meshes.
I can manually mount snaps to the correct locations.

GregorKopka · 2023-12-06T13:29:42Z

@ahrens @behlendorf
please reopen.

behlendorf · 2023-12-06T19:43:47Z

Reopening. The issue here is further described in #9461 (comment)

mariaa144 · 2024-04-16T13:27:03Z

I had faced this problem on NixOS using ZFS as root. It was upsetting when I couldn't get to my snapshots quickly to restore my Firefox session I accidentally deleted.

A simple work around which allowed me to access my snapshots is to manually mount the snapshot in another directory, instead of using the .zfs directory.

I did the following:

# Find my snapshot
zfs list -t snapshot

# create a mount directory
sudo mkdir /mnt/snapshot_test

# mount the snapshot
sudo mount -t zfs rpool/nixos/home@zfs-auto-snap_daily-2024-04-16-08h01 /mnt/snapshot_test/

This allowed me to copy my files from the snapshot in the directory /mnt/snapshot_test.

pbek · 2024-05-04T11:56:29Z

I have the same issue when I try to use a .zfs/snapshot as source for creating a backup with restic inside a docker container. The snapshots were done by sanoid on NixOS.

This e.g. also yields the error inside the docker container:

# "/backup" is the mount inside the docker container
ls /backup/home/.zfs/snapshot/autosnap_2024-05-04_11:00:06_hourly

Outside the docker container, this doesn't yield the error:

ls /home/.zfs/snapshot/autosnap_2024-05-04_11:00:06_hourly

This was referenced May 19, 2016

Snapshots are showing unreachable inside snapdir #4661

Closed

Skip ctldir znode in zfs_rezget to fix snapdir issues #4672

Closed

behlendorf closed this as completed in cbecb4f May 23, 2016

jonalbrecht mentioned this issue Aug 22, 2016

syncoid: getting "Too many levels of symbolic links" in snapshot dir on remote host jimsalterjrs/sanoid#48

Closed

pivot69 mentioned this issue Dec 8, 2016

"Too many levels of symbolic links" when "cd"ing to snapshot subdir #816

Closed

spacerunner5 mentioned this issue Oct 18, 2019

Container Bind Mounts: Snaphots: Too many levels of symbolic links #9479

Closed

behlendorf reopened this Dec 6, 2023

"Too many levels of symbolic links" when accessing snapshots #4514

"Too many levels of symbolic links" when accessing snapshots #4514

Comments

odoucet commented Apr 12, 2016

tuxoko commented Apr 12, 2016

odoucet commented Apr 12, 2016

tuxoko commented Apr 12, 2016

odoucet commented Apr 12, 2016

tuxoko commented Apr 12, 2016

odoucet commented Apr 13, 2016

odoucet commented Apr 15, 2016

tuxoko commented Apr 15, 2016

odoucet commented Apr 16, 2016

odoucet commented Apr 25, 2016

tuxoko commented Apr 25, 2016

odoucet commented Apr 26, 2016

m-r-r commented May 1, 2016

odoucet commented May 2, 2016

JuliaVixen commented May 4, 2016

JuliaVixen commented May 4, 2016

JuliaVixen commented May 5, 2016

JuliaVixen commented May 18, 2016

pivot69 commented Dec 8, 2016

zffocussss commented May 17, 2019

jgoerzen commented Oct 10, 2020

helamonster commented Nov 23, 2021

parke commented May 23, 2022

rptb1 commented Jan 4, 2023 • edited Loading

rptb1 commented Jan 4, 2023 • edited Loading

rptb1 commented Jan 4, 2023 • edited Loading

rptb1 commented Jan 4, 2023

nachtgeist commented Jan 13, 2023 • edited Loading

rptb1 commented Feb 15, 2023

cjthompson commented May 18, 2023 • edited Loading

ceastus commented Jun 23, 2023

GregorKopka commented Dec 6, 2023

behlendorf commented Dec 6, 2023 • edited Loading

mariaa144 commented Apr 16, 2024 • edited Loading

pbek commented May 4, 2024 • edited Loading

rptb1 commented Jan 4, 2023 •

edited

Loading

rptb1 commented Jan 4, 2023 •

edited

Loading

rptb1 commented Jan 4, 2023 •

edited

Loading

nachtgeist commented Jan 13, 2023 •

edited

Loading

cjthompson commented May 18, 2023 •

edited

Loading

behlendorf commented Dec 6, 2023 •

edited

Loading

mariaa144 commented Apr 16, 2024 •

edited

Loading

pbek commented May 4, 2024 •

edited

Loading