Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Too many levels of symbolic links" when accessing snapshots #4514

Open
odoucet opened this issue Apr 12, 2016 · 35 comments
Open

"Too many levels of symbolic links" when accessing snapshots #4514

odoucet opened this issue Apr 12, 2016 · 35 comments

Comments

@odoucet
Copy link

odoucet commented Apr 12, 2016

This is the same error as #816 but this old ticket was closed as fixed because it happens on old ZFS versions with an old kernel. I'm opening a new issue with additional informations to get it fixed.

This happens on two different systems (with same data, replicated with zfs send/recv).
System 1 is Kernel 4.4.4 + SPL/ZFS 0.6.5.5
System 2 is Kernel 4.5.0 + SPL/ZFS 0.6.5.6

This worked on ZFS 0.6.3 with kernel 3.10 ...

$ ls -lah /backup/xxxxobfuscatedxxx/.zfs/snapshot/2016-04-11.000000Z/
ls: cannot access /backup/xxxxobfuscatedxxx/.zfs/snapshot/2016-04-11.000000Z/: Too many levels of symbolic links

But this almost work :

$ cd /backup/xxxxobfuscatedxxx/.zfs/snapshot/2016-04-11.000000Z/
$ ls
home/
# Great, I have it ! 
$ pwd
/backup/xxxxobfuscatedxxx/.zfs/snapshot/2016-04-11.000000Z/
$ cd home && pwd
(unreachable)/home

But as the path seems completely f**ed up, tools like rsync do not work to restore data...

Fortunately, I can still have access to my data with mount :

$ mount -t zfs backupxx/xxxxobfuscatedxxx@2016-04-11.000000Z /mnt/test
# works \o/

stracing ls does not result to segfault like old ticket :)

stat("/backup/xxxxobfuscatedxxx/.zfs/snapshot/2016-04-11.000000Z/", 0x1ac50d0) = -1 ELOOP

This volume currently have 57 snapshots.

This can be reproduced easily if needed. Just tell me how can I help ...

@tuxoko
Copy link
Contributor

tuxoko commented Apr 12, 2016

Is the snapshot already mounted somewhere else?

@odoucet
Copy link
Author

odoucet commented Apr 12, 2016

None of the snapshot in this filesystem is mounted elsewhere.
Initial filesystem is mounted though.

@tuxoko
Copy link
Contributor

tuxoko commented Apr 12, 2016

When you saw the ELOOP error, can you manual mount the snapshot at the .zfs/snapshot/xxx location and would every thing work afterward?

@odoucet
Copy link
Author

odoucet commented Apr 12, 2016

I'm not sure what you mean ... Yes, I can mount the same snapshot I have the ELOOP error with.

Or do you mean

mount -t zfs xxxx@2016-04-10.000000Z /backup/xxxx/.zfs/snapshot/2016-04-10.000000Z/ ?

This does not have any sense nope ? mount does no error, but ls still failed with ELOOP.

And 3rd try :

$ mount -t zfs xxx@2016-04-10.000000Z /mnt/test
$ ls /backup/xxx/.zfs/snapshot/2016-04-10.000000Z/
Too many levels of symbolic links

@tuxoko
Copy link
Contributor

tuxoko commented Apr 12, 2016

Do this mount -t zfs xxxx@2016-04-10.000000Z /backup/xxxx/.zfs/snapshot/2016-04-10.000000Z/
It might not make sense to you, but I want to have an idea of which part in the code went wrong.

@odoucet
Copy link
Author

odoucet commented Apr 13, 2016

Yes, that's what I tested in my previous post ;) mount command does no error, but ELOOP is still there, and /proc/mounts does not show the mount.

@odoucet
Copy link
Author

odoucet commented Apr 15, 2016

This bug also prevents "zfs diff" to work (failed with message "Unable to obtain diffs: No such file or directory"), if it can help.
Folder .zfs/shares is empty

@tuxoko
Copy link
Contributor

tuxoko commented Apr 15, 2016

@odoucet
Please try strace mount.zfs xxxx@2016-04-10.000000Z /backup/xxxx/.zfs/snapshot/2016-04-10.000000Z/

@odoucet
Copy link
Author

odoucet commented Apr 16, 2016

[...]
stat("xxxxxxx/xxxxxx@2016-04-10.000000Z", 0x7ffe33b45500) = -1 ENOENT (No such file or directory)
getcwd("/root", 4096)                   = 6
lstat("/xx", {st_mode=S_IFDIR|0755, st_size=10, ...}) = 0
lstat("/xx/xx", {st_mode=S_IFDIR|0755, st_size=50, ...}) = 0
lstat("/xx/xx/xx", {st_mode=S_IFDIR|0755, st_size=25, ...}) = 0
lstat("/xx/xx/xx/xx", {st_mode=S_IFDIR|0755, st_size=3, ...}) = 0
lstat("/xx/xx/xx/xx/.zfs", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
lstat("/xx/xx/xx/xx/.zfs/snapshot", {st_mode=S_IFDIR|0555, st_size=2, ...}) = 0
lstat("/xx/xx/xx/xx/.zfs/snapshot/2016-04-10.000000Z", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
access("/sys/module/zfs", F_OK)         = 0
access("/sys/module/zfs", F_OK)         = 0
open("/dev/zfs", O_RDWR)                = 3
close(3)                                = 0
open("/dev/zfs", O_RDWR)                = 3
open("/etc/mtab", O_RDONLY)             = 4
open("/etc/dfs/sharetab", O_RDONLY)     = 5
open("/dev/zfs", O_RDWR)                = 6
open("/usr/share/locale/locale.alias", O_RDONLY) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=2512, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2c28673000
read(7, "# Locale name alias data base.\n#"..., 4096) = 2512
read(7, "", 4096)                       = 0
close(7)                                = 0
munmap(0x7f2c28673000, 4096)            = 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.UTF-8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/zfs-linux-user.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
ioctl(3, 0x5a12, 0x7ffe33b41a70)        = 0
ioctl(3, 0x5a05, 0x7ffe33b3e420)        = 0
ioctl(3, 0x5a13, 0x7ffe33b41e80)        = 0
close(3)                                = 0
close(4)                                = 0
close(5)                                = 0
close(6)                                = 0
mount("xxxxxxx/xxxxxx@2016-04-10.000000Z", "/xxxxxxx/.zfs/snapshot/2016-04-10.000000Z", "zfs", 0, ",mntpoint=/xxxxxxxxx"...) = 0
lstat("/etc/mtab", {st_mode=S_IFREG|0644, st_size=17740, ...}) = 0
open("/etc/mtab", O_RDWR|O_CREAT, 0644) = 3
close(3)                                = 0
open("/etc/mtab", O_RDWR|O_CREAT|O_APPEND, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=17740, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2c28673000
fstat(3, {st_mode=S_IFREG|0644, st_size=17740, ...}) = 0
lseek(3, 16384, SEEK_SET)               = 16384
read(3, "rw,noatime 0 0\nxxxxxxx"..., 1356) = 1356
write(3, "xxxxxx"..., 147) = 147
close(3)                                = 0
munmap(0x7f2c28673000, 4096)            = 0
exit_group(0)                           = ?
+++ exited with 0 +++

@odoucet
Copy link
Author

odoucet commented Apr 25, 2016

any update ? What can I do to help ?

@tuxoko
Copy link
Contributor

tuxoko commented Apr 25, 2016

The mount command returns success, I don't know why it wouldn't work for you.

@odoucet
Copy link
Author

odoucet commented Apr 26, 2016

Just tested different kernel with same SPL/ZFS version (v0.6.5.6-1).
On kernel 4.4.4 : not working
Rebooted same system on kernel 3.10.101 : working

on kernel 4.4.4, behaviour is really really strange :

$ ls /backup/xxxobfuscatedxxx/.zfs/snapshot/2016-04-26.000000Z/
ls: cannot access /backup/xxxobfuscatedxxx/.zfs/snapshot/2016-04-26.000000Z/: Too many levels of symbolic links
$ cd xxobfuscatedxxx && ls 2016-04-26.000000Z/
home/

As stated above, mount on snapshot is working ...

@m-r-r
Copy link

m-r-r commented May 1, 2016

Hello @odoucet,

I have the same problem with a ZFS filesystem which is bind-mounted in a LXC container.
Is your ZFS filesystem also accessible from a LXC container ? If so, that could be related…

@odoucet
Copy link
Author

odoucet commented May 2, 2016

Hi @m-r-r , sorry, no LXC involved in my setup ...

@JuliaVixen
Copy link

I'm having, was appears to be, the same issue. Short summary: 'ls' on a snapshot when the CWD is under .zfs works as expected. When the CWD is elsewhere: "Too many symbolic links".

localhost ~ # uname -a
Linux localhost 4.4.6-gentoo #1 SMP Mon Apr 25 02:58:59 Local time zone must be set--see zic  x86_64 Intel(R) Xeon(R) CPU E3-1220 v3 @ 3.10GHz GenuineIntel GNU/Linux

localhost ~ # modinfo zfs
filename:       /lib/modules/4.4.6-gentoo/extra/zfs/zfs.ko
version:        0.6.5.4-r1-gentoo
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     4251E810337436FD7B850DA
depends:        spl,znvpair,zunicode,zcommon,zavl
vermagic:       4.4.6-gentoo SMP mod_unload modversions 
[... And so on.]

localhost ~ # ls /l/photos/.zfs/snapshot/send/
ls: cannot open directory /l/photos/.zfs/snapshot/send/: Too many levels of symbolic links
[Wait a minute]
localhost ~ # ls /l/photos/.zfs/snapshot/send/
ls: cannot open directory /l/photos/.zfs/snapshot/send/: Too many levels of symbolic links

localhost ~ # pwd
/root

localhost ~ # ls /l/photos/.zfs/snapshot/send/
ls: cannot open directory /l/photos/.zfs/snapshot/send/: Too many levels of symbolic links
localhost ~ # ls /l/photos/.zfs/snapshot/send
ls: cannot access /l/photos/.zfs/snapshot/send/stuff: Too many levels of symbolic links
ls: cannot access /l/photos/.zfs/snapshot/send/thing: Too many levels of symbolic links
ls: cannot access /l/photos/.zfs/snapshot/send/whatever: Too many levels of symbolic links
[...]
localhost ~ # ls /l/photos/.zfs/snapshot/
[Expected output, no error]
localhost ~ # ls /l/photos/.zfs/
[Expected output, no error]
localhost ~ # ls /l/photos/
[Expected output, no error]
localhost ~ # ls /l/
[Expected output, no error]

localhost photos # cd /l/photos
localhost photos # pwd
/l/photos

localhost photos # ls .zfs/snapshot/send/
ls: cannot open directory .zfs/snapshot/send/: Too many levels of symbolic links

localhost photos # cd .zfs
localhost .zfs # pwd
/l/photos/.zfs
localhost .zfs # ls snapshot/send/
thing
stuff
whatever
[...And the correctly expected results with no errors.]

Using an absolute path, even with CWD being .zfs

localhost .zfs # ls /l/photos/.zfs/snapshot/send/
ls: cannot open directory /l/photos/.zfs/snapshot/send/: Too many levels of symbolic links

Only relative paths to .zfs and below work without error.

There's nothing in dmesg, and I've been scrubbing the pool, but there are no errors detected so far.

Oh, also...

localhost .zfs # zpool get all l
NAME  PROPERTY                    VALUE                       SOURCE
l     size                        36.2T                       -
l     capacity                    67%                         -
l     altroot                     -                           default
l     health                      ONLINE                      -
l     guid                        4946876290228094116         default
l     version                     -                           default
l     bootfs                      -                           default
l     delegation                  on                          default
l     autoreplace                 off                         default
l     cachefile                   -                           default
l     failmode                    wait                        default
l     listsnapshots               off                         default
l     autoexpand                  off                         default
l     dedupditto                  0                           default
l     dedupratio                  1.00x                       -
l     free                        11.9T                       -
l     allocated                   24.3T                       -
l     readonly                    off                         -
l     ashift                      12                          local
l     comment                     -                           default
l     expandsize                  -                           -
l     freeing                     0                           default
l     fragmentation               37%                         -
l     leaked                      0                           default
l     feature@async_destroy       enabled                     local
l     feature@empty_bpobj         active                      local
l     feature@lz4_compress        active                      local
l     feature@spacemap_histogram  active                      local
l     feature@enabled_txg         active                      local
l     feature@hole_birth          active                      local
l     feature@extensible_dataset  active                      local
l     feature@embedded_data       active                      local
l     feature@bookmarks           enabled                     local
l     feature@filesystem_limits   enabled                     local
l     feature@large_blocks        active                      local

localhost .zfs # zfs get all l/photos
NAME      PROPERTY              VALUE                                                       SOURCE
l/photos  type                  filesystem                                                  -
l/photos  creation              Tue Apr 26 19:43 2016                                       -
l/photos  used                  13.8T                                                       -
l/photos  available             8.64T                                                       -
l/photos  referenced            13.2T                                                       -
l/photos  compressratio         1.01x                                                       -
l/photos  mounted               yes                                                         -
l/photos  quota                 none                                                        default
l/photos  reservation           none                                                        default
l/photos  recordsize            128K                                                        default
l/photos  mountpoint            /l/photos                                                   default
l/photos  sharenfs              fsid=25,rw=172.16.111.0/24,sec=sys,insecure,insecure_locks  received
l/photos  checksum              on                                                          default
l/photos  compression           on                                                          inherited from l
l/photos  atime                 off                                                         inherited from l
l/photos  devices               off                                                         inherited from l
l/photos  exec                  on                                                          default
l/photos  setuid                off                                                         inherited from l
l/photos  readonly              off                                                         default
l/photos  zoned                 off                                                         default
l/photos  snapdir               hidden                                                      default
l/photos  aclinherit            restricted                                                  default
l/photos  canmount              on                                                          default
l/photos  xattr                 on                                                          default
l/photos  copies                1                                                           default
l/photos  version               1                                                           -
l/photos  utf8only              off                                                         default
l/photos  normalization         none                                                        default
l/photos  casesensitivity       sensitive                                                   default
l/photos  vscan                 off                                                         default
l/photos  nbmand                off                                                         default
l/photos  sharesmb              off                                                         default
l/photos  refquota              none                                                        default
l/photos  refreservation        none                                                        default
l/photos  primarycache          all                                                         default
l/photos  secondarycache        all                                                         default
l/photos  usedbysnapshots       680G                                                        -
l/photos  usedbydataset         13.2T                                                       -
l/photos  usedbychildren        0                                                           -
l/photos  usedbyrefreservation  0                                                           -
l/photos  logbias               latency                                                     default
l/photos  dedup                 off                                                         default
l/photos  mlslabel              none                                                        default
l/photos  sync                  standard                                                    default
l/photos  refcompressratio      1.01x                                                       -
l/photos  written               0                                                           -
l/photos  logicalused           14.0T                                                       -
l/photos  logicalreferenced     13.3T                                                       -
l/photos  filesystem_limit      none                                                        default
l/photos  snapshot_limit        none                                                        default
l/photos  filesystem_count      none                                                        default
l/photos  snapshot_count        none                                                        default
l/photos  snapdev               hidden                                                      default
l/photos  acltype               off                                                         default
l/photos  context               none                                                        default
l/photos  fscontext             none                                                        default
l/photos  defcontext            none                                                        default
l/photos  rootcontext           none                                                        default
l/photos  relatime              off                                                         default
l/photos  redundant_metadata    all                                                         default
l/photos  overlay               off                                                         default

There are 43 snapshots under /l/photos/.zfs/snapshots. The filesystem has been "zfs send" and "zfs received" a few times over the years, from older versions of ZFS on Solaris, FreeBSD, and Linux. It's been running under ZFSonLinux for two years now. I just did a...

zfs send -eLRv -I May_20_2015 h/photos@send | zfs recv -evF l

...a few days ago. (The h pool was an older version without many features turned on. And I don't remember this snapshot error on that pool, but I'd have to plug the drives back in to check...)

@JuliaVixen
Copy link

I plugged the drives back in; there is no error when using the old pool.

localhost ~ # ls /l/photos/.zfs/snapshot/send/
ls: cannot open directory /l/photos/.zfs/snapshot/send/: Too many levels of symbolic links
localhost ~ # ls /h/photos/.zfs/snapshot/send/
things
stuff
[...expected results, no errors]

localhost ~ # zpool get all h

[Ignore the "DEGRADED" state, I only plugged in just enough drives to import this to test.]

h     size                        21.8T                       -
h     capacity                    96%                         -
h     altroot                     -                           default
h     health                      DEGRADED                    -
h     guid                        5105680881284105628         default
h     version                     -                           default
h     bootfs                      -                           default
h     delegation                  on                          default
h     autoreplace                 off                         default
h     cachefile                   -                           default
h     failmode                    wait                        default
h     listsnapshots               off                         default
h     autoexpand                  off                         default
h     dedupditto                  0                           default
h     dedupratio                  1.00x                       -
h     free                        825G                        -
h     allocated                   20.9T                       -
h     readonly                    on                          -
h     ashift                      12                          local
h     comment                     -                           default
h     expandsize                  -                           -
h     freeing                     0                           default
h     fragmentation               0%                          -
h     leaked                      0                           default
h     feature@async_destroy       enabled                     local
h     feature@empty_bpobj         active                      local
h     feature@lz4_compress        active                      local
h     feature@spacemap_histogram  disabled                    local
h     feature@enabled_txg         disabled                    local
h     feature@hole_birth          disabled                    local
h     feature@extensible_dataset  disabled                    local
h     feature@embedded_data       active                      local
h     feature@bookmarks           disabled                    local
h     feature@filesystem_limits   disabled                    local
h     feature@large_blocks        disabled                    local

localhost ~ # zfs get all h/photos
NAME      PROPERTY              VALUE                                                       SOURCE
h/photos  type                  filesystem                                                  -
h/photos  creation              Sat Mar 21 20:06 2015                                       -
h/photos  used                  13.9T                                                       -
h/photos  available             85.7G                                                       -
h/photos  referenced            13.0T                                                       -
h/photos  compressratio         1.00x                                                       -
h/photos  mounted               yes                                                         -
h/photos  quota                 none                                                        default
h/photos  reservation           none                                                        default
h/photos  recordsize            128K                                                        default
h/photos  mountpoint            /h/photos                                                   default
h/photos  sharenfs              fsid=25,rw=172.16.111.0/24,sec=sys,insecure,insecure_locks  received
h/photos  checksum              on                                                          default
h/photos  compression           lz4                                                         local
h/photos  atime                 off                                                         local
h/photos  devices               off                                                         local
h/photos  exec                  off                                                         local
h/photos  setuid                off                                                         local
h/photos  readonly              on                                                          temporary
h/photos  zoned                 off                                                         default
h/photos  snapdir               hidden                                                      default
h/photos  aclinherit            restricted                                                  default
h/photos  canmount              on                                                          default
h/photos  xattr                 on                                                          default
h/photos  copies                1                                                           default
h/photos  version               1                                                           -
h/photos  utf8only              off                                                         default
h/photos  normalization         none                                                        default
h/photos  casesensitivity       sensitive                                                   default
h/photos  vscan                 off                                                         default
h/photos  nbmand                off                                                         default
h/photos  sharesmb              off                                                         default
h/photos  refquota              none                                                        default
h/photos  refreservation        none                                                        default
h/photos  primarycache          all                                                         default
h/photos  secondarycache        all                                                         default
h/photos  usedbysnapshots       988G                                                        -
h/photos  usedbydataset         13.0T                                                       -
h/photos  usedbychildren        0                                                           -
h/photos  usedbyrefreservation  0                                                           -
h/photos  logbias               latency                                                     default
h/photos  dedup                 off                                                         default
h/photos  mlslabel              none                                                        default
h/photos  sync                  standard                                                    default
h/photos  refcompressratio      1.00x                                                       -
h/photos  written               234K                                                        -
h/photos  logicalused           13.9T                                                       -
h/photos  logicalreferenced     13.0T                                                       -
h/photos  filesystem_limit      none                                                        default
h/photos  snapshot_limit        none                                                        default
h/photos  filesystem_count      none                                                        default
h/photos  snapshot_count        none                                                        default
h/photos  snapdev               hidden                                                      default
h/photos  acltype               off                                                         default
h/photos  context               none                                                        default
h/photos  fscontext             none                                                        default
h/photos  defcontext            none                                                        default
h/photos  rootcontext           none                                                        default
h/photos  relatime              off                                                         default
h/photos  redundant_metadata    all                                                         default
h/photos  overlay               off                                                         default

I have this in my zpool history:
2016-04-26.06:53:29 zpool set feature@embedded_data=enabled h
But I haven't modified any data in the pool at all, so it has probably had no tangible effect.
I also set these on h/photos, just before I make the h/photos@send snapshot:
h/photos compression lz4 local
h/photos atime off local
h/photos devices off local
h/photos exec off local
h/photos setuid off local
These were all the default values (off), for all snapshots prior to that last snapshot.

All 43 snapshots on the pool "h" work without error, all 43 snapshots on the pool "l" don't return the "too many symbolic links" error message.

`localhost ~ # zdb -C l

MOS Configuration:
version: 5000
name: 'l'
state: 0
txg: 623065
pool_guid: 4946876290228094116
errata: 0
hostname: 'localhost'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 4946876290228094116
create_txg: 4
children[0]:
type: 'raidz'
id: 0
guid: 3883405302801716307
nparity: 1
metaslab_array: 35
metaslab_shift: 38
ashift: 12
asize: 40007743569920
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 560335158339124174
path: '/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_VKGU6V2X-part1'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 17515656895715273874
path: '/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_VKH408MX-part1'
whole_disk: 1
DTL: 425
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 8921389355845190819
path: '/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_VKH52T6X-part1'
whole_disk: 1
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 13116042119914339975
path: '/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_VKHLNHZX-part1'
whole_disk: 1
create_txg: 4
children[4]:
type: 'disk'
id: 4
guid: 6003650805274103645
path: '/dev/disk/by-id/ata-WDC_WD80EFZX-68UW8N0_VKHNJ93X-part1'
whole_disk: 1
create_txg: 4
features_for_read:
com.delphix:embedded_data
com.delphix:hole_birth
space map refcount mismatch: expected 150 != actual 146

localhost ~ # zdb -C h
zdb: can't open 'h': No such file or directory
[That's unexpected...]

localhost ~ # zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
h 21.8T 20.9T 825G - 0% 96% 1.00x DEGRADED -
l 36.2T 24.3T 11.9T - 37% 67% 1.00x ONLINE -

localhost ~ # zdb -d l
Dataset mos [META], ID 0, cr_txg 4, 356M, 672 objects
Dataset l/photos@May_20_2015 [ZPL], ID 632, cr_txg 550939, 8.96T, 1946909 objects
[etc...]
Dataset l/photos@Oct_19_2015 [ZPL], ID 861, cr_txg 601790, 12.0T, 2214649 objects
Dataset l/photos [ZPL], ID 218, cr_txg 69828, 13.2T, 2290873 objects
Dataset l/test [ZPL], ID 356, cr_txg 131705, 153K, 6 objects
Dataset l/new [ZPL], ID 42, cr_txg 75, 4.40T, 35274 objects
Dataset l/CDs [ZPL], ID 49, cr_txg 1607, 78.8G, 144 objects
Dataset l [ZPL], ID 21, cr_txg 1, 1.11T, 2351 objects
Verified large_blocks feature refcount is correct (1)
space map refcount mismatch: expected 150 != actual 146

localhost ~ # zdb -d h
zdb: can't open 'h': No such file or directory

`

That's strange, zdb isn't seeing the other pool. Is this another bug? "zdb -C" only lists the "l" pool and not the "h" pool. Also the "h" pool doesn't appear in "/etc/zfs/zpool.cache". Could this be because I imported "h" with:

zpool import -o readonly=on h

@JuliaVixen
Copy link

So, I just checked, and yes, when a pool is imported "readonly=on", it doesn't appear in zpool.cache and "zdb" can't see it. I'm going to create an issue for this if there isn't one already...

localhost ~ # zpool import -o readonly=on h
localhost ~ # zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
h     21.8T  20.9T   825G         -     0%    96%  1.00x  ONLINE  -
l     36.2T  24.3T  11.9T         -    37%    67%  1.00x  ONLINE  -

localhost ~ # strings /etc/zfs/zpool.cache 
[There's just one pool there]

localhost ~ # zdb -C  
l:
    version: 5000
    name: 'l'
    state: 0
[etc etc, just this one pool]

[But, re-import read-write...]

localhost ~ # zpool export h
localhost ~ # zpool import h
localhost ~ # zdb -C
h:
    version: 5000
    name: 'h'
    state: 0
    txg: 5195339
    pool_guid: 5105680881284105628
    errata: 0
    hostname: 'localhost'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 5105680881284105628
        children[0]:
            type: 'raidz'
            id: 0
            guid: 16056632966898540210
            nparity: 1
            metaslab_array: 34
            metaslab_shift: 37
            ashift: 12
            asize: 24004646141952
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 2091825546480782105
                path: '/dev/sdn1'
                whole_disk: 1
                DTL: 381
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 10983704478065269639
                path: '/dev/sdo1'
                whole_disk: 1
                DTL: 261
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 16892970404048100612
                path: '/dev/sdm1'
                whole_disk: 1
                DTL: 380
                create_txg: 4
    features_for_read:
        com.delphix:embedded_data

l:
    version: 5000
    name: 'l'
    state: 0
    txg: 623065
    pool_guid: 4946876290228094116
    errata: 0
[etc. etc.]
36.2T  24.3T  11.9T         -    37%    67%  1.00x  ONLINE  -
localhost ~ # zdb -d h
Dataset mos [META], ID 0, cr_txg 4, 354M, 680 objects
Dataset h/photos@2015_Jul_8 [ZPL], ID 330, cr_txg 1261118, 10.1T, 1910968 objects
[etc. etc].

@JuliaVixen
Copy link

I have updated to the Gentoo zfs-9999 package (as of May 17, 2016) and with this version I can no longer reproduce the "Too many symbolic links" error when I attempt to "ls" the ".zfs/snapshots" directory with an absolute path. As far as I know, nothing has changed about the configuration of my filesystem, except I made a new snapshot yesterday, but I haven't even written any data to this filesystem since I wrote this bug. The only things which have changed about the configuration of the zpool which this filesystem sits upon, is the creation of one or two new zfs filesystems, and I think I destroyed a filesystem, and I wrote a few TiB of data into the other zfs filesystems, and did some "zfs send"'s of all the filesystems.

The exact version of zfs-9999 I'm using is this one:
#4582 (comment)

ryao pushed a commit to ClusterHQ/zfs that referenced this issue Jun 7, 2016
Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#4514
Closes openzfs#4661
Closes openzfs#4672
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Jun 14, 2016
Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#4514
Closes openzfs#4661
Closes openzfs#4672
ironMann pushed a commit to ironMann/zfs that referenced this issue Jun 30, 2016
Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#4514
Closes openzfs#4661
Closes openzfs#4672
GeLiXin added a commit to GeLiXin/zfs that referenced this issue Aug 1, 2016
* Consistently use parsable instead of parseable

This is a purely cosmetical change, to consistently prefer one of
two (both acceptable) choises for the word parsable in documentation and
code. I don't really care which to use, but acording to wiktionary
https://en.wiktionary.org/wiki/parsable#English parsable is preferred.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4682

* Add missing RPM BuildRequires

Both libudev and libattr are recommended build requirements.  As
such their development headers should lists in the rpm spec file
so those dependencies are pulled in when building rpm packages.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4676

* Skip ctldir znode in zfs_rezget to fix snapdir issues

Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4514
Closes #4661
Closes #4672

* Improve zfs-module-parameters(5)

Various rewrites to the descriptions of module parameters. Corrects
spelling mistakes, makes descriptions them more user-friendly and
describes some ZFS quirks which should be understood before changing
parameter values.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4671

* Fix arc_prune_task use-after-free

arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent
the underlying zsb from disappearing if there's a concurrent umount. We fix
this by force the caller of arc_remove_prune_callback to wait for
arc_prune_taskq to finish.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4687
Closes #4690

* Add request size histograms (-r) to zpool iostat, minor man page fix

Add -r option to "zpool iostat" to print request size histograms for the leaf
ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs
("agg"). These stats can be useful for seeing how well the ZFS IO aggregator
is working.

$ zpool iostat -r
mypool        sync_read    sync_write    async_read    async_write      scrub
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0    530      0      0      0
1K              0      0    260      0      0      0    116    246      0      0
2K              0      0      0      0      0      0      0    431      0      0
4K              0      0      0      0      0      0      3    107      0      0
8K             15      0     35      0      0      0      0      6      0      0
16K             0      0      0      0      0      0      0     39      0      0
32K             0      0      0      0      0      0      0      0      0      0
64K            20      0     40      0      0      0      0      0      0      0
128K            0      0     20      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0    155     19      0      0
8M              0      0      0      0      0      0      0    811      0      0
16M             0      0      0      0      0      0      0     68      0      0
--------------------------------------------------------------------------------

Also rename the stray "-G" in the man page to be "-w" for latency histograms.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #4659

* OpenZFS 6531 - Provide mechanism to artificially limit disk performance

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130

Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
  to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files

* Systemd configuration fixes

* Disable zfs-import-scan.service by default.  This ensures that
pools will not be automatically imported unless they appear in
the cache file.  When this service is explicitly enabled pools
will be imported with the "cachefile=none" property set.  This
prevents the creation of, or update to, an existing cache file.

    $ systemctl list-unit-files | grep zfs
    zfs-import-cache.service                  enabled
    zfs-import-scan.service                   disabled
    zfs-mount.service                         enabled
    zfs-share.service                         enabled
    zfs-zed.service                           enabled
    zfs.target                                enabled

* Change services to dynamic from static by adding an [Install]
section and adding 'WantedBy' tags in favor of 'Requires' tags.
This allows for easier customization of the boot behavior.

* Start the zfs-import-cache.service after the root pivot so
the cache file is available in the standard location.

* Start the zfs-mount.service after the systemd-remount-fs.service
to ensure the root fs is writeable and the ZFS filesystems can
create their mount points.

* Change the default behavior to only load the ZFS kernel modules
in zfs-import-*.service or when blkid(8) detects a pool.  Users
who wish to unconditionally load the kernel modules must uncomment
the list of modules in /lib/modules-load.d/zfs.conf.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4325
Closes #4496
Closes #4658
Closes #4699

* Fix self-healing IO prior to dsl_pool_init() completion

Async writes triggered by a self-healing IO may be issued before the
pool finishes the process of initialization.  This results in a NULL
dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes().

George Wilson recommended addressing this issue by initializing the
passed `dsl_pool_t **` prior to dmu_objset_open_impl().  Since the
caller is passing the `spa->spa_dsl_pool` this has the effect of
ensuring it's initialized.

However, since this depends on the caller knowing they must pass
the `spa->spa_dsl_pool` an additional NULL check was added to
vdev_queue_max_async_writes().  This guards against any future
restructuring of the code which might result in dsl_pool_init()
being called differently.

Signed-off-by: GeLiXin <47034221@qq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4652

* Add isa_defs for MIPS

GCC for MIPS only defines _LP64 when 64bit,
while no _ILP32 defined when 32bit.

Signed-off-by: YunQiang Su <syq@debian.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4712

* Fix out-of-bound access in zfs_fillpage

The original code will do an out-of-bound access on pl[] during last
iteration.

 ==================================================================
 BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs]
 Read of size 8 by task tmpfile/7850
 page:ffffea00017c6dc0 count:0 mapcount:0 mapping:          (null) index:0x0
 flags: 0xffff8000000000()
 page dumped because: kasan: bad access detected
 CPU: 3 PID: 7850 Comm: tmpfile Tainted: G           OE   4.6.0+ #3
  ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618
  ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8
  ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3
 Call Trace:
  [<ffffffff81635618>] dump_stack+0x63/0x8b
  [<ffffffff81313ee8>] kasan_report_error+0x528/0x560
  [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0
  [<ffffffff813144b8>] kasan_report+0x58/0x60
  [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffff81312e4e>] __asan_load8+0x5e/0x70
  [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs]

  [<ffffffff81353c3a>] SyS_execve+0x3a/0x50
  [<ffffffff810058ef>] do_syscall_64+0xef/0x180
  [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25
 Memory state around the buggy address:
  ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4
                                                                 ^
  ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ==================================================================

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4705
Issue #4708

* Fix memleak in zpl_parse_options

strsep() will advance tmp_mntopts, and will change it to NULL on last
iteration.  This will cause strfree(tmp_mntopts) to not free anything.

unreferenced object 0xffff8800883976c0 (size 64):
  comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s)
  hex dump (first 32 bytes):
    72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a  rw.strictatime.z
    66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d  fsutil.mntpoint=
  backtrace:
    [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811f9cac>] __kmalloc+0x16c/0x250
    [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl]
    [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs]
    [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs]
    [<ffffffff81222dc8>] mount_fs+0x38/0x160
    [<ffffffff81240097>] vfs_kern_mount+0x67/0x110
    [<ffffffff812428e0>] do_mount+0x250/0xe20
    [<ffffffff812437d5>] SyS_mount+0x95/0xe0
    [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4706
Issue #4708

* Fix memleak in vdev_config_generate_stats

fnvlist_add_nvlist will copy the contents of nvx, so we need to
free it here.

unreferenced object 0xffff8800a6934e80 (size 64):
  comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s)
  hex dump (first 32 bytes):
    60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff  `..s.....|.s....
    00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff  ........@.p.....
  backtrace:
    [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310
    [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl]
    [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl]
    [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
    [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair]
    [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair]
    [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair]
    [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs]
    [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs]
    [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs]
    [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs]
    [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs]
    [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs]
    [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0
    [<ffffffff812333b9>] SyS_ioctl+0x79/0x90

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4707
Issue #4708

* Linux 4.7 compat: handler->set() takes both dentry and inode

Counterpart to fd4c7b7, the same approach was taken to resolve
the compatibility issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4717 
Issue #4665

* Implementation of AVX2 optimized Fletcher-4

New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.

New zcommon module parameters:
-  zfs_fletcher_4_impl (str): selects the implementation to use.
    "fastest" - use the fastest version available
    "cycle"   - cycle trough all available impl for ztest
    "scalar"  - use the original version
    "avx2"    - new AVX2 implementation if available

Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar:  4216 MB/s
- AVX2:   14499 MB/s

See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4330

* Fix cstyle.pl warnings

As of perl v5.22.1 the following warnings are generated:

* Redundant argument in printf at scripts/cstyle.pl line 194

* Unescaped left brace in regex is deprecated, passed through
  in regex; marked by <-- HERE in m/\S{ <-- HERE / at
  scripts/cstyle.pl line 608.

They have been addressed by escaping the left braces and by
providing the correct number of arguments to printf based on
the fmt specifier set by the verbose option.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4723

* Fix minor spelling mistakes

Trivial spelling mistake fix in error message text.

* Fix spelling mistake "adminstrator" -> "administrator"
* Fix spelling mistake "specificed" -> "specified"
* Fix spelling mistake "interperted" -> "interpreted"

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4728

* Add `zfs allow` and `zfs unallow` support

ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands.  In addition, non-
privileged users should be able to run all of the following commands:

  * zpool [list | iostat | status | get]
  * zfs [list | get]

Historically this functionality was not available on Linux.  In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability.  Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.

Even with this change some limitations remain.  Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace).  This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code.  It
may be possible to add this functionality in the future.

This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite.  These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.

Two minor bug fixes were required for test-running.py.  First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it.  Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.

Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #362 
Closes #434 
Closes #4100
Closes #4394 
Closes #4410 
Closes #4487

* Remove libzfs_graph.c

The libzfs_graph.c source file should have been removed in 330d06f,
it is entirely unused.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4766

* Linux 4.6 compat: Fall back to d_prune_aliases() if necessary

As of 4.6, the icache and dcache LRUs are memcg aware insofar as the
kernel's per-superblock shrinker is concerned.  The effect is that dcache
or icache entries added by a task in a non-root memcg won't be scanned
by the shrinker in the context of the root (or NULL) memcg.  This defeats
the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to
grow uncontrollably.  This patch reverts to the d_prune_aliaes() method
in case the kernel's per-superblock shrinker is not able to free anything.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes: #4726

* SIMD implementation of vdev_raidz generate and reconstruct routines

This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.

Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite

New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
  module load, the parameter will only accept first 3 options, and
  the other implementations can be set once module is finished
  loading. Possible values for this option are:
    "fastest" - use the fastest math available
    "original" - use the original raidz code
    "scalar" - new scalar impl
    "sse" - new SSE impl if available
    "avx2" - new AVX2 impl if available

See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4328

* Fix NFS credential

The commit f74b821 caused a regression where creating file through NFS will
always create a file owned by root. This is because the patch enables the KSID
code in zfs_acl_ids_create, which it would use euid and egid of the current
process. However, on Linux, we should use fsuid and fsgid for file operations,
which is the original behaviour. So we revert this part of code.

The patch also enables secpolicy_vnode_*, since they are also used in file
operations, we change them to use fsuid and fsgid.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4772
Closes #4758

* OpenZFS 6513 - partially filled holes lose birth time

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>a
Ported by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6513
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0

If a ZFS object contains a hole at level one, and then a data block is
created at level 0 underneath that l1 block, l0 holes will be created.
However, these l0 holes do not have the birth time property set; as a
result, incremental sends will not send those holes.

Fix is to modify the dbuf_read code to fill in birth time data.

* Add a test case for dmu_free_long_range() to ztest

Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4754

* Revert "Add a test case for dmu_free_long_range() to ztest"

This reverts commit d0de2e82df579f4e4edf5643b674a1464fae485f which
introduced a new test case to ztest which is failing occasionally
during automated testing.  The change is being reverted until
the issue can be fully investigated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4754

* OpenZFS 6878 - Add scrub completion info to "zpool history"

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Authored by: Nav Ravindranath <nav@delphix.com>
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6878
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5
Closes #4787

* FreeBSD rS271776 - Persist vdev_resilver_txg changes

Persist vdev_resilver_txg changes to avoid panic caused by validation
vs a vdev_resilver_txg value from a previous resilver.

Authored-by: smh <smh@FreeBSD.org>
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5154
FreeBSD-issue: https://reviews.freebsd.org/rS271776
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf
Closes #4790

* xattrtest: allow verify with -R and other improvements

- Use a fixed buffer of random bytes when random xattr values are in
  effect.  This eliminates the potential performance bottleneck of
  reading from /dev/urandom for each file. This also allows us to
  verify xattrs in random value mode.

- Show the rate of operations per second in addition to elapsed time
  for each phase of the test. This may be useful for benchmarking.

- Set default xattr size to 6 so that verify doesn't fail if user
  doesn't specify a size. We need at least six bytes to store the
  leading "size=X" string that is used for verification.

- Allow user to execute just one phase of the test. Acceptable
  values for -o and their meanings are:

   1 - run the create phase
   2 - run the setxattr phase
   3 - run the getxattr phase
   4 - run the unlink phase

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Backfill metadnode more intelligently

Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates.  As summarized by
@mahrens in #4636:

"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects).  When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don’t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can’t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.

The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."

Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Implement large_dnode pool feature

Justification
-------------

This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks.  Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided.  Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks.  Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

Implementation
--------------

The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.

Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

 # zfs set dnodesize=auto tank/fish

The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.

New DMU interfaces:
  dmu_object_alloc_dnsize()
  dmu_object_claim_dnsize()
  dmu_object_reclaim_dnsize()

New ZAP interfaces:
  zap_create_dnsize()
  zap_create_norm_dnsize()
  zap_create_flags_dnsize()
  zap_create_claim_norm_dnsize()
  zap_create_link_dnsize()

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
  When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
  ensure the hole at the specified object offset is large enough to
  hold the dnode being created. The slots parameter is also used
  to ensure a dnode does not span multiple dnode blocks. In both of
  these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
  these failure cases are only possible when using DNODE_MUST_BE_FREE.

  If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
  dnode_hold_impl() will check if the requested dnode is already
  consumed as an extra dnode slot by an large dnode, in which case
  it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
  if dnode_hold_impl() returns an error for a requested object.
  This is because the beginning of the next dnode block is the only
  location it can safely assume to either be a hole or a valid
  starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
  through dnode blocks may no longer use a simple array indexing
  scheme. These now use the current dnode's dn_num_slots field to
  advance to the next dnode in the block. This is to ensure we
  properly skip the current dnode's bonus area and don't interpret it
  as a valid dnode.

zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.

Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number.  This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.

ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.

Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.

Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542

* Sync DMU_BACKUP_FEATURE_* flags

Flag 20 was used in OpenZFS as DMU_BACKUP_FEATURE_RESUMING.  The
DMU_BACKUP_FEATURE_LARGE_DNODE flag must be shifted to 21 and
then reserved in the upstream OpenZFS implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #4795

* OpenZFS 2605, 6980, 6902

2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12

6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f

Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
  'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
  this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
  rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.

* OpenZFS 6051 - lzc_receive: allow the caller to read the begin record

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6051
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/620f322

* OpenZFS 6393 - zfs receive a full send as a clone

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6394
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e

* OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS

Authored by: Andrew Stormont <astormont@racktopsystems.com>
Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com>
Reviewed by: Kim Shrier <kshrier@racktopsystems.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6536
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b

* OpenZFS 6738 - zfs send stream padding needs documentation

Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6738
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff

* OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota

Authored by: Dan McDonald <danmcd@omniti.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/4986
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad

* OpenZFS 6562 - Refquota on receive doesn't account for overage

Authored by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6562
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6

* Implement zfs_ioc_recv_new() for OpenZFS 2605

Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy
ZFS_IOC_RECV user/kernel interface.  The new interface supports all
stream options but is currently only used for resumable streams.
This way updated user space utilities will interoperate with older
kernel modules.

ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW
handler.  Non-Linux OpenZFS platforms have opted to change the
legacy interface in an incompatible fashion instead of adding a
new ioctl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* OpenZFS 6314 - buffer overflow in dsl_dataset_name

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6314
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee

* OpenZFS 6876 - Stack corruption after importing a pool with a too-long name

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking
for trouble. We should check every dataset on import, using a 1024 byte
buffer and checking each time to see if the dataset's new name is longer
than 256 bytes.

OpenZFS-issue: https://www.illumos.org/issues/6876
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e

* Vectorized fletcher_4 must be 128-bit aligned

The fletcher_4_native() and fletcher_4_byteswap() functions may only
safely use the vectorized implementations when the buffer is 128-bit
aligned.  This is because both the AVX2 and SSE implementations process
four 32-bit words per iterations.  Fallback to the scalar implementation
which only processes a single 32-bit word for unaligned buffers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #4330

* Allow building with `CFLAGS="-O0"`

If compiled with -O0, gcc doesn't do any stack frame coalescing
and -Wframe-larger-than=1024 is triggered in debug mode.
Starting with gcc 4.8, new opt level -Og is introduced for debugging, which
does not trigger this warning.

Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4799

* Don't allow accessing XATTR via export handle

Allow accessing XATTR through export handle is a very bad idea. It
would allow user to write whatever they want in fields where they
otherwise could not.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828

* Fix get_zfs_sb race with concurrent umount

Certain ioctl operations will call get_zfs_sb, which will holds an active
count on sb without checking whether it's active or not. This will result
in use-after-free. We fix this by using atomic_inc_not_zero to make sure
we got an active sb.

P1                                          P2
---                                         ---
deactivate_locked_super(): s_active = 0
                                            zfs_sb_hold()
                                            ->get_zfs_sb(): s_active = 1
->zpl_kill_sb()
-->zpl_put_super()
--->zfs_umount()
---->zfs_sb_free(zsb)
                                            zfs_sb_rele(zsb)

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Fix Large kmem_alloc in vdev_metaslab_init

This allocation can go way over 1MB, so we should use vmem_alloc
instead of kmem_alloc.

  Large kmem_alloc(1430784, 0x1000), please file an issue...
  Call Trace:
   [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl]
   [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs]
   [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs]
   [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs]
   [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs]
   [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs]
   [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs]
   [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs]
   [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0
   [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4752

* Add configure result for xattr_handler

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828

* fh_to_dentry should return ESTALE when generation mismatch

When generation mismatch, it usually means the file pointed by the file handle
was deleted. We should return ESTALE to indicate this. We return ENOENT in
zfs_vget since zpl_fh_to_dentry will convert it to ESTALE.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828

* xattr dir doesn't get purged during iput

We need to set inode->i_nlink to zero so iput will purge it. Without this, it
will get purged during shrink cache or umount, which would likely result in
deadlock due to zfs_zget waiting forever on its children which are in the
dispose_list of the same thread.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827

* Kill zp->z_xattr_parent to prevent pinning

zp->z_xattr_parent will pin the parent. This will cause huge issue
when unlink a file with xattr. Because the unlinked file is pinned, it
will never get purged immediately. And because of that, the xattr
stuff will never be marked as unlinked. So the whole unlinked stuff
will stay there until shrink cache or umount.

This change partially reverts e89260a.  This is safe because only the
zp->z_xattr_parent optimization is removed, zpl_xattr_security_init()
is still called from the zpl outside the inode lock.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827

* Fix RAIDZ_TEST tests

Remove stray trailing } which prevented the raidz stress tests from
running in-tree.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z

The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.

Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
  db_blkptr = NULL and it's dirtied.

Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
  xattr fits into the bonus buffer, so it's removed. The dbuf is
  undirtied in this txg, but it's still referenced and cannot be
  destroyed.

Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
  is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
  (not done yet).
* A new change makes the spill buffer necessary again.
  sa_build_layouts() ends up calling dbuf_find() to locate the
  dbuf.  It finds the old dbuf because it has not been destroyed yet
  (it will be destroyed when the previous write is done and there
  are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
  referenced, so it's not destroyed.

Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
  directly copied into the dnode, overwriting the blkptr area because,
  in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
  gets corrupted.

Signed-off-by: Peng <peng.hse@xtaotech.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3937

* Fix handling of errors nvlist in zfs_ioc_recv_new()

zfs_ioc_recv_impl() is changed to always allocate the 'errors'
nvlist, its callers are responsible for freeing it.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4829

* Add RAID-Z routines for SSE2 instruction set, in x86_64 mode.

The patch covers low-end and older x86 CPUs.  Parity generation is
equivalent to SSSE3 implementation, but reconstruction is somewhat
slower.  Previous 'sse' implementation is renamed to 'ssse3' to
indicate highest instruction set used.

Benchmark results:
scalar_rec_p                    4    720476442
scalar_rec_q                    4    187462804
scalar_rec_r                    4    138996096
scalar_rec_pq                   4    140834951
scalar_rec_pr                   4    129332035
scalar_rec_qr                   4    81619194
scalar_rec_pqr                  4    53376668

sse2_rec_p                      4    2427757064
sse2_rec_q                      4    747120861
sse2_rec_r                      4    499871637
sse2_rec_pq                     4    522403710
sse2_rec_pr                     4    464632780
sse2_rec_qr                     4    319124434
sse2_rec_pqr                    4    205794190

ssse3_rec_p                     4    2519939444
ssse3_rec_q                     4    1003019289
ssse3_rec_r                     4    616428767
ssse3_rec_pq                    4    706326396
ssse3_rec_pr                    4    570493618
ssse3_rec_qr                    4    400185250
ssse3_rec_pqr                   4    377541245

original_rec_p                  4    691658568
original_rec_q                  4    195510948
original_rec_r                  4    26075538
original_rec_pq                 4    103087368
original_rec_pr                 4    15767058
original_rec_qr                 4    15513175
original_rec_pqr                4    10746357

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4783

* Enable zpool_upgrade test cases

Creating the pool in a striped rather than mirrored configuration
provides enough space for all upgrade tests to run.  Test case
zpool_upgrade_007_pos still fails and must be investigated so
it has been left disabled.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4852

* Prevent null dereferences when accessing dbuf kstat

In arc_buf_info(), the arc_buf_t may have no header.  If not, don't try
to fetch the arc buffer stats and instead just zero them.

The null dereferences were observed while accessing the dbuf kstat with
awk on a system in which millions of small files were being created in
order to overflow the system's metadata limit.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4837

* Fix dbuf_stats_hash_table_data race

Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf
can be freed at any time.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4846

* Use native inode->i_nlink instead of znode->z_links

A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's
64 bit on-disk link count.

We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a
more Linux-integrated fix for the same issue.

In addition, setting the initial link count on a new node has been changed
from setting one less than required in zfs_mknode() then incrementing to the
correct count in zfs_link_create() (which was somewhat bizarre in the first
place), to setting the correct count in zfs_mknode() and not incrementing it
in zfs_link_create(). This both means we no longer set the link count in
sa_bulk_update() twice (once for the initial incorrect count then again for
the correct count), as well as adhering to the Linux requirement of not
incrementing a zero link count without I_LINKABLE (see linux commit
f4e0c30c).

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #4838
Issue #227

* Implementation of SSE optimized Fletcher-4

Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.

The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.

Adds a pair of fields to an existing zcommon module parameter:
-  zfs_fletcher_4_impl (str)
    "sse2"    - new SSE2 implementation if available
    "ssse3"   - new SSSE3 implementation if available

Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4789

* Fix filesystem destroy with receive_resume_token

It is possible that the given DS may have hidden child (%recv)
datasets - "leftovers" resulting from the previously interrupted
'zfs receieve'.  Try to remove the hidden child (%recv) and after
that try to remove the target dataset.   If the hidden child
(%recv) does not exist the original error (EEXIST) will be returned.

Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4818

* Prevent segfaults in SSE optimized Fletcher-4

In some cases, the compiler was not respecting the GNU aligned
attribute for stack variables in 35a76a0. This was resulting in
a segfault on CentOS 6.7 hosts using gcc 4.4.7-17.  This issue
was fixed in gcc 4.6.

To prevent this from occurring, use unaligned loads and stores
for all stack and global memory references in the SSE optimized
Fletcher-4 code.

Disable zimport testing against master where this flaw exists:

TEST_ZIMPORT_VERSIONS="installed"

Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4862

* Update arc_summary.py for prefetch changes

Commit 7f60329 removed several kstats which arc_summary.py read.
Remove these kstats from arc_summary.py in the same way this was
handled in FreeNAS.

FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4695

* Wait iput_async before evict_inodes to prevent race

Wait for iput_async before entering evict_inodes in
generic_shutdown_super. The reason we must finish before
evict_inodes is when lazytime is on, or when zfs_purgedir calls
zfs_zget, iput would bump i_count from 0 to 1. This would race
with the i_count check in evict_inodes.  This means it could
destroy the inode while we are still using it.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4854

* Fixes and enhancements of SIMD raidz parity

- Implementation lock replaced with atomic variable

- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`

- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813

- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392

- Minor fixes and cleanups

- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4813
Issue #1392

* RAIDZ parity kstat rework

Print table with speed of methods for each implementation.
Last line describes contents of [fastest] selection.

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4860

* Fix NULL pointer in zfs_preumount from 1d9b3bd

When zfs_domount fails zsb will be freed, and its caller
mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into
zfs_preumount.

In order to make sure we don't touch any nonexistent stuff, we must make sure
s_fs_info is NULL in the fail path so zfs_preumount can easily check that.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4867
Issue #4854

* Illumos Crypto Port module added to enable native encryption in zfs

A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329

* Fix for compilation error when using the kernel's CONFIG_LOCKDEP

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329

* zloop: print backtrace from core files

Find the core file by using `/proc/sys/kernel/core_pattern`

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4874

* Fix for metaslab_fastwrite_unmark() assert failure

Currently there is an issue where metaslab_fastwrite_unmark() unmarks
fastwrites on vdev_t's that have never had fastwrites marked on them.
The 'fastwrite mark' is essentially a count of outstanding bytes that
will be written to a vdev and is used in syncing context. The problem
stems from the fact that the vdev_pending_fastwrite field is not being
transferred over when replacing a top-level vdev. As a result, the
metaslab is marked for fastwrite on the old vdev and unmarked on the
new one, which brings the fastwrite count below zero. This fix simply
assigns vdev_pending_fastwrite from the old vdev to the new one so
this count is not lost.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4267

* Remove znode's z_uid/z_gid member

Remove duplicate z_uid/z_gid member which are also held in the
generic vfs inode struct. This is done by first removing the members
from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID
macros to access the respective member from struct inode. In cases
where the uid/gids are being marshalled from/to disk, use the newly
introduced zfs_(uid|gid)_(read|write) functions to properly
save the uids rather than the internal kernel representation.

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227

* Check whether the kernel supports i_uid/gid_read/write helpers

Since the concept of a kuid and the need to translate from it to
ordinary integer type was added in kernel version 3.5 implement necessary
plumbing to be able to detect this condition during compile time. If
the kernel doesn't support the kuid then just fall back to directly
accessing the respective struct inode's members

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227

* Fix uninitialized variable in avl_add()

Silence the following warning when compiling with gcc 5.4.0.
Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609.

module/avl/avl.c: In function ‘avl_add’:
module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized
    in this function [-Wmaybe-uninitialized]
  avl_insert(tree, new_node, where);

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Fix sync behavior for disk vdevs

Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer.  This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.

In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start().  The follow-on commits 5592404 and
aa159af fixed several problems introduced by b39c22b.  In particular,
5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.

The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency.  This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.

To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.

For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4858

* Limit the amount of dnode metadata in the ARC

Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).

In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.

The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes.  Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.

The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded.  By default,
it tries to unpin 10% of the dnodes.

The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):

    - Create a large number of small files until arc_meta_used exceeds
      arc_meta_limit (3GiB with default tuning) and arc_prune
      starts increasing.

    - Create a 3GiB file with dd.  Observe arc_mata_used.  It will still
      be around 3GiB.

    - Repeatedly read the 3GiB file and observe arc_meta_limit as before.
      It will continue to stay around 3GiB.

With this modification, space for the 3GiB file is gradually made
available as subsequent demands on th…
nedbass pushed a commit to nedbass/zfs that referenced this issue Aug 26, 2016
Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#4514
Closes openzfs#4661
Closes openzfs#4672
nedbass pushed a commit to nedbass/zfs that referenced this issue Sep 3, 2016
Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#4514
Closes openzfs#4661
Closes openzfs#4672
nedbass pushed a commit to nedbass/zfs that referenced this issue Sep 5, 2016
Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#4514
Closes openzfs#4661
Closes openzfs#4672
nedbass pushed a commit to nedbass/zfs that referenced this issue Sep 5, 2016
Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This
will cause funny behaviour for the mounted snapdirs. Especially for
Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone
automount it again as long as someone is still using the detached mount.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#4514
Closes openzfs#4661
Closes openzfs#4672
@pivot69
Copy link

pivot69 commented Dec 8, 2016

Im using zfs on ubuntu server 16.04.1 and I had the same issue with the symlink error when accessing the snapshots. I got the error after sending incremental snapshots from another ubuntu server (running ubuntu server 14.04).

After updating the affected server and trying everything in my mind (atime off, compression off, mountpoints etc) it still did not work. I did a reboot and suddenly everything worked again - until I transferred new incremental snapshots.

This led me to try unmounting and remounting the filesystem after each time I transferred snapshots, and that seemed to do the trick! Now I just put the remount-commands into my script, and I am no longer bothered by this bug.

This is not a fix, it is a only workaround. But in case someone cannot get it working, even with the newest versions of everything, then try this! :)

Just some additional info:
The ubuntu server sending snapshots (14.04) has the ubuntu-zfs package installed.
[ 1.570547] ZFS: Loaded module v0.6.5.7-1~trusty, ZFS pool version 5000, ZFS filesystem version 5
The ubuntu server receiving snapshots (16.04.1) has zfs native
[ 17.440504] ZFS: Loaded module v0.6.5.6-0ubuntu10, ZFS pool version 5000, ZFS filesystem version 5

@zffocussss
Copy link

I have same error messages when using autofs + docker volume mount

@jgoerzen
Copy link

I encountered this when attempting to access a snapshot under /usr/.zfs/snapshots on my system. On this sytem, I have filesystems like this:

tank/hephaestus-1                    20.2G   135G       96K  /tank/hephaestus-1
tank/hephaestus-1/ROOT               15.9G  8.78G     1.22G  /
tank/hephaestus-1/ROOT/opt           1.51G   135G      823M  /opt
tank/hephaestus-1/ROOT/usr           12.4G  9.47G     10.5G  /usr
tank/hephaestus-1/var                4.35G  6.96G     3.04G  legacy

There are no bind mounts involved.

The relevant snapshots would have been created with zfs snapshot -r tank@foo. I observed df showing directories under /.zfs/snapshot -- showing snapshots of tank/hephaestus-1/ROOT. It did not make the /usr snapshot available in any way.

@helamonster
Copy link

I have just encountered this for the first time myself.
My environment:

root@myserver ~ # lsb_release  -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

root@myserver ~ # uname -a
Linux myserver.mydomain.com 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

root@myserver ~ # zfs version
zfs-2.0.3-0york0~20.04
zfs-kmod-2.0.3-0york0~20.04

Observe:

root@myserver ~ # pwd
/root

root@myserver ~ # ls -l  /var/.zfs/snapshot/zrepl_20211116_231549_000/
ls: cannot access '/var/.zfs/snapshot/zrepl_20211116_231549_000/': Too many levels of symbolic links

root@myserver ~ # cd   /var/.zfs/snapshot/zrepl_20211116_231549_000/
-bash: cd: /var/.zfs/snapshot/zrepl_20211116_231549_000/: Too many levels of symbolic links

root@myserver ~ # cd /var/.zfs

root@myserver .zfs # cd snapshot/

root@myserver snapshot # cd zrepl_20211116_231549_000
-bash: cd: zrepl_20211116_231549_000: Too many levels of symbolic links

root@myserver snapshot # mount.zfs rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000 /var/.zfs/snapshot/zrepl_20211116_231549_000/  
filesystem 'rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000' is already mounted

Interestingly, I found that the snapshot is indeed mounted but at a different level in the filesystem (/.zfs/snapshot instead of /var/.zfs/snapshot). Strange...

root@myserver snapshot # mount.zfs rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000 /mnt/var
filesystem 'rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000' is already mounted

root@myserver snapshot # mount | grep zrepl
rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000 on /.zfs/snapshot/zrepl_20211116_231549_000 type zfs (ro,relatime,xattr,posixacl)

root@myserver snapshot # ls /.zfs/snapshot/zrepl_20211116_231549_000
backups  cache  empty  lib  local  lock  log  mail  opt  run  snap  spool  tmp

root@myserver snapshot # zfs list -r rpool -o name,used,avail,refer,canmount,mounted,mountpoint
NAME                    USED  AVAIL     REFER  CANMOUNT  MOUNTED  MOUNTPOINT
rpool                  30.9G   182G       96K  off       no       none
rpool/ROOT             30.8G   182G       96K  off       no       none
rpool/ROOT/ubuntu      30.8G   182G     2.88G  on        yes      /
rpool/ROOT/ubuntu/tmp  48.3M   182G      440K  on        yes      /tmp
rpool/ROOT/ubuntu/var  27.9G   182G     11.7G  on        yes      /var
rpool/temp               96K   182G       96K  on        yes      /temp

I can confirm that trying to access the correct .zfs/snapshot directory causes the snapshot to be mounted at the parent directory's .zfs/snapshot directory instead. Odd.

root@myserver ~ # mount | grep 'rpool.*zfs'
rpool/ROOT/ubuntu on / type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/tmp on /tmp type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/var on /var type zfs (rw,relatime,xattr,posixacl)
rpool/temp on /temp type zfs (rw,noatime,xattr,posixacl)
rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000 on /.zfs/snapshot/zrepl_20211116_231549_000 type zfs (ro,relatime,xattr,posixacl)

root@myserver ~ # umount /.zfs/snapshot/zrepl_20211116_231549_000

root@myserver ~ # mount | grep 'rpool.*zfs'
rpool/ROOT/ubuntu on / type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/tmp on /tmp type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/var on /var type zfs (rw,relatime,xattr,posixacl)
rpool/temp on /temp type zfs (rw,noatime,xattr,posixacl)

root@myserver ~ # ls /var/.zfs/snapshot/zrepl_20211116_231549_000
ls: cannot open directory '/var/.zfs/snapshot/zrepl_20211116_231549_000': Too many levels of symbolic links

root@myserver ~ # mount | grep 'rpool.*zfs'
rpool/ROOT/ubuntu on / type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/tmp on /tmp type zfs (rw,relatime,xattr,posixacl)
rpool/ROOT/ubuntu/var on /var type zfs (rw,relatime,xattr,posixacl)
rpool/temp on /temp type zfs (rw,noatime,xattr,posixacl)
rpool/ROOT/ubuntu/var@zrepl_20211116_231549_000 on /.zfs/snapshot/zrepl_20211116_231549_000 type zfs (ro,relatime,xattr,posixacl)

Nothing is written to dmesg, syslog, or /proc/spl/kstat/zfs/dbgmsg as a result of executing these failing commands.

I can access files under the snapshot where it was mounted (/.zfs/snapshot/zrepl_20211116_231549_000) just fine.

@parke
Copy link

parke commented May 23, 2022

Fyi, it appears this issue may be related to #9958.

@rptb1
Copy link

rptb1 commented Jan 4, 2023

Just another "me too", since this issue is marked closed but is still happening.

$ uname -a
Linux plover 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.1 LTS
Release:	22.04
Codename:	jammy
$ dpkg-query -l 'zfs*' | grep ii
ii  zfs-initramfs  2.1.4-0ubuntu0.1 amd64        OpenZFS root filesystem capabilities for Linux - initramfs
ii  zfs-zed        2.1.4-0ubuntu0.1 amd64        OpenZFS Event Daemon
ii  zfsutils-linux 2.1.4-0ubuntu0.1 amd64        command-line tools to manage OpenZFS filesystems
$ 
$ ls -ld /var/lib/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19
ls: cannot access '/var/lib/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links

The system is set up according to https://openzfs.github.io/openzfs-docs/Getting%20Started/Ubuntu/Ubuntu%2022.04%20Root%20on%20ZFS.html

It's not consistent for all filesystems:

$ zfs list -H -o mountpoint | grep '^/' | sort -u | while read m; do sudo ls -ld "$m/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19"; done
drwx------ 2 root root 30 Nov 13 09:48 //.zfs/snapshot/backup-pelican-2022-11-13-09-53-19
ls: cannot access '/boot/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
drwxr-xr-x 28 rb rb 46 Nov 12 19:55 /home/rb/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19
ls: cannot access '/home/rb/.cache/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/home/rb/private/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/home/rb/snap/firefox/common/.cache/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/home/rb/tmp/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
drwx------ 9 root root 16 Nov  8 14:43 /root/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19
ls: cannot access '/tmp/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/usr/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/var/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/var/cache/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory
ls: cannot access '/var/lib/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/lib/AccountsService/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/lib/docker/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/lib/NetworkManager/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/log/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/snap/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/spool/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': Too many levels of symbolic links
ls: cannot access '/var/tmp/.zfs/snapshot/backup-pelican-2022-11-13-09-53-19': No such file or directory

I noticed a large number of mounts of snapshots:

$ mount | grep 'type zfs' | grep '@' | wc -l
76

Perhaps there some resource limit that's being exceeded?

I frequently browse lists of snapshots in Emacs dired, which will be doing at least ls -l on .zfs/snapshot. I wonder if that triggers a lot of mounts at once?

EDIT: Later on, I'm getting messages from zfs-auto-snapshot like this:

cannot destroy snapshot kiwi-rpool/ROOT/ubuntu_8lgjho/var/lib@zfs-auto-snap_frequent-2023-01-04-1100: dataset is busy
cannot destroy snapshot kiwi-rpool/ROOT/ubuntu_8lgjho/var/lib@zfs-auto-snap_frequent-2023-01-04-1045: dataset is busy

Notice that these are in the same filesystem as the original problem. Does this also suggest a problem with automatic mount/unmount?

Can someone point me at whatever code/daemon/agent is responsible for this mounting and unmounting of snapshots? Perhaps I can figure it out.

@rptb1
Copy link

rptb1 commented Jan 4, 2023

Can someone point me at whatever code/daemon/agent is responsible for this mounting and unmounting of snapshots? Perhaps I can figure it out.

Digging deeper, I have bogus entries like this in /etc/mtab:

kiwi-rpool/ROOT/ubuntu_8lgjho/var/lib@backup-pelican-2022-11-13-09-53-19 /.zfs/snapshot/backup-pelican-2022-11-13-09-53-19 zfs ro,relatime,xattr,posixacl 0 0

Notice that this is a snapshot for filesystem kiwi-rpool/ROOT/ubuntu_8lgjho/var/lib but look at where it's mounted -- it's mounted at /.zfs in the wrong place. (Something like this was mentioned by #4514 (comment) .)

And if I unmount it from there, then the problem is fixed:

rb@plover:/var/lib/.zfs/snapshot$ sudo umount /.zfs/snapshot/backup-pelican-2022-12-15-07-36-11
rb@plover:/var/lib/.zfs/snapshot$ ls -ld /.zfs/snapshot/backup-pelican-2023-01-03-08-08-09
drwxr-xr-x 79 root root 81 Dec 11 08:28 /.zfs/snapshot/backup-pelican-2023-01-03-08-08-09
rb@plover:/var/lib/.zfs/snapshot$ 

But even more weirdness, I can get the wrong directory contents to appear at /.zfs/snapshot:

$ ls -l /.zfs/snapshot/backup-pelican-2022-12-15-07-36-11
[shows correctly the things I expect to see in / ]
$ mount | grep backup-pelican-2023-01-03-08-08-09
kiwi-rpool/ROOT/ubuntu_8lgjho/var/lib@backup-pelican-2023-01-03-08-08-09 on /.zfs/snapshot/backup-pelican-2023-01-03-08-08-09 type zfs (ro,relatime,xattr,posixacl)
$ ls -l /.zfs/snapshot/backup-pelican-2022-12-15-07-36-11
[shows incorrectly the contents of /var/lib]

So it seems to me that there is something very broken about the automounting of snapshots. They're mounting in /.zfs and ignoring their filesystem mountpoint, giving incorrect contents, and probably causing the symbolic link problem.

EDIT: I can get correct and incorrect contents in the same location like this:

rb@plover:/.zfs/snapshot$ ls backup-pelican-2023-01-03-08-08-09
AccountsService      flatpak         [and more contents of /var/lib]
rb@plover:/.zfs/snapshot$ sudo umount backup-pelican-2023-01-03-08-08-09
rb@plover:/.zfs/snapshot$ ls backup-pelican-2023-01-03-08-08-09
bin   etc  [and more contents of /]

Does this warrant another issue? Showing the wrong contents at the /.zfs/snapshot path seems like it might be.

@rptb1
Copy link

rptb1 commented Jan 4, 2023

Possibly relevant

* All mounts are handled automatically by an user mode helper which invokes
* the mount procedure. Unmounts are handled by allowing the mount
* point to expire so the kernel may automatically unmount it.
and I note that this is different to the FreeBSD module.

The call to the user mode helper (mount.zfs) is constructed here:

/*
* Construct a mount point path from sb of the ctldir inode and dirent
* name, instead of from d_path(), so that chroot'd process doesn't fail
* on mount.zfs(8).
*/
snprintf(full_path, MAXPATHLEN, "%s/.zfs/snapshot/%s",
zfsvfs->z_vfs->vfs_mntpoint ? zfsvfs->z_vfs->vfs_mntpoint : "",
dname(dentry));

Note that this gives up and uses "" in the conditional, which might cause everything to be mounted at /.zfs. The gun is smoking. I am suspicious!

@rptb1
Copy link

rptb1 commented Jan 4, 2023

Note that this gives up and uses "" in the conditional, which might cause everything to be mounted at /.zfs. The gun is smoking. I am suspicious!

I have tried to confirm this is happening by enabling debug logging in the hope that I would see a message from

dprintf("mount; name=%s path=%s\n", full_name, full_path);
but I suspect my zfs module is not built with ZFS_DEBUG and so this is disabled.

In case someone with a debug build can do it, here is what I tried:

cd /var/lib/.zfs/snapshot
umount * /.zfs/snapshot/*
# clear debug log
echo 0 > /proc/spl/kstat/zfs/dbgmsg
# enable debug messages
echo 1 >> /sys/module/zfs/parameters/zfs_flags
# provoke "too many symbolic links"
ls backup-pelican-2023-01-03-08-08-09
# stop debug messages
echo 0 >> /sys/module/zfs/parameters/zfs_flags
# examine debug log
cat /proc/spl/kstat/zfs/dbgmsg

I see messages from zfs_dbgmsg calls but not dprintf.

@nachtgeist
Copy link
Contributor

nachtgeist commented Jan 13, 2023

So it seems to me that there is something very broken about the automounting of snapshots. They're mounting in /.zfs and ignoring their filesystem mountpoint, giving incorrect contents, and probably causing the symbolic link problem.

Yup. Got another case:

  • Debian Bullseye with ZFS 2.1.7, systemd
  • rootfs on ZFS
  • no bind mounts
  • zfs get mountpoint -r tank -t filesystem gives:
NAME                        PROPERTY    VALUE               SOURCE
tank                        mountpoint  /                   local
tank/home                   mountpoint  /home               inherited from tank
tank/opt                    mountpoint  /opt                inherited from tank
tank/rootfs                 mountpoint  /                   local
tank/root                   mountpoint  /root               inherited from tank
tank/srv                    mountpoint  /srv                inherited from tank

Listing and auto-mounting of snapshots via $MNTPOINT/.zfs/snapshot/* works everywhere EXCEPT in /.
When I realized the connection explained below, there were snapshots with identical names present for different datasets - date-time-strings, actually - which had
been created by sanoid via cron. So let's say there are identically named snapshots of tank/rootfs and tank/root like so:

tank/rootfs@a
tank/rootfs@b

tank/root@a
tank/root@b

This leads to:

  • Acessing /root/.zfs/snapshot/* works, i.e. ls, cd, cat, cp...
  • let's do an umount /root/.zfs/snapshot/* (It'll become clear why in a second)
  • ls /.zfs/snapshot/ gives the dreaded "too many levels of symbolic links"...but oh look!
  • cd /root/.zfs/snapshot/a; ls now displays the contents of snapshot "a" of tank/rootfs@a, NOT tank/root@a!

This got me thinking and I tinkered with the init script from initramfs-tools and just applied this patch:

diff -Naur /usr/share/initramfs-tools/init.orig /usr/share/initramfs-tools/init
--- /usr/share/initramfs-tools/init.orig        2023-01-13 02:30:14.717469685 +0100
+++ /usr/share/initramfs-tools/init     2023-01-13 02:30:09.565173985 +0100
@@ -57,7 +57,7 @@
 export break=
 export init=/sbin/init
 export readonly=y
-export rootmnt=/root
+export rootmnt=/rfs
 mkdir "$rootmnt"
 export debug=
 export panic=

The only thing this changes is the name of the mount point where the rootfs-on-zfs gets initially mounted during boot by the initramfs. Re-build the initramfs and reboot...

Now, ls /.zfs/snapshot nicely lists a and b, ls /.zfs/snapshot/* yields empty directories, however. And nothing got mounted.

Let's mkdir -p /rfs/.zfs/snapshot and ls /.zfs/snapshot/a.

No error printed, again empty directories and nothing got mounted.

Now let's mkdir /rfs/.zfs/snapshot/a and ls /.zfs/snapshot/a again yields "ls: cannot access '/.zfs/snapshot/b': Too many levels of symbolic links",

BUT

ls /rfs/.zfs/snapshot/a shows the expected snapshot content.

Under /rfs. The mountpoint where the initramfs's init mounted tank/rfs to before pivot-rooting it to /.

Note: This did NOT happen on a rootfs-on-zfs setup on Debian Buster with ZFS 2.0.3 from buster-backports. However, that buster machine still ran sysvinit instead of systemd.

I'll try the steps described here on bullseye-sysvinit and buster-systemd machines in the coming days...
#9461 (comment) seems to describe the root cause

@rptb1
Copy link

rptb1 commented Feb 15, 2023

I'm not sure why this issue is still closed. It's clearly not fixed.

@cjthompson
Copy link

cjthompson commented May 18, 2023

I'm not sure why this issue is still closed. It's clearly not fixed.

I just got this error today as well (zfs v2.1.9-2ubuntu1)

@ceastus
Copy link

ceastus commented Jun 23, 2023

I just upgraded from Ubuntu 22.04 to 23.04 (reusing my pool) and I see the same issue, just as nachtgeist described.
ZFS 2.1.9-2ubuntu1.1
6.2.0-23-generic #23-Ubuntu

Listing and auto-mounting of snapshots via $MNTPOINT/.zfs/snapshot/* works everywhere EXCEPT in /.

The snaps for / are mounted under /root/.zfs/snapshot

root@xander:/.zfs# ls -l snapshot/pre-rebuild-2023-06-22/
ls: cannot access 'snapshot/pre-rebuild-2023-06-22/': Too many levels of symbolic links
2 root@xander:/.zfs# mount |grep xander.ubuntu
xander/ubuntu on / type zfs (rw,nodev,noatime,xattr,posixacl)
xander/ubuntu@pre-rebuild-2023-06-22 on /root/.zfs/snapshot/pre-rebuild-2023-06-22 type zfs (ro,relatime,xattr,posixacl)

I don't see this issue when booting from the ISO, so that meshes.
I can manually mount snaps to the correct locations.

@GregorKopka
Copy link
Contributor

@ahrens @behlendorf
please reopen.

@behlendorf
Copy link
Contributor

behlendorf commented Dec 6, 2023

Reopening. The issue here is further described in #9461 (comment)

@behlendorf behlendorf reopened this Dec 6, 2023
@mariaa144
Copy link

mariaa144 commented Apr 16, 2024

I had faced this problem on NixOS using ZFS as root. It was upsetting when I couldn't get to my snapshots quickly to restore my Firefox session I accidentally deleted.

A simple work around which allowed me to access my snapshots is to manually mount the snapshot in another directory, instead of using the .zfs directory.

I did the following:

# Find my snapshot
zfs list -t snapshot

# create a mount directory
sudo mkdir /mnt/snapshot_test

# mount the snapshot
sudo mount -t zfs rpool/nixos/home@zfs-auto-snap_daily-2024-04-16-08h01 /mnt/snapshot_test/

This allowed me to copy my files from the snapshot in the directory /mnt/snapshot_test.

@pbek
Copy link

pbek commented May 4, 2024

I have the same issue when I try to use a .zfs/snapshot as source for creating a backup with restic inside a docker container. The snapshots were done by sanoid on NixOS.

This e.g. also yields the error inside the docker container:

# "/backup" is the mount inside the docker container
ls /backup/home/.zfs/snapshot/autosnap_2024-05-04_11:00:06_hourly

Outside the docker container, this doesn't yield the error:

ls /home/.zfs/snapshot/autosnap_2024-05-04_11:00:06_hourly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests