Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File incorrectly zeroed when receiving incremental stream that toggles -L #6224

Closed
zrav opened this issue Jun 13, 2017 · 59 comments · Fixed by #10383
Closed

File incorrectly zeroed when receiving incremental stream that toggles -L #6224

zrav opened this issue Jun 13, 2017 · 59 comments · Fixed by #10383
Labels
Component: Send/Recv "zfs send/recv" feature Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@zrav
Copy link

zrav commented Jun 13, 2017

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 17.04
Linux Kernel 4.10.0-22-generic
Architecture x86_64
ZFS Version 0.6.5.9
SPL Version 0.6.5.9

Describe the problem you're observing

I have found a data corruption issue in zfs send. In pools using 1M recordsize, incremental sends without the -L flag sometimes silently zero out some files. The results are repeatable. Scrub does not find any errors.

Tested on Ubuntu Xenial with 0.6.5.6 and Zesty with 0.6.5.9 on the same systems.

Describe how to reproduce the problem

Source pool:

root@oddity:~# modinfo zfs
filename:       /lib/modules/4.10.0-22-generic/kernel/zfs/zfs/zfs.ko
version:        0.6.5.9-2
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     42C4AB70887EA26A9970936
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       4.10.0-22-generic SMP mod_unload
...
root@oddity:~# zpool status
  pool: tank
 state: ONLINE
  scan: scrub canceled on Sun Jun 11 09:52:38 2017
config:

        NAME                                 STATE     READ WRITE CKSUM
        tank                                 ONLINE       0     0     0
          raidz2-0                           ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-2AH166_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0
            ata-ST4000VN000-1H4168_XXXXXXXX  ONLINE       0     0     0

errors: No known data errors

root@oddity:~# zpool get all tank
NAME  PROPERTY                    VALUE                       SOURCE
tank  size                        29T                         -
tank  capacity                    57%                         -
tank  altroot                     -                           default
tank  health                      ONLINE                      -
tank  guid                        18319514605431597227        default
tank  version                     -                           default
tank  bootfs                      -                           default
tank  delegation                  on                          default
tank  autoreplace                 off                         default
tank  cachefile                   -                           default
tank  failmode                    wait                        default
tank  listsnapshots               off                         default
tank  autoexpand                  off                         default
tank  dedupditto                  0                           default
tank  dedupratio                  1.00x                       -
tank  free                        12.3T                       -
tank  allocated                   16.7T                       -
tank  readonly                    off                         -
tank  ashift                      12                          local
tank  comment                     -                           default
tank  expandsize                  -                           -
tank  freeing                     0                           default
tank  fragmentation               5%                          -
tank  leaked                      0                           default
tank  feature@async_destroy       enabled                     local
tank  feature@empty_bpobj         active                      local
tank  feature@lz4_compress        active                      local
tank  feature@spacemap_histogram  active                      local
tank  feature@enabled_txg         active                      local
tank  feature@hole_birth          active                      local
tank  feature@extensible_dataset  active                      local
tank  feature@embedded_data       active                      local
tank  feature@bookmarks           enabled                     local
tank  feature@filesystem_limits   enabled                     local
tank  feature@large_blocks        active                      local

root@oddity:~# zfs get all tank
NAME  PROPERTY              VALUE                  SOURCE
tank  type                  filesystem             -
tank  creation              Fri May 13 19:22 2016  -
tank  used                  11.8T                  -
tank  available             8.13T                  -
tank  referenced            222K                   -
tank  compressratio         1.03x                  -
tank  mounted               yes                    -
tank  quota                 none                   default
tank  reservation           none                   default
tank  recordsize            1M                     local
tank  mountpoint            /tank                  default
tank  sharenfs              off                    default
tank  checksum              on                     default
tank  compression           lz4                    local
tank  atime                 off                    local
tank  devices               on                     default
tank  exec                  on                     default
tank  setuid                on                     default
tank  readonly              off                    default
tank  zoned                 off                    default
tank  snapdir               hidden                 default
tank  aclinherit            passthrough            local
tank  canmount              on                     default
tank  xattr                 sa                     local
tank  copies                1                      default
tank  version               5                      -
tank  utf8only              off                    -
tank  normalization         none                   -
tank  casesensitivity       mixed                  -
tank  vscan                 off                    default
tank  nbmand                off                    default
tank  sharesmb              off                    default
tank  refquota              none                   default
tank  refreservation        none                   default
tank  primarycache          all                    default
tank  secondarycache        all                    default
tank  usedbysnapshots       154K                   -
tank  usedbydataset         222K                   -
tank  usedbychildren        11.8T                  -
tank  usedbyrefreservation  0                      -
tank  logbias               latency                default
tank  dedup                 off                    default
tank  mlslabel              none                   default
tank  sync                  standard               default
tank  refcompressratio      1.00x                  -
tank  written               0                      -
tank  logicalused           12.8T                  -
tank  logicalreferenced     41K                    -
tank  filesystem_limit      none                   default
tank  snapshot_limit        none                   default
tank  filesystem_count      none                   default
tank  snapshot_count        none                   default
tank  snapdev               hidden                 default
tank  acltype               posixacl               local
tank  context               none                   default
tank  fscontext             none                   default
tank  defcontext            none                   default
tank  rootcontext           none                   default
tank  relatime              off                    default
tank  redundant_metadata    all                    default
tank  overlay               off                    default

The target pool:

root@ubackup:~# modinfo zfs
filename:       /lib/modules/4.10.0-22-generic/kernel/zfs/zfs/zfs.ko
version:        0.6.5.9-2
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     42C4AB70887EA26A9970936
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       4.10.0-22-generic SMP mod_unload
...
root@ubackup:~# zpool status
  pool: btank
 state: ONLINE
  scan: scrub repaired 0 in 3h36m with 0 errors on Tue Jun 13 13:34:08 2017
config:

        NAME                                          STATE     READ WRITE CKSUM
        btank                                         ONLINE       0     0     0
          raidz1-0                                    ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0
            ata-WDC_WD30EZRX-00MMMB0_WD-XXXXXXXXXXXX  ONLINE       0     0     0

errors: No known data errors
root@ubackup:~# zpool get all btank
NAME   PROPERTY                    VALUE                       SOURCE
btank  size                        24.5T                       -
btank  capacity                    23%                         -
btank  altroot                     -                           default
btank  health                      ONLINE                      -
btank  guid                        14601555808903550874        default
btank  version                     -                           default
btank  bootfs                      -                           default
btank  delegation                  on                          default
btank  autoreplace                 off                         default
btank  cachefile                   -                           default
btank  failmode                    wait                        default
btank  listsnapshots               off                         default
btank  autoexpand                  off                         default
btank  dedupditto                  0                           default
btank  dedupratio                  1.00x                       -
btank  free                        18.9T                       -
btank  allocated                   5.64T                       -
btank  readonly                    off                         -
btank  ashift                      12                          local
btank  comment                     -                           default
btank  expandsize                  -                           -
btank  freeing                     0                           default
btank  fragmentation               9%                          -
btank  leaked                      0                           default
btank  feature@async_destroy       enabled                     local
btank  feature@empty_bpobj         active                      local
btank  feature@lz4_compress        active                      local
btank  feature@spacemap_histogram  active                      local
btank  feature@enabled_txg         active                      local
btank  feature@hole_birth          active                      local
btank  feature@extensible_dataset  active                      local
btank  feature@embedded_data       active                      local
btank  feature@bookmarks           enabled                     local
btank  feature@filesystem_limits   enabled                     local
btank  feature@large_blocks        active                      local
root@ubackup:~# zfs get all btank
NAME   PROPERTY              VALUE                  SOURCE
btank  type                  filesystem             -
btank  creation              Mon Jun 12 18:41 2017  -
btank  used                  5.01T                  -
btank  available             16.1T                  -
btank  referenced            171K                   -
btank  compressratio         1.03x                  -
btank  mounted               yes                    -
btank  quota                 none                   default
btank  reservation           none                   default
btank  recordsize            1M                     local
btank  mountpoint            /btank                 default
btank  sharenfs              off                    default
btank  checksum              on                     default
btank  compression           lz4                    local
btank  atime                 off                    local
btank  devices               on                     default
btank  exec                  on                     default
btank  setuid                on                     default
btank  readonly              off                    default
btank  zoned                 off                    default
btank  snapdir               hidden                 default
btank  aclinherit            passthrough            local
btank  canmount              on                     default
btank  xattr                 sa                     local
btank  copies                1                      default
btank  version               5                      -
btank  utf8only              on                     -
btank  normalization         formD                  -
btank  casesensitivity       mixed                  -
btank  vscan                 off                    default
btank  nbmand                off                    default
btank  sharesmb              off                    default
btank  refquota              none                   default
btank  refreservation        none                   default
btank  primarycache          all                    default
btank  secondarycache        all                    default
btank  usedbysnapshots       0                      -
btank  usedbydataset         171K                   -
btank  usedbychildren        5.01T                  -
btank  usedbyrefreservation  0                      -
btank  logbias               latency                default
btank  dedup                 off                    default
btank  mlslabel              none                   default
btank  sync                  disabled               local
btank  refcompressratio      1.00x                  -
btank  written               171K                   -
btank  logicalused           5.18T                  -
btank  logicalreferenced     40K                    -
btank  filesystem_limit      none                   default
btank  snapshot_limit        none                   default
btank  filesystem_count      none                   default
btank  snapshot_count        none                   default
btank  snapdev               hidden                 default
btank  acltype               posixacl               local
btank  context               none                   default
btank  fscontext             none                   default
btank  defcontext            none                   default
btank  rootcontext           none                   default
btank  relatime              off                    default
btank  redundant_metadata    all                    default
btank  overlay               off                    default

While the issue was observed with multiple datasets, I'll focus on a smaller one. First, the source (some uneventful snapshots omitted):

root@oddity:~# zfs list -o space -r tank/dataz/Backup
NAME               AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
tank/dataz/Backup  8.13T   126G      461K    126G              0          0
root@oddity:~# zfs list -t snapshot -r tank/dataz/Backup
NAME                                                      USED  AVAIL  REFER  MOUNTPOINT
tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100      0      -   125G  -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-1900       0      -   125G  -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2000    205K      -   125G  -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2100       0      -   126G  -
tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900       0      -   126G  -

The initial send to ubackup has been performed with the -L and -e flags (not sure if this is relevant). Then performing an incremental send with the -L flag produces the expected result (size differences are due to different pool geometry):

root@oddity:~# zfs send -L -e -I tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100 tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900 | ssh ubackup "zfs receive btank/oddity/tank/dataz/Backup"

root@ubackup:~# zfs list -o space -r btank/oddity/tank/dataz/Backup
NAME                            AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
btank/oddity/tank/dataz/Backup  16.0T   133G      754K    133G              0          0
root@ubackup:~# zfs list -t snapshot -r btank/oddity/tank/dataz/Backup
NAME                                                                   USED  AVAIL  REFER  MOUNTPOINT
btank/oddity/tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100  14.2K      -   131G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-1900   14.2K      -   131G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2000    156K      -   131G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2100   14.2K      -   133G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900       0      -   133G  -

But repeating the same send without the -L flag results in corruption:

root@ubackup:~# zfs rollback -r btank/oddity/tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100

root@oddity:~# zfs send -e -I tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100 tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900 | ssh ubackup "zfs receive btank/oddity/tank/dataz/Backup"

root@ubackup:~# zfs list -o space -r btank/oddity/tank/dataz/Backup
NAME                            AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
btank/oddity/tank/dataz/Backup  16.0T   133G     19.0G    114G              0          0
root@ubackup:~# zfs list -t snapshot -r btank/oddity/tank/dataz/Backup
NAME                                                                   USED  AVAIL  REFER  MOUNTPOINT
btank/oddity/tank/dataz/Backup@zfs-auto-snap_monthly-2017-06-01-0100  14.2K      -   131G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-1900   14.2K      -   131G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2000    156K      -   112G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-12-2100   14.2K      -   114G  -
btank/oddity/tank/dataz/Backup@zfs-auto-snap_hourly-2017-06-13-0900       0      -   114G  -

Notice how the REFER size drops and the USEDSNAP increase between the two runs. I looked around to find where the data had gone missing and found several random files whose reported sizes on disk were reduced to 512 bytes. Reading these files seems to return the full original size but with content all zeros.

I repeated the whole process multiple times, also recreating the target pool, with the same result. The affected files are always the same, but I haven't found a common characteristic among them. Scrubbing both pools finds no errors. Syslog shows nothing uncommon. I have never noted something similar before. The only notable change I made recently was that I started using sa based xattrs in May, but that might be unrelated.

I am happy to provide more info as far as I'm able.

@zrav
Copy link
Author

zrav commented Jun 13, 2017

I think I found the unifying characteristic of all corrupted files: All of them are in directories that have been updated via file creation or deletion between snapshots. Which of the files in the directory get corrupted seems to be random, though.

@zrav
Copy link
Author

zrav commented Jun 16, 2017

I'll try to put something together. I'm busy for the next week though, so it'll take a while.

@zrav
Copy link
Author

zrav commented Jun 19, 2017

I can reproduce the issue consistently with this script, tested on two different machines (0.6.5.6 and 0.6.5.9):

#!/bin/bash

BASE_DS="tank/issue6224"

zfs create -o recordsize=1M -o compression=lz4 $BASE_DS
zfs create $BASE_DS/source
cd /$BASE_DS/source
wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
zfs snapshot $BASE_DS/source@snap1
zfs send -L -e $BASE_DS/source@snap1 | zfs receive $BASE_DS/target
cp zfs-0.6.5.10.tar.gz zfs-0.6.5.10.tar.gz.1
zfs snapshot $BASE_DS/source@snap2
zfs send -e -i $BASE_DS/source@snap1 $BASE_DS/source@snap2 | zfs receive $BASE_DS/target

After running, check the size on disk of the files in the target dataset using du or ncdu. Note that the first file is only 512 bytes and all zero content.

NAME                   AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
tank/issue6224         8.13T  10.5M         0    205K              0      10.3M
tank/issue6224/source  8.13T  5.10M      136K   4.97M              0          0
tank/issue6224/target  8.13T  5.22M     2.52M   2.70M              0          0

@loli10K
Copy link
Contributor

loli10K commented Jun 24, 2017

@zrav thank you for providing a working reproducer!

I've not had the time to debug this properly yet, but after a cursory reading of the code i think we're seeing the result of the following block in receive_object()

  2134          /*
  2135           * If we are losing blkptrs or changing the block size this must
  2136           * be a new file instance.  We must clear out the previous file
  2137           * contents before we can change this type of metadata in the dnode.
  2138           */
  2139          if (err == 0) {
  2140                  int nblkptr;
  2141  
  2142                  nblkptr = deduce_nblkptr(drro->drr_bonustype,
  2143                      drro->drr_bonuslen);
  2144  
  2145                  if (drro->drr_blksz != doi.doi_data_block_size ||
  2146                      nblkptr < doi.doi_nblkptr) {
  2147                          err = dmu_free_long_range(rwa->os, drro->drr_object,
  2148                              0, DMU_OBJECT_END);
  2149                          if (err != 0)
  2150                                  return (SET_ERROR(EINVAL));
  2151                  }
  2152          }

When we send the incremental stream our DRR OBJECT = inumof($corrupted_file) is sent with a block size that is different from the one already on disk (which was sent with -L the first time), so we "clear out the previous file contents":

First send (blksz = 1M)

[root@centos ~]# zfs send -L $POOLNAME/source@snap1 | zstreamdump -vv | grep "object = `inum /mnt/$POOLNAME/source/file1.dat`"
OBJECT object = 2 type = 19 bonustype = 44 blksz = 1048576 bonuslen = 176
FREE object = 2 offset = 5242880 length = -1
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
FREE object = 2 offset = 5242880 length = 1068498944

Incremental send (blksz = 128K)

[root@centos ~]# zfs send -i $POOLNAME/source@snap1 $POOLNAME/source@snap2 | zstreamdump -vv | grep "object = `inum /mnt/$POOLNAME/source/file1.dat`"
OBJECT object = 2 type = 19 bonustype = 44 blksz = 131072 bonuslen = 176
FREE object = 2 offset = 5242880 length = -1

The resulting file on disk is:

[root@centos ~]# zdb -ddddddd -bbbbbbb $POOLNAME/target `inum /mnt/$POOLNAME/target/file1.dat`
Dataset testpool/target [ZPL], ID 76, cr_txg 18, 5.03M, 12 objects, rootbp DVA[0]=<0:1527000:200> DVA[1]=<0:1527200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=62L/62P fill=12 cksum=dd1da4093:545ff5a1b20:10a4e0a522bd7:2427c83816b784

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    2   128K   128K      0     512   128K    0.00  ZFS plain file (K=inherit) (Z=inherit)
                                               176   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
	dnode maxblkid: 0
	path	/file1.dat
	uid     0
	gid     0
	atime	Sat Jun 24 07:26:16 2017
	mtime	Sat Jun 12 07:26:16 2018
	ctime	Sat Jun 12 07:26:16 2018
	crtime	Sat Jun 12 07:26:15 2018
	gen	11
	mode	100644
	size	5242880
	parent	34
	links	1
	pflags	40800000004
	xattr	3
Indirect blocks:
               0 L1  HOLE [L1 ZFS plain file] size=20000L birth=43L

The code i was mentioning earlier was integrated with 6c59307 (Illumos 3693 - restore_object uses at least two transactions to restore an object), even before f1512ee (Illumos 5027 - zfs large block support): if all this is true i think every ZFS version since Illumos 5027 is affected, possibly on other platfroms too.

@loli10K
Copy link
Contributor

loli10K commented Jun 27, 2017

This reproduces on current master and Illumos, which is quite troubling. Cross-platform reproducer here https://gist.github.com/loli10K/af70fb0f1aae7b174822aa5657e07c28.

ZoL:

root@linux:~# cat /sys/module/zfs/version 
0.7.0-rc4_67_g7e35ea783
root@linux:~# bash -x ./issue-6224.sh 
+ POOLNAME=testpool
+ is_linux
++ uname
+ [[ Linux == \L\i\n\u\x ]]
+ return 0
+ TMPDIR=/var/tmp
+ mountpoint -q /var/tmp
+ zpool destroy testpool
+ fallocate -l 128m /var/tmp/zpool.dat
+ zpool create -O mountpoint=/mnt/testpool testpool /var/tmp/zpool.dat
+ zfs create -o recordsize=1M testpool/source
+ dd if=/dev/urandom of=/mnt/testpool/source/file1.dat bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 0.552689 s, 9.5 MB/s
+ zfs snapshot testpool/source@snap1
+ zfs send -L testpool/source@snap1
+ cat /var/tmp/full.dat
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ zstreamdump -v
+ grep 'object = 2'
OBJECT object = 2 type = 19 bonustype = 44 blksz = 1048576 bonuslen = 168
FREE object = 2 offset = 5242880 length = -1
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
WRITE object = 2 type = 19 checksum type = 7 compression type = 0
FREE object = 2 offset = 5242880 length = 1068498944
+ cat /var/tmp/full.dat
+ zfs receive -F testpool/target
+ cp /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat
5.1M	/mnt/testpool/source/file1.dat
5.1M	/mnt/testpool/source/full.dat
5.1M	/mnt/testpool/target/file1.dat
+ zfs snapshot testpool/source@snap2
+ zfs send -i testpool/source@snap1 testpool/source@snap2
+ cat /var/tmp/incr.dat
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ grep 'object = 2'
+ zstreamdump -v
OBJECT object = 2 type = 19 bonustype = 44 blksz = 131072 bonuslen = 168
FREE object = 2 offset = 5242880 length = -1
+ cat /var/tmp/incr.dat
+ zfs receive -F testpool/target
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat /mnt/testpool/target/full.dat
5.1M	/mnt/testpool/source/file1.dat
5.1M	/mnt/testpool/source/full.dat
512	/mnt/testpool/target/file1.dat
5.1M	/mnt/testpool/target/full.dat
++ inum /mnt/testpool/target/file1.dat
++ stat -c %i /mnt/testpool/target/file1.dat
+ zdb -ddddddd -bbbbbbb testpool/target 2
Dataset testpool/target [ZPL], ID 268, cr_txg 16, 5.03M, 8 objects, rootbp DVA[0]=<0:14fa000:200> DVA[1]=<0:14fa200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=50L/50P fill=8 cksum=db27241b5:541d80cfdbd:10b6c0498b32d:2497961710b23a

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    2   128K   128K      0     512   128K    0.00  ZFS plain file (K=inherit) (Z=inherit)
                                               168   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
	dnode maxblkid: 0
	path	/file1.dat
	uid     0
	gid     0
	atime	Mon Jun 26 19:38:50 2017
	mtime	Mon Jun 26 19:38:49 2017
	ctime	Mon Jun 26 19:38:49 2017
	crtime	Mon Jun 26 19:38:49 2017
	gen	9
	mode	100644
	size	5242880
	parent	34
	links	1
	pflags	40800000004
Indirect blocks:
               0 L1  HOLE [L1 ZFS plain file] size=20000L birth=35L


root@linux:~# 

Illumos (SmartOS):

[root@52-54-00-d3-7a-01 ~]# uname -v
joyent_20170622T212149Z
[root@52-54-00-d3-7a-01 ~]# bash -x ./issue-6224.sh
+ POOLNAME=testpool
+ is_linux
++ uname
+ [[ SunOS == \L\i\n\u\x ]]
+ return 1
+ TMPDIR=/tmp
+ zpool destroy testpool
+ mkfile 128m /tmp/zpool.dat
+ zpool create -O mountpoint=/mnt/testpool testpool /tmp/zpool.dat
+ zfs create -o recordsize=1M testpool/source
+ dd if=/dev/urandom of=/mnt/testpool/source/file1.dat bs=1M count=5
5+0 records in
5+0 records out
5242880 bytes transferred in 0.078813 secs (66523111 bytes/sec)
+ zfs snapshot testpool/source@snap1
+ zfs send -L testpool/source@snap1
+ cat /tmp/full.dat
+ zstreamdump -v
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ grep 'object = 8'
OBJECT object = 8 type = 19 bonustype = 44 blksz = 1048576 bonuslen = 168
FREE object = 8 offset = 5242880 length = -1
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
WRITE object = 8 type = 19 checksum type = 7 compression type = 0
FREE object = 8 offset = 5242880 length = 1068498944
+ cat /tmp/full.dat
+ zfs receive -F testpool/target
+ cp /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat
 5.0M   /mnt/testpool/source/file1.dat
 5.0M   /mnt/testpool/source/full.dat
 5.0M   /mnt/testpool/target/file1.dat
+ zfs snapshot testpool/source@snap2
+ zfs send -i testpool/source@snap1 testpool/source@snap2
+ cat /tmp/incr.dat
+ zstreamdump -v
++ inum /mnt/testpool/source/file1.dat
++ stat -c %i /mnt/testpool/source/file1.dat
+ grep 'object = 8'
OBJECT object = 8 type = 19 bonustype = 44 blksz = 131072 bonuslen = 168
FREE object = 8 offset = 5242880 length = -1
+ cat /tmp/incr.dat
+ zfs receive -F testpool/target
+ du -sh /mnt/testpool/source/file1.dat /mnt/testpool/source/full.dat /mnt/testpool/target/file1.dat /mnt/testpool/target/full.dat
 5.0M   /mnt/testpool/source/file1.dat
 5.0M   /mnt/testpool/source/full.dat
   0K   /mnt/testpool/target/file1.dat
 5.0M   /mnt/testpool/target/full.dat
++ inum /mnt/testpool/target/file1.dat
++ stat -c %i /mnt/testpool/target/file1.dat
+ zdb -ddddddd -bbbbbbb testpool/target 8
Dataset testpool/target [ZPL], ID 51, cr_txg 15, 5.03M, 9 objects, rootbp DVA[0]=<0:14f8200:200> DVA[1]=<0:14f8400:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=49L/49P fill=9 cksum=b47a0b8ce:44a5957f452:d88c20f8e067:1d7addab9d69db

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         8    2   128K   128K      0   128K    0.00  ZFS plain file (K=inherit) (Z=inherit)
                                        168   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 0
        path    /file1.dat
        uid     0
        gid     0
        atime   Mon Jun 26 17:40:50 2017
        mtime   Mon Jun 26 17:40:50 2017
        ctime   Mon Jun 26 17:40:50 2017
        crtime  Mon Jun 26 17:40:50 2017
        gen     8
        mode    100644
        size    5242880
        parent  4
        links   1
        pflags  40800000004
Indirect blocks:
               0 L1  HOLE [L1 ZFS plain file] size=20000L birth=34L


[root@52-54-00-d3-7a-01 ~]# 

@behlendorf behlendorf added this to the 0.7.0 milestone Jun 27, 2017
@behlendorf behlendorf removed this from the 0.7.0 milestone Jul 25, 2017
@rottegift
Copy link

FWIW, although it says above that there is a cross-platform that reproduces on Illumos, @lundman notes that it hasn't made the slack and wonders if @ahrens has seen this.

(Reproduces in openzfsonosx too, and annoyingly I have a few affected files in backups that thankfully I haven't had to restore from.)

@zrav
Copy link
Author

zrav commented Mar 7, 2018

@behlendorf any particular reason this wasn't revisited?

@behlendorf
Copy link
Contributor

Available developer resources, I'll add it to the 0.8 milestone so we don't loose track of it.

@loli10K
Copy link
Contributor

loli10K commented Nov 3, 2018

@ahrens
Copy link
Member

ahrens commented Nov 7, 2018

@loli10K that issue is related but different. I would guess the problem here is with turning -L on/off between incrementals. The incremental will report the drr_blksz in the OBJECT record as different from what is on the receiving system, so the recv assumes that the object must have been freed and reallocated, because the block size can't change on the sending system (if there are multiple blocks).

The solution might be a bit tricky, but I think it will have to involve disallowing some kinds of zfs receive.

@ahrens ahrens changed the title Silent corruption in incremental send without -L flag with large block pool File incorrectly zeroed when receiving incremental stream that toggles -L Nov 7, 2018
@ahrens
Copy link
Member

ahrens commented Nov 7, 2018

@pcd1193182 or @tcaputi also have some experience with zfs send/recv.

@tcaputi
Copy link
Contributor

tcaputi commented Nov 7, 2018

We are having the same kind of issue in openzfs/openzfs#705.... I don't have a solution right at this moment (other than simply forcing zfs or the user to use -L if the filesystem uses large blocks. Both of these issues stem from the fact that we are trying to infer whether an object was deleted from a few properties of the object, but this would be much more solid if we could somehow know for sure that this happened. Ill try to look into it more today.

@sjau
Copy link

sjau commented Nov 7, 2018

I just tried this on an encrypted dataset with raw sending. Since I use legacy mountpoint I altered the script to:

#!/usr/bin/env bash
set -x

BASE_DS="tankSubi/encZFS/issue6224"
MNT="/tmp/test-6224"

zfs create -o recordsize=1M -o compression=lz4 $BASE_DS
zfs create $BASE_DS/source
mkdir -p $MNT
mount -t zfs $BASE_DS/source $MNT
cd $MNT
wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
zfs snapshot $BASE_DS/source@snap1
zfs send -L -e --raw $BASE_DS/source@snap1 | zfs receive $BASE_DS/target
cp zfs-0.6.5.10.tar.gz zfs-0.6.5.10.tar.gz.1
zfs snapshot $BASE_DS/source@snap2
zfs send -e --raw -i $BASE_DS/source@snap1 $BASE_DS/source@snap2 | zfs receive $BASE_DS/target
zfs list -t snapshot -r $BASE_DS

and it produced:

root@subi:/tmp# ./6224
+ BASE_DS=tankSubi/encZFS/issue6224
+ MNT=/tmp/test-6224
+ zfs create -o recordsize=1M -o compression=lz4 tankSubi/encZFS/issue6224
+ zfs create tankSubi/encZFS/issue6224/source
+ mkdir -p /tmp/test-6224
+ mount -t zfs tankSubi/encZFS/issue6224/source /tmp/test-6224
+ cd /tmp/test-6224
+ wget https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
--2018-11-07 20:13:24--  https://github.com/zfsonlinux/zfs/releases/download/zfs-0.6.5.10/zfs-0.6.5.10.tar.gz
Resolving github.com (github.com)... 140.82.118.4, 140.82.118.3
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/437011/79cf5bd6-5109-11e7-8255-038691a19c2c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20181107%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20181107T191325Z&X-Amz-Expires=300&X-Amz-Signature=47b1a953908e5f4436159c482f08d2c65c2d417ae94afab36bea78134d24f80c&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dzfs-0.6.5.10.tar.gz&response-content-type=application%2Foctet-stream [following]
--2018-11-07 20:13:25--  https://github-production-release-asset-2e65be.s3.amazonaws.com/437011/79cf5bd6-5109-11e7-8255-038691a19c2c?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20181107%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20181107T191325Z&X-Amz-Expires=300&X-Amz-Signature=47b1a953908e5f4436159c482f08d2c65c2d417ae94afab36bea78134d24f80c&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dzfs-0.6.5.10.tar.gz&response-content-type=application%2Foctet-stream
Resolving github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)... 52.216.134.19                                                                                                
Connecting to github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)|52.216.134.19|:443... connected.                                                                            
HTTP request sent, awaiting response... 200 OK
Length: 2597676 (2.5M) [application/octet-stream]
Saving to: ‘zfs-0.6.5.10.tar.gz’

zfs-0.6.5.10.tar.gz                                         100%[========================================================================================================================================>]   2.48M  3.26MB/s    in 0.8s    

2018-11-07 20:13:26 (3.26 MB/s) - ‘zfs-0.6.5.10.tar.gz’ saved [2597676/2597676]

+ zfs snapshot tankSubi/encZFS/issue6224/source@snap1
+ zfs send -L -e --raw tankSubi/encZFS/issue6224/source@snap1
+ zfs receive tankSubi/encZFS/issue6224/target
+ cp zfs-0.6.5.10.tar.gz zfs-0.6.5.10.tar.gz.1
+ zfs snapshot tankSubi/encZFS/issue6224/source@snap2
+ zfs receive tankSubi/encZFS/issue6224/target
+ zfs send -e --raw -i tankSubi/encZFS/issue6224/source@snap1 tankSubi/encZFS/issue6224/source@snap2
+ zfs list -t snapshot -r tankSubi/encZFS/issue6224
NAME                                     USED  AVAIL  REFER  MOUNTPOINT
tankSubi/encZFS/issue6224/source@snap1   136K      -  2.69M  -
tankSubi/encZFS/issue6224/source@snap2     0B      -  5.21M  -
tankSubi/encZFS/issue6224/target@snap1   120K      -  2.67M  -
tankSubi/encZFS/issue6224/target@snap2     0B      -  5.20M  -

Also after mounting it seems to be fine

root@subi:/tmp# mkdir snp2
root@subi:/tmp# zfs load-key tankSubi/encZFS/issue6224/target
Enter passphrase for 'tankSubi/encZFS/issue6224/target': 
root@subi:/tmp# mount -t zfs tankSubi/encZFS/issue6224/target /tmp/snp2
root@subi:/tmp# cd snp2/
root@subi:/tmp/snp2# ls -al
total 5114
drwxr-xr-x  2 root root       4 Nov  7 20:13 .
drwxrwxrwt 24 root root     860 Nov  7 20:17 ..
-rw-r--r--  1 root root 2597676 Jun 14  2017 zfs-0.6.5.10.tar.gz
-rw-r--r--  1 root root 2597676 Nov  7 20:13 zfs-0.6.5.10.tar.gz.1

It seems when using raw send on encrypted dataset that this bug is not triggered.

@seonwoolee
Copy link

seonwoolee commented Nov 7, 2018

Right, because --raw sends the data exactly as they are on disk. This implies not changing the block size, which is what -L does

@tcaputi
Copy link
Contributor

tcaputi commented Nov 7, 2018

raw sends imply -L for large-block datasets. Encrypted data cannot be split into smaller blocks.

@sjau
Copy link

sjau commented Nov 7, 2018

ok, thanks for the explanation.

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 15, 2019
@behlendorf behlendorf removed this from the 0.8.0 milestone Mar 15, 2019
@behlendorf behlendorf added the Component: Send/Recv "zfs send/recv" feature label Apr 13, 2019
@behlendorf behlendorf added this to the 0.8.0 milestone Apr 17, 2019
behlendorf added a commit to behlendorf/zfs that referenced this issue Apr 24, 2019
When receiving an object in a send stream the receive_object()
function must determine if it is an existing or new object.  This
is normally straight forward since that object number will usually
not be allocated, and therefore it must be a new object.

However, when the object exists there are two possible scenarios.

1) The object may have been freed and an entirely new object
   allocated.  In which case it needs to be reallocated to free
   any attached spill block and to set the new attributes (i.e.
   block size, bonus size, etc).  Or,

2) The object's attributes, like block size, we're modified at
   the source but it is the same original object.  In which case
   only those attributes should be updated, and everything else
   preserved.

The issue is that this determination is accomplished using a set
of heuristics from the OBJECT record.  Unfortunately, these fields
aren't sufficient to always distinguish between these two cases.

The result of which is that a change in the objects block size will
result it in being reallocated.  As part of this reallocation any
spill block associated with the object will be freed.

When performing a normal send/recv this issue will most likely
manifest itself as a file with missing xattrs.  This is because
when the xattr=sa property is set the xattrs can be stored in
this lost spill block.

If this issue occurs when performing a raw send then the missing
spill block will trigger an authentication error.  This error will
prevent the receiving side for accessing the damaged dnode block.
Furthermore, if first dnode block is damaged in this way it will
make it impossible to mount the received snapshot.

This change resolves the issue by updating the sender to always
include a SPILL record for each OBJECT record with a spill block.
This allows the missing spill block to be recreated if it's freed
during receive_object().

The major advantage of this approach is that it is backwards
compatible with existing versions of 'zfs receive'.   This means
there's no need to add an incompatible feature flag which is only
understood by the latest versions.  Older versions of the software
which already know how to handle spill blocks will do the right thing.

The downside to this approach is that it can increases the size of
the stream due to the additional spill blocks.  Additionally, since
new spill blocks will be written the received snapshot will consume
more capacity.  These drawbacks can be largely mitigated by using
the large dnode feature which reduces the need for spill blocks.

Both the send_realloc_files and send_realloc_encrypted_files ZTS
test cases were updated to create xattrs in order to force spill
blocks.  As part of validating an incremental receive the contents
of all received xattrs are verified against the source snapshot.

OpenZFS-issue: https://www.illumos.org/issues/9952
FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6224
behlendorf added a commit to behlendorf/zfs that referenced this issue Apr 26, 2019
When receiving an object in a send stream the receive_object()
function must determine if it is an existing or new object.  This
is normally straight forward since that object number will usually
not be allocated, and therefore it must be a new object.

However, when the object exists there are two possible scenarios.

1) The object may have been freed and an entirely new object
   allocated.  In which case it needs to be reallocated to free
   any attached spill block and to set the new attributes (i.e.
   block size, bonus size, etc).  Or,

2) The object's attributes, like block size, we're modified at
   the source but it is the same original object.  In which case
   only those attributes should be updated, and everything else
   preserved.

The issue is that this determination is accomplished using a set
of heuristics from the OBJECT record.  Unfortunately, these fields
aren't sufficient to always distinguish between these two cases.

The result of which is that a change in the objects block size will
result it in being reallocated.  As part of this reallocation any
spill block associated with the object will be freed.

When performing a normal send/recv this issue will most likely
manifest itself as a file with missing xattrs.  This is because
when the xattr=sa property is set the xattrs can be stored in
this lost spill block.

If this issue occurs when performing a raw send then the missing
spill block will trigger an authentication error.  This error will
prevent the receiving side for accessing the damaged dnode block.
Furthermore, if first dnode block is damaged in this way it will
make it impossible to mount the received snapshot.

This change resolves the issue by updating the sender to always
include a SPILL record for each OBJECT record with a spill block.
This allows the missing spill block to be recreated if it's freed
during receive_object().

The major advantage of this approach is that it is backwards
compatible with existing versions of 'zfs receive'.   This means
there's no need to add an incompatible feature flag which is only
understood by the latest versions.  Older versions of the software
which already know how to handle spill blocks will do the right thing.

The downside to this approach is that it can increases the size of
the stream due to the additional spill blocks.  Additionally, since
new spill blocks will be written the received snapshot will consume
more capacity.  These drawbacks can be largely mitigated by using
the large dnode feature which reduces the need for spill blocks.

Both the send_realloc_files and send_realloc_encrypted_files ZTS
test cases were updated to create xattrs in order to force spill
blocks.  As part of validating an incremental receive the contents
of all received xattrs are verified against the source snapshot.

TODO:
* requires an way to enable the optional feature

OpenZFS-issue: https://www.illumos.org/issues/9952
FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6224
behlendorf added a commit to behlendorf/zfs that referenced this issue Apr 26, 2019
When receiving an object in a send stream the receive_object()
function must determine if it is an existing or new object.  This
is normally straight forward since that object number will usually
not be allocated, and therefore it must be a new object.

However, when the object exists there are two possible scenarios.

1) The object may have been freed and an entirely new object
   allocated.  In which case it needs to be reallocated to free
   any attached spill block and to set the new attributes (i.e.
   block size, bonus size, etc).  Or,

2) The object's attributes, like block size, we're modified at
   the source but it is the same original object.  In which case
   only those attributes should be updated, and everything else
   preserved.

The issue is that this determination is accomplished using a set
of heuristics from the OBJECT record.  Unfortunately, these fields
aren't sufficient to always distinguish between these two cases.

The result of which is that a change in the objects block size will
result it in being reallocated.  As part of this reallocation any
spill block associated with the object will be freed.

When performing a normal send/recv this issue will most likely
manifest itself as a file with missing xattrs.  This is because
when the xattr=sa property is set the xattrs can be stored in
this lost spill block.

If this issue occurs when performing a raw send then the missing
spill block will trigger an authentication error.  This error will
prevent the receiving side for accessing the damaged dnode block.
Furthermore, if first dnode block is damaged in this way it will
make it impossible to mount the received snapshot.

This change resolves the issue by updating the sender to always
include a SPILL record for each OBJECT record with a spill block.
This allows the missing spill block to be recreated if it's freed
during receive_object().

The major advantage of this approach is that it is backwards
compatible with existing versions of 'zfs receive'.   This means
there's no need to add an incompatible feature flag which is only
understood by the latest versions.  Older versions of the software
which already know how to handle spill blocks will do the right thing.

The downside to this approach is that it can increases the size of
the stream due to the additional spill blocks.  Additionally, since
new spill blocks will be written the received snapshot will consume
more capacity.  These drawbacks can be largely mitigated by using
the large dnode feature which reduces the need for spill blocks.

Both the send_realloc_files and send_realloc_encrypted_files ZTS
test cases were updated to create xattrs in order to force spill
blocks.  As part of validating an incremental receive the contents
of all received xattrs are verified against the source snapshot.

OpenZFS-issue: https://www.illumos.org/issues/9952
FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#6224
@ahrens
Copy link
Member

ahrens commented May 15, 2020

@scineram I think that would work, as a partial solution.

If the dataset does not have large blocks, but the stream does, we don't currently have a way of knowing whether that's because the earlier send split the large blocks (in which case we should reject this receive), or the earlier snapshot didn't happen to have any large blocks in it (in which case we should accept it).

See my earlier comment #6224 (comment) for more thoughts on this.

@ahrens
Copy link
Member

ahrens commented May 15, 2020

@gerardba

What happens when using 'zfs send -t'

Unfortunately, it looks like zfs send -Lt will add the large-block flag even if the token doesn't have it set, thus allowing you to toggle -L and hit hit bug.

wouldn't that actually cause me to be flipping -l Off/On and cause corruption?

Yes. One difficulty of dealing with this issue is that the current default is not the recommended way to do sends of filesystems with large blocks, but switching to the recommended way can hit this bug.

@pcd1193182
Copy link
Contributor

pcd1193182 commented May 16, 2020

It seems like there is a lot of duplication of ideas and discussion going on here and elsewhere, so I'm going to attempt to consolidate and summarize all the various suggestions that have been made, and highlight the problems with each of them. This will hopefully help people to understand why this issue hasn't been resolved yet; it's not just a matter of implementing "[a solution] that is free of data loss", none of the proposed solutions avoids data loss in all cases for all users.

  1. Make -L the default behavior...
    Users with existing large-blocks datasets who haven't been using -L suddenly are forcibly exposed to the bug. This can be mitigated by application of solution Implement zeventd daemon #2, but not totally, as @ahrens broke down ; it forces users with existing large-block datasets to re-send from scratch. This is especially pernicious for the backup use case, where the user may not realize that their sends can no longer be received until the source material is gone. Option 1b helps mitigate this downside, but see its entry.
    a. Without a --no-large-blocks flag
    This approach has the downside of removing the only way we currently have to de-large-blockify datasets.
    b. With a --no-large-blocks flag / with zstream split-blocks
    This solution reopens the bug window through incorrect usage of the --no-large-blocks flag or the userland utility. It does require that you do a more explicit step to cause yourself grief, but it by no means absolutely prevents it. Of the two, I would prefer the latter, since it would prevent the backup use case from being griefed when a user has mistakenly not passed --no-large-blocks.
  2. Reject sends that don't match the large-block-having state of the receiving system
    This solution works well moving forwards, but for systems that already have some received state, this solution does little to address their issues. Again, see @ahrens' previous comment.
  3. Reimplement zfs recv to not be block based
    This suggestion, in addition to being a substantial developer/QA workload, has numerous substantial downsides. See this comment for some of them; I'm sure others can think of more.
  4. Improve/perfect detection of whether an object needs to be reallocated, independent of large blocks
    To me, this is the ideal solution, but it is not at all simple to achieve. The bug we are faced with ultimately stems from the heuristics that attempt to determine if the object has been reallocated between the previous snapshot and this one. Because these heuristics are impossible to make perfect with the current set of available data, the only solution is to add a creation_txg or generation counter to the dnode_phys_t. This would give us perfect information about reallocation for datasets created or upgraded after this change lands, but would do nothing for other users. The heuristics would have to remain, and we would still need to come up with solutions for users with existing data, and users who can't/don't want to upgrade their pools to use the new dnode structure. This solution also needs an appropriate helping of dev and testing effort to ensure correctness in all the various edge cases that may arise.

Personally, I would advocate that we immediately move forwards with #4; it is the only perfect long-term solution to this problem and others like it that we may encounter later on. The sooner we get the feature into zfs, the more data will have the creation_txg in it.

For all existing data, I think that the best behavior is probably a combination of 1b and 2, protecting existing users with large-block datasets who have used -L, users who have small-block datasets, and requiring large-block users who have not used -L thus far to restart their sends or requiring them to deliberately use a zstream split-blocks feature. Hopefully having to use a separate command line utility is sufficiently abnormal for the zfs send workflow that it will give users pause, and prevent them from encountering this bug, especially if adequate explanation is placed in the man pages/usage messages for various commands. Unfortunately, my guess is that there are a lot of users who fall into the large-block-without-dash-L camp, and this solution will inconvenience all of them. I don't know how to fix that.

If there are solutions that people proposed/like that they did not address, feel free to let me know and I will attempt to keep this up to date. If you see a way around one of the issues, or a new solution, again let me know and I will try to add/address it. Hopefully this helps people understand the current state of this bug and our attempts to come up with an ideal solution.

@ahrens
Copy link
Member

ahrens commented May 19, 2020

Thanks for the great write-up, @pcd1193182!

I realized there's a simpler way to do #4 (Improve/perfect detection of whether an object needs to be reallocated) which allows us to implement #2 perfectly (reject receives that toggle -L).

The ZPL already has a per-file generation number (unique for each actual file, even if they have the same object number), which it stores in the bonus buffer, which is included in the OBJECT record. We can use the generation number to tell if a file's block size changed due to toggling -L, and reject the receive in that case. Using the ZPL's generation number is a slight layering violation (send/receive operates at the DMU level and otherwise doesn't know about object contents / ZPL state), but I think that using it as an error-detection mechanism is fair, especially compared to the alternatives.

I think we can also orthogonally consider 1b (make -L the default), once we have the safety of the above solution, which I'm working on implementing.

@tcaputi
Copy link
Contributor

tcaputi commented May 20, 2020

@ahrens Do we want to reject the receive in this case? Or do we just want to correctly reallocate the dnode (as the heuristic should have done)?

@ahrens
Copy link
Member

ahrens commented May 20, 2020

After discussion with @tcaputi, I think I have a way to correctly process send streams that change to have -L, and reject send streams that change to not have -L. That would give us a better path forward towards having -L be the default (or only) option in a future release (likely after 2.0).

@GregorKopka
Copy link
Contributor

@pcd1193182 Thank you for taking the time to compile the options.

As far as I understand on-disk format and code the root of the issue is the details of storing objects in ObjSets:

Each object within an object set is uniquely identified by a 64 bit integer called a nobject number. An object's “object number” identifies the array element, in the dnode array, containing this object's dnode_phys_t.

which leads to objects being removed from the set leaving gaps in the dnode array, which are then (as I presume) greedily reused by ZFS to avoid growing that array unless absolutely needed, resulting in "object number" being ambiguous in a zfs send stream, resulting in the decision to have zfs recv detect recycling of an object number through looking for a changed record size.
And finally toggling the use of the zfs send -L option wrongfully triggering that heuristic.

Did I understand that correctly so far?

If yes: Do we really need that check for reuse of object numbers? Shouldn't the dead list when sending the snapshot take care of removing all stale data (that belonged to the removed object) from the destination?

Apart from that...

I still think that ZFS send/recv should have the ability to convert datasets from one record-/volblocksize to another. Machines get retired, pools get transferred to new and different hardware as technology advances.If ZFS is ment to be the last word in filesystems... shouldn't it have the ability to remap the block size of the data in existing datasets (without sacrificing the snapshot history) to benefit from the abilities of the destination systems hardware as much as possible?

Wouldn't it make sense to be able to instruct a backup system using shingled drives to store data in the 8MB (just to pull a number from thin air) records that align with the native write stripe size of the drives, regardless of the block site of the source? And in case of a transfer of such a backup to another system where a database daemon is to access and modify the dataset in the DB native block size of (say) 16KiB... being able to recv that dataset with such a recordsize?

@ahrens
Copy link
Member

ahrens commented May 21, 2020

@GregorKopka

Did I understand that correctly so far?

That's right.

Do we really need that check for reuse of object numbers?

We need it so that the new object is set up correctly before we start getting WRITE records. e.g. has the correct block size and dnode size (number of "slots").

I still think that ZFS send/recv should have the ability to convert datasets from one record-/volblocksize to another.

I hear that you would like us to implement that new feature. It's a reasonable thing to want, but not straightforward to implement. Right now I am prioritizing this data-loss bug.

@GregorKopka
Copy link
Contributor

I agree with your priorities.

My question is: will the ZPL's generation number be deterministic enough to completely solve this issue or will it just trade this issue against a more convoluted and obscure variation?

ahrens added a commit to ahrens/zfs that referenced this issue May 28, 2020
…s -L

Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#6224
ahrens added a commit to ahrens/zfs that referenced this issue May 28, 2020
Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#6224
ahrens added a commit to ahrens/zfs that referenced this issue Jun 8, 2020
Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#6224
behlendorf pushed a commit that referenced this issue Jun 9, 2020
…s -L

Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224 
Closes #10383
BrainSlayer pushed a commit to BrainSlayer/zfs that referenced this issue Jun 10, 2020
…s -L

Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#6224 
Closes openzfs#10383
lundman referenced this issue in openzfsonosx/openzfs Jun 12, 2020
…s -L

Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224 
Closes #10383
as-com pushed a commit to as-com/zfs that referenced this issue Jun 20, 2020
…s -L

Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#6224
Closes openzfs#10383
(cherry picked from commit 7bcb7f0)
as-com pushed a commit to as-com/zfs that referenced this issue Jun 20, 2020
…s -L

Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#6224
Closes openzfs#10383
(cherry picked from commit 7bcb7f0)
@f00b4r0
Copy link

f00b4r0 commented Aug 31, 2020

Sorry to hijack this but I think I've been bitten by this bug on a replication between two zfs pools (running FreeNAS). Is there any recovery path once corruption is established or is the only option to destroy the replicated pool and start from scratch?

Also, pardon the necessarily stupid question but how comes that a zeroed file doesn't trigger any checksumming error upon reading? I was under the impression that zfs send/recv carry all the checksumming info necessary to confirm that what's written on the recv end exactly matches what was sent, am I mistaken?

Thanks

@ahrens
Copy link
Member

ahrens commented Aug 31, 2020

@f00b4r0 I'm not aware of any recovery path.

The send stream is itself checksummed, but it doesn't include checksums of every block of every file.

@f00b4r0
Copy link

f00b4r0 commented Aug 31, 2020

@f00b4r0 I'm not aware of any recovery path.

sigh, thanks. Not what I was hoping for, but not entirely unexpected.

The send stream is itself checksummed, but it doesn't include checksums of every block of every file.

This sounds like a design flaw that would have immediately exposed this bug, or am I missing something?

@lschweiss-wustl
Copy link

@f00b4r0 If the receiving filesystem has snapshots before the corruption the data should still be in the snapshot. The trick is locating all the null fill files and the clean snapshot. I recovered millions of files from my snapshot history with some scripting work.

It is best if you have a clean copy that wasn't affected by this bug and redo the replicated copy.

@f00b4r0
Copy link

f00b4r0 commented Aug 31, 2020

@lschweiss-wustl thanks. It looks like I'll have to start from scratch. Then again that bug bites so hard, I'm not sure I can ever trust send/recv again. At least rsync doesn't silently wipe data ;P

@shodanshok
Copy link
Contributor

Sorry if asking something obvious, but I have a question: sending a dataset from a pool with large_block enabled, and with 1M recordsize datasets, to a non-large-block-enabled pool without ever using -L, am I right saying that I would not trigger the bug? What about the inverse scenario (sending from 128K block dataset/pool to a large_block enabled pool)?

In other words: if I never used -L I am safe, right?

Thanks.

@ahrens
Copy link
Member

ahrens commented Oct 3, 2020

@shodanshok

sending a dataset from a pool with large_block enabled, and with 1M recordsize datasets, to a non-large-block-enabled pool without ever using -L, am I right saying that I would not trigger the bug?

That's right. The bug is about changing the -L setting in an incremental send.

What about the inverse scenario (sending from 128K block dataset/pool to a large_block enabled pool)?

In this scenario you don't have any large blocks, so it doesn't matter. You can even toggle -L, because it's a no-op if you don't have large blocks.

if I never used -L I am safe, right?

Correct.

@behlendorf behlendorf moved this from To do to In Progress in 0.8-release Dec 9, 2020
@behlendorf behlendorf removed this from In Progress in 0.8-release Dec 18, 2020
jsai20 pushed a commit to jsai20/zfs that referenced this issue Mar 30, 2021
…s -L

Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#6224 
Closes openzfs#10383
sempervictus pushed a commit to sempervictus/zfs that referenced this issue May 31, 2021
…s -L

Background:

By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks.  By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature.  A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.

When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out.  The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).

Changes:

This commit addresses the problem with several changes to the semantics
of zfs send/receive:

1. "-L to no-L" incrementals are rejected.  If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:

    incremental send stream requires -L (--large-block), to match
    previous receive.

2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.

3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals.  This flag is currently not set on any send streams.  In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.

Implementation notes:

To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.

In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks.  The zio pipeline will recompress
each smaller block individually.

A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#6224 
Closes openzfs#10383
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Send/Recv "zfs send/recv" feature Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

Successfully merging a pull request may close this issue.