Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed receive with different ashift can result in incorrect PSIZE on disk #8462

Closed
pcd1193182 opened this issue Feb 27, 2019 · 7 comments
Labels
Component: Send/Recv "zfs send/recv" feature Status: Understood The root cause of the issue is known Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@pcd1193182
Copy link
Contributor

System information

N/A, problem found from code inspection, verified on Illumos

Describe the problem you're observing

When performing a compressed send, the PSIZE of the block is used as the compressed_size of the record. When the receiving system receives these records, they do a raw write, taking the compressed data and writing it directly to disk. When the sending pool has disks with ashift=9 and the receiving system has disks with ashift=12, this result in a bug.

For normal writes, zio_write_compress rounds the PSIZE of the blkptr up to the ashift of the smallest ashift device. For raw writes, this step is bypassed. Instead, the zio's io_size is used directly as the psize. The zio's io_size comes from the arc buf header's size via dbuf_sync_leaf -> dbuf_write -> arc_write -> zio_write. The arc buf header's size is set to be the compressed_size of the record in receive_read_record . Since this size is from a system with a different set of ashifts, this can result in a situation where a block is stored using 4kb on disk, but the psize is only 512 bytes.

So far the only negative effect of this issue I've found is that the dataset's bookkeeping is messed up, resulting in incorrect lrefer and compression ratio statistics. There may be other issues lurking, however.

Describe how to reproduce the problem

  1. Create a pool on a system with ashift=9 disks, and a pool on a system with ashift=12 disks (or set spa_min_ashift when creating the second pool).
  2. Create a filesystem with compression enabled on the source pool.
  3. Populate that filesystem with compressible data.
  4. Snapshot that filesystem
  5. Send that filesystem with -c, and receive it on the target pool.
  6. Verify that the asize is incorrect using zdb -vvvvv to examine the block pointers.

Include any warning/errors/backtraces from the system logs

N/A

@pcd1193182 pcd1193182 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 27, 2019
@tcaputi
Copy link
Contributor

tcaputi commented Feb 27, 2019

I don't know if I'l have time to work on this in the near future. Im going to unassign it from me for the moment.

@ahrens
Copy link
Member

ahrens commented Feb 27, 2019

No worries, I accidentally assigned it to you, thinking that I was requesting a code review.

@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020
@ahrens ahrens added Status: Understood The root cause of the issue is known and removed Status: Stale No recent activity for issue labels Aug 25, 2020
@stale
Copy link

stale bot commented Aug 25, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2021
@pcd1193182
Copy link
Contributor Author

It looks like this issue may have been partially modified by the lightweight write changes. The code that directly manipulated the arc buf hdr in dmu_recv was removed, except in the case of spill blocks. There is no longer an arc_buf_hdr for the lightweight writes; instead, they stay as lightweight write structures until dbuf_sync_lightweight, when they get zio_writed. Because that doesn't correct the asize issue, it appears that the problem still exists. However, we can resolve it in one of two locations. Either in dbuf_sync_lightweight, by modifying the psize passed into zio_write, or in receive_read_record by modifying the DRR_WRITE case to bump up the size of the allocated abd to a multiple of asize, zero it before use if it had to be increased, and then do the receive_read_payload_and_next_header of the original psize.

For the DRR_SPILL case, only the second approach will work, so that is probably the approach to take.

@stale stale bot removed the Status: Stale No recent activity for issue label Aug 26, 2021
@ahrens
Copy link
Member

ahrens commented Aug 26, 2021

I think those would work with compressed streams, and then you could add assertions in the zio pipeline that the buffer/psize provided is a multiple of 1<<ashift. It might be cleaner if the zio pipeline accepted (raw) writes that are not sector-aligned and rounded them up itself, as it does for normal writes.

However, I'm not sure that either of these approaches will work in practice with encrypted streams, because the PSIZE (of L0-blocks) is verified by the MAC, so I think it can not change across an encrypted send. See zio_crypt_bp_zero_nonportable_blkprop().

@pcd1193182
Copy link
Contributor Author

After some offline discussions, we decided that the plan we're going to attempt moving forwards is to fix the PSIZE for future cross-ashift compressed-but-not-encrypted receives. This can be done pretty simply in the receive logic or the zio pipeline. Fixing existing incorrect blocks is much more challenging; rewriting the blocks to use the correct PSIZE would be slow, fixing just the ds_ metadata would also be somewhat slow, and no solution we've come up with would allow us to fix the PSIZE of received encrypted blocks.

We also considered modifying the ds_dataset_block_* functions to do the math based on the corrected PSIZE, but most ways that could be implemented continue to leave out cases or result in situations where the ds_* statistics can become negative, which is significantly worse than the current failure mode.

pcd1193182 added a commit to pcd1193182/zfs that referenced this issue Aug 30, 2021
…E on disk (openzfs#8462)

Signed-off-by: Paul Dagnelie <pcd@delphix.com>
pcd1193182 added a commit to pcd1193182/zfs that referenced this issue Aug 30, 2021
…E on disk (openzfs#8462)

Signed-off-by: Paul Dagnelie <pcd@delphix.com>
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 15, 2021
…E on disk

We round up the psize to the nearest multiple of the asize or to the
lsize, whichever is smaller. Once that's done, we allocate a new
buffer of the appropriate size, zero the tail, and copy the data
into it. This adds a small performance cost to these kinds of writes,
but fixes the bookkeeping problems.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#12522
Closes openzfs#8462
rincebrain pushed a commit to rincebrain/zfs that referenced this issue Sep 22, 2021
…E on disk

We round up the psize to the nearest multiple of the asize or to the
lsize, whichever is smaller. Once that's done, we allocate a new
buffer of the appropriate size, zero the tail, and copy the data
into it. This adds a small performance cost to these kinds of writes,
but fixes the bookkeeping problems.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Matthew Ahrens <matthew.ahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#12522
Closes openzfs#8462
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Send/Recv "zfs send/recv" feature Status: Understood The root cause of the issue is known Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants
@pcd1193182 @behlendorf @ahrens @tcaputi and others