Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dmu_buf_will_clone: fix race in transition back to NOFILL #15566

Merged
merged 1 commit into from Nov 28, 2023

Conversation

robn
Copy link
Contributor

@robn robn commented Nov 23, 2023

Motivation and Context

#15526 and the related unpleasantness of the week.

Description

Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state.

This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state.

Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data.

Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data.

This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down.

Notes

How I got here is interesting. First, added DBUF_VERIFY(db) in dmu_buf_will_clone(), just after taking and just before releasing db_mtx.

Then ran:

dd if=/dev/random of=/tank/file bs=1M count=64 status=progress
zpool sync

while true ; do clonefile -t -c /tank/file /tank/clone ; done &
while true ; do dd if=/dev/random of=/tank/clone bs=4K count=16384 status=progress ; done &

wait

This immediatly blew up on the second DBUF_VERIFY, before releasing the lock. Examing the dbuf, it was in state DB_CACHED , no block pointer, no dirty records, but db_data set, so assertions tripped there.

And then I just kept following the breadcrumbs until arrived here, with DBUF_VERIFY passing.

Of course this is all assuming that DBUF_VERIFY is even correct for a clone write, but I think it is (as much as I can be sure of anything anymore).

More notes

I am not sure if this is right. I wonder what happens if the dbuf is written to, then is cloned late while that write is being synced out. I don't know if this even makes sense but dbuf_fix_old_data() exists and it seems like this is not unlike the != NOFILL case in dbuf_dirty(). But I'm very tired so.

Even if this is right, it isn't the whole story for #15526. There's a similar thing that can be tripped in 2.1 (pre block cloning). So even if you love this, don't assume its the whole game.

How Has This Been Tested?

block_cloning tests pass, so at least clones still sorta work. DBUF_VERIFY is happy. My gut says its not worse, but this stuff is mad. Past performance is no guarantee of future results.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn
Copy link
Contributor Author

robn commented Nov 24, 2023

A day later, I believe #15571 is the actual fix for #15526, but I suspect that the block cloning bug fixed here made it a little easier to hit because in the gap between dmu_buf_will_clone() and dmu_buf_will_not_fill(), the dbuf would be clean, and so the dnode too.

(that and coreutils 9.2/9.3 doing clones and sparse by default).

Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.

This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.

Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.

Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.

This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.

Sponsored-by: Klara, Inc.
Sponsored-By: OpenDrives Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 28, 2023
@pjd
Copy link
Contributor

pjd commented Nov 28, 2023

I run my tests (mostly testing various corner cases) with those changes and they all passed. The change also looks good.
Thank you for working on this!

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 28, 2023
@behlendorf behlendorf merged commit 688514e into openzfs:master Nov 28, 2023
24 of 26 checks passed
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 28, 2023
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.

This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.

Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.

Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.

This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.

Sponsored-by: Klara, Inc.
Sponsored-By: OpenDrives Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#15566
Closes openzfs#15526
behlendorf pushed a commit that referenced this pull request Nov 28, 2023
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.

This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.

Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.

Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.

This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.

Sponsored-by: Klara, Inc.
Sponsored-By: OpenDrives Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #15566
Closes #15526
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
Previously, dmu_buf_will_clone() would roll back any dirty record, but
would not clean out the modified data nor reset the state before
releasing the lock. That leaves the last-written data in db_data, but
the dbuf in the wrong state.

This is eventually corrected when the dbuf state is made NOFILL, and
dbuf_noread() called (which clears out the old data), but at this point
its too late, because the lock was already dropped with that invalid
state.

Any caller acquiring the lock before the call into
dmu_buf_will_not_fill() can find what appears to be a clean, readable
buffer, and would take the wrong state from it: it should be getting the
data from the cloned block, not from earlier (unwritten) dirty data.

Even after the state was switched to NOFILL, the old data was still not
cleaned out until dbuf_noread(), which is another gap for a caller to
take the lock and read the wrong data.

This commit fixes all this by properly cleaning up the previous state
and then setting the new state before dropping the lock. The
DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the
lock is down.

Sponsored-by: Klara, Inc.
Sponsored-By: OpenDrives Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#15566
Closes openzfs#15526
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants