dnode_is_dirty: use dn_dirty_txg to check dirtiness #15615

robn · 2023-11-30T22:13:47Z

Motivation and Context

#15571 is a reliable fix for #15526, but it wasn't clear why it was necessary. This PR explains it, and offers the correct fix.

For avoidance of doubt: #15571 fixes the problem. There's no new problem that this fixes. There's no hurry at all to ship this (assuming its even right).

Description

dn_dirty_ctx is always set to the highest txg that has ever dirtied the dnode. It is set in dbuf_dirty() when a data or metadnode dbuf is dirtied, and never cleared.

[analysis of bug #15526 and fix #15571 below, for future readers]

The previous dirty check was:

for (int i = 0; i < TXG_SIZE; i++) {
    if (multilist_link_active(&dn->dn_dirty_link[i])
        [dnode is dirty]

However, this check is not "is the dnode dirty?" but rather, "is the dnode on a list?".

There is a gap in dmu_objset_sync_dnodes() where the dnode is moved from os_dirty_dnodes to os_synced_dnodes, before dnode_sync() is called to write out the dirty dbufs. So, there is a moment when the dnode is not on a list, and so the check fails.

It doesn't matter that the dirty check takes dn_mtx, because that lock isn't used for dn_dirty_link. The os_dirty_dnodes sublist lock is held in dmu_objset_sync_dnodes(), but trying to take that would mean possibly waiting until everything on that sublist has been synced.

The correct fix has to check something that positively asserts the dnode is dirty, rather than an implementation detail. dn_dirty_txg (via DNODE_IS_DIRTY()) is that - its a normal bit of dnode state, under the dn_mtx lock, and unambiguously indicates whether or not there's changes pending.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

Running a variant of the reproducer from #15526, with the below patch applied to substantially widen the gap:

diff --git module/zfs/dmu_objset.c module/zfs/dmu_objset.c
index f098e1daa..837f9786f 100644
--- module/zfs/dmu_objset.c
+++ module/zfs/dmu_objset.c
@@ -1575,6 +1575,8 @@ dmu_objset_sync_dnodes(multilist_sublist_t *list, dmu_tx_t *tx)
 		ASSERT3U(dn->dn_nlevels, <=, DN_MAX_LEVELS);
 		multilist_sublist_remove(list, dn);
 
+		zfs_sleep_until(gethrtime()+USEC2NSEC(100));
+
 		/*
 		 * See the comment above dnode_rele_task() for an explanation
 		 * of why this dnode hold is always needed (even when not

Without the previous fix or this one, its easy to hit over and over again. With this fix in place, its silent.

Doing this also goes some way to support the analysis.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

dn_dirty_ctx is always set to the highest txg that has ever dirtied the dnode. It is set in dbuf_dirty() when a data or metadnode dbuf is dirtied, and never cleared. [analysis of bug openzfs#15526 and fix openzfs#15571 below, for future readers] The previous dirty check was: for (int i = 0; i < TXG_SIZE; i++) { if (multilist_link_active(&dn->dn_dirty_link[i]) [dnode is dirty] However, this check is not "is the dnode dirty?" but rather, "is the dnode on a list?". There is a gap in dmu_objset_sync_dnodes() where the dnode is moved from os_dirty_dnodes to os_synced_dnodes, before dnode_sync() is called to write out the dirty dbufs. So, there is a moment when the dnode is not on a list, and so the check fails. It doesn't matter that the dirty check takes dn_mtx, because that lock isn't used for dn_dirty_link. The os_dirty_dnodes sublist lock is held in dmu_objset_sync_dnodes(), but trying to take that would mean possibly waiting until everything on that sublist has been synced. The correct fix has to check something that positively asserts the dnode is dirty, rather than an implementation detail. dn_dirty_txg (via DNODE_IS_DIRTY()) is that - its a normal bit of dnode state, under the dn_mtx lock, and unambiguously indicates whether or not there's changes pending. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

amotin

I like the cleanness of this approach, but as I can see dn_dirty_txg is now updated only when one of dnode dbufs is getting dirty. Same time dnode_next_offset() seems to heavily depend on random structural data protected by dn_struct_rwlock, like number of blocks in a file, block size, number of indirection levels, etc. We must make sure that in all cases when those can change and get stale dn_dirty_txg is also bumped.

While looking on this I've got an optimization idea: would dnode_is_dirty() return the txg in which the dnode was last dirtied, we could pass the value to txg_wait_synced() instead of the 0, and so reduce number of transactions that has to be synced before the dnode will get clean and we can call dnode_next_offset() on it. There is a chance that dnode was modified a while ago and its TXG is already in process of syncing.

behlendorf · 2023-12-02T00:11:58Z

I agree, this would be a nice way to perform the dirty check. Of course the devil's in the details, as @amotin pointed out currently dn_dirty_txg only gets set in dbuf_dirty() which may not cover all possible cases.

robn · 2023-12-02T02:02:33Z

Ahh, I think I see where I went wrong: I saw dnode_setdirty() being called on all (?) dnode changes, and that calls dbuf_dirty(dn->dn_dbuf). But of course a change to dn_dbuf dirties the object 0 dnode, not this dnode. Right. Bummer. So I guess dn_dirty_txg isn't what we want.

There is also dn_dirtyctx but it seems that's never cleared, only bumped from DN_UNDIRTIED to DN_DIRTY_SYNC or DN_DIRTY_OPEN. Interestingly we used to use that in the dirty check in dmu_offset_next() but we stopped in 454365b (#6867). That's after the dbuf cache was introduced, which made dnodes live on in memory for a lot longer. So I wonder if dn_dirtyctx never used to be set back to DN_UNDIRTIED because the expectation was that dnode_t was about to be freed anyway, but then we started caching them? Maybe then the answer is to fix it so dn_dirtyctx is returned to to DN_UNDIRTIED, and then start using again for the dirty test?

I agree that now we've got a fix in place, its worth taking the time to study all the cases and fix this up properly. So no hurry on this. Give me your thoughts on above, and I'll come back to this as I have time (I kind of have to get back to my actual work for a little while haha).

mikkorantalainen · 2023-12-04T18:48:32Z

It doesn't matter that the dirty check takes dn_mtx, because that lock isn't used for dn_dirty_link. The os_dirty_dnodes sublist lock is held in dmu_objset_sync_dnodes(), but trying to take that would mean possibly waiting until everything on that sublist has been synced.

The correct fix has to check something that positively asserts the dnode is dirty, rather than an implementation detail. dn_dirty_txg (via DNODE_IS_DIRTY()) is that - its a normal bit of dnode state, under the dn_mtx lock, and unambiguously indicates whether or not there's changes pending.

Shouldn't correct code do logically following?

Start critical section
Check if dirty
Act on dirty status if needed
End critical section

If you have no critical section at all, the code will be racy forever. And if you only have a critical section for the test, the code will be potentially racy, too, in case the action is ever changed without checking if the critical test alone is acceptable with the modified code.

Or am I missing something here?

robn · 2023-12-05T07:36:00Z

@mikkorantalainen

If you have no critical section at all, the code will be racy forever. And if you only have a critical section for the test, the code will be potentially racy, too, in case the action is ever changed without checking if the critical test alone is acceptable with the modified code.

Or am I missing something here?

Nope, you're exactly right. The bit of code that syncs things out does the right thing, in that that it takes the list lock before moving things on and off the dirty list. The dirty check that caused all the trouble, however, does not take that lock first. It certainly could, but I don't want to do that because the list lock has a much wider scope than a single dnode, so it can cause a significant stall. So instead I'm fishing around for an equivalent item on the dnode itself, so we can take the dnode lock only to do the check (and of course take the dnode lock in the sync function to keep that one in correct to both observers).

I feel like I didn't explain that so well though... 😅

rrevans · 2023-12-28T17:55:49Z

(Hi all, first time contributing here. I've been spending a while reading the code and history trying to understand the precise causes and history of #15526. I have written a note with my findings here.)

dn_dirty_txg would work here if dnode_setdirty sets it instead of dbuf_dirty. However the racy plain read of spa_syncing_txg in DNODE_IS_DIRTY is not ideal.

Another approach is to count txgs in which the dnode is dirtied (similar to db_dirtycnt), incrementing in dnode_setdirty and decementing in userquota_updates_task / dnode_rele_task. That wouldn't require reads of the spa txg state to check if the dnode is dirty.

Even better is to make dnode_next_offset work on dirty objects to avoid sync checks entirely. Adding or freeing leaf blocks will in turn dirty any parent indirect blocks up to the dnode. dnode_next_offset can use this to locate dirty leaf blocks block pointers are updated while still skipping over unmodified blocks. dnode_next_offset_level can check if a child block is dirty with dbuf_find and db_dirtycnt > 0. If dirty, dnode_next_offset_level returns a match to continue the search there. If not, dnode_next_offset_level uses the (clean) block pointer to detect data or holes. For L1 scans, dnode_next_offset_level also needs to check for dirty freed L0 blocks using dnode_block_freed. Some more care is needed when the tree grows (or increases levels), and dnode_next_offset must loop to continue the search one level up when a dirty L1 block does not have any data or holes. All this behavior can be gated on a new dnode_next_offset flag so that only dmu_offset_next triggers dirty checks. I have implemented a proof of concept which seems to work, and I am continuing to test it.

p.s. dn_dirty_txg looks redundant in dnode_check_slots_free? dn->dn_type == DMU_OT_NONE && zfs_refcount_is_zero(&dn->dn_holds) should be sufficient to check a freed object is fully synced. dn_dirty_txg was added in #7388 to fix #7147 after large_dnodes (#3542) omitted hold checks for interior slots (which are needed because sync holds the dnode after freeing it until userquota is updated). The holds check was added back in #8249, after which one of the two conditions is no longer needed. Maybe both should be replaced by a new dn_dirtycnt?

rrevans · 2024-03-28T13:41:52Z

@robn See #16025 for the first step and master...rrevans:zfs:find_dirty for the rest of the approach described in my comment above.

behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 1, 2023

This was referenced Dec 1, 2023

some copied files are corrupted (chunks replaced by zeros) #15526

Closed

ZTS: dnode stress test #15608

Merged

amotin reviewed Dec 1, 2023

View reviewed changes

robn marked this pull request as draft December 2, 2023 02:02

robn mentioned this pull request Mar 25, 2024

Fix corruption caused by mmap flushing problems #16019

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dnode_is_dirty: use dn_dirty_txg to check dirtiness #15615

dnode_is_dirty: use dn_dirty_txg to check dirtiness #15615

robn commented Nov 30, 2023

amotin left a comment

behlendorf commented Dec 2, 2023

robn commented Dec 2, 2023 •

edited

mikkorantalainen commented Dec 4, 2023

robn commented Dec 5, 2023

rrevans commented Dec 28, 2023

rrevans commented Mar 28, 2024

dnode_is_dirty: use dn_dirty_txg to check dirtiness #15615

Are you sure you want to change the base?

dnode_is_dirty: use dn_dirty_txg to check dirtiness #15615

Conversation

robn commented Nov 30, 2023

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

amotin left a comment

Choose a reason for hiding this comment

behlendorf commented Dec 2, 2023

robn commented Dec 2, 2023 • edited

mikkorantalainen commented Dec 4, 2023

robn commented Dec 5, 2023

rrevans commented Dec 28, 2023

rrevans commented Mar 28, 2024

robn commented Dec 2, 2023 •

edited