Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process 'zfs_unlinked_drain' asynchronously on mount #3814

Closed
behlendorf opened this issue Sep 22, 2015 · 16 comments
Closed

Process 'zfs_unlinked_drain' asynchronously on mount #3814

behlendorf opened this issue Sep 22, 2015 · 16 comments
Labels
Type: Feature Feature request or new feature
Milestone

Comments

@behlendorf
Copy link
Contributor

Historically ZFS has always processed the unlinked set synchronously during mount. When the unlinked set is very large this can lead to long delays mounting the filesystem and it may appear as if the system is hung. It would be desirable to handle this work asynchronously after the mount completes. There's no technical reason this can't be done and it would significantly improve availability in certain situations.

@behlendorf behlendorf added the Type: Feature Feature request or new feature label Sep 22, 2015
@behlendorf behlendorf added this to the 0.7.0 milestone Sep 22, 2015
@lundman
Copy link
Contributor

lundman commented Sep 22, 2015

We did something quick and dirty

openzfsonosx/zfs@11dd716

@neingeist
Copy link

I see this problem on my system; The system not only does appear to be hung, it is effectively hung. I can RO-import the pool in question without problems.

@ilovezfs
Copy link
Contributor

How do you know it is "this problem"? Ordinarily, the unlinked drain list should not be very long.

@neingeist
Copy link

The kernel dumps a trace (on the console) that contains "zfs_unlinked_drain" (using 0.6.5 and 0.6.4) and the last thing I did before a crash of the system was to delete a lot of files (many hundred thousands). This and the fact that I can import with -o readonly=on makes me believe that it is this issue.

@ilovezfs
Copy link
Contributor

If it is indeed this issue, the solution is to wait until it's done.

@neingeist
Copy link

Agreed. One problem though: I have absolutely no indication that it's actually making progress, as the system appears to be hung. I can only switch consoles, maybe get dmesg output on one of the previously running shells (this appears to be better working with 0.6.4).

@ilovezfs
Copy link
Contributor

Are you able to build from source? lundman's patch does work, though we removed it from openzfsonosx master after it was no longer needed to help with unlinked drain lists that we hadn't been draining at all because the patch would occasionally deadlock. Even though the full patch was removed, there are still currently status messages in openzfsonosx master when the list is long, so you could just add those if all you need is to be able to monitor the progress:
https://github.com/openzfsonosx/zfs/blob/master/module/zfs/zfs_dir.c#L537-L546

@neingeist
Copy link

Thanks! I'll patch it in later/tomorrow!

@neingeist
Copy link

So, I patched zfs-0.6.5.3 with this: neingeist@2edff94

[ 1283.738635] ZFS: unlinked drain started.
[ 1283.963863] ZFS: unlinked drain completed (0).
[ 1284.125962] ZFS: unlinked drain started.

It stops there.

It's definitely hung up in unlinked drain on the dataset with the many thousand deletes. Here's a call trace:

[ 1289.332037] Call Trace:
[ 1289.332037]  <NMI>  [<ffffffff81046cbd>] ? warn_slowpath_common+0x78/0x8c
[ 1289.332037]  [<ffffffff81046d69>] ? warn_slowpath_fmt+0x45/0x4a
[ 1289.332037]  [<ffffffff810139dc>] ? sched_clock+0x5/0x8
[ 1289.332037]  [<ffffffff810906a4>] ? watchdog_overflow_callback+0x93/0x9e
[ 1289.332037]  [<ffffffff81090611>] ? touch_nmi_watchdog+0x60/0x60
[ 1289.332037]  [<ffffffff810b05ca>] ? __perf_event_overflow+0xf9/0x17b
[ 1289.332037]  [<ffffffff81063fbe>] ? local_clock+0x24/0x2b
[ 1289.332037]  [<ffffffff810aea04>] ? perf_event_update_userpage+0x9/0x9d
[ 1289.332037]  [<ffffffff81019ea7>] ? x86_perf_event_set_period+0x101/0x10b
[ 1289.332037]  [<ffffffff8101a293>] ? x86_pmu_handle_irq+0xb7/0xec
[ 1289.332037]  [<ffffffff8135015c>] ? nmi_handle.isra.0+0x41/0x64
[ 1289.332037]  [<ffffffff813501fc>] ? do_nmi+0x7d/0x258
[ 1289.332037]  [<ffffffff8134fa70>] ? nmi+0x20/0x30
[ 1289.332037]  [<ffffffff81070ff2>] ? do_raw_spin_lock+0x11/0x1b
[ 1289.332037]  <<EOE>>  [<ffffffff8134f0f5>] ? rwsem_down_failed_common+0x41/0x114
[ 1289.332037]  [<ffffffffa06a7529>] ? dmu_buf_hold_noread+0x96/0xbd [zfs]
[ 1289.332037]  [<ffffffff811b3f24>] ? call_rwsem_down_read_failed+0x14/0x30
[ 1289.332037]  [<ffffffff8134eb1a>] ? down_read+0x17/0x19
[ 1289.332037]  [<ffffffffa06f57f5>] ? zap_get_leaf_byblk+0x15b/0x1c7 [zfs]
[ 1289.332037]  [<ffffffffa06f5a70>] ? zap_deref_leaf+0x65/0x6e [zfs]
[ 1289.332037]  [<ffffffffa06f7563>] ? fzap_cursor_retrieve+0x8c/0x205 [zfs]
[ 1289.332037]  [<ffffffffa06f9ff6>] ? zap_cursor_retrieve+0xc1/0x1e4 [zfs]
[ 1289.332037]  [<ffffffffa06d6889>] ? sa_lookup+0x75/0x87 [zfs]
[ 1289.332037]  [<ffffffffa070214b>] ? zfs_purgedir+0x192/0x1dc [zfs]
[ 1289.332037]  [<ffffffffa0719c84>] ? zfs_inode_update+0x114/0x17a [zfs]
[ 1289.332037]  [<ffffffff8105f9ba>] ? bit_waitqueue+0x17/0x75
[ 1289.332037]  [<ffffffff8105fa52>] ? wake_up_bit+0xd/0x1e
[ 1289.332037]  [<ffffffff8110c92e>] ? unlock_new_inode+0x43/0x4c
[ 1289.332037]  [<ffffffffa071a17b>] ? zfs_znode_alloc+0x491/0x4bb [zfs]
[ 1289.332037]  [<ffffffff81036628>] ? should_resched+0x5/0x23
[ 1289.332037]  [<ffffffff8134deec>] ? _cond_resched+0x7/0x1c
[ 1289.332037]  [<ffffffff81036628>] ? should_resched+0x5/0x23
[ 1289.332037]  [<ffffffffa07021e1>] ? zfs_rmnode+0x4c/0x2b8 [zfs]
[ 1289.332037]  [<ffffffffa071c932>] ? zfs_zinactive+0xb3/0x139 [zfs]
[ 1289.332037]  [<ffffffffa07129b8>] ? zfs_inactive+0x139/0x18a [zfs]
[ 1289.332037]  [<ffffffffa0729aa7>] ? zpl_evict_inode+0x3e/0x4c [zfs]
[ 1289.332037]  [<ffffffff8110d068>] ? evict+0x96/0x148
[ 1289.332037]  [<ffffffffa0701779>] ? zfs_unlinked_drain+0x9d/0x11d [zfs]
[ 1289.332037]  [<ffffffff81070fc1>] ? arch_local_irq_save+0x11/0x17
[ 1289.332037]  [<ffffffff81070fc1>] ? arch_local_irq_save+0x11/0x17
[ 1289.332037]  [<ffffffff8134f209>] ? _raw_spin_lock_irqsave+0x9/0x25
[ 1289.332037]  [<ffffffff8105fd09>] ? finish_wait+0x35/0x60
[ 1289.332037]  [<ffffffff8134f247>] ? _raw_spin_unlock_irqrestore+0xe/0xf
[ 1289.332037]  [<ffffffffa05fd0af>] ? taskq_create+0x277/0x299 [spl]
[ 1289.332037]  [<ffffffff8105fc83>] ? add_wait_queue+0x3c/0x3c
[ 1289.332037]  [<ffffffffa0718935>] ? zfs_get_done+0x5a/0x5a [zfs]
[ 1289.332037]  [<ffffffffa0711355>] ? zfs_sb_setup+0xdc/0x150 [zfs]
[ 1289.332037]  [<ffffffffa0712111>] ? zfs_domount+0x245/0x2da [zfs]
[ 1289.332037]  [<ffffffff8134eae5>] ? down_write+0x9/0x27
[ 1289.332037]  [<ffffffff810c143b>] ? register_shrinker+0x18/0x46
[ 1289.332037]  [<ffffffffa072991d>] ? spl_fstrans_unmark.isra.0+0x14/0x14 [zfs]
[ 1289.332037]  [<ffffffffa072995a>] ? zpl_fill_super+0x3d/0x52 [zfs]
[ 1289.332037]  [<ffffffff810fc731>] ? mount_nodev+0x42/0x78
[ 1289.332037]  [<ffffffff810fd117>] ? mount_fs+0x61/0x146
[ 1289.332037]  [<ffffffff811101a6>] ? vfs_kern_mount+0x5f/0x99
[ 1289.332037]  [<ffffffff81110590>] ? do_kern_mount+0x49/0xd8
[ 1289.332037]  [<ffffffff81111c0b>] ? do_mount+0x660/0x6c6
[ 1289.332037]  [<ffffffff81111f19>] ? sys_mount+0x88/0xc3
[ 1289.332037]  [<ffffffff81354212>] ? system_call_fastpath+0x16/0x1b

With my limited understanding of this, I think it's not the size of the list but maybe a deadlock?

@tuxoko
Copy link
Contributor

tuxoko commented Nov 13, 2015

@neingeist
#4006
Also from zfs_unlinked_drain.
The problem seems to be some kind of memory corruption.

@DeHackEd
Copy link
Contributor

I've run into something similar, but caused by the ZAP structure growing to enormous sizes and not shrinking when done. A 391 megabytes ZAP with 0 entries in it delays boot by around 10 minutes.

I might try porting this patch.

@DeHackEd
Copy link
Contributor

A few issues came up looking at the OS X patch as is.

@lundman
Copy link
Contributor

lundman commented Mar 25, 2017

Once the unlinked_drain list has been very large (we've had users in multiple of millions of entries) the zap object is still slow to process each mount, even if it is empty (now). When I talked to ahrens, it might need some code to destroy the object (assuming unlinked-drain is empty, and there were no errors) and recreate it, so it doesn't need to traverse all the disk buffers, only to find there is nothing there. We have not yet had time to look at that.

The async drain did help people being able to use their FS immediately, instead of having to wait. But we removed it in that it treated the symptoms, not the cause.

@behlendorf
Copy link
Contributor Author

A potentially nice way to tackle this issue would be update the ZAP code so it's smart about reverting back to a microzap when a fatzap is no longer necessary. This is something we've wanted to do for quite some time since it has the advantage of improving other common use cases. For example, when a large number of files are added to a directory and then removed readdir performance in in that directory will continue suffer. I think it would be great if someone had time to tackle this optimization.

@behlendorf behlendorf modified the milestones: 0.8.0, 0.9.0 Feb 9, 2018
@DeHackEd
Copy link
Contributor

It processes the ZAP asynchronously in the background, hence not blocking the mounting of the filesystem. Fundamentally what was asked for has been provided.

...Which is the most important bit of what we want, but there's a related pathological case where the thread may spend a lot of time and disk IO accomplishing nothing, which is what behlendorf's most recent comment and my complaint above are about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

7 participants
@tuxoko @behlendorf @neingeist @lundman @DeHackEd @ilovezfs and others