Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs rename can cause zfs snapshot to deadlock the txg_sync thread on zvols #2654

Closed
ryao opened this issue Sep 2, 2014 · 2 comments
Closed

Comments

@ryao
Copy link
Contributor

ryao commented Sep 2, 2014

I was sent the following backtraces for analysis following a deadlock that was encountered on SoftNAS during a long running send:

Aug 20 10:02:48 ip-10-0-5-75 kernel: INFO: task txg_sync:1157 blocked for more than 120 seconds.
Aug 20 10:02:48 ip-10-0-5-75 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 20 10:02:48 ip-10-0-5-75 kernel: txg_sync      D ffff8800280c2680     0  1157      2 0x00000000
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffff880399affad0 0000000000000246 ffffffff81007ca2 0000000000000000
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffff880399affad8 0000000000000000 0000000000000200 0000000000000200
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffff8803a0961098 ffff880399afffd8 000000000000fb88 ffff8803a0961098
Aug 20 10:02:48 ip-10-0-5-75 kernel: Call Trace:
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007ca2>] ? check_events+0x12/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff814eb14b>] mutex_lock+0x2b/0x50
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0294619>] zvol_rename_minors+0x79/0x180 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa020f319>] dsl_dataset_rename_snapshot_sync_impl+0x199/0x2d0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa02113cf>] dsl_dataset_rename_snapshot_sync+0xbf/0x1a0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa02211e2>] dsl_sync_task_sync+0xf2/0x100 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa02199fb>] dsl_pool_sync+0x2eb/0x440 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa022d3ab>] spa_sync+0x40b/0xae0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0241749>] txg_sync_thread+0x369/0x5e0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff810527f9>] ? set_user_nice+0xc9/0x130
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa02413e0>] ? txg_sync_thread+0x0/0x5e0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa01023b8>] thread_generic_wrapper+0x68/0x80 [spl]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0102350>] ? thread_generic_wrapper+0x0/0x80 [spl]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81090626>] kthread+0x96/0xa0
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8100b294>] ? int_ret_from_sys_call+0x7/0x1b
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: INFO: task zfs:23467 blocked for more than 120 seconds.
Aug 20 10:02:48 ip-10-0-5-75 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 20 10:02:48 ip-10-0-5-75 kernel: zfs           D ffff880028084680     0 23467  23462 0x00000000
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffff8803148718d8 0000000000000286 ffffffff810074fd ffff880314871898
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffffffff81007ca2 0000000000000000 0000000000000000 00000000ffffffff
Aug 20 10:02:48 ip-10-0-5-75 kernel: ffff8803ac1145f8 ffff880314871fd8 000000000000fb88 ffff8803ac1145f8
Aug 20 10:02:48 ip-10-0-5-75 kernel: Call Trace:
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff810074fd>] ? xen_force_evtchn_callback+0xd/0x10
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007ca2>] ? check_events+0x12/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007c8f>] ? xen_restore_fl_direct_end+0x0/0x1
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff814ec58c>] ? _spin_unlock_irqrestore+0x1c/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81090b9e>] ? prepare_to_wait_exclusive+0x4e/0x80
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff810074fd>] ? xen_force_evtchn_callback+0xd/0x10
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa010a635>] cv_wait_common+0x105/0x1c0 [spl]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa023d6c3>] ? spa_lookup+0x53/0x60 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa010a745>] __cv_wait+0x15/0x20 [spl]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa022810b>] rrw_enter_read+0x3b/0x150 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007ca2>] ? check_events+0x12/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0228240>] rrw_enter+0x20/0x30 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa021898d>] dsl_pool_config_enter+0x1d/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0218a1a>] dsl_pool_hold+0x4a/0x60 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007c8f>] ? xen_restore_fl_direct_end+0x0/0x1
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa01fd16b>] dmu_objset_hold+0x2b/0xb0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa021cd8f>] dsl_prop_get+0x3f/0x90 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8115f977>] ? kfree+0x127/0x320
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa021cdfe>] dsl_prop_get_integer+0x1e/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0293fa7>] __zvol_create_minor+0xa7/0x640 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa00fed0b>] ? kmem_free_debug+0x4b/0x150 [spl]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0294567>] zvol_create_minor+0x27/0x40 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa029458e>] zvol_create_minors_cb+0xe/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa01faace>] dmu_objset_find_impl+0x36e/0x3e0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0294580>] ? zvol_create_minors_cb+0x0/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa01fa91e>] dmu_objset_find_impl+0x1be/0x3e0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0294580>] ? zvol_create_minors_cb+0x0/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0294580>] ? zvol_create_minors_cb+0x0/0x20 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa01fab92>] dmu_objset_find+0x52/0x80 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa02930b3>] zvol_create_minors+0x33/0x40 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa0268e7e>] zfs_ioc_snapshot+0x26e/0x2d0 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffffa026a1d7>] zfsdev_ioctl+0x1e7/0x500 [zfs]
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81189012>] vfs_ioctl+0x22/0xa0
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff811891b4>] do_vfs_ioctl+0x84/0x580
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81007ca2>] ? check_events+0x12/0x20
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff81189731>] sys_ioctl+0x81/0xa0
Aug 20 10:02:48 ip-10-0-5-75 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

The txg_sync task called rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); in dsl_sync_task_sync() and then zvol_rename_minors(), which immediately called mutex_lock(&zvol_state_lock). Simultaneously, A userland process started a snapshot and took mutex_lock(&zvol_state_lock) before creating minors. It then called dsl_pool_config_enter(), which did rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);.

The system where this happened is running the latest version of SoftNAS, which has a CentOS 6.5 userland, Linux 3.14.8 and ZoL 0.6.3 with some light patches on top.

@ryao
Copy link
Contributor Author

ryao commented Sep 2, 2014

This is a Linux-specific regression because Illumos does not have anything for renaming zvol snapshots (I am not sure if the snapshots are even accessible). I have a rough plan of action sketched out in my head. The general idea is to make the rename functionality asynchronous. I will just have a couple of lists of zvols to rename (syncing/quiscing style) similar to the txg commit. Then I will kick the system taskq to run this code whenever work is batched. That ought to avoid this deadlock.

@ryao ryao changed the title zfs send can cause zfs snapshot to deadlock the txg_sync thread on zvols zfs rename can cause zfs snapshot to deadlock the txg_sync thread on zvols Sep 2, 2014
@ryao
Copy link
Contributor Author

ryao commented Sep 2, 2014

Closed as duplicate of #2652.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant