Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock under heavy I/O using xattrs and deduplication #1657

Closed
wbrown opened this issue Aug 15, 2013 · 12 comments
Closed

Deadlock under heavy I/O using xattrs and deduplication #1657

wbrown opened this issue Aug 15, 2013 · 12 comments
Milestone

Comments

@wbrown
Copy link

wbrown commented Aug 15, 2013

I've been having system hangs on my ZoL systems running Gluster -- this typically seems to happen under heavy I/O, and appears to be a race condition of some sort.

The zpool in question has 15 mirrors, a L2ARC SSD, and forced sync to a SAS ZIL.

[  604.666195] mlx4_core 0000:08:00.0: command 0x5a failed: fw status = 0x2
[13564.897053] INFO: task z_wr_iss/1:4228 blocked for more than 120 seconds.
[13564.926105] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13564.955875] z_wr_iss/1      D 0000000000000002     0  4228      2 0x00000000
[13564.955880]  ffff881f7e39d7d0 0000000000000046 0000000000014280 ffff881f7e39dfd8
[13564.955889]  ffff881f7e39dfd8 0000000000014280 ffff881f7dd95c20 ffffffff814d0740
[13564.955893]  ffff8837f26da720 0000000000000001 ffff881f7e39d7a8 2222222222222222
[13564.955897] Call Trace:
[13564.955908]  [<ffffffff814d0740>] ? __mutex_lock_slowpath+0x260/0x360
[13564.955942]  [<ffffffffa07c1a40>] ? vdev_config_sync+0x150/0x150 [zfs]
[13564.955952]  [<ffffffff814d0740>] ? __mutex_lock_slowpath+0x260/0x360
[13564.955957]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[13564.955960]  [<ffffffff814d1e9f>] io_schedule+0x8f/0xe0
[13564.955970]  [<ffffffffa06b2376>] cv_wait_common+0xa6/0x1c0 [spl]
[13564.955975]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[13564.955982]  [<ffffffffa06b24e8>] __cv_wait_io+0x18/0x20 [spl]
[13564.955996]  [<ffffffffa07fde7b>] zio_wait+0x11b/0x1c0 [zfs]
[13564.956012]  [<ffffffffa0760d1a>] dbuf_read+0x2ea/0x8a0 [zfs]
[13564.956026]  [<ffffffffa076860a>] dmu_buf_hold+0xea/0x1b0 [zfs]
[13564.956045]  [<ffffffffa07c61a6>] zap_get_leaf_byblk+0x46/0x2c0 [zfs]
[13564.956063]  [<ffffffffa07c6594>] ? zap_idx_to_blk+0xd4/0x130 [zfs]
[13564.956080]  [<ffffffffa07c6654>] zap_deref_leaf+0x64/0x80 [zfs]
[13564.956099]  [<ffffffffa07c79b2>] fzap_length+0x32/0x90 [zfs]
[13564.956116]  [<ffffffffa07cb226>] ? zap_name_alloc_uint64+0x66/0x80 [zfs]
[13564.956132]  [<ffffffffa07cc975>] zap_length_uint64+0x75/0xc0 [zfs]
[13564.956147]  [<ffffffffa076830e>] ddt_zap_lookup+0x5e/0xd0 [zfs]
[13564.956161]  [<ffffffffa07682b0>] ? ddt_zap_prefetch+0x20/0x20 [zfs]
[13564.956174]  [<ffffffffa0767001>] ddt_lookup+0xe1/0x1f0 [zfs]
[13564.956191]  [<ffffffffa07a244d>] ? SHA256Transform+0xad/0x210 [zfs]
[13564.956196]  [<ffffffff814d0740>] ? __mutex_lock_slowpath+0x260/0x360
[13564.956200]  [<ffffffff814d0852>] ? mutex_lock+0x12/0x30
[13564.956214]  [<ffffffffa07fe931>] zio_ddt_write+0x61/0x490 [zfs]
[13564.956227]  [<ffffffffa0800711>] ? zio_checksum_compute+0xd1/0x160 [zfs]
[13564.956243]  [<ffffffffa07fbf66>] zio_execute+0xa6/0x120 [zfs]
[13564.956249]  [<ffffffffa06acb67>] taskq_thread+0x237/0x4b0 [spl]
[13564.956256]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[13564.956262]  [<ffffffffa06ac930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[13564.956265]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[13564.956270]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[13564.956274]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[13564.956278]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[13564.956283] INFO: task z_wr_iss/15:4242 blocked for more than 120 seconds.
[13564.986062] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13565.016404] z_wr_iss/15     D 0000000000000002     0  4242      2 0x00000000
[13565.016408]  ffff881f7de937d0 0000000000000046 0000000000014280 ffff881f7de93fd8
[13565.016413]  ffff881f7de93fd8 0000000000014280 ffff880fc89fa180 ffffffff814d0740
[13565.016417]  ffff8837f2455848 0000000000000001 ffff881f7de937a8 2222222222222222
[13565.016424] Call Trace:
[13565.016429]  [<ffffffff814d0740>] ? __mutex_lock_slowpath+0x260/0x360
[13565.016447]  [<ffffffffa07c1a40>] ? vdev_config_sync+0x150/0x150 [zfs]
[13565.016454]  [<ffffffff814d0740>] ? __mutex_lock_slowpath+0x260/0x360
[13565.016458]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[13565.016461]  [<ffffffff814d1e9f>] io_schedule+0x8f/0xe0
[13565.016467]  [<ffffffffa06b2376>] cv_wait_common+0xa6/0x1c0 [spl]
[13565.016471]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[13565.016479]  [<ffffffffa06b24e8>] __cv_wait_io+0x18/0x20 [spl]
[13565.016492]  [<ffffffffa07fde7b>] zio_wait+0x11b/0x1c0 [zfs]
[13565.016507]  [<ffffffffa0760d1a>] dbuf_read+0x2ea/0x8a0 [zfs]
[13565.016521]  [<ffffffffa076860a>] dmu_buf_hold+0xea/0x1b0 [zfs]
[13565.016539]  [<ffffffffa07c61a6>] zap_get_leaf_byblk+0x46/0x2c0 [zfs]
[13565.016557]  [<ffffffffa07c6594>] ? zap_idx_to_blk+0xd4/0x130 [zfs]
[13565.016575]  [<ffffffffa07c6654>] zap_deref_leaf+0x64/0x80 [zfs]
[13565.016592]  [<ffffffffa07c79b2>] fzap_length+0x32/0x90 [zfs]
[13565.016609]  [<ffffffffa07cb226>] ? zap_name_alloc_uint64+0x66/0x80 [zfs]
[13565.016625]  [<ffffffffa07cc975>] zap_length_uint64+0x75/0xc0 [zfs]
[13565.016638]  [<ffffffffa076830e>] ddt_zap_lookup+0x5e/0xd0 [zfs]
[13565.016654]  [<ffffffffa07682b0>] ? ddt_zap_prefetch+0x20/0x20 [zfs]
[13565.016668]  [<ffffffffa0767001>] ddt_lookup+0xe1/0x1f0 [zfs]
[13565.016687]  [<ffffffffa07a23f4>] ? SHA256Transform+0x54/0x210 [zfs]
[13565.016692]  [<ffffffff814d0740>] ? __mutex_lock_slowpath+0x260/0x360
[13565.016696]  [<ffffffff814d0852>] ? mutex_lock+0x12/0x30
[13565.016710]  [<ffffffffa07fe931>] zio_ddt_write+0x61/0x490 [zfs]
[13565.016723]  [<ffffffffa0800711>] ? zio_checksum_compute+0xd1/0x160 [zfs]
[13565.016739]  [<ffffffffa07fbf66>] zio_execute+0xa6/0x120 [zfs]
[13565.016745]  [<ffffffffa06acb67>] taskq_thread+0x237/0x4b0 [spl]
[13565.016752]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[13565.016759]  [<ffffffffa06ac930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[13565.016762]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[13565.016766]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[13565.016769]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[13565.016774]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[13685.021807] INFO: task glusterfsd:5053 blocked for more than 120 seconds.
[13685.052463] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13685.083326] glusterfsd      D 0000000000000002     0  5053      1 0x00000000
[13685.083331]  ffff881f6bb3db98 0000000000000086 0000000000014280 ffff881f6bb3dfd8
[13685.083340]  ffff881f6bb3dfd8 0000000000014280 ffff881f7dcc6480 ffff881f6bb3db08
[13685.083345]  ffffffff81086caa 0000000000014280 ffff880fc8b69920 ffff880fc8b69f42
[13685.083349] Call Trace:
[13685.083359]  [<ffffffff81086caa>] ? ttwu_stat+0x9a/0x110
[13685.083366]  [<ffffffff8108c9d9>] ? try_to_wake_up+0x189/0x2f0
[13685.083370]  [<ffffffff8108cb92>] ? default_wake_function+0x12/0x20
[13685.083376]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[13685.083397]  [<ffffffffa06b23dd>] cv_wait_common+0x10d/0x1c0 [spl]
[13685.083403]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[13685.083412]  [<ffffffffa06b24a5>] __cv_wait+0x15/0x20 [spl]
[13685.083434]  [<ffffffffa07b6d73>] txg_wait_open+0xa3/0x130 [zfs]
[13685.083450]  [<ffffffffa0775acd>] dmu_tx_wait+0x10d/0x110 [zfs]
[13685.083464]  [<ffffffffa07ef4d2>] zfs_write+0x362/0xca0 [zfs]
[13685.083473]  [<ffffffff811adc0c>] ? generic_getxattr+0x4c/0x70
[13685.083476]  [<ffffffff811ae048>] ? vfs_getxattr+0x88/0xb0
[13685.083489]  [<ffffffffa08038d2>] zpl_write_common+0x52/0x80 [zfs]
[13685.083502]  [<ffffffffa0803960>] zpl_write+0x60/0x90 [zfs]
[13685.083507]  [<ffffffff8118bb2b>] vfs_write+0x9b/0x170
[13685.083510]  [<ffffffff8118bfa2>] sys_pwrite64+0x72/0xb0
[13685.083515]  [<ffffffff814da99d>] system_call_fastpath+0x1a/0x1f
[13685.083518] INFO: task glusterfsd:5055 blocked for more than 120 seconds.
[13685.114538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13685.145946] glusterfsd      D 0000000000000002     0  5055      1 0x00000000
[13685.145949]  ffff883f8b113c98 0000000000000086 0000000000014280 ffff883f8b113fd8
[13685.145956]  ffff883f8b113fd8 0000000000014280 ffff883fc606d3c0 ffff883f8b113c08
[13685.145960]  ffffffff81086caa 0000000000014280 ffff880fc8b69920 ffff880fc8b69f42
[13685.145963] Call Trace:
[13685.145967]  [<ffffffff81086caa>] ? ttwu_stat+0x9a/0x110
[13685.145972]  [<ffffffff8108c9d9>] ? try_to_wake_up+0x189/0x2f0
[13685.145978]  [<ffffffff8108cb92>] ? default_wake_function+0x12/0x20
[13685.145982]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[13685.145990]  [<ffffffffa06b23dd>] cv_wait_common+0x10d/0x1c0 [spl]
[13685.145994]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[13685.146003]  [<ffffffffa06b24a5>] __cv_wait+0x15/0x20 [spl]
[13685.146019]  [<ffffffffa07b6d73>] txg_wait_open+0xa3/0x130 [zfs]
[13685.146035]  [<ffffffffa0775acd>] dmu_tx_wait+0x10d/0x110 [zfs]
[13685.146048]  [<ffffffffa07ea2fa>] zfs_create+0x28a/0x6f0 [zfs]
[13685.146064]  [<ffffffffa08043cd>] zpl_create+0x9d/0x130 [zfs]
[13685.146069]  [<ffffffff811975b4>] vfs_create+0xb4/0x120
[13685.146073]  [<ffffffff8119b368>] sys_mknodat+0x1e8/0x210
[13685.146076]  [<ffffffff8119b3ad>] sys_mknod+0x1d/0x20
[13685.146079]  [<ffffffff814da99d>] system_call_fastpath+0x1a/0x1f
[13685.146082] INFO: task glusterfsd:5056 blocked for more than 120 seconds.
[13685.177854] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13685.210395] glusterfsd      D 0000000000000002     0  5056      1 0x00000000
[13685.210399]  ffff880e02057b98 0000000000000086 0000000000014280 ffff880e02057fd8
[13685.210406]  ffff880e02057fd8 0000000000014280 ffff880d922a9920 ffff880e02057b08
[13685.210410]  ffffffff81086caa 0000000000014280 ffff880fc8b69920 ffff880fc8b69f42
[13685.210413] Call Trace:
[13685.210418]  [<ffffffff81086caa>] ? ttwu_stat+0x9a/0x110
[13685.210422]  [<ffffffff8108c9d9>] ? try_to_wake_up+0x189/0x2f0
[13685.210427]  [<ffffffff8108cb92>] ? default_wake_function+0x12/0x20
[13685.210432]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[13685.210440]  [<ffffffffa06b23dd>] cv_wait_common+0x10d/0x1c0 [spl]
[13685.210443]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[13685.210449]  [<ffffffffa06b24a5>] __cv_wait+0x15/0x20 [spl]
[13685.210468]  [<ffffffffa07b6d73>] txg_wait_open+0xa3/0x130 [zfs]
[13685.210482]  [<ffffffffa0775acd>] dmu_tx_wait+0x10d/0x110 [zfs]
[13685.210494]  [<ffffffffa07ef4d2>] zfs_write+0x362/0xca0 [zfs]
[13685.210498]  [<ffffffff811adc0c>] ? generic_getxattr+0x4c/0x70
[13685.210501]  [<ffffffff811ae048>] ? vfs_getxattr+0x88/0xb0
[13685.210515]  [<ffffffffa08038d2>] zpl_write_common+0x52/0x80 [zfs]
[13685.210528]  [<ffffffffa0803960>] zpl_write+0x60/0x90 [zfs]
[13685.210532]  [<ffffffff8118bb2b>] vfs_write+0x9b/0x170
[13685.210536]  [<ffffffff8118bfa2>] sys_pwrite64+0x72/0xb0
[13685.210542]  [<ffffffff814da99d>] system_call_fastpath+0x1a/0x1f
[13685.210546] INFO: task glusterfsd:5059 blocked for more than 120 seconds.
[13685.243453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13685.277032] glusterfsd      D 0000000000000002     0  5059      1 0x00000000
[13685.277036]  ffff881f6bbb1b98 0000000000000086 0000000000014280 ffff881f6bbb1fd8
[13685.277041]  ffff881f6bbb1fd8 0000000000014280 ffff881f7d9eece0 ffff881f6bbb1b08
[13685.277044]  ffffffff81086caa 0000000000014280 ffff882fc7f88000 ffff882fc7f88622
[13685.277048] Call Trace:
[13685.277052]  [<ffffffff81086caa>] ? ttwu_stat+0x9a/0x110
[13685.277056]  [<ffffffff8108c9d9>] ? try_to_wake_up+0x189/0x2f0
[13685.277061]  [<ffffffff8108cb92>] ? default_wake_function+0x12/0x20
[13685.277065]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[13685.277073]  [<ffffffffa06b23dd>] cv_wait_common+0x10d/0x1c0 [spl]
[13685.277077]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[13685.277084]  [<ffffffffa06b24a5>] __cv_wait+0x15/0x20 [spl]
[13685.277100]  [<ffffffffa07b6d73>] txg_wait_open+0xa3/0x130 [zfs]
[13685.277114]  [<ffffffffa0775acd>] dmu_tx_wait+0x10d/0x110 [zfs]
[13685.277126]  [<ffffffffa07ef4d2>] zfs_write+0x362/0xca0 [zfs]
[13685.277130]  [<ffffffff811adc0c>] ? generic_getxattr+0x4c/0x70
[13685.277133]  [<ffffffff811ae048>] ? vfs_getxattr+0x88/0xb0
[13685.277146]  [<ffffffffa08038d2>] zpl_write_common+0x52/0x80 [zfs]
[13685.277158]  [<ffffffffa0803960>] zpl_write+0x60/0x90 [zfs]
[13685.277162]  [<ffffffff8118bb2b>] vfs_write+0x9b/0x170
[13685.277165]  [<ffffffff8118bfa2>] sys_pwrite64+0x72/0xb0
[13685.277169]  [<ffffffff814da99d>] system_call_fastpath+0x1a/0x1f
[13685.277172] INFO: task glusterfsd:5061 blocked for more than 120 seconds.
[13685.311319] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13685.345908] glusterfsd      D 0000000000000002     0  5061      1 0x00000000
[13685.345912]  ffff880e038dfb98 0000000000000086 0000000000014280 ffff880e038dffd8
[13685.345917]  ffff880e038dffd8 0000000000014280 ffff880d922aa9e0 ffff880e038dfb08
[13685.345920]  ffffffff81086caa 0000000000014280 ffff880fc8b69920 ffff880fc8b69f42
[13685.345927] Call Trace:
[13685.345931]  [<ffffffff81086caa>] ? ttwu_stat+0x9a/0x110
[13685.345935]  [<ffffffff8108c9d9>] ? try_to_wake_up+0x189/0x2f0
[13685.345939]  [<ffffffff8108cb92>] ? default_wake_function+0x12/0x20
[13685.345942]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[13685.345950]  [<ffffffffa06b23dd>] cv_wait_common+0x10d/0x1c0 [spl]
[13685.345955]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[13685.345961]  [<ffffffffa06b24a5>] __cv_wait+0x15/0x20 [spl]
[13685.345977]  [<ffffffffa07b6d73>] txg_wait_open+0xa3/0x130 [zfs]
[13685.345994]  [<ffffffffa0775acd>] dmu_tx_wait+0x10d/0x110 [zfs]
[13685.346007]  [<ffffffffa07ef4d2>] zfs_write+0x362/0xca0 [zfs]
[13685.346013]  [<ffffffff811adc0c>] ? generic_getxattr+0x4c/0x70
[13685.346017]  [<ffffffff811ae048>] ? vfs_getxattr+0x88/0xb0
[13685.346028]  [<ffffffffa08038d2>] zpl_write_common+0x52/0x80 [zfs]
[13685.346041]  [<ffffffffa0803960>] zpl_write+0x60/0x90 [zfs]
[13685.346046]  [<ffffffff8118bb2b>] vfs_write+0x9b/0x170
[13685.346049]  [<ffffffff8118bfa2>] sys_pwrite64+0x72/0xb0
[13685.346053]  [<ffffffff814da99d>] system_call_fastpath+0x1a/0x1f
[13685.346061] INFO: task glusterfsd:5160 blocked for more than 120 seconds.
[13685.381021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13685.416679] glusterfsd      D 0000000000000002     0  5160      1 0x00000000
[13685.416682]  ffff88313286bb10 0000000000000086 0000000000014280 ffff88313286bfd8
[13685.416689]  ffff88313286bfd8 0000000000014280 ffff883f4f1ab240 ffff88313286ba80
[13685.416693]  ffffffff81086caa 0000000000014280 ffff880fc8b69920 ffff880fc8b69f42
[13685.416697] Call Trace:
[13685.416726]  [<ffffffff81086caa>] ? ttwu_stat+0x9a/0x110
[13685.416732]  [<ffffffff8108c9d9>] ? try_to_wake_up+0x189/0x2f0
[13685.416737]  [<ffffffff8108cb92>] ? default_wake_function+0x12/0x20
[13685.416740]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[13685.416747]  [<ffffffffa06b23dd>] cv_wait_common+0x10d/0x1c0 [spl]
[13685.416753]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[13685.416760]  [<ffffffffa06b24a5>] __cv_wait+0x15/0x20 [spl]
[13685.416776]  [<ffffffffa07b6d73>] txg_wait_open+0xa3/0x130 [zfs]
[13685.416791]  [<ffffffffa0775acd>] dmu_tx_wait+0x10d/0x110 [zfs]
[13685.416804]  [<ffffffffa07f0436>] zfs_rename+0x436/0xf50 [zfs]
[13685.416810]  [<ffffffff81196d45>] ? link_path_walk+0x255/0x930
[13685.416815]  [<ffffffff810642f5>] ? inode_capable+0x15/0x20
[13685.416819]  [<ffffffff811753a2>] ? __kmalloc_track_caller+0x32/0x230
[13685.416831]  [<ffffffffa0803c4f>] zpl_rename+0x4f/0x70 [zfs]
[13685.416837]  [<ffffffff8119aa8a>] vfs_rename+0x12a/0x4b0
[13685.416841]  [<ffffffff8119bd09>] sys_renameat+0x439/0x470
[13685.416845]  [<ffffffff8119b011>] ? user_path_at+0x11/0x20
[13685.416849]  [<ffffffff811908d7>] ? sys_newlstat+0x27/0x40
[13685.416853]  [<ffffffff8119bd5b>] sys_rename+0x1b/0x20
[13685.416856]  [<ffffffff814da99d>] system_call_fastpath+0x1a/0x1f
[13685.416867] INFO: task glusterfsd:4881 blocked for more than 120 seconds.
[13685.453112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13685.489782] glusterfsd      D 0000000000000002     0  4881      1 0x00000000
[13685.489786]  ffff881fc2175ab0 0000000000000082 0000000000014280 ffff881fc2175fd8
[13685.489790]  ffff881fc2175fd8 0000000000014280 ffff881f7d9e90c0 ffffffff814d0740
[13685.489794]  ffff88182c6b3b80 2222222222222222 2222222222222222 2222222222222222
[13685.489798] Call Trace:
[13685.489802]  [<ffffffff814d0740>] ? __mutex_lock_slowpath+0x260/0x360
[13685.489806]  [<ffffffff814d04de>] ? mutex_unlock+0xe/0x10
[13685.489811]  [<ffffffff814d0740>] ? __mutex_lock_slowpath+0x260/0x360
[13685.489815]  [<ffffffff811a569b>] ? iput+0x3b/0x190
[13685.489818]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[13685.489826]  [<ffffffffa06b23dd>] cv_wait_common+0x10d/0x1c0 [spl]
[13685.489830]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[13685.489835]  [<ffffffffa06b24a5>] __cv_wait+0x15/0x20 [spl]
[13685.489850]  [<ffffffffa07f889b>] zil_commit.part.12+0x7b/0x720 [zfs]
[13685.489863]  [<ffffffffa07f94f7>] zil_commit+0x17/0x20 [zfs]
[13685.489876]  [<ffffffffa07e91c7>] zfs_read+0x417/0x4b0 [zfs]
[13685.489881]  [<ffffffffa06b494c>] ? tsd_exit+0x28c/0x2c0 [spl]
[13685.489893]  [<ffffffffa08037c2>] zpl_read_common+0x52/0x80 [zfs]
[13685.489904]  [<ffffffffa0804a59>] __zpl_xattr_get+0x1e9/0x240 [zfs]
[13685.489916]  [<ffffffffa0804b0c>] zpl_xattr_get+0x5c/0x130 [zfs]
[13685.489928]  [<ffffffffa0804cd1>] zpl_xattr_trusted_get+0x61/0x90 [zfs]
[13685.489931]  [<ffffffff811adc0c>] generic_getxattr+0x4c/0x70
[13685.489934]  [<ffffffff811ae048>] vfs_getxattr+0x88/0xb0
[13685.489937]  [<ffffffff811ae2cb>] getxattr+0xab/0x1d0
[13685.489949]  [<ffffffffa07ec30c>] ? zfs_getattr_fast+0xec/0x170 [zfs]
[13685.489953]  [<ffffffff811904bb>] ? cp_new_stat+0x10b/0x120
[13685.489957]  [<ffffffff811aef21>] sys_fgetxattr+0x61/0xc0
[13685.489961]  [<ffffffff814da99d>] system_call_fastpath+0x1a/0x1f
[13685.489964] INFO: task glusterfsd:4946 blocked for more than 120 seconds.
[13685.526669] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[13685.563752] glusterfsd      D 0000000000000002     0  4946      1 0x00000000
[13685.563756]  ffff880d883edab0 0000000000000082 0000000000014280 ffff880d883edfd8
[13685.563760]  ffff880d883edfd8 0000000000014280 ffff880fc912dc20 2222222200070008
[13685.563763]  ffff883694fc7230 2222222200080001 ffff8814ca4ea4a0 0000000000000000
[13685.563767] Call Trace:
[13685.563772]  [<ffffffff814d0740>] ? __mutex_lock_slowpath+0x260/0x360
[13685.563775]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[13685.563782]  [<ffffffffa06b23dd>] cv_wait_common+0x10d/0x1c0 [spl]
[13685.563786]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[13685.563792]  [<ffffffffa06b24a5>] __cv_wait+0x15/0x20 [spl]
[13685.563806]  [<ffffffffa07f889b>] zil_commit.part.12+0x7b/0x720 [zfs]
[13685.563819]  [<ffffffffa07f94f7>] zil_commit+0x17/0x20 [zfs]
[13685.563833]  [<ffffffffa07e91c7>] zfs_read+0x417/0x4b0 [zfs]
[13685.563838]  [<ffffffffa06b494c>] ? tsd_exit+0x28c/0x2c0 [spl]
[13685.563851]  [<ffffffffa08037c2>] zpl_read_common+0x52/0x80 [zfs]
[13685.563864]  [<ffffffffa0804a59>] __zpl_xattr_get+0x1e9/0x240 [zfs]
[13685.563877]  [<ffffffffa0804b0c>] zpl_xattr_get+0x5c/0x130 [zfs]
[13685.563890]  [<ffffffffa0804cd1>] zpl_xattr_trusted_get+0x61/0x90 [zfs]
[13685.563893]  [<ffffffff811adc0c>] generic_getxattr+0x4c/0x70
[13685.563897]  [<ffffffff811ae048>] vfs_getxattr+0x88/0xb0
[13685.563900]  [<ffffffff811ae2cb>] getxattr+0xab/0x1d0
[13685.563914]  [<ffffffffa07ec30c>] ? zfs_getattr_fast+0xec/0x170 [zfs]
[13685.563917]  [<ffffffff811904bb>] ? cp_new_stat+0x10b/0x120
[13685.563920]  [<ffffffff811aef21>] sys_fgetxattr+0x61/0xc0
[13685.563924]  [<ffffffff814da99d>] system_call_fastpath+0x1a/0x1f
@ryao
Copy link
Contributor

ryao commented Aug 16, 2013

Which version of ZFSOnLinux are you using and which distribution?

@wbrown
Copy link
Author

wbrown commented Aug 16, 2013

Latest source release download - 0.6.1.

Linux is 3.9.10 with Infiniband and 1000hz kernel. Not a stock
distribution, derived from Arch Linux with a pretty custom stack.
.

@behlendorf
Copy link
Contributor

@wbrown It appears the system is blocked reading deduplication table entries from disk. When it hangs like this are you still seeing lots of I/O being issued or does the system appear idle?

@wbrown
Copy link
Author

wbrown commented Aug 16, 2013

When it hangs like this, I/O comes to a near standstill; zero writes and reads across the board to the entire pool, except for a few twitches.

@behlendorf
Copy link
Contributor

@wbrown So what's happening here is that a critical IO is not completing (either successfully or failing). There have been a handful of reports of this sort of problem and the root cause in most instances is misbehaving hardware, typically under heavy load. However, similar problems do get occasionally reported for other the zfs platforms so there may be a common bug here.

What's never been easy to determine in these situations is exactly what state the IO is in. If you're running fairly recent code for ZoL it would be helpful to dump the zpool events -v when this occurs. That should provide us some additional information on what state the IOs are in.

@wbrown
Copy link
Author

wbrown commented Aug 16, 2013

Next time I see this hang, I'll do a zpool events -v and see what's going on there.

Some data points:

  • We have terabytes of deduplicated data being written to at high data rates along with non-dedplicated data.
  • We have zfs_arc_max set to about 160GB and zfs_arc_meta_limit set to 128GB on a 256GB system.
  • We have a 200GB SAS SSD for L2ARC set to secondary_cache of metadata.
  • zfs_txg_timeout being set to 30 along with zfs_txg_synctime_ms set to 5000 may be affecting this. This was in response to performance issues interacting with Gluster with very busy z_wr_iss threads. When the transaction group is flushed out, write I/O is blocked, and the Gluster daemons appear to be doing a fast-busy poll as they are blocked.
  • When these hangs happen, they happen on extremely heavy load to the tune of 200-400MB/sec of disk I/O occurring.
  • Altering the kmem_cache_expire to address another issue of spl_kmem_cache completely pegging CPU0 and blocking I/O may be affecting this as well.

@wbrown
Copy link
Author

wbrown commented Aug 19, 2013

One thing that I'd altered was disabling the zfs_dedup_prefetch flag. This seems to help, but I just got this hit on one of my systems:

80287.854438] INFO: task kswapd2:102 blocked for more than 120 seconds.
[80287.884260] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80287.914664] kswapd2         D 0000000000000246     0   102      2 0x00000000
[80287.914669]  ffff880fc8d15a80 0000000000000046 0000000000014280 ffff880fc8d15fd8
[80287.914675]  ffff880fc8d15fd8 0000000000014280 ffff883fc816d3c0 ffffffff810870b9
[80287.914679]  0000000000000000 ffff882fdfc14280 ffff883fc74d3f00 0000000000000001
[80287.914683] Call Trace:
[80287.914693]  [<ffffffff810870b9>] ? finish_task_switch+0x49/0xe0
[80287.914699]  [<ffffffff814d1686>] ? __schedule+0x3f6/0x940
[80287.914703]  [<ffffffff814d0740>] ? __mutex_lock_slowpath+0x260/0x360
[80287.914706]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[80287.914709]  [<ffffffff814d2033>] schedule_preempt_disabled+0x23/0x30
[80287.914712]  [<ffffffff814d0639>] __mutex_lock_slowpath+0x159/0x360
[80287.914716]  [<ffffffff814d0852>] mutex_lock+0x12/0x30
[80287.914747]  [<ffffffffa07c5b0d>] zfs_zinactive+0x6d/0x150 [zfs]
[80287.914761]  [<ffffffffa07be9f4>] zfs_inactive+0x64/0x210 [zfs]
[80287.914767]  [<ffffffff81130819>] ? truncate_pagecache+0x59/0x60
[80287.914781]  [<ffffffffa07d4654>] zpl_evict_inode+0x24/0x30 [zfs]
[80287.914785]  [<ffffffff811a4eb0>] evict+0xb0/0x1b0
[80287.914788]  [<ffffffff811a4fe9>] dispose_list+0x39/0x50
[80287.914791]  [<ffffffff811a5f43>] prune_icache_sb+0x173/0x340
[80287.914796]  [<ffffffff8118ed06>] prune_super+0xd6/0x1a0
[80287.914800]  [<ffffffff811338d1>] shrink_slab+0x161/0x350
[80287.914805]  [<ffffffff81180a59>] ? mem_cgroup_iter+0x189/0x240
[80287.914809]  [<ffffffff8113715e>] balance_pgdat+0x45e/0x5a0
[80287.914814]  [<ffffffff81137401>] kswapd+0x161/0x470
[80287.914818]  [<ffffffff8107c3a0>] ? wake_up_bit+0x30/0x30
[80287.914822]  [<ffffffff811372a0>] ? balance_pgdat+0x5a0/0x5a0
[80287.914825]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[80287.914829]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80287.914833]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[80287.914836]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80287.914902] INFO: task z_fr_iss_0/0:4667 blocked for more than 120 seconds.
[80287.945619] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80287.976675] z_fr_iss_0/0    D 0000000000000246     0  4667      2 0x00000000
[80287.976681]  ffff880effe5bc90 0000000000000046 0000000000014280 ffff880effe5bfd8
[80287.976687]  ffff880effe5bfd8 0000000000014280 ffff880eff8ff540 ffff880effe5bbd0
[80287.976692]  ffffffff81090508 ffff880ef24008a8 ffff880eff8ff588 ffff880eff8ff540
[80287.976696] Call Trace:
[80287.976703]  [<ffffffff81090508>] ? __enqueue_entity+0x78/0x80
[80287.976709]  [<ffffffff81091109>] ? wakeup_gran.isra.35+0x29/0x30
[80287.976712]  [<ffffffff81093bd8>] ? check_preempt_wakeup+0x1a8/0x270
[80287.976718]  [<ffffffff8108a715>] ? check_preempt_curr+0x85/0xa0
[80287.976722]  [<ffffffff8108a75c>] ? ttwu_do_wakeup+0x2c/0x120
[80287.976726]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[80287.976731]  [<ffffffff814d2033>] schedule_preempt_disabled+0x23/0x30
[80287.976735]  [<ffffffff814d0639>] __mutex_lock_slowpath+0x159/0x360
[80287.976740]  [<ffffffff814d0852>] mutex_lock+0x12/0x30
[80287.976759]  [<ffffffffa076bf71>] metaslab_free_dva+0xc1/0x1c0 [zfs]
[80287.976779]  [<ffffffffa076ddcd>] metaslab_free+0x8d/0xc0 [zfs]
[80287.976794]  [<ffffffffa07cad2c>] zio_dva_free+0x1c/0x30 [zfs]
[80287.976810]  [<ffffffffa07cbf66>] zio_execute+0xa6/0x120 [zfs]
[80287.976820]  [<ffffffffa067cb67>] taskq_thread+0x237/0x4b0 [spl]
[80287.976826]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[80287.976833]  [<ffffffffa067c930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[80287.976838]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[80287.976842]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80287.976846]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[80287.976850]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80287.976854] INFO: task z_fr_iss_0/1:4668 blocked for more than 120 seconds.
[80288.007867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80288.039347] z_fr_iss_0/1    D 0000000000000246     0  4668      2 0x00000000
[80288.039352]  ffff880effe71c90 0000000000000046 0000000000014280 ffff880effe71fd8
[80288.039357]  ffff880effe71fd8 0000000000014280 ffff880fc7c76ce0 ffff880effe71bd0
[80288.039362]  ffffffff81090508 ffff880fc7c70048 ffff880fc7c76d28 ffff880fc7c76ce0
[80288.039366] Call Trace:
[80288.039371]  [<ffffffff81090508>] ? __enqueue_entity+0x78/0x80
[80288.039375]  [<ffffffff81091109>] ? wakeup_gran.isra.35+0x29/0x30
[80288.039380]  [<ffffffff81093bd8>] ? check_preempt_wakeup+0x1a8/0x270
[80288.039385]  [<ffffffff8108a715>] ? check_preempt_curr+0x85/0xa0
[80288.039389]  [<ffffffff8108a75c>] ? ttwu_do_wakeup+0x2c/0x120
[80288.039394]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[80288.039399]  [<ffffffff814d2033>] schedule_preempt_disabled+0x23/0x30
[80288.039402]  [<ffffffff814d0639>] __mutex_lock_slowpath+0x159/0x360
[80288.039407]  [<ffffffff814d0852>] mutex_lock+0x12/0x30
[80288.039426]  [<ffffffffa076bf71>] metaslab_free_dva+0xc1/0x1c0 [zfs]
[80288.039450]  [<ffffffffa076ddcd>] metaslab_free+0x8d/0xc0 [zfs]
[80288.039465]  [<ffffffffa07cad2c>] zio_dva_free+0x1c/0x30 [zfs]
[80288.039480]  [<ffffffffa07cbf66>] zio_execute+0xa6/0x120 [zfs]
[80288.039488]  [<ffffffffa067cb67>] taskq_thread+0x237/0x4b0 [spl]
[80288.039494]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[80288.039501]  [<ffffffffa067c930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[80288.039507]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[80288.039512]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.039515]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[80288.039520]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.039524] INFO: task z_fr_iss_0/3:4670 blocked for more than 120 seconds.
[80288.071194] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80288.103446] z_fr_iss_0/3    D 0000000000000246     0  4670      2 0x00000000
[80288.103450]  ffff880effec5c90 0000000000000046 0000000000014280 ffff880effec5fd8
[80288.103456]  ffff880effec5fd8 0000000000014280 ffff880ef2400000 ffff880effec5bd0
[80288.103460]  ffffffff81090508 ffff880ef2407588 ffff880ef2400048 ffff880ef2400000
[80288.103465] Call Trace:
[80288.103471]  [<ffffffff81090508>] ? __enqueue_entity+0x78/0x80
[80288.103474]  [<ffffffff81091109>] ? wakeup_gran.isra.35+0x29/0x30
[80288.103477]  [<ffffffff81093bd8>] ? check_preempt_wakeup+0x1a8/0x270
[80288.103483]  [<ffffffff8108a715>] ? check_preempt_curr+0x85/0xa0
[80288.103486]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[80288.103492]  [<ffffffff814d2033>] schedule_preempt_disabled+0x23/0x30
[80288.103495]  [<ffffffff814d0639>] __mutex_lock_slowpath+0x159/0x360
[80288.103498]  [<ffffffff814d0852>] mutex_lock+0x12/0x30
[80288.103517]  [<ffffffffa076bf71>] metaslab_free_dva+0xc1/0x1c0 [zfs]
[80288.103536]  [<ffffffffa076ddcd>] metaslab_free+0x8d/0xc0 [zfs]
[80288.103552]  [<ffffffffa07cad2c>] zio_dva_free+0x1c/0x30 [zfs]
[80288.103567]  [<ffffffffa07cbf66>] zio_execute+0xa6/0x120 [zfs]
[80288.103575]  [<ffffffffa067cb67>] taskq_thread+0x237/0x4b0 [spl]
[80288.103581]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[80288.103588]  [<ffffffffa067c930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[80288.103593]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[80288.103598]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.103601]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[80288.103606]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.103610] INFO: task z_fr_iss_1/0:4671 blocked for more than 120 seconds.
[80288.136473] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80288.169723] z_fr_iss_1/0    D 0000000000000246     0  4671      2 0x00000000
[80288.169726]  ffff880effec7c90 0000000000000046 0000000000014280 ffff880effec7fd8
[80288.169733]  ffff880effec7fd8 0000000000014280 ffff880ef2400860 ffff880effec7bd0
[80288.169738]  ffffffff81090508 ffff880ef2402a28 ffff880ef24008a8 ffff880ef2400860
[80288.169744] Call Trace:
[80288.169748]  [<ffffffff81090508>] ? __enqueue_entity+0x78/0x80
[80288.169754]  [<ffffffff81091109>] ? wakeup_gran.isra.35+0x29/0x30
[80288.169758]  [<ffffffff81093bd8>] ? check_preempt_wakeup+0x1a8/0x270
[80288.169762]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[80288.169768]  [<ffffffff814d2033>] schedule_preempt_disabled+0x23/0x30
[80288.169771]  [<ffffffff814d0639>] __mutex_lock_slowpath+0x159/0x360
[80288.169776]  [<ffffffff814d0852>] mutex_lock+0x12/0x30
[80288.169795]  [<ffffffffa076bf71>] metaslab_free_dva+0xc1/0x1c0 [zfs]
[80288.169814]  [<ffffffffa076ddcd>] metaslab_free+0x8d/0xc0 [zfs]
[80288.169829]  [<ffffffffa07cad2c>] zio_dva_free+0x1c/0x30 [zfs]
[80288.169844]  [<ffffffffa07cbf66>] zio_execute+0xa6/0x120 [zfs]
[80288.169853]  [<ffffffffa067cb67>] taskq_thread+0x237/0x4b0 [spl]
[80288.169859]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[80288.169867]  [<ffffffffa067c930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[80288.169872]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[80288.169877]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.169881]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[80288.169886]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.169890] INFO: task z_fr_iss_1/1:4672 blocked for more than 120 seconds.
[80288.203696] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80288.238487] z_fr_iss_1/1    D 0000000000000246     0  4672      2 0x00000000
[80288.238491]  ffff880effee5c90 0000000000000046 0000000000014280 ffff880effee5fd8
[80288.238496]  ffff880effee5fd8 0000000000014280 ffff880fc7c74b60 ffff880effee5bd0
[80288.238499]  ffffffff81090508 ffff880fc7c71108 ffff883fc816e4c8 ffff883fc816e480
[80288.238502] Call Trace:
[80288.238507]  [<ffffffff81090508>] ? __enqueue_entity+0x78/0x80
[80288.238510]  [<ffffffff81091109>] ? wakeup_gran.isra.35+0x29/0x30
[80288.238513]  [<ffffffff81093bd8>] ? check_preempt_wakeup+0x1a8/0x270
[80288.238517]  [<ffffffff8108a715>] ? check_preempt_curr+0x85/0xa0
[80288.238520]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[80288.238523]  [<ffffffff814d2033>] schedule_preempt_disabled+0x23/0x30
[80288.238525]  [<ffffffff814d0639>] __mutex_lock_slowpath+0x159/0x360
[80288.238528]  [<ffffffff814d0852>] mutex_lock+0x12/0x30
[80288.238548]  [<ffffffffa076bf71>] metaslab_free_dva+0xc1/0x1c0 [zfs]
[80288.238562]  [<ffffffffa076ddcd>] metaslab_free+0x8d/0xc0 [zfs]
[80288.238573]  [<ffffffffa07cad2c>] zio_dva_free+0x1c/0x30 [zfs]
[80288.238584]  [<ffffffffa07cbf66>] zio_execute+0xa6/0x120 [zfs]
[80288.238590]  [<ffffffffa067cb67>] taskq_thread+0x237/0x4b0 [spl]
[80288.238594]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[80288.238600]  [<ffffffffa067c930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[80288.238603]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[80288.238606]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.238609]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[80288.238612]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.238616] INFO: task z_fr_iss_1/3:4674 blocked for more than 120 seconds.
[80288.273367] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80288.308707] z_fr_iss_1/3    D 0000000000000246     0  4674      2 0x00000000
[80288.308712]  ffff880faf93dc90 0000000000000046 0000000000014280 ffff880faf93dfd8
[80288.308717]  ffff880faf93dfd8 0000000000014280 ffff880ef24010c0 ffff880faf93dbd0
[80288.308720]  ffffffff81090508 ffff880ef2400048 ffff880ef2401108 ffff880ef24010c0
[80288.308723] Call Trace:
[80288.308732]  [<ffffffff81090508>] ? __enqueue_entity+0x78/0x80
[80288.308735]  [<ffffffff81091109>] ? wakeup_gran.isra.35+0x29/0x30
[80288.308738]  [<ffffffff81093bd8>] ? check_preempt_wakeup+0x1a8/0x270
[80288.308742]  [<ffffffff8108a715>] ? check_preempt_curr+0x85/0xa0
[80288.308750]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[80288.308753]  [<ffffffff814d2033>] schedule_preempt_disabled+0x23/0x30
[80288.308756]  [<ffffffff814d0639>] __mutex_lock_slowpath+0x159/0x360
[80288.308759]  [<ffffffff814d0852>] mutex_lock+0x12/0x30
[80288.308789]  [<ffffffffa076bf71>] metaslab_free_dva+0xc1/0x1c0 [zfs]
[80288.308804]  [<ffffffffa076ddcd>] metaslab_free+0x8d/0xc0 [zfs]
[80288.308816]  [<ffffffffa07cad2c>] zio_dva_free+0x1c/0x30 [zfs]
[80288.308827]  [<ffffffffa07cbf66>] zio_execute+0xa6/0x120 [zfs]
[80288.308835]  [<ffffffffa067cb67>] taskq_thread+0x237/0x4b0 [spl]
[80288.308839]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[80288.308845]  [<ffffffffa067c930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[80288.308849]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[80288.308853]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.308857]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[80288.308860]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.308864] INFO: task z_fr_iss_2/0:4675 blocked for more than 120 seconds.
[80288.344666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80288.381258] z_fr_iss_2/0    D 0000000000000246     0  4675      2 0x00000000
[80288.381262]  ffff880faf93fc90 0000000000000046 0000000000014280 ffff880faf93ffd8
[80288.381266]  ffff880faf93ffd8 0000000000014280 ffff880ef2401920 000000000069be66
[80288.381270]  ffff880fdfc14280 ffff880ef24029e0 ffff880fdfc14280 0000000000000000
[80288.381274] Call Trace:
[80288.381279]  [<ffffffff8108a715>] ? check_preempt_curr+0x85/0xa0
[80288.381284]  [<ffffffff8108a75c>] ? ttwu_do_wakeup+0x2c/0x120
[80288.381288]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[80288.381292]  [<ffffffff814d2033>] schedule_preempt_disabled+0x23/0x30
[80288.381295]  [<ffffffff814d0639>] __mutex_lock_slowpath+0x159/0x360
[80288.381299]  [<ffffffff814d0852>] mutex_lock+0x12/0x30
[80288.381320]  [<ffffffffa076bf71>] metaslab_free_dva+0xc1/0x1c0 [zfs]
[80288.381337]  [<ffffffffa076ddcd>] metaslab_free+0x8d/0xc0 [zfs]
[80288.381350]  [<ffffffffa07cad2c>] zio_dva_free+0x1c/0x30 [zfs]
[80288.381362]  [<ffffffffa07cbf66>] zio_execute+0xa6/0x120 [zfs]
[80288.381369]  [<ffffffffa067cb67>] taskq_thread+0x237/0x4b0 [spl]
[80288.381374]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[80288.381381]  [<ffffffffa067c930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[80288.381384]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[80288.381388]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.381392]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[80288.381395]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.381399] INFO: task z_fr_iss_2/1:4676 blocked for more than 120 seconds.
[80288.418291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80288.455536] z_fr_iss_2/1    D 0000000000000246     0  4676      2 0x00000000
[80288.455540]  ffff880efff95c90 0000000000000046 0000000000014280 ffff880efff95fd8
[80288.455546]  ffff880efff95fd8 0000000000014280 ffff880fc7c73aa0 ffff880efff95bd0
[80288.455550]  ffffffff81090508 ffff880fc7c721c8 ffff880fc7c73ae8 ffff880fc7c73aa0
[80288.455553] Call Trace:
[80288.455561]  [<ffffffff81090508>] ? __enqueue_entity+0x78/0x80
[80288.455566]  [<ffffffff81091109>] ? wakeup_gran.isra.35+0x29/0x30
[80288.455569]  [<ffffffff81093bd8>] ? check_preempt_wakeup+0x1a8/0x270
[80288.455573]  [<ffffffff8108a715>] ? check_preempt_curr+0x85/0xa0
[80288.455578]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[80288.455581]  [<ffffffff814d2033>] schedule_preempt_disabled+0x23/0x30
[80288.455585]  [<ffffffff814d0639>] __mutex_lock_slowpath+0x159/0x360
[80288.455588]  [<ffffffff814d0852>] mutex_lock+0x12/0x30
[80288.455615]  [<ffffffffa076bf71>] metaslab_free_dva+0xc1/0x1c0 [zfs]
[80288.455632]  [<ffffffffa076ddcd>] metaslab_free+0x8d/0xc0 [zfs]
[80288.455645]  [<ffffffffa07cad2c>] zio_dva_free+0x1c/0x30 [zfs]
[80288.455659]  [<ffffffffa07cbf66>] zio_execute+0xa6/0x120 [zfs]
[80288.455667]  [<ffffffffa067cb67>] taskq_thread+0x237/0x4b0 [spl]
[80288.455671]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[80288.455678]  [<ffffffffa067c930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[80288.455681]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[80288.455685]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.455689]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[80288.455692]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.455695] INFO: task z_fr_iss_2/3:4678 blocked for more than 120 seconds.
[80288.492942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[80288.531045] z_fr_iss_2/3    D 0000000000000246     0  4678      2 0x00000000
[80288.531049]  ffff880efffc1c90 0000000000000046 0000000000014280 ffff880efffc1fd8
[80288.531053]  ffff880efffc1fd8 0000000000014280 ffff880ef2402180 ffff880efffc1bd0
[80288.531057]  ffffffff81090508 ffff880ef2403288 ffff880ef24021c8 ffff880ef2402180
[80288.531060] Call Trace:
[80288.531064]  [<ffffffff81090508>] ? __enqueue_entity+0x78/0x80
[80288.531068]  [<ffffffff81091109>] ? wakeup_gran.isra.35+0x29/0x30
[80288.531071]  [<ffffffff81093bd8>] ? check_preempt_wakeup+0x1a8/0x270
[80288.531075]  [<ffffffff8108a715>] ? check_preempt_curr+0x85/0xa0
[80288.531078]  [<ffffffff814d1bf9>] schedule+0x29/0x70
[80288.531081]  [<ffffffff814d2033>] schedule_preempt_disabled+0x23/0x30
[80288.531085]  [<ffffffff814d0639>] __mutex_lock_slowpath+0x159/0x360
[80288.531088]  [<ffffffff814d0852>] mutex_lock+0x12/0x30
[80288.531106]  [<ffffffffa076bf71>] metaslab_free_dva+0xc1/0x1c0 [zfs]
[80288.531123]  [<ffffffffa076ddcd>] metaslab_free+0x8d/0xc0 [zfs]
[80288.531137]  [<ffffffffa07cad2c>] zio_dva_free+0x1c/0x30 [zfs]
[80288.531150]  [<ffffffffa07cbf66>] zio_execute+0xa6/0x120 [zfs]
[80288.531156]  [<ffffffffa067cb67>] taskq_thread+0x237/0x4b0 [spl]
[80288.531160]  [<ffffffff8108cb80>] ? wake_up_process+0x40/0x40
[80288.531166]  [<ffffffffa067c930>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[80288.531169]  [<ffffffff8107b580>] kthread+0xc0/0xd0
[80288.531172]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120
[80288.531176]  [<ffffffff814da8ec>] ret_from_fork+0x7c/0xb0
[80288.531179]  [<ffffffff8107b4c0>] ? kthread_create_on_node+0x120/0x120

The output from zpool events -v, as requested:

[wbrown@chagas ~]$ sudo zpool events -v
TIME                           CLASS
Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x95a8c79052289ae1
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xe5edc3e7ab26c93b
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xdc94dffe01333c7b
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x10abd9aae6f8442c
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xf90345f163e1bb6c
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x66ae155db8270cef
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xb15c0504b6e65daf
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xbc5f66766caa9cb8
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x8b25441d3474f6ac
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x300ef787fcc9fba
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x51217a4b517ea5ee
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x1d5712408277aff5
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xc4273abd530a0080
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xa610627a9eb06fd8
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xafe430f332f7e976
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x5d87a7bf4eca5eb3
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x9e417c1998aef71b
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x707acc9b40439bfd
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x8618d275f569a005
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.918039294 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x233f0ae809fa55fc
        vdev_state = 0x7
        time = 0x520ecbd1 0x36b82afe

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x379ecd5384f0bcf
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xb3ba4c3ab932b56a
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x3420bae781661a99
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x3ea91f3aaf24a19b
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xb29b9c54afc9f53e
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xdaf088f2bbac2896
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x3e763fdac4773fe7
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x41ed0252c12c3379
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x16cc1a60fe30e215
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xbda9b3bedcbe5f88
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.919039282 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x4d40a6f2e4049541
        vdev_state = 0x7
        time = 0x520ecbd1 0x36c76d32

Aug 16 2013 21:03:13.920039269 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x71bf9492250bf525
        vdev_state = 0x7
        time = 0x520ecbd1 0x36d6af65

Aug 16 2013 21:03:13.920039269 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0xecb6ee35e01c9272
        vdev_state = 0x7
        time = 0x520ecbd1 0x36d6af65

Aug 16 2013 21:03:14.387033398 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x245176f9263e1a3
        vdev_state = 0x7
        time = 0x520ecbd2 0x1711a936

Aug 16 2013 21:03:20.005962759 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x245176f9263e1a3
        vdev_state = 0x7
        time = 0x520ecbd8 0x5afc07

Aug 16 2013 21:03:20.018962595 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x245176f9263e1a3
        vdev_state = 0x7
        time = 0x520ecbd8 0x12158a3

Aug 16 2013 21:03:20.031962432 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1f62a992a0855ec2
        vdev_guid = 0x245176f9263e1a3
        vdev_state = 0x7
        time = 0x520ecbd8 0x1e7b540

Aug 16 2013 21:03:20.580955530 ereport.fs.zfs.config.sync
        class = "ereport.fs.zfs.config.sync"
        ena = 0x1453b4736800801
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x1f62a992a0855ec2
        (end detector)
        pool = "chagas-internal"
        pool_guid = 0x1f62a992a0855ec2
        pool_context = 0x0
        pool_failmode = "wait"
        time = 0x520ecbd8 0x22a0ad8a

Aug 16 2013 21:03:21.062949471 ereport.fs.zfs.config.sync
        class = "ereport.fs.zfs.config.sync"
        ena = 0x14706a7af001c01
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x1f62a992a0855ec2
        (end detector)
        pool = "chagas-internal"
        pool_guid = 0x1f62a992a0855ec2
        pool_context = 0x0
        pool_failmode = "wait"
        time = 0x520ecbd9 0x3c0885f

@wbrown
Copy link
Author

wbrown commented Aug 21, 2013

Some further news here -- after studying the issue, I realized that despite attempting to force all writes to go through my ZeusRAM ZIL SLOG by setting the pool's sync property to always -- that wasn't happening. This was new behavior to me, after much experience with ZFS on Illumnos platforms. I discovered the zfs_slog_limit parameter discussed in the below threads.

#1012
#783

What it turned out was that despite my attempts to use the ZIL as the write throttle, writes larger than 1MB were going straight to the pool. GlusterFS does write coalescing so that writes can often be larger than 1MB. This combination of Gluster's write-back cache and the zfs_slog_limit changes actually increased the latency and decreased the throughput of my pool in the context of a heavy deduplication load.

With the ZIL being written straight to the pool, and with zfs_no_write_throttle being enabled, this caused memory to be gradually consumed until there was so much memory pressure that deduplication got into a deadlock. zfs_no_write_throttle was enabled due to issues with write performance when it comes to writing deduplicated files out -- and the ZIL was relied upon as the upper limit on how many pending transactions there needed to be.

Increasing the zfs_slog_limit to 256MB on the ZFS machines seems to have increased the stability of the machine, decreased the latency, and increased the throughput.

@behlendorf
Copy link
Contributor

@wbrown Thank you for following up in the issue. This completely explains why you were observing issues with your workload. Can you summarize what non-standard tunings you needed to make on your system to get gluster behaving well.

@wbrown
Copy link
Author

wbrown commented Aug 21, 2013

Sure, @behlendorf . This is for a GlusterFS setup layered on top of ZFS with heavy deduplication of content that deduplicates well.

Hardware:

  • 2x AMD Opteron 6328's (16 cores @ 3.2ghz)
  • 256GB of ECC DDR3 1600Mhz
  • Dual-port QDR Infiniband
  • LSI SAS 9211-8i SAS adapter
  • STEC 8GB ZeusRAM SAS SSD (ZIL)
  • STEC 200GB 840 SAS SSD (L2ARC)
  • 16x 4TB Seagate Constellation SAS drives (main pool, mirrored pairs)

The pool is set to:

  • compression=lz4

Compression is good, especially cheap LZ4 compression as it reduces the amount of I/O to disk.

  • dedup=on

Deduplication is also good if you have a dataset that works really well for this. But in the scale of terabytes, you really need a maxed out system.

  • recordsize=8k

Best results here for my data set.

  • sync=always

This is to force all writes to be committed via the ZIL SSD. This device has both latency and throughput, so the zfs_slog_limit optimization for non-enterprise SSDs really screws up performance for me. I use the ZIL SLOG as a natural write throttle, to protect myself from the effects of zfs_no_write_throttle=1 and get much better performance as a result. I am also protected from crashes -- theoretically.

The pool's L2ARC is set to metadata only, for the deduplication tables:

  • secondarycache=metadata
zil_slog_limit=268435456

Set the ZIL SLOG limit to 256MB so that there's a near dead certainty that writes are going to the ZIL. If this doesn't happen, I risk a daily system implosion from zfs_no_write_throttle=1 below.

options zfs zfs_no_write_throttle=1

I have had to set this to get any sort of decent performance from my workload -- the write throttle algorithm seems to really be off in the case of a heavy synchronous dedup load, and stalls my systems a lot.

options zfs zfs_prefetch_disable=1

Prefetch is nearly worthless in the context of many, many small files -- in the range of hundreds of millions.

options zfs zfs_dedup_prefetch=0

The above has appeared to help the stability of my system, though this is questionable and not proven.

options zfs zfs_arc_max=171798691840

This allows ZFS to take up about 70% of system memory for the ARC cache.

options zfs zfs_arc_meta_limit=137438953472

One of the most important settings for a deduplication server -- as high an ARC metadata as you can get away with.

options zfs zvol_threads=512

This is helpful when you have hundreds of synchronous write threads.

options zfs zfs_vdev_cache_size=67108864

I get 50% cache hit rate, and this helps with the metadata loads for deduplication.

options zfs zfs_txg_timeout=5
options zfs zfs_txg_synctime_ms=1000

This used to be set to 30 -- when combined with the ZeusRAM SLOG, I got really good write performance with my workload, but it was apparently impacting on system stability.

options zfs zfs_vdev_max_pending=30

I have enterprise SAS disks with dual-pathing, so this leverages the SAS bus to its fullest. Don't try this with SATA disks.

options zfs l2arc_write_max=8388608
options zfs l2arc_write_boost=25165824
options zfs l2arc_feed_again 0

I want my L2ARC to be filled with metadata as fast as it can take it rather than take the risk of page eviction.

@ryao
Copy link
Contributor

ryao commented Jul 11, 2014

@wbrown Can you reproduce this in 0.6.3?

@behlendorf
Copy link
Contributor

Closing for now, we can reopen it if we get confirmation it's still and issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants