Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock due to filesystem re-entry during zfs_evict_inode() in 0.6.5 #3808

Closed
rgmiller opened this issue Sep 21, 2015 · 17 comments
Closed

Deadlock due to filesystem re-entry during zfs_evict_inode() in 0.6.5 #3808

rgmiller opened this issue Sep 21, 2015 · 17 comments
Milestone

Comments

@rgmiller
Copy link

I upgraded to 0.6.5 when it became available for CentOS 7 recently and I've begun to notice problems with large rsync transfers. (I'm using zfs as the main storage for a BackupPC server.) Small rsync's (ie: incremental backups) seem to be OK, but large ones (from the full backups) end up hanging and basically locking up the computer.

I manged to get a stack trace from dmesg this evening, though, and I'll paste it below. The traces all seem to be in a mutex locking function, so maybe this is some kind of deadlock scenario?

I think I can trigger this situation fairly reliably by kicking off a full backup, so if there's something specific you want me to try, just let me know.

@rgmiller
Copy link
Author

dmesg output:

[153452.271834] INFO: task kswapd0:46 blocked for more than 120 seconds.
[153452.271853] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153452.271872] kswapd0 D ffff88041e313680 0 46 2 0x00000000
[153452.271874] ffff88040766b760 0000000000000046 ffff8804083fc440 ffff88040766bfd8
[153452.271876] ffff88040766bfd8 ffff88040766bfd8 ffff8804083fc440 ffff88041e313f48
[153452.271878] ffff88018a4b37a8 ffff88018a4b37e8 ffff88018a4b37d0 0000000000000001
[153452.271879] Call Trace:
[153452.271884] [] io_schedule+0x9d/0x130
[153452.271899] [] cv_wait_common+0xae/0x150 [spl]
[153452.271902] [] ? wake_up_bit+0x30/0x30
[153452.271907] [] __cv_wait_io+0x18/0x20 [spl]
[153452.271931] [] zio_wait+0x123/0x210 [zfs]
[153452.271941] [] dbuf_read+0x30d/0x930 [zfs]
[153452.271953] [] dmu_buf_hold+0x50/0x80 [zfs]
[153452.271970] [] zap_get_leaf_byblk+0x5c/0x2c0 [zfs]
[153452.271980] [] ? dmu_buf_rele+0xe/0x10 [zfs]
[153452.271995] [] ? zap_idx_to_blk+0x103/0x180 [zfs]
[153452.272011] [] zap_deref_leaf+0x7a/0xa0 [zfs]
[153452.272026] [] fzap_remove+0x3f/0xb0 [zfs]
[153452.272042] [] ? zap_name_alloc+0x73/0xd0 [zfs]
[153452.272058] [] zap_remove_norm+0x17b/0x1e0 [zfs]
[153452.272073] [] zap_remove+0x13/0x20 [zfs]
[153452.272089] [] zap_remove_int+0x54/0x80 [zfs]
[153452.272104] [] zfs_rmnode+0x224/0x350 [zfs]
[153452.272106] [] ? mutex_lock+0x12/0x2f
[153452.272122] [] zfs_zinactive+0x168/0x180 [zfs]
[153452.272138] [] zfs_inactive+0x67/0x240 [zfs]
[153452.272141] [] ? truncate_pagecache+0x59/0x60
[153452.272156] [] zpl_evict_inode+0x43/0x60 [zfs]
[153452.272159] [] evict+0xa7/0x170
[153452.272161] [] dispose_list+0x3e/0x50
[153452.272162] [] prune_icache_sb+0x163/0x320
[153452.272164] [] prune_super+0xd6/0x1a0
[153452.272166] [] shrink_slab+0x165/0x300
[153452.272168] [] ? vmpressure+0x87/0x90
[153452.272170] [] balance_pgdat+0x4b1/0x5e0
[153452.272171] [] kswapd+0x173/0x450
[153452.272173] [] ? wake_up_bit+0x30/0x30
[153452.272175] [] ? balance_pgdat+0x5e0/0x5e0
[153452.272176] [] kthread+0xcf/0xe0
[153452.272177] [] ? kthread_create_on_node+0x140/0x140
[153452.272180] [] ret_from_fork+0x58/0x90
[153452.272181] [] ? kthread_create_on_node+0x140/0x140
[153452.272188] INFO: task spl_kmem_cache:558 blocked for more than 120 seconds.
[153452.272205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153452.272224] spl_kmem_cache D ffff88041e213680 0 558 2 0x00000000
[153452.272226] ffff880406af7498 0000000000000046 ffff880407c738e0 ffff880406af7fd8
[153452.272227] ffff880406af7fd8 ffff880406af7fd8 ffff880407c738e0 ffff88041e213f48
[153452.272229] ffff8801d7f0daa8 ffff8801d7f0dae8 ffff8801d7f0dad0 0000000000000001
[153452.272230] Call Trace:
[153452.272232] [] io_schedule+0x9d/0x130
[153452.272237] [] cv_wait_common+0xae/0x150 [spl]
[153452.272238] [] ? wake_up_bit+0x30/0x30
[153452.272243] [] __cv_wait_io+0x18/0x20 [spl]
[153452.272258] [] zio_wait+0x123/0x210 [zfs]
[153452.272268] [] dbuf_read+0x30d/0x930 [zfs]
[153452.272279] [] dmu_buf_hold+0x50/0x80 [zfs]
[153452.272296] [] zap_get_leaf_byblk+0x5c/0x2c0 [zfs]
[153452.272305] [] ? dmu_buf_rele+0xe/0x10 [zfs]
[153452.272320] [] ? zap_idx_to_blk+0x103/0x180 [zfs]
[153452.272336] [] zap_deref_leaf+0x7a/0xa0 [zfs]
[153452.272350] [] fzap_remove+0x3f/0xb0 [zfs]
[153452.272366] [] ? zap_name_alloc+0x73/0xd0 [zfs]
[153452.272381] [] zap_remove_norm+0x17b/0x1e0 [zfs]
[153452.272396] [] zap_remove+0x13/0x20 [zfs]
[153452.272410] [] zap_remove_int+0x54/0x80 [zfs]
[153452.272425] [] zfs_rmnode+0x224/0x350 [zfs]
[153452.272427] [] ? mutex_lock+0x12/0x2f
[153452.272442] [] zfs_zinactive+0x168/0x180 [zfs]
[153452.272458] [] zfs_inactive+0x67/0x240 [zfs]
[153452.272459] [] ? truncate_pagecache+0x59/0x60
[153452.272474] [] zpl_evict_inode+0x43/0x60 [zfs]
[153452.272477] [] evict+0xa7/0x170
[153452.272479] [] dispose_list+0x3e/0x50
[153452.272480] [] prune_icache_sb+0x163/0x320
[153452.272481] [] prune_super+0xd6/0x1a0
[153452.272483] [] shrink_slab+0x165/0x300
[153452.272484] [] ? vmpressure+0x21/0x90
[153452.272486] [] do_try_to_free_pages+0x3c2/0x4e0
[153452.272488] [] try_to_free_pages+0xfc/0x180
[153452.272490] [] __alloc_pages_nodemask+0x7fd/0xb90
[153452.272492] [] alloc_pages_current+0xa9/0x170
[153452.272495] [] __vmalloc_node_range+0x15b/0x270
[153452.272499] [] ? spl_vmalloc+0x34/0x60 [spl]
[153452.272500] [] __vmalloc+0x41/0x50
[153452.272504] [] ? spl_vmalloc+0x34/0x60 [spl]
[153452.272508] [] spl_vmalloc+0x34/0x60 [spl]
[153452.272512] [] kv_alloc.isra.5+0x87/0x90 [spl]
[153452.272515] [] spl_cache_grow_work+0x55/0x2e0 [spl]
[153452.272519] [] taskq_thread+0x21e/0x420 [spl]
[153452.272521] [] ? wake_up_state+0x20/0x20
[153452.272533] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[153452.272534] [] kthread+0xcf/0xe0
[153452.272536] [] ? kthread_create_on_node+0x140/0x140
[153452.272538] [] ret_from_fork+0x58/0x90
[153452.272539] [] ? kthread_create_on_node+0x140/0x140
[153452.272543] INFO: task z_rd_int_0:972 blocked for more than 120 seconds.
[153452.272558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153452.272576] z_rd_int_0 D ffff88041e393680 0 972 2 0x00000000
[153452.272578] ffff8803fe17fcc0 0000000000000046 ffff8803fe37f1c0 ffff8803fe17ffd8
[153452.272579] ffff8803fe17ffd8 ffff8803fe17ffd8 ffff8803fe37f1c0 ffff8803fd5159d0
[153452.272580] ffff8803fd5159d4 ffff8803fe37f1c0 00000000ffffffff ffff8803fd5159d8
[153452.272582] Call Trace:
[153452.272584] [] schedule_preempt_disabled+0x29/0x70
[153452.272585] [] __mutex_lock_slowpath+0xc5/0x1c0
[153452.272588] [] mutex_lock+0x1f/0x2f
[153452.272603] [] vdev_queue_io_done+0x57/0x290 [zfs]
[153452.272605] [] ? __switch_to+0x136/0x4a0
[153452.272607] [] ? mutex_lock+0x12/0x2f
[153452.272622] [] zio_vdev_io_done+0x88/0x180 [zfs]
[153452.272636] [] zio_execute+0xc8/0x180 [zfs]
[153452.272641] [] taskq_thread+0x21e/0x420 [spl]
[153452.272642] [] ? wake_up_state+0x20/0x20
[153452.272646] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[153452.272647] [] kthread+0xcf/0xe0
[153452.272649] [] ? kthread_create_on_node+0x140/0x140
[153452.272650] [] ret_from_fork+0x58/0x90
[153452.272652] [] ? kthread_create_on_node+0x140/0x140
[153452.272653] INFO: task z_rd_int_1:973 blocked for more than 120 seconds.
[153452.272669] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153452.272687] z_rd_int_1 D ffff88041e313680 0 973 2 0x00000000
[153452.272689] ffff8803fe2cbcc0 0000000000000046 ffff8803fe37ad80 ffff8803fe2cbfd8
[153452.272690] ffff8803fe2cbfd8 ffff8803fe2cbfd8 ffff8803fe37ad80 ffff8803fd5159d0
[153452.272691] ffff8803fd5159d4 ffff8803fe37ad80 00000000ffffffff ffff8803fd5159d8
[153452.272693] Call Trace:
[153452.272694] [] schedule_preempt_disabled+0x29/0x70
[153452.272696] [] __mutex_lock_slowpath+0xc5/0x1c0
[153452.272698] [] mutex_lock+0x1f/0x2f
[153452.272714] [] vdev_queue_io_done+0x57/0x290 [zfs]
[153452.272715] [] ? __switch_to+0x136/0x4a0
[153452.272717] [] ? mutex_lock+0x12/0x2f
[153452.272732] [] zio_vdev_io_done+0x88/0x180 [zfs]
[153452.272747] [] zio_execute+0xc8/0x180 [zfs]
[153452.272751] [] taskq_thread+0x21e/0x420 [spl]
[153452.272752] [] ? wake_up_state+0x20/0x20
[153452.272756] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[153452.272757] [] kthread+0xcf/0xe0
[153452.272759] [] ? kthread_create_on_node+0x140/0x140
[153452.272760] [] ret_from_fork+0x58/0x90
[153452.272762] [] ? kthread_create_on_node+0x140/0x140
[153452.272763] INFO: task z_rd_int_2:974 blocked for more than 120 seconds.
[153452.272778] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153452.272796] z_rd_int_2 D ffff88041e213680 0 974 2 0x00000000
[153452.272797] ffff8803fe00bcc0 0000000000000046 ffff8803fe378000 ffff8803fe00bfd8
[153452.272799] ffff8803fe00bfd8 ffff8803fe00bfd8 ffff8803fe378000 ffff8803fd5159d0
[153452.272800] ffff8803fd5159d4 ffff8803fe378000 00000000ffffffff ffff8803fd5159d8
[153452.272801] Call Trace:
[153452.272803] [] schedule_preempt_disabled+0x29/0x70
[153452.272805] [] __mutex_lock_slowpath+0xc5/0x1c0
[153452.272807] [] mutex_lock+0x1f/0x2f
[153452.272822] [] vdev_queue_io_done+0x57/0x290 [zfs]
[153452.272827] [] ? __switch_to+0x136/0x4a0
[153452.272829] [] ? mutex_lock+0x12/0x2f
[153452.272844] [] zio_vdev_io_done+0x88/0x180 [zfs]
[153452.272860] [] zio_execute+0xc8/0x180 [zfs]
[153452.272865] [] taskq_thread+0x21e/0x420 [spl]
[153452.272866] [] ? wake_up_state+0x20/0x20
[153452.272870] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[153452.272871] [] kthread+0xcf/0xe0
[153452.272872] [] ? kthread_create_on_node+0x140/0x140
[153452.272874] [] ret_from_fork+0x58/0x90
[153452.272875] [] ? kthread_create_on_node+0x140/0x140
[153452.272876] INFO: task z_rd_int_3:975 blocked for more than 120 seconds.
[153452.272892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153452.272910] z_rd_int_3 D ffff88041e213680 0 975 2 0x00000000
[153452.272911] ffff8803fe007cc0 0000000000000046 ffff8803fe37a220 ffff8803fe007fd8
[153452.272913] ffff8803fe007fd8 ffff8803fe007fd8 ffff8803fe37a220 ffff8803fd5169d0
[153452.272914] ffff8803fd5169d4 ffff8803fe37a220 00000000ffffffff ffff8803fd5169d8
[153452.272915] Call Trace:
[153452.272917] [] schedule_preempt_disabled+0x29/0x70
[153452.272919] [] __mutex_lock_slowpath+0xc5/0x1c0
[153452.272921] [] mutex_lock+0x1f/0x2f
[153452.272937] [] vdev_queue_io_done+0x57/0x290 [zfs]
[153452.272938] [] ? __switch_to+0x136/0x4a0
[153452.272940] [] ? mutex_lock+0x12/0x2f
[153452.272955] [] zio_vdev_io_done+0x88/0x180 [zfs]
[153452.272969] [] zio_execute+0xc8/0x180 [zfs]
[153452.272974] [] taskq_thread+0x21e/0x420 [spl]
[153452.272976] [] ? wake_up_state+0x20/0x20
[153452.272979] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[153452.272981] [] kthread+0xcf/0xe0
[153452.272982] [] ? kthread_create_on_node+0x140/0x140
[153452.272984] [] ret_from_fork+0x58/0x90
[153452.272985] [] ? kthread_create_on_node+0x140/0x140
[153452.272987] INFO: task z_rd_int_5:977 blocked for more than 120 seconds.
[153452.273002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153452.273020] z_rd_int_5 D ffff88041e293680 0 977 2 0x00000000
[153452.273021] ffff880400617cc0 0000000000000046 ffff8803fe37b8e0 ffff880400617fd8
[153452.273023] ffff880400617fd8 ffff880400617fd8 ffff8803fe37b8e0 ffff8803fd5169d0
[153452.273024] ffff8803fd5169d4 ffff8803fe37b8e0 00000000ffffffff ffff8803fd5169d8
[153452.273026] Call Trace:
[153452.273028] [] schedule_preempt_disabled+0x29/0x70
[153452.273029] [] __mutex_lock_slowpath+0xc5/0x1c0
[153452.273031] [] mutex_lock+0x1f/0x2f
[153452.273047] [] vdev_queue_io_done+0x57/0x290 [zfs]
[153452.273049] [] ? __switch_to+0x136/0x4a0
[153452.273050] [] ? mutex_lock+0x12/0x2f
[153452.273066] [] zio_vdev_io_done+0x88/0x180 [zfs]
[153452.273080] [] zio_execute+0xc8/0x180 [zfs]
[153452.273085] [] taskq_thread+0x21e/0x420 [spl]
[153452.273087] [] ? wake_up_state+0x20/0x20
[153452.273090] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[153452.273092] [] kthread+0xcf/0xe0
[153452.273094] [] ? kthread_create_on_node+0x140/0x140
[153452.273095] [] ret_from_fork+0x58/0x90
[153452.273097] [] ? kthread_create_on_node+0x140/0x140
[153452.273098] INFO: task z_rd_int_6:978 blocked for more than 120 seconds.
[153452.273113] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153452.273131] z_rd_int_6 D ffff88041e213680 0 978 2 0x00000000
[153452.273133] ffff8803fd60bcc0 0000000000000046 ffff8803fe3796c0 ffff8803fd60bfd8
[153452.273134] ffff8803fd60bfd8 ffff8803fd60bfd8 ffff8803fe3796c0 ffff8803fd5169d0
[153452.273135] ffff8803fd5169d4 ffff8803fe3796c0 00000000ffffffff ffff8803fd5169d8
[153452.273137] Call Trace:
[153452.273138] [] schedule_preempt_disabled+0x29/0x70
[153452.273140] [] __mutex_lock_slowpath+0xc5/0x1c0
[153452.273142] [] mutex_lock+0x1f/0x2f
[153452.273158] [] vdev_queue_io_done+0x57/0x290 [zfs]
[153452.273160] [] ? __switch_to+0x136/0x4a0
[153452.273161] [] ? mutex_lock+0x12/0x2f
[153452.273177] [] zio_vdev_io_done+0x88/0x180 [zfs]
[153452.273192] [] zio_execute+0xc8/0x180 [zfs]
[153452.273197] [] taskq_thread+0x21e/0x420 [spl]
[153452.273198] [] ? wake_up_state+0x20/0x20
[153452.273202] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[153452.273204] [] kthread+0xcf/0xe0
[153452.273205] [] ? kthread_create_on_node+0x140/0x140
[153452.273207] [] ret_from_fork+0x58/0x90
[153452.273208] [] ? kthread_create_on_node+0x140/0x140
[153452.273209] INFO: task z_rd_int_7:979 blocked for more than 120 seconds.
[153452.273225] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153452.273242] z_rd_int_7 D ffff88041e393680 0 979 2 0x00000000
[153452.273244] ffff8803fd463cc0 0000000000000046 ffff8803fe378b60 ffff8803fd463fd8
[153452.273246] ffff8803fd463fd8 ffff8803fd463fd8 ffff8803fe378b60 ffff8803fd5159d0
[153452.273247] ffff8803fd5159d4 ffff8803fe378b60 00000000ffffffff ffff8803fd5159d8
[153452.273248] Call Trace:
[153452.273250] [] schedule_preempt_disabled+0x29/0x70
[153452.273252] [] __mutex_lock_slowpath+0xc5/0x1c0
[153452.273254] [] mutex_lock+0x1f/0x2f
[153452.273269] [] vdev_queue_io_done+0x57/0x290 [zfs]
[153452.273272] [] ? __switch_to+0x136/0x4a0
[153452.273273] [] ? mutex_lock+0x12/0x2f
[153452.273289] [] zio_vdev_io_done+0x88/0x180 [zfs]
[153452.273304] [] zio_execute+0xc8/0x180 [zfs]
[153452.273308] [] taskq_thread+0x21e/0x420 [spl]
[153452.273310] [] ? wake_up_state+0x20/0x20
[153452.273313] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[153452.273315] [] kthread+0xcf/0xe0
[153452.273317] [] ? kthread_create_on_node+0x140/0x140
[153452.273318] [] ret_from_fork+0x58/0x90
[153452.273319] [] ? kthread_create_on_node+0x140/0x140
[153452.273321] INFO: task z_wr_iss:980 blocked for more than 120 seconds.
[153452.273336] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[153452.273353] z_wr_iss D ffff88041e393680 0 980 2 0x00000000
[153452.273355] ffff8803fe0f7af0 0000000000000046 ffff8804049d6660 ffff8803fe0f7fd8
[153452.273357] ffff8803fe0f7fd8 ffff8803fe0f7fd8 ffff8804049d6660 ffff8803fd5169d0
[153452.273358] ffff8803fd5169d4 ffff8804049d6660 00000000ffffffff ffff8803fd5169d8
[153452.273359] Call Trace:
[153452.273361] [] schedule_preempt_disabled+0x29/0x70
[153452.273363] [] __mutex_lock_slowpath+0xc5/0x1c0
[153452.273364] [] mutex_lock+0x1f/0x2f
[153452.273380] [] vdev_queue_io+0x88/0x200 [zfs]
[153452.273396] [] ? zio_create+0x3fa/0x500 [zfs]
[153452.273411] [] zio_vdev_io_start+0x187/0x2e0 [zfs]
[153452.273426] [] zio_nowait+0xc6/0x1b0 [zfs]
[153452.273442] [] vdev_mirror_io_start+0xa7/0x1a0 [zfs]
[153452.273457] [] ? vdev_config_sync+0x140/0x140 [zfs]
[153452.273472] [] ? zio_push_transform+0x39/0x90 [zfs]
[153452.273487] [] zio_vdev_io_start+0x9f/0x2e0 [zfs]
[153452.273501] [] zio_nowait+0xc6/0x1b0 [zfs]
[153452.273517] [] vdev_mirror_io_start+0xa7/0x1a0 [zfs]
[153452.273532] [] ? vdev_config_sync+0x140/0x140 [zfs]
[153452.273547] [] zio_vdev_io_start+0x1dd/0x2e0 [zfs]
[153452.273561] [] zio_execute+0xc8/0x180 [zfs]
[153452.273566] [] taskq_thread+0x21e/0x420 [spl]
[153452.273567] [] ? wake_up_state+0x20/0x20
[153452.273571] [] ? taskq_thread_spawn+0x60/0x60 [spl]
[153452.273573] [] kthread+0xcf/0xe0
[153452.273574] [] ? kthread_create_on_node+0x140/0x140
[153452.273576] [] ret_from_fork+0x58/0x90
[153452.273577] [] ? kthread_create_on_node+0x140/0x140

@gmarkey
Copy link

gmarkey commented Sep 21, 2015

I have a very similar problem; in my case it appears to be related to reclaiming ARC metadata as the arc_meta_max starts to drop, taking arc_meta_used with it, before arc_meta_max shoots back up to a value higher than it was before dropping. At that point, I've seen this issue (or at least one that has the same symptoms) occur once the arc_meta_used exceeds arc_meta_limit.

I've actually just switched off the ZOL machines today in favour of BTRFS until this issue is resolved so I sadly have no stack trace. Also running RHEL7, even tried with the 4.x kernels from elrepo.org.

There are similar bug reports that also sounded very similar and I've tried the workarounds from without success, perhaps these will help:

@fling-
Copy link
Contributor

fling- commented Sep 21, 2015

@rgmiller Which kernel version are you running?
I fixed my issues by downgrading the kernel to 3.17.7

@rgmiller
Copy link
Author

@fling: The current stock CentOS 7 kernel: 3.10.0-229.14.1.el7.x86_64
@gmarkey : Thanks for the tip. I'll take a look at those links.

@dweeezil
Copy link
Contributor

@rgmiller Does the target system have lots of files with xattrs? If so, is xattr=sa set? This deadlock seems like it would be a lot more likely when there are lots of dir-style xattrs. The reason I ask it because a bug in 0.6.5 caused xattr=sa to not be enabled properly. If this is the cause of your issue, there should be a fix coming very shortly in a point release which will cause xattr=sa to be properly enabled.

That said, however, this is clearly a deadlock situation which could occur in any case. I'm not sure why, other than the above, it would be more likely in 0.6.5 than in earlier releases. Or for that matter, why it might be more likely under kernels newer than 3.17.7.

These stack traces are very useful. In summary, you've shown a deadlock which can occur during kswapd-invoked inode cache pruning due to the IO required during management of the unlinked set (f.k.a. "delete queue"). Actually, in this particular case, it's due to the waiting for dependent IO operations (parent zio).

@kernelOfTruth
Copy link
Contributor

If there's still no improvement you could test https://github.com/Feh/nocache

@rgmiller
Copy link
Author

@dweeezil Yes, I'm pretty sure BackupPC makes heavy use of xattrs, and I have set xattr=sa on the zpool.

@dweeezil
Copy link
Contributor

@rgmiller I suspect 3af56fd, which was committed to master over the weekend and should appear in a point release shortly, will fix your problem by causing SA xattrs to actually be used. As I pointed out, however, your stack does show a very interesting case of potential deadlock which should be investigated further. It demonstrates a manner in which IO-causing filesystem code can be entered via a reclaim-like operation from the swapper.

I'd suggest leaving this issue open and re-titling it to something like:

Deadlock due to filesystem re-entry during zfs_evict_inode()

or something similar even if reinstating SA xattrs fixes your problem.

@rgmiller rgmiller changed the title Problems with large rsync's in 0.6.5 Deadlock due to filesystem re-entry during zfs_evict_inode() in 0.6.5 Sep 21, 2015
@spacelama
Copy link

@dweeezil Yes, I'm pretty sure BackupPC makes heavy use of xattrs, and I have set xattr=sa on the zpool.

Are you sure? I'm quite familiar with backuppc, and I'm pretty sure it
doesn't use xattrs. It's just files and its own perl implementation of
rsync. It creates a single file in each directory that records the
attrs on the remote end, but it's a normal file.

Tim Connors

@rgmiller
Copy link
Author

@dweeezil I see that the 0.6.5.1 tarball is available for download (including the change for the SA xattrs). I'm inclined to install it fairly soon so that I can get my backups running again. Before I do, are there any tests you'd like me to run while I've still got the system in a state that can (probably) reproduce this problem?

@dweeezil
Copy link
Contributor

@rgmiller If the problem goes away when you get the SA xattrs enabled again, then there's not much else to do. As I said, there's a deeper problem here which we'll have to address in some way. ZoL tries to prevent re-entering ZFS (causing IO operations) during ZFS operations, themselves, but there's still a possibility of re-entry from the kernel itself and I believe that's what your case demonstrates.

@spacelama There's always a chance the filesystems simply have a lot of xattrs due to posixacl and/or selinux. I've got no experience with BackupPC and no knowledge of whether it uses xattrs for its own purpose.

@rgmiller
Copy link
Author

I updated to 0.6.5.1 last night and let my usual BackupPC jobs run. Unfortunately, they still hung, though they at least didn't lock up the entire server. I've opened up a new issue - #3822 - because the stack traces looked different. It's possible the issues are related, though.

@behlendorf
Copy link
Contributor

@rgmiller this problem may be made more likely in 0.6.5 due to the dynamic taskqs. Because we're creating and destroying threads more often it's more likely to trigger this deadlock. My suggestion would be to disable this support for now by setting the module option spl_taskq_thread_dynamic=0.

behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 23, 2015
As described in the comment above arc_reclaim_thread() it's critical
that the reclaim thread be careful about blocking.  Just like it must
never wait on a hash lock, it must never wait on a task which can in
turn wait on the CV in arc_get_data_buf().  This will deadlock, see
issue openzfs#3822 for full backtraces showing the problem.

To resolve this issue arc_kmem_reap_now() has been updated to use the
asynchronous arc prune function.  This means that arc_prune_async()
may now be called while there are still outstanding arc_prune_tasks.
However, this isn't a problem because arc_prune_async() already
keeps a reference count preventing multiple outstanding tasks per
registered consumer.  Functionally, this behavior is the same as
the counterpart illumos function dnlc_reduce_cache().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3808
Issue openzfs#3822
@behlendorf
Copy link
Contributor

@rgmiller never mind my previous request. I've posted a proper fix in #3826, if your in a position where you could apply the fix (which is safe) and verify it resolves the issue that would be very helpful.

@rgmiller
Copy link
Author

@behlendorf Disabling dynamic taskq's seemed to help a little, but not much. I made the change Wednesday evening and was able to run a few backup jobs. However, the machine still locked up overnight. (The normal backup jobs are run by cron starting at 2:00AM.) I couldn't even log in via the console and ended up having to hit the reset button on the machine.

behlendorf added a commit that referenced this issue Sep 25, 2015
As described in the comment above arc_reclaim_thread() it's critical
that the reclaim thread be careful about blocking.  Just like it must
never wait on a hash lock, it must never wait on a task which can in
turn wait on the CV in arc_get_data_buf().  This will deadlock, see
issue #3822 for full backtraces showing the problem.

To resolve this issue arc_kmem_reap_now() has been updated to use the
asynchronous arc prune function.  This means that arc_prune_async()
may now be called while there are still outstanding arc_prune_tasks.
However, this isn't a problem because arc_prune_async() already
keeps a reference count preventing multiple outstanding tasks per
registered consumer.  Functionally, this behavior is the same as
the counterpart illumos function dnlc_reduce_cache().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3808
Issue #3834
Issue #3822
@behlendorf
Copy link
Contributor

Resolved by ef5b2e1 which will be cherry-picked in to 0.6.5.2 release.

behlendorf added a commit that referenced this issue Sep 30, 2015
As described in the comment above arc_reclaim_thread() it's critical
that the reclaim thread be careful about blocking.  Just like it must
never wait on a hash lock, it must never wait on a task which can in
turn wait on the CV in arc_get_data_buf().  This will deadlock, see
issue #3822 for full backtraces showing the problem.

To resolve this issue arc_kmem_reap_now() has been updated to use the
asynchronous arc prune function.  This means that arc_prune_async()
may now be called while there are still outstanding arc_prune_tasks.
However, this isn't a problem because arc_prune_async() already
keeps a reference count preventing multiple outstanding tasks per
registered consumer.  Functionally, this behavior is the same as
the counterpart illumos function dnlc_reduce_cache().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #3808
Issue #3834
Issue #3822
MorpheusTeam pushed a commit to Xyratex/lustre-stable that referenced this issue Oct 17, 2015
ZFS/SPL 0.6.5.2

Bug Fixes
* Init script fixes openzfs/zfs#3816
* Fix uioskip crash when skip to end openzfs/zfs#3806
  openzfs/zfs#3850
* Userspace can trigger an assertion openzfs/zfs#3792
* Fix quota userused underflow bug openzfs/zfs#3789
* Fix performance regression from unwanted synchronous I/O
  openzfs/zfs#3780
* Fix deadlock during ARC reclaim openzfs/zfs#3808
  openzfs/zfs#3834
* Fix deadlock with zfs receive and clamscan openzfs/zfs#3719
* Allow NFS activity to defer snapshot unmounts openzfs/zfs#3794
* Linux 4.3 compatibility openzfs/zfs#3799
* Zed reload fixes openzfs/zfs#3773
* Fix PAX Patch/Grsec SLAB_USERCOPY panic openzfs/zfs#3796
* Always remove during dkms uninstall/update openzfs/spl#476

ZFS/SPL 0.6.5.1

Bug Fixes

* Fix zvol corruption with TRIM/discard openzfs/zfs#3798
* Fix NULL as mount(2) syscall data parameter openzfs/zfs#3804
* Fix xattr=sa dataset property not honored openzfs/zfs#3787

ZFS/SPL 0.6.5

Supported Kernels

* Compatible with 2.6.32 - 4.2 Linux kernels.

New Functionality

* Support for temporary mount options.
* Support for accessing the .zfs/snapshot over NFS.
* Support for estimating send stream size when source is a bookmark.
* Administrative commands are allowed to use reserved space improving
  robustness.
* New notify ZEDLETs support email and pushbullet notifications.
* New keyword 'slot' for vdev_id.conf to control what is use for the
  slot number.
* New zpool export -a option unmounts and exports all imported pools.
* New zpool iostat -y omits the first report with statistics since
  boot.
* New zdb can now open the root dataset.
* New zdb can print the numbers of ganged blocks.
* New zdb -ddddd can print details of block pointer objects.
* New zdb -b performance improved.
* New zstreamdump -d prints contents of blocks.

New Feature Flags

* large_blocks - This feature allows the record size on a dataset to
be set larger than 128KB. We currently support block sizes from 512
bytes to 16MB. The benefits of larger blocks, and thus larger IO, need
to be weighed against the cost of COWing a giant block to modify one
byte. Additionally, very large blocks can have an impact on I/O
latency, and also potentially on the memory allocator. Therefore, we
do not allow the record size to be set larger than zfs_max_recordsize
(default 1MB). Larger blocks can be created by changing this tuning,
pools with larger blocks can always be imported and used, regardless
of this setting.

* filesystem_limits - This feature enables filesystem and snapshot
limits. These limits can be used to control how many filesystems
and/or snapshots can be created at the point in the tree on which the
limits are set.

*Performance*

* Improved zvol performance on all kernels (>50% higher throughput,
  >20% lower latency)
* Improved zil performance on Linux 2.6.39 and earlier kernels (10x
  lower latency)
* Improved allocation behavior on mostly full SSD/file pools (5% to
  10% improvement on 90% full pools)
* Improved performance when removing large files.
* Caching improvements (ARC):
** Better cached read performance due to reduced lock contention.
** Smarter heuristics for managing the total size of the cache and the
   distribution of data/metadata.
** Faster release of cached buffers due to unexpected memory pressure.

*Changes in Behavior*

* Default reserved space was increased from 1.6% to 3.3% of total pool
capacity. This default percentage can be controlled through the new
spa_slop_shift module option, setting it to 6 will restore the
previous percentage.

* Loading of the ZFS module stack is now handled by systemd or the
sysv init scripts. Invoking the zfs/zpool commands will not cause the
modules to be automatically loaded. The previous behavior can be
restored by setting the ZFS_MODULE_LOADING=yes environment variable
but this functionality will be removed in a future release.

* Unified SYSV and Gentoo OpenRC initialization scripts. The previous
functionality has been split in to zfs-import, zfs-mount, zfs-share,
and zfs-zed scripts. This allows for independent control of the
services and is consistent with the unit files provided for a systemd
based system. Complete details of the functionality provided by the
updated scripts can be found here.

* Task queues are now dynamic and worker threads will be created and
destroyed as needed. This allows the system to automatically tune
itself to ensure the optimal number of threads are used for the active
workload which can result in a performance improvement.

* Task queue thread priorities were correctly aligned with the default
Linux file system thread priorities. This allows ZFS to compete fairly
with other active Linux file systems when the system is under heavy
load.

* When compression=on the default compression algorithm will be lz4 as
long as the feature is enabled. Otherwise the default remains lzjb.
Similarly lz4 is now the preferred method for compressing meta data
when available.

* The use of mkdir/rmdir/mv in the .zfs/snapshot directory has been
disabled by default both locally and via NFS clients. The
zfs_admin_snapshot module option can be used to re-enable this
functionality.

* LBA weighting is automatically disabled on files and SSDs ensuring
the entire device is used fairly.
* iostat accounting on zvols running on kernels older than Linux 3.19
is no longer supported.

* The known issues preventing swap on zvols for Linux 3.9 and newer
kernels have been resolved. However, deadlocks are still possible for
older kernels.

Module Options

* Changed zfs_arc_c_min default from 4M to 32M to accommodate large
  blocks.
* Added metaslab_aliquot to control how many bytes are written to a
  top-level vdev before moving on to the next one. Increasing this may
  be helpful when using blocks larger than 1M.
* Added spa_slop_shift, see 'reserved space' comment in the 'changes
  to behavior' section.
* Added zfs_admin_snapshot, enable/disable the use of mkdir/rmdir/mv
  in .zfs/snapshot directory.
* Added zfs_arc_lotsfree_percent, throttle I/O when free system
  memory drops below this percentage.
* Added zfs_arc_num_sublists_per_state, used to allow more
  fine-grained locking.
* Added zfs_arc_p_min_shift, used to set a floor on arc_p.
* Added zfs_arc_sys_free, the target number of bytes the ARC should
  leave as free.
* Added zfs_dbgmsg_enable, used to enable the 'dbgmsg' kstat.
* Added zfs_dbgmsg_maxsize, sets the maximum size of the dbgmsg
  buffer.
* Added zfs_max_recordsize, used to control the maximum allowed
  record size.
* Added zfs_arc_meta_strategy, used to select the preferred ARC
  reclaim strategy.
* Removed metaslab_min_alloc_size, it was unused internally due to
  prior changes.
* Removed zfs_arc_memory_throttle_disable, replaced by
  zfs_arc_lotsfree_percent.
* Removed zvol_threads, zvols no longer require a dedicated task
  queue.
* See zfs-module-parameters(5) for complete details on available
  module options.

Bug Fixes

* Improved documentation with many updates, corrections, and
  additions.
* Improved sysv, systemd, initramfs, and dracut support.
* Improved block pointer validation before issuing IO.
* Improved scrub pause heuristics.
* Improved test coverage.
* Improved heuristics for automatic repair when zfs_recover=1 module
  option is set.
* Improved debugging infrastructure via 'dbgmsg' kstat.
* Improved zpool import performance.
* Fixed deadlocks in direct memory reclaim.
* Fixed deadlock on db_mtx and dn_holds.
* Fixed deadlock in dmu_objset_find_dp().
* Fixed deadlock during zfs rollback.
* Fixed kernel panic due to tsd_exit() in ZFS_EXIT.
* Fixed kernel panic when adding a duplicate dbuf to dn_dbufs.
* Fixed kernel panic due to security / ACL creation failure.
* Fixed kernel panic on unmount due to iput taskq.
* Fixed panic due to corrupt nvlist when running utilities.
* Fixed panic on unmount due to not waiting for all znodes to be
  released.
* Fixed panic with zfs clone from different source and target pools.
* Fixed NULL pointer dereference in dsl_prop_get_ds().
* Fixed NULL pointer dereference in dsl_prop_notify_all_cb().
* Fixed NULL pointer dereference in zfsdev_getminor().
* Fixed I/Os are now aggregated across ZIO priority classes.
* Fixed .zfs/snapshot auto-mounting for all supported kernels.
* Fixed 3-digit octal escapes by changing to 4-digit which
  disambiguate the output.
* Fixed hard lockup due to infinite loop in zfs_zget().
* Fixed misreported 'alloc' value for cache devices.
* Fixed spurious hung task watchdog stack traces.
* Fixed direct memory reclaim deadlocks.
* Fixed module loading in zfs import systemd service.
* Fixed intermittent libzfs_init() failure to open /dev/zfs.
* Fixed hot-disk sparing for disk vdevs
* Fixed system spinning during ARC reclaim.
* Fixed formatting errors in {{zfs(8)}}
* Fixed zio pipeline stall by having callers invoke next stage.
* Fixed assertion failed in zrl_tryenter().
* Fixed memory leak in make_root_vdev().
* Fixed memory leak in zpool_in_use().
* Fixed memory leak in libzfs when doing rollback.
* Fixed hold leak in dmu_recv_end_check().
* Fixed refcount leak in bpobj_iterate_impl().
* Fixed misuse of input argument in traverse_visitbp().
* Fixed missing missing mutex_destroy() calls.
* Fixed integer overflows in dmu_read/dmu_write.
* Fixed verify() failure in zio_done().
* Fixed zio_checksum_error() to only include info for ECKSUM errors.
* Fixed -ESTALE to force lookup on missing NFS file handles.
* Fixed spurious failures from dsl_dataset_hold_obj().
* Fixed zfs compressratio when using with 4k sector size.
* Fixed spurious watchdog warnings in prefetch thread.
* Fixed unfair disk space allocation when vdevs are of unequal size.
* Fixed ashift accounting error writing to cache devices.
* Fixed zdb -d has false positive warning when
  feature@large_blocks=disabled.
* Fixed zdb -h | -i seg fault.
* Fixed force-received full stream into a dataset if it has a
  snapshot.
* Fixed snapshot error handling.
* Fixed 'hangs' while deleting large files.
* Fixed lock contention (rrw_exit) while running a read only load.
* Fixed error message when creating a pool to include all problematic
  devices.
* Fixed Xen virtual block device detection, partitions are now
  created.
* Fixed missing E2BIG error handling in zfs_setprop_error().
* Fixed zpool import assertion in libzfs_import.c.
* Fixed zfs send -nv output to stderr.
* Fixed idle pool potentially running itself out of space.
* Fixed narrow race which allowed read(2) to access beyond fstat(2)'s
  reported end-of-file.
* Fixed support for VPATH builds.
* Fixed double counting of HDR_L2ONLY_SIZE in ARC.
* Fixed 'BUG: Bad page state' warning from kernel due to writeback
  flag.
* Fixed arc_available_memory() to check freemem.
* Fixed arc_memory_throttle() to check pageout.
* Fixed'zpool create warning when using zvols in debug builds.
* Fixed loop devices layered on ZFS with 4.1 kernels.
* Fixed zvol contribution to kernel entropy pool.
* Fixed handling of compression flags in arc header.
* Substantial changes to realign code base with illumos.
* Many additional bug fixes.

Signed-off-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Change-Id: I87c012aec9ec581b10a417d699dafc7d415abf63
Reviewed-on: http://review.whamcloud.com/16399
Tested-by: Jenkins
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
@behlendorf
Copy link
Contributor

Actually, this does appear the same as #3867 and therefore may not be completely resolved by ef5b2e1.

behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 16, 2015
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either success or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_vmalloc
  ...
  evict
    ...
    zfs_inactive
      dmu_tx_wait
        txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3808
Issue openzfs#3867
behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 17, 2015
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3808
Issue openzfs#3867
behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 18, 2015
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes openzfs#3808
Closes openzfs#3867
behlendorf added a commit that referenced this issue Dec 24, 2015
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3808
Closes #3867
ryao pushed a commit to ryao/zfs that referenced this issue Jan 4, 2016
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes openzfs#3808
Closes openzfs#3867
goulvenriou pushed a commit to Alyseo/zfs that referenced this issue Jan 17, 2016
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes openzfs#3808
Closes openzfs#3867
goulvenriou pushed a commit to Alyseo/zfs that referenced this issue Feb 3, 2016
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes openzfs#3808
Closes openzfs#3867
goulvenriou pushed a commit to Alyseo/zfs that referenced this issue Feb 4, 2016
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes openzfs#3808
Closes openzfs#3867
goulvenriou pushed a commit to Alyseo/zfs that referenced this issue Feb 4, 2016
This deadlock may manifest itself in slightly different ways but
at the core it is caused by a memory allocation blocking on file-
system reclaim in the zio pipeline.  This is normally impossible
because zio_execute() disables filesystem reclaim by setting
PF_FSTRANS on the thread.  However, kmem cache allocations may
still indirectly block on file system reclaim while holding the
critical vq->vq_lock as shown below.

To resolve this issue zio_buf_alloc_flags() is introduced which
allocation flags to be passed.  This can then be used in
vdev_queue_aggregate() with KM_NOSLEEP when allocating the
aggregate IO buffer.  Since aggregating the IO is purely a
performance optimization we want this to either succeed or fail
quickly.  Trying too hard to allocate this memory under the
vq->vq_lock can negatively impact performance and result in
this deadlock.

* z_wr_iss
zio_vdev_io_start
  vdev_queue_io -> Takes vq->vq_lock
    vdev_queue_io_to_issue
      vdev_queue_aggregate
        zio_buf_alloc -> Waiting on spl_kmem_cache process

* z_wr_int
zio_vdev_io_done
  vdev_queue_io_done
    mutex_lock -> Waiting on vq->vq_lock held by z_wr_iss

* txg_sync
spa_sync
  dsl_pool_sync
    zio_wait -> Waiting on zio being handled by z_wr_int

* spl_kmem_cache
spl_cache_grow_work
  kv_alloc
    spl_vmalloc
      ...
      evict
        zpl_evict_inode
          zfs_inactive
            dmu_tx_wait
              txg_wait_open -> Waiting on txg_sync

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes openzfs#3808
Closes openzfs#3867
@behlendorf behlendorf modified the milestones: 0.6.5.4, 0.6.5.2 Mar 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants