Removing L2ARC under IO load causes Kernel OOPS: ZFS hangs after #9337

h1z1 · 2019-09-18T23:36:17Z

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	7
Linux Kernel	4.20.0
Architecture	x86_64
ZFS Version	0.8.0
SPL Version	0.8.0

Describe the problem you're observing

All zfs volumes are unresponsive

Describe how to reproduce the problem

Create pool
Add L2ARC
Generate load (dd was used, unknown for how long)
cd /tank; dd if=/dev/zero of=foo.bin bs=2048M
Attempt to remove L2ARC
zfs remove tank /cache/tank.ARC.zfs

Observe not only does the newly created pool hang, but any others in the same host will also not respond to zfs commands. OOPS eventually triggers as processes become unresponsive

Include any warning/errors/backtraces from the system logs

[146841.849635] INFO: task l2arc_feed:3900 blocked for more than 120 seconds.
[146841.849637]       Tainted: P           O      4.20.0+ #1
[146841.849637] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.849638] l2arc_feed      D    0  3900      2 0x80080000
[146841.849640] Call Trace:
[146841.849645]  __schedule+0x2ab/0x880
[146841.849648]  schedule+0x36/0x80
[146841.849649]  schedule_preempt_disabled+0xe/0x10
[146841.849651]  __mutex_lock.isra.11+0x19e/0x4d0
[146841.849652]  ? mutex_lock+0x12/0x40
[146841.849654]  ? remove_wait_queue+0x60/0x60
[146841.849655]  __mutex_lock_slowpath+0x13/0x20
[146841.849656]  mutex_lock+0x34/0x40
[146841.849691]  l2arc_feed_thread+0x177/0x440 [zfs]
[146841.849714]  ? l2arc_evict+0x2c0/0x2c0 [zfs]
[146841.849718]  thread_generic_wrapper+0x74/0x90 [spl]
[146841.849721]  kthread+0x105/0x140
[146841.849726]  ? __thread_exit+0x20/0x20 [spl]
[146841.849727]  ? kthread_bind+0x20/0x20
[146841.849728]  ret_from_fork+0x22/0x40


[146841.849783] INFO: task txg_sync:13523 blocked for more than 120 seconds.
[146841.849784]       Tainted: P           O      4.20.0+ #1
[146841.849784] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.849785] txg_sync        D    0 13523      2 0x80080080
[146841.849786] Call Trace:
[146841.849787]  __schedule+0x2ab/0x880
[146841.849788]  schedule+0x36/0x80
[146841.849790]  cv_wait_common+0x101/0x140 [spl]
[146841.849791]  ? remove_wait_queue+0x60/0x60
[146841.849793]  __cv_wait+0x15/0x20 [spl]
[146841.849822]  spa_config_enter+0xf8/0x110 [zfs]
[146841.849850]  spa_txg_history_fini_io+0x50/0x230 [zfs]
[146841.849877]  txg_sync_thread+0x2df/0x4b0 [zfs]
[146841.849904]  ? txg_init+0x2a0/0x2a0 [zfs]
[146841.849908]  thread_generic_wrapper+0x74/0x90 [spl]
[146841.849910]  kthread+0x105/0x140
[146841.849913]  ? __thread_exit+0x20/0x20 [spl]
[146841.849914]  ? kthread_bind+0x20/0x20
[146841.849915]  ret_from_fork+0x22/0x40


[146841.849916] INFO: task mmp:13524 blocked for more than 120 seconds.
[146841.849917]       Tainted: P           O      4.20.0+ #1
[146841.849917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.849918] mmp             D    0 13524      2 0x80080080
[146841.849919] Call Trace:
[146841.849920]  __schedule+0x2ab/0x880
[146841.849920]  ? __wake_up_common_lock+0x8c/0xc0
[146841.849921]  schedule+0x36/0x80
[146841.849923]  cv_wait_common+0x101/0x140 [spl]
[146841.849924]  ? remove_wait_queue+0x60/0x60
[146841.849926]  __cv_wait+0x15/0x20 [spl]
[146841.849952]  spa_config_enter+0xf8/0x110 [zfs]
[146841.849979]  vdev_count_leaves+0x25/0x50 [zfs]
[146841.850005]  mmp_thread+0x2d8/0x6d0 [zfs]
[146841.850007]  ? __switch_to_asm+0x34/0x70
[146841.850009]  ? __switch_to_asm+0x40/0x70
[146841.850010]  ? __switch_to_asm+0x34/0x70
[146841.850011]  ? __switch_to_asm+0x40/0x70
[146841.850011]  ? __switch_to_asm+0x34/0x70
[146841.850012]  ? __switch_to_asm+0x40/0x70
[146841.850013]  ? __switch_to_asm+0x34/0x70
[146841.850014]  ? __switch_to_asm+0x40/0x70
[146841.850015]  ? __switch_to_asm+0x34/0x70
[146841.850039]  ? mmp_write_uberblock+0x710/0x710 [zfs]
[146841.850043]  thread_generic_wrapper+0x74/0x90 [spl]
[146841.850045]  kthread+0x105/0x140
[146841.850048]  ? __thread_exit+0x20/0x20 [spl]
[146841.850049]  ? kthread_bind+0x20/0x20
[146841.850050]  ret_from_fork+0x22/0x40


[146841.850161] INFO: task rm:64047 blocked for more than 120 seconds.
[146841.850162]       Tainted: P           O      4.20.0+ #1
[146841.850162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.850163] rm              D    0 64047  65589 0x00080080
[146841.850163] Call Trace:
[146841.850165]  __schedule+0x2ab/0x880
[146841.850165]  schedule+0x36/0x80
[146841.850168]  cv_wait_common+0x101/0x140 [spl]
[146841.850170]  ? remove_wait_queue+0x60/0x60
[146841.850173]  __cv_wait+0x15/0x20 [spl]
[146841.850200]  spa_config_enter+0xf8/0x110 [zfs]
[146841.850227]  zfs_blkptr_verify+0x382/0x400 [zfs]
[146841.850255]  zio_read+0x42/0xf0 [zfs]
[146841.850258]  ? __kmalloc_node+0x1f5/0x280
[146841.850276]  ? arc_get_data_abd.isra.34+0x51/0x60 [zfs]
[146841.850279]  ? spl_kmem_zalloc+0xdc/0x140 [spl]
[146841.850297]  ? arc_buf_destroy+0x100/0x100 [zfs]
[146841.850315]  arc_read+0x4b4/0x1080 [zfs]
[146841.850317]  ? mutex_lock+0x12/0x40
[146841.850334]  ? dbuf_rele_and_unlock+0x5b0/0x5b0 [zfs]
[146841.850354]  dbuf_read_impl.constprop.23+0x2af/0x7f0 [zfs]
[146841.850373]  dbuf_read+0xe2/0x530 [zfs]
[146841.850392]  ? dbuf_hold_impl+0x3c/0x50 [zfs]
[146841.850413]  dmu_tx_check_ioerr+0x68/0xd0 [zfs]
[146841.850434]  dmu_tx_hold_free_impl+0x145/0x260 [zfs]
[146841.850454]  dmu_tx_hold_free+0x3c/0x50 [zfs]
[146841.850474]  dmu_free_long_range+0x275/0x4e0 [zfs]
[146841.850502]  zfs_rmnode+0x2a0/0x340 [zfs]
[146841.850529]  ? zfs_znode_hold_exit+0x111/0x120 [zfs]
[146841.850557]  zfs_zinactive+0xf6/0x100 [zfs]
[146841.850585]  zfs_inactive+0x88/0x220 [zfs]
[146841.850588]  ? truncate_pagecache+0x59/0x60
[146841.850615]  zpl_evict_inode+0x43/0x60 [zfs]
[146841.850618]  evict+0xd6/0x1b0
[146841.850620]  iput+0x16f/0x1c0
[146841.850621]  do_unlinkat+0x187/0x2b0
[146841.850622]  __x64_sys_unlinkat+0x38/0x60
[146841.850625]  do_syscall_64+0x60/0x190
[146841.850626]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[146841.850627] RIP: 0033:0x7f5687d9874d
[146841.850631] Code: Bad RIP value.
[146841.850631] RSP: 002b:00007fffd779db68 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[146841.850632] RAX: ffffffffffffffda RBX: 00000000014dffa0 RCX: 00007f5687d9874d
[146841.850633] RDX: 0000000000000000 RSI: 00000000014e00a8 RDI: 0000000000000006
[146841.850633] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[146841.850634] R10: 00000000022646d0 R11: 0000000000000246 R12: 00000000010e005c
[146841.850634] R13: 00007fffd779dcd0 R14: 00000000014dffa0 R15: 0000000000000000


[146841.850636] INFO: task dd:125196 blocked for more than 120 seconds.
[146841.850636]       Tainted: P           O      4.20.0+ #1
[146841.850637] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.850637] dd              D    0 125196  65589 0x00080082
[146841.850638] Call Trace:
[146841.850639]  __schedule+0x2ab/0x880
[146841.850642]  ? spl_kmem_cache_alloc+0x92/0x150 [spl]
[146841.850643]  schedule+0x36/0x80
[146841.850646]  cv_wait_common+0x101/0x140 [spl]
[146841.850647]  ? remove_wait_queue+0x60/0x60
[146841.850649]  __cv_wait+0x15/0x20 [spl]
[146841.850669]  dmu_tx_wait+0xb5/0x390 [zfs]
[146841.850691]  dmu_tx_assign+0x8c/0x470 [zfs]
[146841.850714]  ? dmu_tx_count_write+0x182/0x1c0 [zfs]
[146841.850742]  zfs_write+0x566/0xf00 [zfs]
[146841.850770]  zpl_write_common_iovec+0xa6/0x100 [zfs]
[146841.850796]  zpl_iter_write+0xf6/0x140 [zfs]
[146841.850799]  __vfs_write+0x10d/0x190
[146841.850802]  vfs_write+0xb2/0x1b0
[146841.850802]  ksys_write+0x55/0xc0
[146841.850803]  __x64_sys_write+0x1a/0x20
[146841.850804]  do_syscall_64+0x60/0x190
[146841.850805]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[146841.850806] RIP: 0033:0x7f182a4e8fd0
[146841.850807] Code: Bad RIP value.
[146841.850808] RSP: 002b:00007fff956c1b38 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[146841.850808] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f182a4e8fd0
[146841.850809] RDX: 000000007ffff000 RSI: 00007f17a3ece000 RDI: 0000000000000001
[146841.850809] RBP: 000000007ffff000 R08: ffffffffffffffff R09: 0000000080002003
[146841.850810] R10: 00007fff956c15a0 R11: 0000000000000246 R12: 000000007ffff000
[146841.850810] R13: 00007f17a3ece000 R14: 00007f1823ecd000 R15: 0000000000000000


[146841.850818] INFO: task zpool:116186 blocked for more than 120 seconds.
[146841.850819]       Tainted: P           O      4.20.0+ #1
[146841.850820] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.850820] zpool           D    0 116186  19708 0x00080080
[146841.850821] Call Trace:
[146841.850822]  __schedule+0x2ab/0x880
[146841.850823]  ? discard_slab+0x3a/0x50
[146841.850824]  ? __slab_free+0x2a9/0x360
[146841.850825]  schedule+0x36/0x80
[146841.850825]  schedule_preempt_disabled+0xe/0x10
[146841.850826]  __mutex_lock.isra.11+0x19e/0x4d0
[146841.850827]  ? kmem_cache_free+0x116/0x140
[146841.850830]  ? spl_kmem_cache_free+0x169/0x1c0 [spl]
[146841.850831]  __mutex_lock_slowpath+0x13/0x20
[146841.850832]  mutex_lock+0x34/0x40
[146841.850848]  l2arc_evict+0x28d/0x2c0 [zfs]
[146841.850867]  l2arc_remove_vdev+0x110/0x1b0 [zfs]
[146841.850893]  spa_load_l2cache+0x280/0x4a0 [zfs]
[146841.850921]  ? spa_vdev_remove_aux+0xcb/0x1b0 [zfs]
[146841.850949]  spa_vdev_remove+0x5d4/0x7a0 [zfs]
[146841.850977]  zfs_ioc_vdev_remove+0x50/0x90 [zfs]
[146841.851005]  zfsdev_ioctl+0x85a/0x8c0 [zfs]
[146841.851008]  do_vfs_ioctl+0xaa/0x620
[146841.851010]  ? __audit_syscall_entry+0xdd/0x130
[146841.851012]  ksys_ioctl+0x67/0x90
[146841.851013]  __x64_sys_ioctl+0x1a/0x20
[146841.851014]  do_syscall_64+0x60/0x190
[146841.851015]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[146841.851016] RIP: 0033:0x7fd8b5b698d7
[146841.851017] Code: Bad RIP value.
[146841.851017] RSP: 002b:00007fff93618f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[146841.851018] RAX: ffffffffffffffda RBX: 00000000021092e0 RCX: 00007fd8b5b698d7
[146841.851018] RDX: 00007fff93619370 RSI: 0000000000005a0c RDI: 0000000000000003
[146841.851019] RBP: 00007fff9361c950 R08: 0000000000000080 R09: 0000000000000000
[146841.851019] R10: 00007fff936189e0 R11: 0000000000000246 R12: 00007fff93619370
[146841.851020] R13: 00000000021070f0 R14: 00007fff93618f70 R15: 00000000021197d8


[146841.851021] INFO: task zpool:118505 blocked for more than 120 seconds.
[146841.851021]       Tainted: P           O      4.20.0+ #1
[146841.851022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.851022] zpool           D    0 118505  14831 0x00080080
[146841.851023] Call Trace:
[146841.851025]  __schedule+0x2ab/0x880
[146841.851026]  schedule+0x36/0x80
[146841.851027]  schedule_preempt_disabled+0xe/0x10
[146841.851027]  __mutex_lock.isra.11+0x19e/0x4d0
[146841.851032]  ? nvlist_xalloc.part.16+0x93/0x160 [znvpair]
[146841.851033]  __mutex_lock_slowpath+0x13/0x20
[146841.851034]  mutex_lock+0x34/0x40
[146841.851060]  spa_all_configs+0x40/0x120 [zfs]
[146841.851087]  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
[146841.851115]  zfsdev_ioctl+0x85a/0x8c0 [zfs]
[146841.851118]  do_vfs_ioctl+0xaa/0x620
[146841.851119]  ? __audit_syscall_entry+0xdd/0x130
[146841.851121]  ? handle_mm_fault+0x102/0x220
[146841.851122]  ksys_ioctl+0x67/0x90
[146841.851123]  __x64_sys_ioctl+0x1a/0x20
[146841.851124]  do_syscall_64+0x60/0x190
[146841.851125]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[146841.851126] RIP: 0033:0x7f1baef7b8d7
[146841.851127] Code: Bad RIP value.
[146841.851127] RSP: 002b:00007ffc5a2703b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[146841.851128] RAX: ffffffffffffffda RBX: 00000000021510f0 RCX: 00007f1baef7b8d7
[146841.851128] RDX: 00007ffc5a2703f0 RSI: 0000000000005a04 RDI: 0000000000000003
[146841.851129] RBP: 00007ffc5a2739d0 R08: ffffffffffffffff R09: 0000000000040000
[146841.851129] R10: 00007ffc5a26fe20 R11: 0000000000000206 R12: 0000000000000000
[146841.851130] R13: 00007ffc5a2703f0 R14: 0000000000406200 R15: 00000000021532b0

Worth noting IO to other volumes appears OK, only zfs commands hang (zpool and zfs)

The text was updated successfully, but these errors were encountered:

cdluminate · 2019-12-11T14:33:30Z

Encountered a similar issue wihout doing anything. The zpool stores a Debian mirror, and we occasionally discovered it malfunctioning (not responding nginx/apache read requests).
console_log.txt
ZFS version: 0.8.1-4 (debian).

h1z1 · 2019-12-11T17:26:57Z

Sounds almost like #828
#9706 (comment)

Unless I've had too much eggnog already, the interesting thing is ..

[146841.850831]  __mutex_lock_slowpath+0x13/0x20
[146841.850832]  mutex_lock+0x34/0x40
[146841.850848]  l2arc_evict+0x28d/0x2c0 [zfs]
[146841.850867]  l2arc_remove_vdev+0x110/0x1b0 [zfs]
   [146841.850893]  spa_load_l2cache+0x280/0x4a0 [zfs]
[146841.850921]  ? spa_vdev_remove_aux+0xcb/0x1b0 [zfs]
[146841.850949]  spa_vdev_remove+0x5d4/0x7a0 [zfs]

stale · 2020-12-10T19:44:00Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale · 2021-12-29T03:21:14Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 16, 2019

This was referenced Nov 14, 2020

removing L2ARC device from pool sometimes causes a freeze #10286

Closed

Fix ASSERT logic in l2arc_evict() #11205

Merged

stale bot added the Status: Stale No recent activity for issue label Dec 10, 2020

behlendorf removed the Status: Stale No recent activity for issue label Dec 28, 2020

stale bot added the Status: Stale No recent activity for issue label Dec 29, 2021

stale bot closed this as completed Apr 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing L2ARC under IO load causes Kernel OOPS: ZFS hangs after #9337

Removing L2ARC under IO load causes Kernel OOPS: ZFS hangs after #9337

h1z1 commented Sep 18, 2019 •

edited

Loading

cdluminate commented Dec 11, 2019

h1z1 commented Dec 11, 2019 •

edited

Loading

stale bot commented Dec 10, 2020

stale bot commented Dec 29, 2021

Removing L2ARC under IO load causes Kernel OOPS: ZFS hangs after #9337

Removing L2ARC under IO load causes Kernel OOPS: ZFS hangs after #9337

Comments

h1z1 commented Sep 18, 2019 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

cdluminate commented Dec 11, 2019

h1z1 commented Dec 11, 2019 • edited Loading

stale bot commented Dec 10, 2020

stale bot commented Dec 29, 2021

h1z1 commented Sep 18, 2019 •

edited

Loading

h1z1 commented Dec 11, 2019 •

edited

Loading