Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing L2ARC under IO load causes Kernel OOPS: ZFS hangs after #9337

Closed
h1z1 opened this issue Sep 18, 2019 · 4 comments
Closed

Removing L2ARC under IO load causes Kernel OOPS: ZFS hangs after #9337

h1z1 opened this issue Sep 18, 2019 · 4 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@h1z1
Copy link

h1z1 commented Sep 18, 2019

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7
Linux Kernel 4.20.0
Architecture x86_64
ZFS Version 0.8.0
SPL Version 0.8.0

Describe the problem you're observing

All zfs volumes are unresponsive

Describe how to reproduce the problem

  1. Create pool
  2. Add L2ARC
  3. Generate load (dd was used, unknown for how long)
    cd /tank; dd if=/dev/zero of=foo.bin bs=2048M
  4. Attempt to remove L2ARC
    zfs remove tank /cache/tank.ARC.zfs

Observe not only does the newly created pool hang, but any others in the same host will also not respond to zfs commands. OOPS eventually triggers as processes become unresponsive

Include any warning/errors/backtraces from the system logs

[146841.849635] INFO: task l2arc_feed:3900 blocked for more than 120 seconds.
[146841.849637]       Tainted: P           O      4.20.0+ #1
[146841.849637] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.849638] l2arc_feed      D    0  3900      2 0x80080000
[146841.849640] Call Trace:
[146841.849645]  __schedule+0x2ab/0x880
[146841.849648]  schedule+0x36/0x80
[146841.849649]  schedule_preempt_disabled+0xe/0x10
[146841.849651]  __mutex_lock.isra.11+0x19e/0x4d0
[146841.849652]  ? mutex_lock+0x12/0x40
[146841.849654]  ? remove_wait_queue+0x60/0x60
[146841.849655]  __mutex_lock_slowpath+0x13/0x20
[146841.849656]  mutex_lock+0x34/0x40
[146841.849691]  l2arc_feed_thread+0x177/0x440 [zfs]
[146841.849714]  ? l2arc_evict+0x2c0/0x2c0 [zfs]
[146841.849718]  thread_generic_wrapper+0x74/0x90 [spl]
[146841.849721]  kthread+0x105/0x140
[146841.849726]  ? __thread_exit+0x20/0x20 [spl]
[146841.849727]  ? kthread_bind+0x20/0x20
[146841.849728]  ret_from_fork+0x22/0x40


[146841.849783] INFO: task txg_sync:13523 blocked for more than 120 seconds.
[146841.849784]       Tainted: P           O      4.20.0+ #1
[146841.849784] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.849785] txg_sync        D    0 13523      2 0x80080080
[146841.849786] Call Trace:
[146841.849787]  __schedule+0x2ab/0x880
[146841.849788]  schedule+0x36/0x80
[146841.849790]  cv_wait_common+0x101/0x140 [spl]
[146841.849791]  ? remove_wait_queue+0x60/0x60
[146841.849793]  __cv_wait+0x15/0x20 [spl]
[146841.849822]  spa_config_enter+0xf8/0x110 [zfs]
[146841.849850]  spa_txg_history_fini_io+0x50/0x230 [zfs]
[146841.849877]  txg_sync_thread+0x2df/0x4b0 [zfs]
[146841.849904]  ? txg_init+0x2a0/0x2a0 [zfs]
[146841.849908]  thread_generic_wrapper+0x74/0x90 [spl]
[146841.849910]  kthread+0x105/0x140
[146841.849913]  ? __thread_exit+0x20/0x20 [spl]
[146841.849914]  ? kthread_bind+0x20/0x20
[146841.849915]  ret_from_fork+0x22/0x40


[146841.849916] INFO: task mmp:13524 blocked for more than 120 seconds.
[146841.849917]       Tainted: P           O      4.20.0+ #1
[146841.849917] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.849918] mmp             D    0 13524      2 0x80080080
[146841.849919] Call Trace:
[146841.849920]  __schedule+0x2ab/0x880
[146841.849920]  ? __wake_up_common_lock+0x8c/0xc0
[146841.849921]  schedule+0x36/0x80
[146841.849923]  cv_wait_common+0x101/0x140 [spl]
[146841.849924]  ? remove_wait_queue+0x60/0x60
[146841.849926]  __cv_wait+0x15/0x20 [spl]
[146841.849952]  spa_config_enter+0xf8/0x110 [zfs]
[146841.849979]  vdev_count_leaves+0x25/0x50 [zfs]
[146841.850005]  mmp_thread+0x2d8/0x6d0 [zfs]
[146841.850007]  ? __switch_to_asm+0x34/0x70
[146841.850009]  ? __switch_to_asm+0x40/0x70
[146841.850010]  ? __switch_to_asm+0x34/0x70
[146841.850011]  ? __switch_to_asm+0x40/0x70
[146841.850011]  ? __switch_to_asm+0x34/0x70
[146841.850012]  ? __switch_to_asm+0x40/0x70
[146841.850013]  ? __switch_to_asm+0x34/0x70
[146841.850014]  ? __switch_to_asm+0x40/0x70
[146841.850015]  ? __switch_to_asm+0x34/0x70
[146841.850039]  ? mmp_write_uberblock+0x710/0x710 [zfs]
[146841.850043]  thread_generic_wrapper+0x74/0x90 [spl]
[146841.850045]  kthread+0x105/0x140
[146841.850048]  ? __thread_exit+0x20/0x20 [spl]
[146841.850049]  ? kthread_bind+0x20/0x20
[146841.850050]  ret_from_fork+0x22/0x40


[146841.850161] INFO: task rm:64047 blocked for more than 120 seconds.
[146841.850162]       Tainted: P           O      4.20.0+ #1
[146841.850162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.850163] rm              D    0 64047  65589 0x00080080
[146841.850163] Call Trace:
[146841.850165]  __schedule+0x2ab/0x880
[146841.850165]  schedule+0x36/0x80
[146841.850168]  cv_wait_common+0x101/0x140 [spl]
[146841.850170]  ? remove_wait_queue+0x60/0x60
[146841.850173]  __cv_wait+0x15/0x20 [spl]
[146841.850200]  spa_config_enter+0xf8/0x110 [zfs]
[146841.850227]  zfs_blkptr_verify+0x382/0x400 [zfs]
[146841.850255]  zio_read+0x42/0xf0 [zfs]
[146841.850258]  ? __kmalloc_node+0x1f5/0x280
[146841.850276]  ? arc_get_data_abd.isra.34+0x51/0x60 [zfs]
[146841.850279]  ? spl_kmem_zalloc+0xdc/0x140 [spl]
[146841.850297]  ? arc_buf_destroy+0x100/0x100 [zfs]
[146841.850315]  arc_read+0x4b4/0x1080 [zfs]
[146841.850317]  ? mutex_lock+0x12/0x40
[146841.850334]  ? dbuf_rele_and_unlock+0x5b0/0x5b0 [zfs]
[146841.850354]  dbuf_read_impl.constprop.23+0x2af/0x7f0 [zfs]
[146841.850373]  dbuf_read+0xe2/0x530 [zfs]
[146841.850392]  ? dbuf_hold_impl+0x3c/0x50 [zfs]
[146841.850413]  dmu_tx_check_ioerr+0x68/0xd0 [zfs]
[146841.850434]  dmu_tx_hold_free_impl+0x145/0x260 [zfs]
[146841.850454]  dmu_tx_hold_free+0x3c/0x50 [zfs]
[146841.850474]  dmu_free_long_range+0x275/0x4e0 [zfs]
[146841.850502]  zfs_rmnode+0x2a0/0x340 [zfs]
[146841.850529]  ? zfs_znode_hold_exit+0x111/0x120 [zfs]
[146841.850557]  zfs_zinactive+0xf6/0x100 [zfs]
[146841.850585]  zfs_inactive+0x88/0x220 [zfs]
[146841.850588]  ? truncate_pagecache+0x59/0x60
[146841.850615]  zpl_evict_inode+0x43/0x60 [zfs]
[146841.850618]  evict+0xd6/0x1b0
[146841.850620]  iput+0x16f/0x1c0
[146841.850621]  do_unlinkat+0x187/0x2b0
[146841.850622]  __x64_sys_unlinkat+0x38/0x60
[146841.850625]  do_syscall_64+0x60/0x190
[146841.850626]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[146841.850627] RIP: 0033:0x7f5687d9874d
[146841.850631] Code: Bad RIP value.
[146841.850631] RSP: 002b:00007fffd779db68 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[146841.850632] RAX: ffffffffffffffda RBX: 00000000014dffa0 RCX: 00007f5687d9874d
[146841.850633] RDX: 0000000000000000 RSI: 00000000014e00a8 RDI: 0000000000000006
[146841.850633] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[146841.850634] R10: 00000000022646d0 R11: 0000000000000246 R12: 00000000010e005c
[146841.850634] R13: 00007fffd779dcd0 R14: 00000000014dffa0 R15: 0000000000000000


[146841.850636] INFO: task dd:125196 blocked for more than 120 seconds.
[146841.850636]       Tainted: P           O      4.20.0+ #1
[146841.850637] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.850637] dd              D    0 125196  65589 0x00080082
[146841.850638] Call Trace:
[146841.850639]  __schedule+0x2ab/0x880
[146841.850642]  ? spl_kmem_cache_alloc+0x92/0x150 [spl]
[146841.850643]  schedule+0x36/0x80
[146841.850646]  cv_wait_common+0x101/0x140 [spl]
[146841.850647]  ? remove_wait_queue+0x60/0x60
[146841.850649]  __cv_wait+0x15/0x20 [spl]
[146841.850669]  dmu_tx_wait+0xb5/0x390 [zfs]
[146841.850691]  dmu_tx_assign+0x8c/0x470 [zfs]
[146841.850714]  ? dmu_tx_count_write+0x182/0x1c0 [zfs]
[146841.850742]  zfs_write+0x566/0xf00 [zfs]
[146841.850770]  zpl_write_common_iovec+0xa6/0x100 [zfs]
[146841.850796]  zpl_iter_write+0xf6/0x140 [zfs]
[146841.850799]  __vfs_write+0x10d/0x190
[146841.850802]  vfs_write+0xb2/0x1b0
[146841.850802]  ksys_write+0x55/0xc0
[146841.850803]  __x64_sys_write+0x1a/0x20
[146841.850804]  do_syscall_64+0x60/0x190
[146841.850805]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[146841.850806] RIP: 0033:0x7f182a4e8fd0
[146841.850807] Code: Bad RIP value.
[146841.850808] RSP: 002b:00007fff956c1b38 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[146841.850808] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f182a4e8fd0
[146841.850809] RDX: 000000007ffff000 RSI: 00007f17a3ece000 RDI: 0000000000000001
[146841.850809] RBP: 000000007ffff000 R08: ffffffffffffffff R09: 0000000080002003
[146841.850810] R10: 00007fff956c15a0 R11: 0000000000000246 R12: 000000007ffff000
[146841.850810] R13: 00007f17a3ece000 R14: 00007f1823ecd000 R15: 0000000000000000


[146841.850818] INFO: task zpool:116186 blocked for more than 120 seconds.
[146841.850819]       Tainted: P           O      4.20.0+ #1
[146841.850820] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.850820] zpool           D    0 116186  19708 0x00080080
[146841.850821] Call Trace:
[146841.850822]  __schedule+0x2ab/0x880
[146841.850823]  ? discard_slab+0x3a/0x50
[146841.850824]  ? __slab_free+0x2a9/0x360
[146841.850825]  schedule+0x36/0x80
[146841.850825]  schedule_preempt_disabled+0xe/0x10
[146841.850826]  __mutex_lock.isra.11+0x19e/0x4d0
[146841.850827]  ? kmem_cache_free+0x116/0x140
[146841.850830]  ? spl_kmem_cache_free+0x169/0x1c0 [spl]
[146841.850831]  __mutex_lock_slowpath+0x13/0x20
[146841.850832]  mutex_lock+0x34/0x40
[146841.850848]  l2arc_evict+0x28d/0x2c0 [zfs]
[146841.850867]  l2arc_remove_vdev+0x110/0x1b0 [zfs]
[146841.850893]  spa_load_l2cache+0x280/0x4a0 [zfs]
[146841.850921]  ? spa_vdev_remove_aux+0xcb/0x1b0 [zfs]
[146841.850949]  spa_vdev_remove+0x5d4/0x7a0 [zfs]
[146841.850977]  zfs_ioc_vdev_remove+0x50/0x90 [zfs]
[146841.851005]  zfsdev_ioctl+0x85a/0x8c0 [zfs]
[146841.851008]  do_vfs_ioctl+0xaa/0x620
[146841.851010]  ? __audit_syscall_entry+0xdd/0x130
[146841.851012]  ksys_ioctl+0x67/0x90
[146841.851013]  __x64_sys_ioctl+0x1a/0x20
[146841.851014]  do_syscall_64+0x60/0x190
[146841.851015]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[146841.851016] RIP: 0033:0x7fd8b5b698d7
[146841.851017] Code: Bad RIP value.
[146841.851017] RSP: 002b:00007fff93618f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[146841.851018] RAX: ffffffffffffffda RBX: 00000000021092e0 RCX: 00007fd8b5b698d7
[146841.851018] RDX: 00007fff93619370 RSI: 0000000000005a0c RDI: 0000000000000003
[146841.851019] RBP: 00007fff9361c950 R08: 0000000000000080 R09: 0000000000000000
[146841.851019] R10: 00007fff936189e0 R11: 0000000000000246 R12: 00007fff93619370
[146841.851020] R13: 00000000021070f0 R14: 00007fff93618f70 R15: 00000000021197d8


[146841.851021] INFO: task zpool:118505 blocked for more than 120 seconds.
[146841.851021]       Tainted: P           O      4.20.0+ #1
[146841.851022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[146841.851022] zpool           D    0 118505  14831 0x00080080
[146841.851023] Call Trace:
[146841.851025]  __schedule+0x2ab/0x880
[146841.851026]  schedule+0x36/0x80
[146841.851027]  schedule_preempt_disabled+0xe/0x10
[146841.851027]  __mutex_lock.isra.11+0x19e/0x4d0
[146841.851032]  ? nvlist_xalloc.part.16+0x93/0x160 [znvpair]
[146841.851033]  __mutex_lock_slowpath+0x13/0x20
[146841.851034]  mutex_lock+0x34/0x40
[146841.851060]  spa_all_configs+0x40/0x120 [zfs]
[146841.851087]  zfs_ioc_pool_configs+0x1b/0x70 [zfs]
[146841.851115]  zfsdev_ioctl+0x85a/0x8c0 [zfs]
[146841.851118]  do_vfs_ioctl+0xaa/0x620
[146841.851119]  ? __audit_syscall_entry+0xdd/0x130
[146841.851121]  ? handle_mm_fault+0x102/0x220
[146841.851122]  ksys_ioctl+0x67/0x90
[146841.851123]  __x64_sys_ioctl+0x1a/0x20
[146841.851124]  do_syscall_64+0x60/0x190
[146841.851125]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[146841.851126] RIP: 0033:0x7f1baef7b8d7
[146841.851127] Code: Bad RIP value.
[146841.851127] RSP: 002b:00007ffc5a2703b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[146841.851128] RAX: ffffffffffffffda RBX: 00000000021510f0 RCX: 00007f1baef7b8d7
[146841.851128] RDX: 00007ffc5a2703f0 RSI: 0000000000005a04 RDI: 0000000000000003
[146841.851129] RBP: 00007ffc5a2739d0 R08: ffffffffffffffff R09: 0000000000040000
[146841.851129] R10: 00007ffc5a26fe20 R11: 0000000000000206 R12: 0000000000000000
[146841.851130] R13: 00007ffc5a2703f0 R14: 0000000000406200 R15: 00000000021532b0

Worth noting IO to other volumes appears OK, only zfs commands hang (zpool and zfs)

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 16, 2019
@cdluminate
Copy link
Contributor

Encountered a similar issue wihout doing anything. The zpool stores a Debian mirror, and we occasionally discovered it malfunctioning (not responding nginx/apache read requests).
console_log.txt
ZFS version: 0.8.1-4 (debian).

@h1z1
Copy link
Author

h1z1 commented Dec 11, 2019

Sounds almost like #828
#9706 (comment)

Unless I've had too much eggnog already, the interesting thing is ..

[146841.850831]  __mutex_lock_slowpath+0x13/0x20
[146841.850832]  mutex_lock+0x34/0x40
[146841.850848]  l2arc_evict+0x28d/0x2c0 [zfs]
[146841.850867]  l2arc_remove_vdev+0x110/0x1b0 [zfs]
   [146841.850893]  spa_load_l2cache+0x280/0x4a0 [zfs]
[146841.850921]  ? spa_vdev_remove_aux+0xcb/0x1b0 [zfs]
[146841.850949]  spa_vdev_remove+0x5d4/0x7a0 [zfs]

@stale
Copy link

stale bot commented Dec 10, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Dec 10, 2020
@behlendorf behlendorf removed the Status: Stale No recent activity for issue label Dec 28, 2020
@stale
Copy link

stale bot commented Dec 29, 2021

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Dec 29, 2021
@stale stale bot closed this as completed Apr 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants