Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC: rpool: blkptr at ... DVA 0 has invalid OFFSET 18388167655883276288 #12019

Closed
niziak opened this issue May 10, 2021 · 10 comments
Closed

PANIC: rpool: blkptr at ... DVA 0 has invalid OFFSET 18388167655883276288 #12019

niziak opened this issue May 10, 2021 · 10 comments
Labels
Status: Stale No recent activity for issue Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@niziak
Copy link

niziak commented May 10, 2021

System information

Proxmox 6.3-6/2184247e
Distribution Name Proxmox
Distribution Version 6.3-6/2184247e
Linux Kernel 5.4.106-1-pve
Architecture x86_64
ZFS Version 2.0.4-pve1
SPL Version 2.0.4-pve1

Describe the problem you're observing

System partially stop responding during normal workload (periodic snapshots, send/receive).
Kernel reports PANIC but system all local FS (on ZFS) operation hangs. So all services already in RAM are working, but I cannot remotely login using SSH (because it wants to access FS).

Now node is running (all services migrated from this node). So I can do any tests / experiments. I'm ready for your suggestion how to clean up this error.

Questions:

  • It looks for permanent inconsistency on ZFS (I saw this error also one month before, but after this I put ECC RAM to computer and I thought problem gone). How to find what file / volume is connected with blk number shown in panic ?
  • There is no problem on HDDs (scanned for bad blocks, SMART also doesn't report anything).
  • Why kernel stuck and cannot reboot automatically even if kernel cmdline panic=30 is set?
  • Can be related to ZIL and CACHE on NVM?
config:
	NAME                                       STATE     READ WRITE CKSUM
	rpool                                      ONLINE       0     0     0
	  mirror-0                                 ONLINE       0     0     0
	    ata-ST2000VN004-2E4164_Z529HN5G-part3  ONLINE       0     0     0
	    ata-ST2000VN004-2E4164_Z529KW0A-part3  ONLINE       0     0     0
	logs	
	  a5005d25-97e3-4541-a60b-16384c03a7bc     ONLINE       0     0     0
	cache
	  nvme0n1p5                                ONLINE       0     0     0

Describe how to reproduce the problem

# zdb -bcsvL rpool
...
zdb_blkptr_cb: Got error 52 reading <617, 1, 0, 3d103> DVA[0]=<0:188145da000:2000> [L0 zvol object] fletcher4 uncompressed unencrypted LE contiguous unique single size=2000L/2000P birth=6373518L/6373518P fill=1 cksum=26fe304c267:9830389637801:1918c2a69fc3597b:47d95a551770185 -- skipping
zdb_blkptr_cb: Got error 52 reading <617, 1, 0, 3d108> DVA[0]=<0:188145e0000:2000> [L0 zvol object] fletcher4 uncompressed unencrypted LE contiguous unique single size=2000L/2000P birth=6373518L/6373518P fill=1 cksum=273614ce481:9c98ab35fb26a:1a562f4282dd3e4c:289de3f37292b2a6 -- skipping
zdb_blkptr_cb: Got error 52 reading <617, 1, 0, 3d106> DVA[0]=<0:188145e2000:2000> [L0 zvol object] fletcher4 uncompressed unencrypted LE contiguous unique single size=2000L/2000P birth=6373518L/6373518P fill=1 cksum=26ab83c199c:9d04818f3c9fa:1a7fa0dfca3401be:75d5f1ccc5ef4ff1 -- skipping
zdb_blkptr_cb: Got error 52 reading <617, 1, 0, 3d10b> DVA[0]=<0:188145ef000:2000> [L0 zvol object] fletcher4 uncompressed unencrypted LE contiguous unique single size=2000L/2000P birth=6373518L/6373518P fill=1 cksum=26ca9b459f8:9ae99bb2b8456:19d1cf6bfb63436e:b3d24edd01d842bf -- skipping
zdb_blkptr_cb: Got error 52 reading <617, 1, 0, 3d10e> DVA[0]=<0:188145f5000:2000> [L0 zvol object] fletcher4 uncompressed unencrypted LE contiguous unique single size=2000L/2000P birth=6373518L/6373518P fill=1 cksum=273c6e561f0:9c1723f11ef67:1a01b288e13086c0:1ff3a858920d3669 -- skipping

...
error: rpool: blkptr at 0x7f684e2d4000 DVA 0 has invalid OFFSET 18388167655883276288

Include any warning/errors/backtraces from the system logs

[Sun May  9 14:34:11 2021]  PANIC: rpool: blkptr at 00000000a44c5bb3 DVA 0 has invalid OFFSET 18388167655883276288
[Sun May  9 14:34:11 2021]  Showing stack for process 4882
[Sun May  9 14:34:11 2021]  CPU: 2 PID: 4882 Comm: txg_sync Tainted: P           O      5.4.106-1-pve #1
[Sun May  9 14:34:11 2021]  Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS, BIOS 3603 03/20/2021
[Sun May  9 14:34:11 2021]  Call Trace:
[Sun May  9 14:34:11 2021]  dump_stack+0x6d/0x8b
[Sun May  9 14:34:11 2021]  spl_dumpstack+0x29/0x2b [spl]
[Sun May  9 14:34:11 2021]  vcmn_err.cold.1+0x60/0x94 [spl]
[Sun May  9 14:34:11 2021]  ? find_busiest_group+0x47/0x530
[Sun May  9 14:34:11 2021]  ? spl_kmem_cache_alloc+0x7c/0x770 [spl]
[Sun May  9 14:34:11 2021]  ? put_dec+0x93/0xa0
[Sun May  9 14:34:11 2021]  ? number+0x31f/0x360
[Sun May  9 14:34:11 2021]  zfs_panic_recover+0x6f/0x90 [zfs]
[Sun May  9 14:34:11 2021]  zfs_blkptr_verify_log+0x94/0x100 [zfs]
[Sun May  9 14:34:11 2021]  ? newidle_balance+0x233/0x3c0
[Sun May  9 14:34:11 2021]  ? vdev_default_asize+0x5f/0x90 [zfs]
[Sun May  9 14:34:11 2021]  zfs_blkptr_verify+0x3ab/0x460 [zfs]
[Sun May  9 14:34:11 2021]  zio_read+0x47/0xc0 [zfs]
[Sun May  9 14:34:11 2021]  ? dsl_scan_prefetch_thread+0x290/0x290 [zfs]
[Sun May  9 14:34:11 2021]  scan_exec_io+0x167/0x230 [zfs]
[Sun May  9 14:34:11 2021]  ? scan_io_queue_insert_impl+0xd7/0xe0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_scrub_cb+0x70e/0x770 [zfs]
[Sun May  9 14:34:11 2021]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x68d/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  ? arc_hdr_set_compress+0x50/0x50 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x5b6/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitbp+0x826/0xcd0 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visit_rootbp+0xe8/0x150 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_visitds+0x1ae/0x510 [zfs]
[Sun May  9 14:34:11 2021]  ? dnode_rele_and_unlock+0xb6/0xe0 [zfs]
[Sun May  9 14:34:11 2021]  ? dnode_rele+0x3b/0x40 [zfs]
[Sun May  9 14:34:11 2021]  ? dbuf_rele_and_unlock+0x306/0x6a0 [zfs]
[Sun May  9 14:34:11 2021]  ? dsl_dataset_hold_obj+0x68c/0x9e0 [zfs]
[Sun May  9 14:34:11 2021]  ? dbuf_rele+0x3b/0x40 [zfs]
[Sun May  9 14:34:11 2021]  dsl_scan_sync+0x90e/0x1320 [zfs]
[Sun May  9 14:34:11 2021]  spa_sync+0x610/0xfe0 [zfs]
[Sun May  9 14:34:11 2021]  ? mutex_lock+0x12/0x30
[Sun May  9 14:34:11 2021]  ? spa_txg_history_init_io+0x104/0x110 [zfs]
[Sun May  9 14:34:11 2021]  txg_sync_thread+0x2e1/0x4a0 [zfs]
[Sun May  9 14:34:11 2021]  ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
[Sun May  9 14:34:11 2021]  thread_generic_wrapper+0x74/0x90 [spl]
[Sun May  9 14:34:11 2021]  kthread+0x120/0x140
[Sun May  9 14:34:11 2021]  ? __thread_exit+0x20/0x20 [spl]
[Sun May  9 14:34:11 2021]  ? kthread_park+0x90/0x90
[Sun May  9 14:34:11 2021]  ret_from_fork+0x1f/0x40

Similar error over one month ago (probably with OpenZFS v0.8.4):

[ 541.783776] PANIC: rpool: blkptr at 000000003a3d1018 DVA 0 has invalid OFFSET 18388167655883276288
[ 541.783871] Showing stack for process 630328
[ 541.783940] CPU: 8 PID: 630328 Comm: zvol Tainted: P O 5.4.106-1-pve #1
[ 541.784027] Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS, BIOS 3603 03/20/2021
[ 541.784124] Call Trace:
[ 541.784195] dump_stack+0x6d/0x8b
[ 541.784267] spl_dumpstack+0x29/0x2b [spl]
[ 541.784337] vcmn_err.cold.1+0x60/0x94 [spl]
[ 541.784407] ? prep_new_page+0x129/0x160
[ 541.784471] ? put_dec+0x93/0xa0
[ 541.784533] ? number+0x31f/0x360
[ 541.784659] zfs_panic_recover+0x6f/0x90 [zfs]
[ 541.784787] zfs_blkptr_verify_log+0x94/0x100 [zfs]
[ 541.784859] ? __sg_alloc_table+0x70/0x170
[ 541.784927] ? sg_alloc_table+0x23/0x50
[ 541.784997] ? _cond_resched+0x19/0x30
[ 541.785068] ? mutex_lock+0x12/0x30
[ 541.785168] ? aggsum_add+0x188/0x1b0 [zfs]
[ 541.785288] ? vdev_default_asize+0x5f/0x90 [zfs]
[ 541.785411] zfs_blkptr_verify+0x3ab/0x460 [zfs]
[ 541.785528] zio_read+0x47/0xc0 [zfs]
[ 541.785627] ? arc_read+0x1210/0x1210 [zfs]
[ 541.785734] arc_read+0xb9d/0x1210 [zfs]
[ 541.785854] ? dbuf_rele_and_unlock+0x6a0/0x6a0 [zfs]
[ 541.785922] ? _cond_resched+0x19/0x30
[ 541.786023] dbuf_read_impl.constprop.33+0x2a8/0x6e0 [zfs]
[ 541.786136] dbuf_read+0x1b2/0x510 [zfs]
[ 541.786241] dmu_buf_hold_array_by_dnode+0x10c/0x4a0 [zfs]
[ 541.786352] dmu_read_uio_dnode+0x49/0xf0 [zfs]
[ 541.786464] zvol_read+0x102/0x300 [zfs]
[ 541.786527] taskq_thread+0x2f7/0x4e0 [spl]
[ 541.786584] ? wake_up_q+0x80/0x80
[ 541.786687] ? zvol_write+0x4e0/0x4e0 [zfs]
[ 541.786745] kthread+0x120/0x140
[ 541.786805] ? task_done+0xb0/0xb0 [spl]
[ 541.786863] ? kthread_park+0x90/0x90
[ 541.786922] ret_from_fork+0x1f/0x40
@niziak niziak added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels May 10, 2021
@niziak
Copy link
Author

niziak commented May 10, 2021

Dmesg when scrub hits this block:

[72476.810319] PANIC: rpool: blkptr at 00000000937f0760 DVA 0 has invalid OFFSET 18388167655883276288
[72476.810407] Showing stack for process 3333
[72476.810468] CPU: 18 PID: 3333 Comm: txg_sync Tainted: P           O      5.4.106-1-pve #1
[72476.810551] Hardware name: System manufacturer System Product Name/TUF GAMING X570-PLUS, BIOS 3603 03/20/2021
[72476.810639] Call Trace:
[72476.810699]  dump_stack+0x6d/0x8b
[72476.810761]  spl_dumpstack+0x29/0x2b [spl]
[72476.810822]  vcmn_err.cold.1+0x60/0x94 [spl]
[72476.810885]  ? spl_kmem_cache_alloc+0x7c/0x770 [spl]
[72476.810949]  ? put_dec+0x93/0xa0
[72476.811006]  ? number+0x31f/0x360
[72476.811098]  zfs_panic_recover+0x6f/0x90 [zfs]
[72476.811189]  zfs_blkptr_verify_log+0x94/0x100 [zfs]
[72476.811253]  ? newidle_balance+0x233/0x3c0
[72476.811314]  ? __switch_to+0x85/0x480
[72476.811401]  ? vdev_default_asize+0x5f/0x90 [zfs]
[72476.811490]  zfs_blkptr_verify+0x3ab/0x460 [zfs]
[72476.811582]  zio_read+0x47/0xc0 [zfs]
[72476.811666]  ? dsl_scan_prefetch_thread+0x290/0x290 [zfs]
[72476.811754]  scan_exec_io+0x167/0x230 [zfs]
[72476.811836]  ? scan_io_queue_insert_impl+0xd7/0xe0 [zfs]
[72476.811920]  dsl_scan_scrub_cb+0x70e/0x770 [zfs]
[72476.811983]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72476.812066]  dsl_scan_visitbp+0x68d/0xcd0 [zfs]
[72476.812145]  ? arc_hdr_set_compress+0x50/0x50 [zfs]
[72476.812228]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72476.812310]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72476.812393]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72476.812475]  dsl_scan_visitbp+0x5b6/0xcd0 [zfs]
[72476.812559]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72476.812642]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72476.812724]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72476.812807]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72476.812889]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72476.812971]  dsl_scan_visitbp+0x826/0xcd0 [zfs]
[72476.813053]  dsl_scan_visit_rootbp+0xe8/0x150 [zfs]
[72476.813136]  dsl_scan_visitds+0x1ae/0x510 [zfs]
[72476.813199]  ? spl_kmem_cache_alloc+0x7c/0x770 [spl]
[72476.813262]  ? _cond_resched+0x19/0x30
[72476.813321]  ? __kmalloc_node+0x1e0/0x330
[72476.813382]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72476.813445]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72476.813507]  ? tsd_hash_search.isra.5+0x47/0xa0 [spl]
[72476.813575]  ? tsd_set+0x1a2/0x4f0 [spl]
[72476.813660]  ? rrw_enter_read_impl+0xaf/0x160 [zfs]
[72476.813744]  dsl_scan_sync+0x858/0x1320 [zfs]
[72476.813830]  spa_sync+0x610/0xfe0 [zfs]
[72476.813889]  ? mutex_lock+0x12/0x30
[72476.813974]  ? spa_txg_history_init_io+0x104/0x110 [zfs]
[72476.814063]  txg_sync_thread+0x2e1/0x4a0 [zfs]
[72476.814148]  ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
[72476.814213]  thread_generic_wrapper+0x74/0x90 [spl]
[72476.814276]  kthread+0x120/0x140
[72476.814335]  ? __thread_exit+0x20/0x20 [spl]
[72476.814395]  ? kthread_park+0x90/0x90
[72476.814454]  ret_from_fork+0x1f/0x40

Message from syslogd@pve3 at May 10 14:47:53 ...
 kernel:[72476.810319] PANIC: rpool: blkptr at 00000000937f0760 DVA 0 has invalid OFFSET 18388167655883276288

[72620.215386] INFO: task dp_sync_taskq:641 blocked for more than 120 seconds.
[72620.215582]       Tainted: P           O      5.4.106-1-pve #1
[72620.215651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[72620.215738] dp_sync_taskq   D    0   641      2 0x80004000
[72620.215810] Call Trace:
[72620.215875]  __schedule+0x2e6/0x700
[72620.215937]  schedule+0x33/0xa0
[72620.216004]  cv_wait_common+0x104/0x130 [spl]
[72620.216071]  ? wait_woken+0x80/0x80
[72620.216138]  __cv_wait+0x15/0x20 [spl]
[72620.216254]  dsl_scan_prefetch_thread+0xad/0x290 [zfs]
[72620.216326]  taskq_thread+0x2f7/0x4e0 [spl]
[72620.216391]  ? wake_up_q+0x80/0x80
[72620.216453]  kthread+0x120/0x140
[72620.216517]  ? task_done+0xb0/0xb0 [spl]
[72620.216580]  ? kthread_park+0x90/0x90
[72620.216642]  ret_from_fork+0x1f/0x40
[72620.216710] INFO: task txg_sync:3333 blocked for more than 120 seconds.
[72620.216782]       Tainted: P           O      5.4.106-1-pve #1
[72620.216850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[72620.216936] txg_sync        D    0  3333      2 0x80004000
[72620.217004] Call Trace:
[72620.217063]  __schedule+0x2e6/0x700
[72620.218696]  schedule+0x33/0xa0
[72620.218774]  vcmn_err.cold.1+0x92/0x94 [spl]
[72620.218845]  ? spl_kmem_cache_alloc+0x7c/0x770 [spl]
[72620.218914]  ? put_dec+0x93/0xa0
[72620.218974]  ? number+0x31f/0x360
[72620.219098]  zfs_panic_recover+0x6f/0x90 [zfs]
[72620.219221]  zfs_blkptr_verify_log+0x94/0x100 [zfs]
[72620.219291]  ? newidle_balance+0x233/0x3c0
[72620.219356]  ? __switch_to+0x85/0x480
[72620.219472]  ? vdev_default_asize+0x5f/0x90 [zfs]
[72620.219592]  zfs_blkptr_verify+0x3ab/0x460 [zfs]
[72620.219706]  zio_read+0x47/0xc0 [zfs]
[72620.219818]  ? dsl_scan_prefetch_thread+0x290/0x290 [zfs]
[72620.219932]  scan_exec_io+0x167/0x230 [zfs]
[72620.220037]  ? scan_io_queue_insert_impl+0xd7/0xe0 [zfs]
[72620.220146]  dsl_scan_scrub_cb+0x70e/0x770 [zfs]
[72620.220215]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72620.220320]  dsl_scan_visitbp+0x68d/0xcd0 [zfs]
[72620.220420]  ? arc_hdr_set_compress+0x50/0x50 [zfs]
[72620.220526]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72620.220631]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72620.220736]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72620.220841]  dsl_scan_visitbp+0x5b6/0xcd0 [zfs]
[72620.220947]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72620.221053]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72620.221158]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72620.221264]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72620.221368]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72620.221471]  dsl_scan_visitbp+0x826/0xcd0 [zfs]
[72620.221576]  dsl_scan_visit_rootbp+0xe8/0x150 [zfs]
[72620.221682]  dsl_scan_visitds+0x1ae/0x510 [zfs]
[72620.221750]  ? spl_kmem_cache_alloc+0x7c/0x770 [spl]
[72620.221817]  ? _cond_resched+0x19/0x30
[72620.221881]  ? __kmalloc_node+0x1e0/0x330
[72620.221947]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72620.222014]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72620.222082]  ? tsd_hash_search.isra.5+0x47/0xa0 [spl]
[72620.222152]  ? tsd_set+0x1a2/0x4f0 [spl]
[72620.222264]  ? rrw_enter_read_impl+0xaf/0x160 [zfs]
[72620.222372]  dsl_scan_sync+0x858/0x1320 [zfs]
[72620.222485]  spa_sync+0x610/0xfe0 [zfs]
[72620.222548]  ? mutex_lock+0x12/0x30
[72620.222661]  ? spa_txg_history_init_io+0x104/0x110 [zfs]
[72620.222784]  txg_sync_thread+0x2e1/0x4a0 [zfs]
[72620.222909]  ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
[72620.222983]  thread_generic_wrapper+0x74/0x90 [spl]
[72620.223060]  kthread+0x120/0x140
[72620.223129]  ? __thread_exit+0x20/0x20 [spl]
[72620.223198]  ? kthread_park+0x90/0x90
[72620.223261]  ret_from_fork+0x1f/0x40
[72620.223388] INFO: task zfs:27691 blocked for more than 120 seconds.
[72620.223459]       Tainted: P           O      5.4.106-1-pve #1
[72620.223528] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[72620.223614] zfs             D    0 27691  27158 0x80004000
[72620.223682] Call Trace:
[72620.223741]  __schedule+0x2e6/0x700
[72620.223803]  schedule+0x33/0xa0
[72620.223863]  io_schedule+0x16/0x40
[72620.223927]  cv_wait_common+0xb5/0x130 [spl]
[72620.223991]  ? wait_woken+0x80/0x80
[72620.224056]  __cv_wait_io+0x18/0x20 [spl]
[72620.224169]  txg_wait_synced_impl+0xc9/0x110 [zfs]
[72620.224285]  txg_wait_synced+0x10/0x40 [zfs]
[72620.224403]  dsl_sync_task_common+0x1b5/0x290 [zfs]
[72620.224514]  ? redact_check.isra.13+0x150/0x150 [zfs]
[72620.224624]  ? receive_cksum+0x30/0x30 [zfs]
[72620.224731]  ? redact_check.isra.13+0x150/0x150 [zfs]
[72620.224841]  ? receive_cksum+0x30/0x30 [zfs]
[72620.224960]  dsl_sync_task+0x1a/0x20 [zfs]
[72620.225069]  dmu_recv_begin+0x1e8/0x3a0 [zfs]
[72620.225194]  zfs_ioc_recv_impl+0x136/0x1030 [zfs]
[72620.225261]  ? dbs_update_util_handler+0x1b/0xa0
[72620.225326]  ? update_load_avg+0x63b/0x660
[72620.225394]  ? nvs_native_nvlist+0x85/0xa0 [znvpair]
[72620.225464]  ? nvs_operation+0x175/0x310 [znvpair]
[72620.225587]  zfs_ioc_recv+0x18b/0x330 [zfs]
[72620.225653]  ? security_capable+0x3f/0x60
[72620.225717]  ? ns_capable_common+0x2f/0x50
[72620.225780]  ? capable+0x19/0x20
[72620.225840]  ? _cond_resched+0x19/0x30
[72620.225904]  ? __kmalloc_node+0x1e0/0x330
[72620.225971]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72620.226091]  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
[72620.226157]  ? __kmalloc_node+0x267/0x330
[72620.226223]  ? spl_kmem_free+0x33/0x40 [spl]
[72620.226343]  zfsdev_ioctl+0x54/0xe0 [zfs]
[72620.226408]  do_vfs_ioctl+0xa9/0x640
[72620.226471]  ? __vfs_read+0x29/0x40
[72620.226532]  ? vfs_read+0x99/0x160
[72620.226593]  ksys_ioctl+0x67/0x90
[72620.226654]  __x64_sys_ioctl+0x1a/0x20
[72620.226717]  do_syscall_64+0x57/0x190
[72620.226786]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[72620.226861] RIP: 0033:0x7f5e1841c427
[72620.226929] Code: Bad RIP value.
[72620.226992] RSP: 002b:00007ffc44040cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[72620.227081] RAX: ffffffffffffffda RBX: 00007ffc44044450 RCX: 00007f5e1841c427
[72620.227158] RDX: 00007ffc44040ea0 RSI: 0000000000005a1b RDI: 0000000000000006
[72620.227231] RBP: 00007ffc44045490 R08: 0000000000000003 R09: 00007f5e184e7dc0
[72620.227304] R10: 0000560d4c8bc010 R11: 0000000000000246 R12: 00007ffc44040ea0
[72620.227377] R13: 00007ffc4404b2f0 R14: 0000560d4c8c5c20 R15: 0000000000000000
[72741.044921] INFO: task dp_sync_taskq:641 blocked for more than 241 seconds.
[72741.044999]       Tainted: P           O      5.4.106-1-pve #1
[72741.045068] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[72741.045155] dp_sync_taskq   D    0   641      2 0x80004000
[72741.045224] Call Trace:
[72741.045290]  __schedule+0x2e6/0x700
[72741.045354]  schedule+0x33/0xa0
[72741.045419]  cv_wait_common+0x104/0x130 [spl]
[72741.045484]  ? wait_woken+0x80/0x80
[72741.045548]  __cv_wait+0x15/0x20 [spl]
[72741.045663]  dsl_scan_prefetch_thread+0xad/0x290 [zfs]
[72741.045735]  taskq_thread+0x2f7/0x4e0 [spl]
[72741.045800]  ? wake_up_q+0x80/0x80
[72741.045862]  kthread+0x120/0x140
[72741.045925]  ? task_done+0xb0/0xb0 [spl]
[72741.045988]  ? kthread_park+0x90/0x90
[72741.046051]  ret_from_fork+0x1f/0x40
[72741.046118] INFO: task txg_sync:3333 blocked for more than 241 seconds.
[72741.046189]       Tainted: P           O      5.4.106-1-pve #1
[72741.046257] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[72741.046343] txg_sync        D    0  3333      2 0x80004000
[72741.046411] Call Trace:
[72741.046470]  __schedule+0x2e6/0x700
[72741.046531]  schedule+0x33/0xa0
[72741.046594]  vcmn_err.cold.1+0x92/0x94 [spl]
[72741.046662]  ? spl_kmem_cache_alloc+0x7c/0x770 [spl]
[72741.046729]  ? put_dec+0x93/0xa0
[72741.046789]  ? number+0x31f/0x360
[72741.046907]  zfs_panic_recover+0x6f/0x90 [zfs]
[72741.047028]  zfs_blkptr_verify_log+0x94/0x100 [zfs]
[72741.047096]  ? newidle_balance+0x233/0x3c0
[72741.047162]  ? __switch_to+0x85/0x480
[72741.047277]  ? vdev_default_asize+0x5f/0x90 [zfs]
[72741.047396]  zfs_blkptr_verify+0x3ab/0x460 [zfs]
[72741.047509]  zio_read+0x47/0xc0 [zfs]
[72741.047619]  ? dsl_scan_prefetch_thread+0x290/0x290 [zfs]
[72741.047732]  scan_exec_io+0x167/0x230 [zfs]
[72741.047837]  ? scan_io_queue_insert_impl+0xd7/0xe0 [zfs]
[72741.047946]  dsl_scan_scrub_cb+0x70e/0x770 [zfs]
[72741.048014]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72741.048119]  dsl_scan_visitbp+0x68d/0xcd0 [zfs]
[72741.048218]  ? arc_hdr_set_compress+0x50/0x50 [zfs]
[72741.048324]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72741.048443]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72741.048553]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72741.048663]  dsl_scan_visitbp+0x5b6/0xcd0 [zfs]
[72741.048776]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72741.048887]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72741.048995]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72741.049100]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72741.049206]  dsl_scan_visitbp+0x2ac/0xcd0 [zfs]
[72741.049311]  dsl_scan_visitbp+0x826/0xcd0 [zfs]
[72741.049417]  dsl_scan_visit_rootbp+0xe8/0x150 [zfs]
[72741.049524]  dsl_scan_visitds+0x1ae/0x510 [zfs]
[72741.049592]  ? spl_kmem_cache_alloc+0x7c/0x770 [spl]
[72741.049660]  ? _cond_resched+0x19/0x30
[72741.049724]  ? __kmalloc_node+0x1e0/0x330
[72741.049789]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72741.049858]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72741.049926]  ? tsd_hash_search.isra.5+0x47/0xa0 [spl]
[72741.049996]  ? tsd_set+0x1a2/0x4f0 [spl]
[72741.050108]  ? rrw_enter_read_impl+0xaf/0x160 [zfs]
[72741.050216]  dsl_scan_sync+0x858/0x1320 [zfs]
[72741.050331]  spa_sync+0x610/0xfe0 [zfs]
[72741.050395]  ? mutex_lock+0x12/0x30
[72741.050507]  ? spa_txg_history_init_io+0x104/0x110 [zfs]
[72741.050624]  txg_sync_thread+0x2e1/0x4a0 [zfs]
[72741.050734]  ? txg_thread_exit.isra.13+0x60/0x60 [zfs]
[72741.050805]  thread_generic_wrapper+0x74/0x90 [spl]
[72741.050872]  kthread+0x120/0x140
[72741.050935]  ? __thread_exit+0x20/0x20 [spl]
[72741.050998]  ? kthread_park+0x90/0x90
[72741.051061]  ret_from_fork+0x1f/0x40
[72741.051186] INFO: task zfs:27691 blocked for more than 241 seconds.
[72741.051257]       Tainted: P           O      5.4.106-1-pve #1
[72741.052876] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[72741.052975] zfs             D    0 27691  27158 0x80004000
[72741.053046] Call Trace:
[72741.053107]  __schedule+0x2e6/0x700
[72741.053171]  schedule+0x33/0xa0
[72741.053234]  io_schedule+0x16/0x40
[72741.053297]  cv_wait_common+0xb5/0x130 [spl]
[72741.053361]  ? wait_woken+0x80/0x80
[72741.053426]  __cv_wait_io+0x18/0x20 [spl]
[72741.053534]  txg_wait_synced_impl+0xc9/0x110 [zfs]
[72741.053643]  txg_wait_synced+0x10/0x40 [zfs]
[72741.053753]  dsl_sync_task_common+0x1b5/0x290 [zfs]
[72741.053860]  ? redact_check.isra.13+0x150/0x150 [zfs]
[72741.053966]  ? receive_cksum+0x30/0x30 [zfs]
[72741.054069]  ? redact_check.isra.13+0x150/0x150 [zfs]
[72741.054174]  ? receive_cksum+0x30/0x30 [zfs]
[72741.054282]  dsl_sync_task+0x1a/0x20 [zfs]
[72741.054384]  dmu_recv_begin+0x1e8/0x3a0 [zfs]
[72741.054504]  zfs_ioc_recv_impl+0x136/0x1030 [zfs]
[72741.054571]  ? dbs_update_util_handler+0x1b/0xa0
[72741.054637]  ? update_load_avg+0x63b/0x660
[72741.054705]  ? nvs_native_nvlist+0x85/0xa0 [znvpair]
[72741.054774]  ? nvs_operation+0x175/0x310 [znvpair]
[72741.054892]  zfs_ioc_recv+0x18b/0x330 [zfs]
[72741.054959]  ? security_capable+0x3f/0x60
[72741.055024]  ? ns_capable_common+0x2f/0x50
[72741.055087]  ? capable+0x19/0x20
[72741.055148]  ? _cond_resched+0x19/0x30
[72741.055210]  ? __kmalloc_node+0x1e0/0x330
[72741.055276]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72741.055391]  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
[72741.055457]  ? __kmalloc_node+0x267/0x330
[72741.055523]  ? spl_kmem_free+0x33/0x40 [spl]
[72741.055639]  zfsdev_ioctl+0x54/0xe0 [zfs]
[72741.055703]  do_vfs_ioctl+0xa9/0x640
[72741.055765]  ? __vfs_read+0x29/0x40
[72741.055827]  ? vfs_read+0x99/0x160
[72741.055888]  ksys_ioctl+0x67/0x90
[72741.055949]  __x64_sys_ioctl+0x1a/0x20
[72741.056012]  do_syscall_64+0x57/0x190
[72741.056074]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[72741.056141] RIP: 0033:0x7f5e1841c427
[72741.056207] Code: Bad RIP value.
[72741.056267] RSP: 002b:00007ffc44040cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[72741.056353] RAX: ffffffffffffffda RBX: 00007ffc44044450 RCX: 00007f5e1841c427
[72741.056433] RDX: 00007ffc44040ea0 RSI: 0000000000005a1b RDI: 0000000000000006
[72741.056506] RBP: 00007ffc44045490 R08: 0000000000000003 R09: 00007f5e184e7dc0
[72741.056579] R10: 0000560d4c8bc010 R11: 0000000000000246 R12: 00007ffc44040ea0
[72741.056654] R13: 00007ffc4404b2f0 R14: 0000560d4c8c5c20 R15: 0000000000000000
[72741.056733] INFO: task zfs:40728 blocked for more than 120 seconds.
[72741.056806]       Tainted: P           O      5.4.106-1-pve #1
[72741.056875] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[72741.056961] zfs             D    0 40728  40726 0x80004000
[72741.057029] Call Trace:
[72741.057088]  __schedule+0x2e6/0x700
[72741.057150]  schedule+0x33/0xa0
[72741.057210]  io_schedule+0x16/0x40
[72741.057274]  cv_wait_common+0xb5/0x130 [spl]
[72741.057338]  ? wait_woken+0x80/0x80
[72741.057402]  __cv_wait_io+0x18/0x20 [spl]
[72741.057518]  txg_wait_synced_impl+0xc9/0x110 [zfs]
[72741.057631]  txg_wait_synced+0x10/0x40 [zfs]
[72741.057743]  dsl_sync_task_common+0x1b5/0x290 [zfs]
[72741.057850]  ? redact_check.isra.13+0x150/0x150 [zfs]
[72741.057956]  ? receive_cksum+0x30/0x30 [zfs]
[72741.058059]  ? redact_check.isra.13+0x150/0x150 [zfs]
[72741.058163]  ? receive_cksum+0x30/0x30 [zfs]
[72741.058273]  dsl_sync_task+0x1a/0x20 [zfs]
[72741.058375]  dmu_recv_begin+0x1e8/0x3a0 [zfs]
[72741.058492]  zfs_ioc_recv_impl+0x136/0x1030 [zfs]
[72741.058560]  ? spl_kvmalloc+0x7e/0xc0 [spl]
[72741.058627]  ? spl_kmem_alloc_impl+0x125/0x130 [spl]
[72741.058697]  ? nvs_operation+0x175/0x310 [znvpair]
[72741.058813]  zfs_ioc_recv+0x18b/0x330 [zfs]
[72741.058879]  ? ns_capable_common+0x2f/0x50
[72741.058942]  ? capable+0x19/0x20
[72741.059002]  ? _cond_resched+0x19/0x30
[72741.059064]  ? __kmalloc_node+0x1e0/0x330
[72741.059130]  ? spl_kmem_alloc+0xdc/0x130 [spl]
[72741.059242]  zfsdev_ioctl_common+0x5b2/0x820 [zfs]
[72741.059308]  ? __kmalloc_node+0x267/0x330
[72741.059374]  ? spl_kmem_free+0x33/0x40 [spl]
[72741.059488]  zfsdev_ioctl+0x54/0xe0 [zfs]
[72741.059552]  do_vfs_ioctl+0xa9/0x640
[72741.059615]  ? handle_mm_fault+0xc9/0x1f0
[72741.059677]  ksys_ioctl+0x67/0x90
[72741.059738]  __x64_sys_ioctl+0x1a/0x20
[72741.059801]  do_syscall_64+0x57/0x190
[72741.059863]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[72741.059929] RIP: 0033:0x7fac38d17427
[72741.059991] Code: Bad RIP value.
[72741.060051] RSP: 002b:00007ffef3c4b598 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[72741.060136] RAX: ffffffffffffffda RBX: 00007ffef3c4ecf0 RCX: 00007fac38d17427
[72741.060209] RDX: 00007ffef3c4b740 RSI: 0000000000005a1b RDI: 0000000000000006
[72741.060281] RBP: 00007ffef3c4fd30 R08: 0000000000000003 R09: 00007fac38de2dc0
[72741.060354] R10: 000055f083f21010 R11: 0000000000000246 R12: 00007ffef3c4b740
[72741.060432] R13: 00007ffef3c55b90 R14: 000055f083f26580 R15: 0000000000000000

@deem0n
Copy link

deem0n commented May 16, 2021

I see the similar message on FreeBSD 13

scrub results in some errors, but reports the pool is ONLINE

  pool: macabu
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 140K in 00:00:11 with 0 errors on Mon May 17 00:06:56 2021
config:

	NAME        STATE     READ WRITE CKSUM
	macabu      ONLINE       0     0     0
	  raidz1-0  ONLINE       1     1     0
	    ada0p1  ONLINE       0     0     0
	    ada5p1  ONLINE       1     1     0
	    ada3p1  ONLINE       1     1     0

errors: No known data errors

/var/log/messages:

May 16 22:14:19 macabu kernel: Solaris: WARNING: macabu: blkptr at 0xfffffe00c58db9c0 DVA 1 has invalid OFFSET 288243918394605568
May 17 00:06:40 macabu kernel: Solaris: WARNING: macabu: blkptr at 0xfffffe00c58db9c0 DVA 0 has invalid OFFSET 288243918394605568
May 17 00:06:40 macabu ZFS[23512]: vdev I/O failure, zpool=macabu path=/dev/ada5p1 offset=96081306135728128 size=4096 error=5
May 17 00:06:40 macabu ZFS[24632]: vdev I/O failure, zpool=macabu path=/dev/ada3p1 offset=96081306135728128 size=4096 error=5
May 17 00:06:40 macabu ZFS[25672]: vdev I/O failure, zpool=macabu path= offset=288243918394605568 size=4096 error=5
May 17 00:06:40 macabu ZFS[27307]: vdev I/O failure, zpool=macabu path=/dev/ada3p1 offset=96081306135728128 size=4096 error=5
May 17 00:06:40 macabu ZFS[28981]: vdev I/O failure, zpool=macabu path=/dev/ada5p1 offset=96081306135728128 size=4096 error=5
May 17 00:06:40 macabu ZFS[30088]: vdev I/O failure, zpool=macabu path= offset=288243918394605568 size=4096 error=5
May 17 00:06:56 macabu kernel: Solaris: WARNING: macabu: blkptr at 0xfffffe012a4639c0 DVA 1 has invalid OFFSET 288243918394605568
May 17 00:06:56 macabu kernel: Solaris: WARNING: macabu: blkptr at 0xfffffe012a4639c0 DVA 0 has invalid OFFSET 288243918394605568

@niziak
Copy link
Author

niziak commented May 24, 2021

@deem0n You have clear situation: errors on underlaying devices. Mines are fine. My errors are spurious. I think my HW is not stable.

@niziak
Copy link
Author

niziak commented May 24, 2021

Sometimes, during zdb -AAA -bbbcsvL rpool call i receive Got error 52 reading. I dig into the code and this translate to checksum errorr ECKSUM.
So it looks some HW components are not stable ? It is desktop based HW:

  • Ryzen 9
  • X570 chipset
  • 2 x 32GB ECC RAM
    I have also single NVM disk used as ARC and ZIL. Perhaps NVM makes errors ?. Is ARC on NVM used during zdb checks ?

Also found that zfs driver code doesn't handle PANIC as it looks like. It contain leftover devel code which halts thread causes I/O lock in future:

printk(KERN_EMERG "PANIC: %s\n", msg);
spl_dumpstack();

/* Halt the thread to facilitate further debugging */
set_current_state(TASK_UNINTERRUPTIBLE);
while (1)
  schedule();

@rincebrain
Copy link
Contributor

rincebrain commented May 24, 2021

I believe zdb does not use ARC or L2ARC, no.

And what do you mean, "as it looks like"? Are you expecting it to actually kill the system when triggering? Because if that's what you want, I believe the tunable spl_panic_halt may be relevant to your interests.

@niziak
Copy link
Author

niziak commented May 24, 2021

Thanks for pointing me to spl_panic_halt. In my scenario "PANIC" was handled by vcmn_err which doesn't contain code:

if (spl_panic_halt)
		panic("%s", msg);

Is this intentional or perhaps should be corrected in SPL ?

@rincebrain
Copy link
Contributor

Seems like a not unreasonable patch to me, but that might be deliberate for some reason; feel free to try it and see if the world burns down.

@niziak
Copy link
Author

niziak commented May 25, 2021

MR created: #12120

I know that root cause is somewhere else (I have running stable ZFS on other systems), but at least this change helps me to maintenance system remotely.

@deem0n
Copy link

deem0n commented May 27, 2021

@deem0n You have clear situation: errors on underlaying devices. Mines are fine. My errors are spurious. I think my HW is not stable.

Hi, interesting enough I see no vdev I/O failure any more. I.e. hardware errors were reported once, but I observed DVA 1 has invalid OFFSET before hardware error and after it! I made a full copy of all my files from zfs with tar and see only one error in the log: DVA 1 has invalid OFFSET. zpool status shows no errors after that! Attempt to run zpool scrub results in 2 DVA 1 has invalid OFFSET errors and final status:

  pool: macabu
 state: ONLINE
  scan: resilvered 140K in 00:01:24 with 0 errors on Mon May 24 14:59:48 2021
config:

	NAME        STATE     READ WRITE CKSUM
	macabu      ONLINE       0     0     0
	  raidz1-0  ONLINE       1     1     0
	    ada0p1  ONLINE       0     0     0
	    ada5p1  ONLINE       1     1     0
	    ada3p1  ONLINE       1     1     0

errors: No known data errors

I assume that my zfs has some internal inconsistency which cannot be fixed by scrub. Should I rebuild zfs pool to fix it?

@stale
Copy link

stale bot commented May 31, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label May 31, 2022
@stale stale bot closed this as completed Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants