Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System Freeze when running scrub #14972

Open
frawau opened this issue Jun 11, 2023 · 3 comments
Open

System Freeze when running scrub #14972

frawau opened this issue Jun 11, 2023 · 3 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@frawau
Copy link

frawau commented Jun 11, 2023

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04, 23.04
Kernel Version 6.2.0-20
Architecture x86_64
OpenZFS Version 2.1.9

Describe the problem you're observing

The system freezes regularly, it also freezes every time I run a scrub.

My system has been behaving weirdly for some time. I was running Ubuntu 22.04 and the command
"zpool status" was sometime showing all devices with exactly the same number of checksum errors, an unlikely feat.

I was using a TUF X570-based motherboard, so I decided to update the BIOS and the thing died. I thought I had found the culprit.

I replace the motherboard, the memory modules and the SATA cables.

After changing all that, the problem still happened.

So I decided to changed to Ubuntu 23.04 with Linux kernel 6.x

The problem still happens.

I am using 6 WD Red 8TB disks in raidz1 mode

pool: Universe
state: ONLINE
scan: scrub canceled on Sun Jun 11 13:33:54 2023
config:

    NAME        STATE     READ WRITE CKSUM
    Universe    ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sda     ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0

My CPU is a AMD Ryzen 5 3600 6-Core Processor
Motherboard is Micro-Star International Co., Ltd. MS-7D54/MAG X570S TORPEDO MAX (MS-7D54)

SMART indicates that all 6 HDD are OK

Describe how to reproduce the problem

zpool scrub Universe

Include any warning/errors/backtraces from the system logs

"""
2023-06-11T13:19:57.393010+07:00 portland zed: eid=18 class=scrub_start pool='Universe'
2023-06-11T13:19:57.439992+07:00 portland systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
2023-06-11T13:19:57.457021+07:00 portland systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
2023-06-11T13:19:57.457151+07:00 portland systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
2023-06-11T13:19:57.459596+07:00 portland systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
2023-06-11T13:20:18.699440+07:00 portland kernel: [ 931.878163] ------------[ cut here ]------------
2023-06-11T13:20:18.699451+07:00 portland kernel: [ 931.878169] rq->clock_update_flags < RQCF_ACT_SKIP
2023-06-11T13:20:18.699452+07:00 portland kernel: [ 931.878173] WARNING: CPU: 8 PID: 0 at kernel/sched/sched.h:1491 update_rq_clock+0x184/0x230
2023-06-11T13:20:18.699453+07:00 portland kernel: [ 931.878181] Modules linked in: tls vhost_net vhost vhost_iotlb tap bridge stp llc cfg80211 binfmt_misc nls_iso8859_1 snd_hda_codec_hdmi intel_rapl_msr snd_hda_intel zfs(PO) intel_rapl_common snd_intel_dspcfg snd_usb_audio snd_intel_sdw_acpi snd_usbmidi_lib zunicode(PO) edac_mce_amd snd_hda_codec snd_rawmidi zzstd(O) kvm_amd snd_hda_core snd_seq_device zlua(O) mc snd_hwdep zavl(PO) kvm snd_pcm icp(PO) snd_timer irqbypass zcommon(PO) snd rapl znvpair(PO) wmi_bmof k10temp ccp soundcore spl(O) joydev mac_hid nfsd auth_rpcgss nfs_acl lockd dm_multipath scsi_dh_rdac scsi_dh_emc grace scsi_dh_alua msr efi_pstore sunrpc dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nouveau mxm_wmi i2c_algo_bit drm_ttm_helper ttm drm_display_helper cec hid_generic crct10dif_pclmul rc_core crc32_pclmul drm_kms_helper polyval_clmulni syscopyarea polyval_generic sysfillrect
2023-06-11T13:20:18.699454+07:00 portland kernel: [ 931.878246] usbhid ghash_clmulni_intel sysimgblt hid sha512_ssse3 aesni_intel nvme drm crypto_simd r8169 ahci xhci_pci nvme_core video cryptd i2c_piix4 libahci xhci_pci_renesas realtek nvme_common wmi
2023-06-11T13:20:18.699454+07:00 portland kernel: [ 931.878260] CPU: 8 PID: 0 Comm: swapper/8 Tainted: P O 6.2.0-20-generic #20-Ubuntu
2023-06-11T13:20:18.699454+07:00 portland kernel: [ 931.878262] Hardware name: Micro-Star International Co., Ltd. MS-7D54/MAG X570S TORPEDO MAX (MS-7D54), BIOS A.60 04/29/2023
2023-06-11T13:20:18.699455+07:00 portland kernel: [ 931.878264] RIP: 0010:update_rq_clock+0x184/0x230
2023-06-11T13:20:18.699455+07:00 portland kernel: [ 931.878267] Code: 0f b6 25 1c b1 c7 02 41 80 fc 01 0f 87 f9 f7 f1 00 41 83 e4 01 75 15 48 c7 c7 90 ec f5 93 c6 05 fe b0 c7 02 01 e8 5c 2c fb ff <0f> 0b 48 8b 93 40 0a 00 00 8b 83 08 0a 00 00 48 89 93 48 0a 00 00
2023-06-11T13:20:18.699456+07:00 portland kernel: [ 931.878268] RSP: 0018:ffffbf508035ce28 EFLAGS: 00010046
2023-06-11T13:20:18.699456+07:00 portland kernel: [ 931.878270] RAX: 0000000000000000 RBX: ffff9b37bf0316c0 RCX: 0000000000000000
2023-06-11T13:20:18.699457+07:00 portland kernel: [ 931.878271] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
2023-06-11T13:20:18.699471+07:00 portland kernel: [ 931.878272] RBP: ffffbf508035ce48 R08: 0000000000000000 R09: 0000000000000000
2023-06-11T13:20:18.699473+07:00 portland kernel: [ 931.878273] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
2023-06-11T13:20:18.699474+07:00 portland kernel: [ 931.878274] R13: 0000000000000008 R14: 0000000000000008 R15: ffff9b30c09c0000
2023-06-11T13:20:18.699474+07:00 portland kernel: [ 931.878275] FS: 0000000000000000(0000) GS:ffff9b37bf000000(0000) knlGS:0000000000000000
2023-06-11T13:20:18.699475+07:00 portland kernel: [ 931.878276] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-06-11T13:20:18.699475+07:00 portland kernel: [ 931.878278] CR2: 00005561beebe928 CR3: 00000001938b6000 CR4: 0000000000350ee0
2023-06-11T13:20:18.699476+07:00 portland kernel: [ 931.878279] Call Trace:
2023-06-11T13:20:18.699477+07:00 portland kernel: [ 931.878281]
2023-06-11T13:20:18.699477+07:00 portland kernel: [ 931.878282] ? arch_scale_freq_tick+0x3a/0x120
2023-06-11T13:20:18.699478+07:00 portland kernel: [ 931.878287] scheduler_tick+0x9a/0x330
2023-06-11T13:20:18.699478+07:00 portland kernel: [ 931.878290] update_process_times+0x89/0xb0
2023-06-11T13:20:18.699478+07:00 portland kernel: [ 931.878293] tick_sched_handle+0x29/0x70
2023-06-11T13:20:18.699479+07:00 portland kernel: [ 931.878296] tick_sched_timer+0x70/0x90
2023-06-11T13:20:18.699479+07:00 portland kernel: [ 931.878298] ? __pfx_tick_sched_timer+0x10/0x10
2023-06-11T13:20:18.699479+07:00 portland kernel: [ 931.878300] __hrtimer_run_queues+0x108/0x280
2023-06-11T13:20:18.699480+07:00 portland kernel: [ 931.878303] hrtimer_interrupt+0xf6/0x250
2023-06-11T13:20:18.699480+07:00 portland kernel: [ 931.878306] __sysvec_apic_timer_interrupt+0x62/0x140
2023-06-11T13:20:18.699481+07:00 portland kernel: [ 931.878309] sysvec_apic_timer_interrupt+0x8d/0xd0
2023-06-11T13:20:18.699481+07:00 portland kernel: [ 931.878313]
2023-06-11T13:20:18.699481+07:00 portland kernel: [ 931.878314]
2023-06-11T13:20:18.699481+07:00 portland kernel: [ 931.878315] asm_sysvec_apic_timer_interrupt+0x1b/0x20
2023-06-11T13:20:18.699482+07:00 portland kernel: [ 931.878318] RIP: 0010:cpuidle_enter_state+0xde/0x6f0
2023-06-11T13:20:18.699482+07:00 portland kernel: [ 931.878322] Code: f3 ce 6c e8 04 d1 42 ff 8b 53 04 49 89 c7 0f 1f 44 00 00 31 ff e8 62 bb 41 ff 80 7d d0 00 0f 85 eb 00 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 12 02 00 00 4d 63 ee 49 83 fd 09 0f 87 c7 04 00 00
2023-06-11T13:20:18.699482+07:00 portland kernel: [ 931.878323] RSP: 0018:ffffbf508019fe28 EFLAGS: 00000246
2023-06-11T13:20:18.699483+07:00 portland kernel: [ 931.878324] RAX: 0000000000000000 RBX: ffff9b30c47ec000 RCX: 0000000000000000
2023-06-11T13:20:18.699483+07:00 portland kernel: [ 931.878325] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
2023-06-11T13:20:18.699484+07:00 portland kernel: [ 931.878326] RBP: ffffbf508019fe78 R08: 0000000000000000 R09: 0000000000000000
2023-06-11T13:20:18.699484+07:00 portland kernel: [ 931.878327] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff952d51e0
2023-06-11T13:20:18.699484+07:00 portland kernel: [ 931.878328] R13: 0000000000000002 R14: 0000000000000002 R15: 000000d8f8444d31
2023-06-11T13:20:18.699485+07:00 portland kernel: [ 931.878331] ? cpuidle_enter_state+0xce/0x6f0
2023-06-11T13:20:18.699485+07:00 portland kernel: [ 931.878333] cpuidle_enter+0x2e/0x50
2023-06-11T13:20:18.699485+07:00 portland kernel: [ 931.878335] cpuidle_idle_call+0x153/0x1e0
2023-06-11T13:20:18.699486+07:00 portland kernel: [ 931.878338] do_idle+0x82/0x100
2023-06-11T13:20:18.699486+07:00 portland kernel: [ 931.878339] cpu_startup_entry+0x1d/0x20
2023-06-11T13:20:18.699486+07:00 portland kernel: [ 931.878341] start_secondary+0x122/0x160
2023-06-11T13:20:18.699487+07:00 portland kernel: [ 931.878343] secondary_startup_64_no_verify+0xe5/0xeb
2023-06-11T13:20:18.699487+07:00 portland kernel: [ 931.878348]
2023-06-11T13:20:18.699487+07:00 portland kernel: [ 931.878348] ---[ end trace 0000000000000000 ]---
2023-06-11T13:20:18.699488+07:00 portland kernel: [ 931.881305] BUG: kernel NULL pointer dereference, address: 0000000000000004
2023-06-11T13:20:18.699488+07:00 portland kernel: [ 931.881312] #PF: supervisor read access in kernel mode
2023-06-11T13:20:18.699488+07:00 portland kernel: [ 931.881315] #PF: error_code(0x0000) - not-present page
2023-06-11T13:20:18.699489+07:00 portland kernel: [ 931.881318] PGD 0 P4D 0
2023-06-11T13:20:18.699489+07:00 portland kernel: [ 931.881321] Oops: 0000 [#1] PREEMPT SMP NOPTI
2023-06-11T13:20:18.699489+07:00 portland kernel: [ 931.881324] CPU: 1 PID: 69723 Comm: z_rd_int_1 Tainted: P W O 6.2.0-20-generic #20-Ubuntu
2023-06-11T13:20:18.699490+07:00 portland kernel: [ 931.881328] Hardware name: Micro-Star International Co., Ltd. MS-7D54/MAG X570S TORPEDO MAX (MS-7D54), BIOS A.60 04/29/2023
2023-06-11T13:20:18.699490+07:00 portland kernel: [ 931.881331] RIP: 0010:abd_is_gang+0x0/0x10 [zfs]
2023-06-11T13:20:18.699490+07:00 portland kernel: [ 931.881586] Code: 90 90 90 90 90 90 90 90 90 90 8b 07 c1 e8 05 83 e0 01 31 ff e9 91 25 e6 d1 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <8b> 07 c1 e8 06 83 e0 01 31 ff e9 71 25 e6 d1 90 90 90 90 90 90 90
2023-06-11T13:20:18.699491+07:00 portland kernel: [ 931.881591] RSP: 0018:ffffbf509558fd38 EFLAGS: 00010202
2023-06-11T13:20:18.699491+07:00 portland kernel: [ 931.881594] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
2023-06-11T13:20:18.699491+07:00 portland kernel: [ 931.881596] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
2023-06-11T13:20:18.699492+07:00 portland kernel: [ 931.881598] RBP: ffffbf509558fd48 R08: 0000000000000000 R09: 0000000000000000
2023-06-11T13:20:18.699492+07:00 portland kernel: [ 931.881601] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9b30e5d11000
2023-06-11T13:20:18.699493+07:00 portland kernel: [ 931.881603] R13: 0000000000000000 R14: ffff9b3314401940 R15: ffff9b3315a33f60
2023-06-11T13:20:18.699493+07:00 portland kernel: [ 931.881605] FS: 0000000000000000(0000) GS:ffff9b37bee40000(0000) knlGS:0000000000000000
2023-06-11T13:20:18.699493+07:00 portland kernel: [ 931.881609] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-06-11T13:20:18.699493+07:00 portland kernel: [ 931.881612] CR2: 0000000000000004 CR3: 0000000194bf0000 CR4: 0000000000350ee0
2023-06-11T13:20:18.699494+07:00 portland kernel: [ 931.881615] Call Trace:
2023-06-11T13:20:18.699494+07:00 portland kernel: [ 931.881616]
2023-06-11T13:20:18.699495+07:00 portland kernel: [ 931.881618] ? abd_free+0x1b/0xb0 [zfs]
2023-06-11T13:20:18.699495+07:00 portland kernel: [ 931.881727] vdev_raidz_row_free+0x38/0xa0 [zfs]
2023-06-11T13:20:18.699495+07:00 portland kernel: [ 931.881877] vdev_raidz_map_free+0x29/0x60 [zfs]
2023-06-11T13:20:18.699495+07:00 portland kernel: [ 931.882019] vdev_raidz_map_free_vsd+0x15/0x20 [zfs]
2023-06-11T13:20:18.699502+07:00 portland kernel: [ 931.882152] zio_vdev_io_assess+0x52/0x2f0 [zfs]
2023-06-11T13:20:18.825539+07:00 portland kernel: [ 931.882282] zio_execute+0x92/0xf0 [zfs]
2023-06-11T13:20:18.825556+07:00 portland kernel: [ 931.882406] taskq_thread+0x229/0x400 [spl]
2023-06-11T13:20:18.825556+07:00 portland kernel: [ 931.882420] ? __pfx_default_wake_function+0x10/0x10
2023-06-11T13:20:18.825557+07:00 portland kernel: [ 931.882424] ? __pfx_zio_execute+0x10/0x10 [zfs]
2023-06-11T13:20:18.825558+07:00 portland kernel: [ 931.882547] ? __pfx_taskq_thread+0x10/0x10 [spl]
2023-06-11T13:20:18.825558+07:00 portland kernel: [ 931.882558] kthread+0xe9/0x110
2023-06-11T13:20:18.825559+07:00 portland kernel: [ 931.882562] ? __pfx_kthread+0x10/0x10
2023-06-11T13:20:18.825560+07:00 portland kernel: [ 931.882566] ret_from_fork+0x2c/0x50
2023-06-11T13:20:18.825560+07:00 portland kernel: [ 931.882570]
2023-06-11T13:20:18.825560+07:00 portland kernel: [ 931.882571] Modules linked in: tls vhost_net vhost vhost_iotlb tap bridge stp llc cfg80211 binfmt_misc nls_iso8859_1 snd_hda_codec_hdmi intel_rapl_msr snd_hda_intel zfs(PO) intel_rapl_common snd_intel_dspcfg snd_usb_audio snd_intel_sdw_acpi snd_usbmidi_lib zunicode(PO) edac_mce_amd snd_hda_codec snd_rawmidi zzstd(O) kvm_amd snd_hda_core snd_seq_device zlua(O) mc snd_hwdep zavl(PO) kvm snd_pcm icp(PO) snd_timer irqbypass zcommon(PO) snd rapl znvpair(PO) wmi_bmof k10temp ccp soundcore spl(O) joydev mac_hid nfsd auth_rpcgss nfs_acl lockd dm_multipath scsi_dh_rdac scsi_dh_emc grace scsi_dh_alua msr efi_pstore sunrpc dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear nouveau mxm_wmi i2c_algo_bit drm_ttm_helper ttm drm_display_helper cec hid_generic crct10dif_pclmul rc_core crc32_pclmul drm_kms_helper polyval_clmulni syscopyarea polyval_generic sysfillrect
2023-06-11T13:20:18.825561+07:00 portland kernel: [ 931.882612] usbhid ghash_clmulni_intel sysimgblt hid sha512_ssse3 aesni_intel nvme drm crypto_simd r8169 ahci xhci_pci nvme_core video cryptd i2c_piix4 libahci xhci_pci_renesas realtek nvme_common wmi
2023-06-11T13:20:18.825562+07:00 portland kernel: [ 931.882639] CR2: 0000000000000004
2023-06-11T13:20:18.825562+07:00 portland kernel: [ 931.882642] ---[ end trace 0000000000000000 ]---
2023-06-11T13:20:18.825563+07:00 portland kernel: [ 932.008113] RIP: 0010:abd_is_gang+0x0/0x10 [zfs]
2023-06-11T13:20:18.825563+07:00 portland kernel: [ 932.008244] Code: 90 90 90 90 90 90 90 90 90 90 8b 07 c1 e8 05 83 e0 01 31 ff e9 91 25 e6 d1 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <8b> 07 c1 e8 06 83 e0 01 31 ff e9 71 25 e6 d1 90 90 90 90 90 90 90
2023-06-11T13:20:18.825564+07:00 portland kernel: [ 932.008249] RSP: 0018:ffffbf509558fd38 EFLAGS: 00010202
2023-06-11T13:20:18.825565+07:00 portland kernel: [ 932.008252] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
2023-06-11T13:20:18.825574+07:00 portland kernel: [ 932.008255] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004
2023-06-11T13:20:18.825575+07:00 portland kernel: [ 932.008257] RBP: ffffbf509558fd48 R08: 0000000000000000 R09: 0000000000000000
2023-06-11T13:20:18.825576+07:00 portland kernel: [ 932.008260] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9b30e5d11000
2023-06-11T13:20:18.825576+07:00 portland kernel: [ 932.008262] R13: 0000000000000000 R14: ffff9b3314401940 R15: ffff9b3315a33f60
2023-06-11T13:20:18.825577+07:00 portland kernel: [ 932.008265] FS: 0000000000000000(0000) GS:ffff9b37bee40000(0000) knlGS:0000000000000000
2023-06-11T13:20:18.825578+07:00 portland kernel: [ 932.008268] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-06-11T13:20:18.825579+07:00 portland kernel: [ 932.008270] CR2: 0000000000000004 CR3: 0000000194bf0000 CR4: 0000000000350ee0
2023-06-11T13:20:18.825579+07:00 portland kernel: [ 932.008273] note: z_rd_int_1[69723] exited with irqs disabled
2023-06-11T13:20:19.043091+07:00 portland zed: eid=20 class=checksum pool='Universe' vdev=sdd1 size=28672 offset=23001530368 priority=4 err=0 flags=0x1008b0 bookmark=387:128:0:55891
2023-06-11T13:20:19.518599+07:00 portland zed: eid=21 class=checksum pool='Universe' vdev=sdc1 size=28672 offset=23101927424 priority=4 err=0 flags=0x1008b0 bookmark=387:128:0:59567
2023-06-11T13:20:20.768537+07:00 portland zed: eid=22 class=checksum pool='Universe' vdev=sdc1 size=28672 offset=23366561792 priority=4 err=0 flags=0x1008b0 bookmark=387:128:0:69257

"""

@frawau frawau added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jun 11, 2023
@almightiest
Copy link

Sounds like a bad power supply or power issue on load to me... If the system freezes outside of just a scrub operation (like heavy i/o or randomly), then it doesn't seem like a zfs specific bug?

@frawau
Copy link
Author

frawau commented Jun 16, 2023

Thanks.

Good idea, I will replace the power supply. Just in case.

@rincebrain
Copy link
Contributor

rincebrain commented Jun 17, 2023

FWIW, if a raidz stripe fails a checksum and can't reconstruct, it'll count it against all the disks in the stripe, since it can't know who's wrong in at least the single parity case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants