You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We created a single dataset on the pool consisting of the single vdev with one drive, which is a LUN from an all-flash NVMe disk array. During the heavy metadata-intensive I/O on this dataset the following kernel panic is triggered:
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783270] list_del corruption, ffff9ed0cc31e028->next is LIST_POISON1 (dead000000000100)-
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783679] kernel BUG at lib/list_debug.c:47!
We've upgraded the ZFS version from the 2.1.0 to 2.1.3 (from zfs-2.1.3-staging). but the problem still occurs randomly.
Describe how to reproduce the problem
The problem is originally triggered by the ADF (quantum chemisty HPC code) on the Lustre MDT with ZFS.
Include any warning/errors/backtraces from the system logs
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783679] kernel BUG at lib/list_debug.c:47!
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783815] invalid opcode: 0000 [#1] SMP NOPTI
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783951] CPU: 22 PID: 4087 Comm: dp_sync_taskq Tainted: P IOE --------- - - 4.18.0-348.2.1.el8_5.x86_64 #1
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.784296] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 7.99 03/11/2021
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.784520] RIP: 0010:__list_del_entry_valid.cold.1+0x12/0x4c
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.784692] Code: ca ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 50 71 51 b2 e8 6c ba ca ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 e0 71 51 b2 e8 58 ba ca ff <0f> 0b 48 c7 c7 90 72 51 b2 e8 4a ba ca ff 0f 0b 48 89 f2 48 89 fe
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785278] RSP: 0018:ffffaaf3e3ad7bc8 EFLAGS: 00010246
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785441] RAX: 000000000000004e RBX: ffff9ed0cc31e000 RCX: 0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785662] RDX: 0000000000000000 RSI: ffff9f2b7fc96818 RDI: ffff9f2b7fc96818
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785885] RBP: ffff9ed01819ca80 R08: 0000000000000791 R09: 0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786107] R10: 0000000000000000 R11: 0000022000000050 R12: ffff9ed01819ca80
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786331] R13: ffff9ed0cc31e028 R14: ffff9ecd9e35b640 R15: ffff9f071cd16c00
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786554] FS: 0000000000000000(0000) GS:ffff9f2b7fc80000(0000) knlGS:0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786806] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786986] CR2: 00007f4e6a4ebe80 CR3: 0000001db7e10001 CR4: 00000000007706e0
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.795043] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.803158] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.811206] PKRU: 55555554
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.819175] Call Trace:
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.827031] dbuf_sync_leaf+0x38d/0x660 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.835029] ? spa_taskq_dispatch_ent+0x64/0xb0 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.842774] ? zio_taskq_dispatch+0x61/0xa0 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.850518] ? zio_issue_async+0xe/0x20 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.858250] dbuf_sync_list+0xcb/0x110 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.866074] dnode_sync+0x3fb/0xa30 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.873841] ? _cond_resched+0x15/0x30
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.881623] sync_dnodes_task+0x71/0xa0 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.889378] taskq_thread+0x2f2/0x540 [spl]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.897151] ? wake_up_q+0x80/0x80
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.904993] ? taskq_thread_spawn+0x50/0x50 [spl]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.912730] kthread+0x116/0x130
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.920429] ? kthread_flush_work_fn+0x10/0x10
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.928158] ret_from_fork+0x1f/0x40
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.935950] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) ko2iblnd(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) lnet(OE) netconsole libcfs(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_umad(OE) intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass zfs(POE) rapl zunicode(POE) zzstd(OE) intel_cstate zlua(OE) zavl(POE) icp(POE) ipmi_si zcommon(POE) znvpair(POE) ipmi_devintf spl(OE) ipmi_msghandler wdat_wdt intel_uncore pcspkr ahci ses libahci enclosure mei_me scsi_transport_sas libata ioatdma joydev i2c_i801 mei lpc_ich dca acpi_power_meter acpi_cpufreq dm_service_time sd_mod sg qla2xxx nvme_fc nvme_fabrics nvme_core crct10dif_pclmul crc32_pclmul t10_pi crc32c_intel i40e ghash_clmulni_intel scsi_transport_fc megaraid_sas ib_ipoib(OE) ib_cm(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) psample mlxfw tls pci_hyperv_intf dm_multipath sunrpc
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.935982] dm_mirror dm_region_hash dm_log dm_mod
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.013612] ---[ end trace d390799aa3e6b823 ]---
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.162579] RIP: 0010:__list_del_entry_valid.cold.1+0x12/0x4c
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.171458] Code: ca ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 50 71 51 b2 e8 6c ba ca ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 e0 71 51 b2 e8 58 ba ca ff <0f> 0b 48 c7 c7 90 72 51 b2 e8 4a ba ca ff 0f 0b 48 89 f2 48 89 fe
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.189495] RSP: 0018:ffffaaf3e3ad7bc8 EFLAGS: 00010246
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.198505] RAX: 000000000000004e RBX: ffff9ed0cc31e000 RCX: 0000000000000000
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.207696] RDX: 0000000000000000 RSI: ffff9f2b7fc96818 RDI: ffff9f2b7fc96818
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.216685] RBP: ffff9ed01819ca80 R08: 0000000000000791 R09: 0000000000000000
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.225537] R10: 0000000000000000 R11: 0000022000000050 R12: ffff9ed01819ca80
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.234369] R13: ffff9ed0cc31e028 R14: ffff9ecd9e35b640 R15: ffff9f071cd16c00
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.243029] FS: 0000000000000000(0000) GS:ffff9f2b7fc80000(0000) knlGS:0000000000000000
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.251665] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.260191] CR2: 00007f4e6a4ebe80 CR3: 0000001db7e10001 CR4: 00000000007706e0
Feb 22 18:16:20 ascratch-mds01 kernel: list_del corruption, ffff9ed0cc31e028->next is LIST_POISON1 (dead000000000100)
The text was updated successfully, but these errors were encountered:
System information
Describe the problem you're observing
We created a single dataset on the pool consisting of the single vdev with one drive, which is a LUN from an all-flash NVMe disk array. During the heavy metadata-intensive I/O on this dataset the following kernel panic is triggered:
We've upgraded the ZFS version from the 2.1.0 to 2.1.3 (from
zfs-2.1.3-staging
). but the problem still occurs randomly.Describe how to reproduce the problem
The problem is originally triggered by the ADF (quantum chemisty HPC code) on the Lustre MDT with ZFS.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: