list_del corruption - ZFS 2.1.0/2.1.3, CentOS 8.5 #13143

doma2203 · 2022-02-22T23:51:03Z

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	8.5.2111
Kernel Version	4.18.0-348.2.1.el8_5
Architecture	x86_64
OpenZFS Version	zfs-2.1.0/zfs-2.1.2-27_g1009e609 (both compilled)
Additional information	This problem occurs interchangeable with #13144.

Describe the problem you're observing

We created a single dataset on the pool consisting of the single vdev with one drive, which is a LUN from an all-flash NVMe disk array. During the heavy metadata-intensive I/O on this dataset the following kernel panic is triggered:

Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783270] list_del corruption, ffff9ed0cc31e028->next is LIST_POISON1 (dead000000000100)-
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783679] kernel BUG at lib/list_debug.c:47!

We've upgraded the ZFS version from the 2.1.0 to 2.1.3 (from zfs-2.1.3-staging). but the problem still occurs randomly.

Describe how to reproduce the problem

The problem is originally triggered by the ADF (quantum chemisty HPC code) on the Lustre MDT with ZFS.

Include any warning/errors/backtraces from the system logs

Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783679] kernel BUG at lib/list_debug.c:47!
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783815] invalid opcode: 0000 [#1] SMP NOPTI
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783951] CPU: 22 PID: 4087 Comm: dp_sync_taskq Tainted: P          IOE    --------- -  - 4.18.0-348.2.1.el8_5.x86_64 #1
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.784296] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 7.99 03/11/2021
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.784520] RIP: 0010:__list_del_entry_valid.cold.1+0x12/0x4c
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.784692] Code: ca ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 50 71 51 b2 e8 6c ba ca ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 e0 71 51 b2 e8 58 ba ca ff <0f> 0b 48 c7 c7 90 72 51 b2 e8 4a ba ca ff 0f 0b 48 89 f2 48 89 fe
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785278] RSP: 0018:ffffaaf3e3ad7bc8 EFLAGS: 00010246
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785441] RAX: 000000000000004e RBX: ffff9ed0cc31e000 RCX: 0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785662] RDX: 0000000000000000 RSI: ffff9f2b7fc96818 RDI: ffff9f2b7fc96818
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785885] RBP: ffff9ed01819ca80 R08: 0000000000000791 R09: 0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786107] R10: 0000000000000000 R11: 0000022000000050 R12: ffff9ed01819ca80
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786331] R13: ffff9ed0cc31e028 R14: ffff9ecd9e35b640 R15: ffff9f071cd16c00
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786554] FS:  0000000000000000(0000) GS:ffff9f2b7fc80000(0000) knlGS:0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786986] CR2: 00007f4e6a4ebe80 CR3: 0000001db7e10001 CR4: 00000000007706e0
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.795043] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.803158] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.811206] PKRU: 55555554
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.819175] Call Trace:
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.827031]  dbuf_sync_leaf+0x38d/0x660 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.835029]  ? spa_taskq_dispatch_ent+0x64/0xb0 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.842774]  ? zio_taskq_dispatch+0x61/0xa0 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.850518]  ? zio_issue_async+0xe/0x20 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.858250]  dbuf_sync_list+0xcb/0x110 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.866074]  dnode_sync+0x3fb/0xa30 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.873841]  ? _cond_resched+0x15/0x30
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.881623]  sync_dnodes_task+0x71/0xa0 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.889378]  taskq_thread+0x2f2/0x540 [spl]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.897151]  ? wake_up_q+0x80/0x80
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.904993]  ? taskq_thread_spawn+0x50/0x50 [spl]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.912730]  kthread+0x116/0x130
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.920429]  ? kthread_flush_work_fn+0x10/0x10
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.928158]  ret_from_fork+0x1f/0x40
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.935950] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) ko2iblnd(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) lnet(OE) netconsole libcfs(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_umad(OE) intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass zfs(POE) rapl zunicode(POE) zzstd(OE) intel_cstate zlua(OE) zavl(POE) icp(POE) ipmi_si zcommon(POE) znvpair(POE) ipmi_devintf spl(OE) ipmi_msghandler wdat_wdt intel_uncore pcspkr ahci ses libahci enclosure mei_me scsi_transport_sas libata ioatdma joydev i2c_i801 mei lpc_ich dca acpi_power_meter acpi_cpufreq dm_service_time sd_mod sg qla2xxx nvme_fc nvme_fabrics nvme_core crct10dif_pclmul crc32_pclmul t10_pi crc32c_intel i40e ghash_clmulni_intel scsi_transport_fc megaraid_sas ib_ipoib(OE) ib_cm(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) psample mlxfw tls pci_hyperv_intf dm_multipath sunrpc
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.935982]  dm_mirror dm_region_hash dm_log dm_mod
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.013612] ---[ end trace d390799aa3e6b823 ]---
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.162579] RIP: 0010:__list_del_entry_valid.cold.1+0x12/0x4c
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.171458] Code: ca ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 50 71 51 b2 e8 6c ba ca ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 e0 71 51 b2 e8 58 ba ca ff <0f> 0b 48 c7 c7 90 72 51 b2 e8 4a ba ca ff 0f 0b 48 89 f2 48 89 fe
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.189495] RSP: 0018:ffffaaf3e3ad7bc8 EFLAGS: 00010246
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.198505] RAX: 000000000000004e RBX: ffff9ed0cc31e000 RCX: 0000000000000000
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.207696] RDX: 0000000000000000 RSI: ffff9f2b7fc96818 RDI: ffff9f2b7fc96818
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.216685] RBP: ffff9ed01819ca80 R08: 0000000000000791 R09: 0000000000000000
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.225537] R10: 0000000000000000 R11: 0000022000000050 R12: ffff9ed01819ca80
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.234369] R13: ffff9ed0cc31e028 R14: ffff9ecd9e35b640 R15: ffff9f071cd16c00
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.243029] FS:  0000000000000000(0000) GS:ffff9f2b7fc80000(0000) knlGS:0000000000000000
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.251665] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.260191] CR2: 00007f4e6a4ebe80 CR3: 0000001db7e10001 CR4: 00000000007706e0
Feb 22 18:16:20 ascratch-mds01 kernel: list_del corruption, ffff9ed0cc31e028->next is LIST_POISON1 (dead000000000100)

The text was updated successfully, but these errors were encountered:

doma2203 · 2022-05-09T10:22:20Z

We decided to close both issues and reformat the filesystem to the supported stack.

doma2203 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 22, 2022

doma2203 mentioned this issue Feb 23, 2022

VERIFY3(sa.sa_magic == SA_MAGIC) failed - ZFS 2.1.0/2.1.3, CentOS 8.5 #13144

Closed

doma2203 closed this as completed May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

list_del corruption - ZFS 2.1.0/2.1.3, CentOS 8.5 #13143

list_del corruption - ZFS 2.1.0/2.1.3, CentOS 8.5 #13143

doma2203 commented Feb 22, 2022 •

edited

doma2203 commented May 9, 2022

list_del corruption - ZFS 2.1.0/2.1.3, CentOS 8.5 #13143

list_del corruption - ZFS 2.1.0/2.1.3, CentOS 8.5 #13143

Comments

doma2203 commented Feb 22, 2022 • edited

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

doma2203 commented May 9, 2022

doma2203 commented Feb 22, 2022 •

edited