Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list_del corruption - ZFS 2.1.0/2.1.3, CentOS 8.5 #13143

Closed
doma2203 opened this issue Feb 22, 2022 · 1 comment
Closed

list_del corruption - ZFS 2.1.0/2.1.3, CentOS 8.5 #13143

doma2203 opened this issue Feb 22, 2022 · 1 comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@doma2203
Copy link

doma2203 commented Feb 22, 2022

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 8.5.2111
Kernel Version 4.18.0-348.2.1.el8_5
Architecture x86_64
OpenZFS Version zfs-2.1.0/zfs-2.1.2-27_g1009e609 (both compilled)
Additional information This problem occurs interchangeable with #13144.

Describe the problem you're observing

We created a single dataset on the pool consisting of the single vdev with one drive, which is a LUN from an all-flash NVMe disk array. During the heavy metadata-intensive I/O on this dataset the following kernel panic is triggered:

Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783270] list_del corruption, ffff9ed0cc31e028->next is LIST_POISON1 (dead000000000100)-
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783679] kernel BUG at lib/list_debug.c:47!

We've upgraded the ZFS version from the 2.1.0 to 2.1.3 (from zfs-2.1.3-staging). but the problem still occurs randomly.

Describe how to reproduce the problem

The problem is originally triggered by the ADF (quantum chemisty HPC code) on the Lustre MDT with ZFS.

Include any warning/errors/backtraces from the system logs

Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783679] kernel BUG at lib/list_debug.c:47!
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783815] invalid opcode: 0000 [#1] SMP NOPTI
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.783951] CPU: 22 PID: 4087 Comm: dp_sync_taskq Tainted: P          IOE    --------- -  - 4.18.0-348.2.1.el8_5.x86_64 #1
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.784296] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 7.99 03/11/2021
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.784520] RIP: 0010:__list_del_entry_valid.cold.1+0x12/0x4c
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.784692] Code: ca ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 50 71 51 b2 e8 6c ba ca ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 e0 71 51 b2 e8 58 ba ca ff <0f> 0b 48 c7 c7 90 72 51 b2 e8 4a ba ca ff 0f 0b 48 89 f2 48 89 fe
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785278] RSP: 0018:ffffaaf3e3ad7bc8 EFLAGS: 00010246
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785441] RAX: 000000000000004e RBX: ffff9ed0cc31e000 RCX: 0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785662] RDX: 0000000000000000 RSI: ffff9f2b7fc96818 RDI: ffff9f2b7fc96818
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.785885] RBP: ffff9ed01819ca80 R08: 0000000000000791 R09: 0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786107] R10: 0000000000000000 R11: 0000022000000050 R12: ffff9ed01819ca80
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786331] R13: ffff9ed0cc31e028 R14: ffff9ecd9e35b640 R15: ffff9f071cd16c00
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786554] FS:  0000000000000000(0000) GS:ffff9f2b7fc80000(0000) knlGS:0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.786986] CR2: 00007f4e6a4ebe80 CR3: 0000001db7e10001 CR4: 00000000007706e0
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.795043] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.803158] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.811206] PKRU: 55555554
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.819175] Call Trace:
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.827031]  dbuf_sync_leaf+0x38d/0x660 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.835029]  ? spa_taskq_dispatch_ent+0x64/0xb0 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.842774]  ? zio_taskq_dispatch+0x61/0xa0 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.850518]  ? zio_issue_async+0xe/0x20 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.858250]  dbuf_sync_list+0xcb/0x110 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.866074]  dnode_sync+0x3fb/0xa30 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.873841]  ? _cond_resched+0x15/0x30
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.881623]  sync_dnodes_task+0x71/0xa0 [zfs]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.889378]  taskq_thread+0x2f2/0x540 [spl]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.897151]  ? wake_up_q+0x80/0x80
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.904993]  ? taskq_thread_spawn+0x50/0x50 [spl]
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.912730]  kthread+0x116/0x130
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.920429]  ? kthread_flush_work_fn+0x10/0x10
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.928158]  ret_from_fork+0x1f/0x40
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.935950] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) ko2iblnd(OE) mgs(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) lnet(OE) netconsole libcfs(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_umad(OE) intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass zfs(POE) rapl zunicode(POE) zzstd(OE) intel_cstate zlua(OE) zavl(POE) icp(POE) ipmi_si zcommon(POE) znvpair(POE) ipmi_devintf spl(OE) ipmi_msghandler wdat_wdt intel_uncore pcspkr ahci ses libahci enclosure mei_me scsi_transport_sas libata ioatdma joydev i2c_i801 mei lpc_ich dca acpi_power_meter acpi_cpufreq dm_service_time sd_mod sg qla2xxx nvme_fc nvme_fabrics nvme_core crct10dif_pclmul crc32_pclmul t10_pi crc32c_intel i40e ghash_clmulni_intel scsi_transport_fc megaraid_sas ib_ipoib(OE) ib_cm(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) psample mlxfw tls pci_hyperv_intf dm_multipath sunrpc
Feb 22 18:16:20 ascratch-mds01 kernel: [16066.935982]  dm_mirror dm_region_hash dm_log dm_mod
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.013612] ---[ end trace d390799aa3e6b823 ]---
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.162579] RIP: 0010:__list_del_entry_valid.cold.1+0x12/0x4c
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.171458] Code: ca ff 0f 0b 48 89 c1 4c 89 c6 48 c7 c7 50 71 51 b2 e8 6c ba ca ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 e0 71 51 b2 e8 58 ba ca ff <0f> 0b 48 c7 c7 90 72 51 b2 e8 4a ba ca ff 0f 0b 48 89 f2 48 89 fe
Feb 22 18:16:20 ascratch-mds01 kernel: [16067.189495] RSP: 0018:ffffaaf3e3ad7bc8 EFLAGS: 00010246
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.198505] RAX: 000000000000004e RBX: ffff9ed0cc31e000 RCX: 0000000000000000
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.207696] RDX: 0000000000000000 RSI: ffff9f2b7fc96818 RDI: ffff9f2b7fc96818
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.216685] RBP: ffff9ed01819ca80 R08: 0000000000000791 R09: 0000000000000000
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.225537] R10: 0000000000000000 R11: 0000022000000050 R12: ffff9ed01819ca80
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.234369] R13: ffff9ed0cc31e028 R14: ffff9ecd9e35b640 R15: ffff9f071cd16c00
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.243029] FS:  0000000000000000(0000) GS:ffff9f2b7fc80000(0000) knlGS:0000000000000000
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.251665] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 22 18:16:21 ascratch-mds01 kernel: [16067.260191] CR2: 00007f4e6a4ebe80 CR3: 0000001db7e10001 CR4: 00000000007706e0
Feb 22 18:16:20 ascratch-mds01 kernel: list_del corruption, ffff9ed0cc31e028->next is LIST_POISON1 (dead000000000100)
@doma2203
Copy link
Author

doma2203 commented May 9, 2022

We decided to close both issues and reformat the filesystem to the supported stack.

@doma2203 doma2203 closed this as completed May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

1 participant