Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel crash when ZFS 2.2 features are enabled, but works for ZFS 2.0 #15984

Open
SystemKeeper opened this issue Mar 11, 2024 · 5 comments
Open
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@SystemKeeper
Copy link

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04.4
Kernel Version 6.5.0-25-generic (HWE Kernel)
Architecture x86_64
OpenZFS Version zfs-2.1.5-1ubuntu6~22.04.2 / zfs-kmod-2.2.0-0ubuntu1~23.10.1

Describe the problem you're observing

We are running multiple machines with LXD (5.20) as ephemeral GitHub Actions Runners, which results in a high number of container creation/deletion. The containers run on a ZFS filesystem which was created by LXD. After setting up another machine, we noticed that the machine crashed after about 16h of use.
After comparing the machines (all are set up identical, or should be) we noticed that on the working machines the ZFS dataset was created before ZFS 2.2 (I guess it was 2.0 or 2.1), while on the latest machine it was created with ZFS 2.2.

After this discovery we destroyed the ZFS pool and recreated it like this:

truncate -s 512G /var/snap/lxd/common/lxd/disks/default_legacy.img
zpool create -m none -O compression=on -o compatibility=openzfs-2.0-linux default_legacy /var/snap/lxd/common/lxd/disks/default_legacy.img
zpool set autotrim=on default_legacy

After that change, the server now runs without an issue so far.

The feature difference between the non-working and working pool were these:

default  unsupported@org.openzfs:zilsaxattr          readonly                                    local
default  unsupported@com.delphix:head_errlog         readonly                                    local
default  unsupported@org.openzfs:blake3              inactive                                    local
default  unsupported@com.fudosecurity:block_cloning  readonly                                    local
default  unsupported@com.klarasystems:vdev_zaps_v2   readonly                                    local

In the syslog I found these pagefaults and null pointer deref:

Mar  2 01:45:06 garm3 kernel: [40773.506304] BUG: unable to handle page fault for address: ffffadd4e4636000
Mar  2 01:45:06 garm3 kernel: [40773.506603] #PF: supervisor write access in kernel mode
Mar  2 01:45:06 garm3 kernel: [40773.506838] #PF: error_code(0x0002) - not-present page
Mar  2 01:45:06 garm3 kernel: [40773.507064] PGD 100000067 P4D 100000067 PUD 7ef220067 PMD a39af3067 PTE 0
Mar  2 01:45:06 garm3 kernel: [40773.507292] Oops: 0002 [#1] PREEMPT SMP NOPTI
Mar  2 01:45:06 garm3 kernel: [40773.507516] CPU: 18 PID: 2064375 Comm: fuse-overlayfs Tainted: P           O       6.5.0-21-generic #21~22.04.1-Ubuntu
Mar  2 01:45:06 garm3 kernel: [40773.507737] Hardware name: ASUS System Product Name/Pro WS 565-ACE, BIOS 9901 10/13/2022
Mar  2 01:45:06 garm3 kernel: [40773.507944] RIP: 0010:memcpy+0x8/0x10
Mar  2 01:45:06 garm3 kernel: [40773.508153] Code: 09 c2 48 89 d0 49 f7 e1 49 01 d0 eb c8 cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 31 81 01 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
Mar  2 01:45:06 garm3 kernel: [40773.508587] RSP: 0018:ffffadd4052239c0 EFLAGS: 00010286
Mar  2 01:45:06 garm3 kernel: [40773.508809] RAX: ffffadd4e4627490 RBX: ffff955018c42e38 RCX: 00000000000113d8
Mar  2 01:45:06 garm3 kernel: [40773.509031] RDX: 000000000001ff48 RSI: ffffadd4ec244bb8 RDI: ffffadd4e4636000
Mar  2 01:45:06 garm3 kernel: [40773.509254] RBP: ffffadd405223a00 R08: 0000000000000000 R09: 0000000000000000
Mar  2 01:45:06 garm3 kernel: [40773.509476] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9566547ea000
Mar  2 01:45:06 garm3 kernel: [40773.509699] R13: 000000000001ff48 R14: ffffadd4e4627490 R15: ffffadd4ec236000
Mar  2 01:45:06 garm3 kernel: [40773.509925] FS:  00007f8ec10de740(0000) GS:ffff956b6ee80000(0000) knlGS:0000000000000000
Mar  2 01:45:06 garm3 kernel: [40773.510152] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  2 01:45:06 garm3 kernel: [40773.510376] CR2: ffffadd4e4636000 CR3: 0000000eaa2dc000 CR4: 0000000000750ee0
Mar  2 01:45:06 garm3 kernel: [40773.510611] PKRU: 55555554
Mar  2 01:45:06 garm3 kernel: [40773.510836] Call Trace:
Mar  2 01:45:06 garm3 kernel: [40773.511057]  <TASK>
Mar  2 01:45:06 garm3 kernel: [40773.511278]  ? show_regs+0x6d/0x80
Mar  2 01:45:06 garm3 kernel: [40773.511498]  ? __die+0x24/0x80
Mar  2 01:45:06 garm3 kernel: [40773.511716]  ? page_fault_oops+0x99/0x1b0
Mar  2 01:45:06 garm3 kernel: [40773.511935]  ? kernelmode_fixup_or_oops+0xb2/0x140
Mar  2 01:45:06 garm3 kernel: [40773.512152]  ? __bad_area_nosemaphore+0x1a5/0x2c0
Mar  2 01:45:06 garm3 kernel: [40773.512368]  ? srso_alias_return_thunk+0x5/0x7f
Mar  2 01:45:06 garm3 kernel: [40773.512583]  ? __wake_up_common_lock+0x8b/0xd0
Mar  2 01:45:06 garm3 kernel: [40773.512797]  ? bad_area_nosemaphore+0x16/0x30
Mar  2 01:45:06 garm3 kernel: [40773.513007]  ? do_kern_addr_fault+0x7b/0xa0
Mar  2 01:45:06 garm3 kernel: [40773.513217]  ? exc_page_fault+0x10d/0x1b0
Mar  2 01:45:06 garm3 kernel: [40773.513429]  ? asm_exc_page_fault+0x27/0x30
Mar  2 01:45:06 garm3 kernel: [40773.513642]  ? memcpy+0x8/0x10
Mar  2 01:45:06 garm3 kernel: [40773.513855]  ? zil_lwb_commit+0x5f/0x340 [zfs]
Mar  2 01:45:06 garm3 kernel: [40773.514241]  zil_lwb_write_issue+0x7d/0x920 [zfs]
Mar  2 01:45:06 garm3 kernel: [40773.514570]  zil_commit_writer+0x91/0x140 [zfs]
Mar  2 01:45:06 garm3 kernel: [40773.514888]  zil_commit_impl+0x64/0xf0 [zfs]
Mar  2 01:45:06 garm3 kernel: [40773.515204]  zil_commit+0x3d/0x80 [zfs]
Mar  2 01:45:06 garm3 kernel: [40773.515515]  zfs_write+0xaad/0xca0 [zfs]
Mar  2 01:45:06 garm3 kernel: [40773.515839]  zpl_iter_write+0x118/0x160 [zfs]
Mar  2 01:45:06 garm3 kernel: [40773.516148]  vfs_write+0x254/0x440
Mar  2 01:45:06 garm3 kernel: [40773.516354]  __x64_sys_pwrite64+0xa6/0xd0
Mar  2 01:45:06 garm3 kernel: [40773.516558]  do_syscall_64+0x5b/0x90
Mar  2 01:45:06 garm3 kernel: [40773.516762]  ? srso_alias_return_thunk+0x5/0x7f
Mar  2 01:45:06 garm3 kernel: [40773.516963]  ? do_syscall_64+0x67/0x90
Mar  2 01:45:06 garm3 kernel: [40773.517164]  ? srso_alias_return_thunk+0x5/0x7f
Mar  2 01:45:06 garm3 kernel: [40773.517367]  ? syscall_exit_to_user_mode+0x37/0x60
Mar  2 01:45:06 garm3 kernel: [40773.517566]  ? srso_alias_return_thunk+0x5/0x7f
Mar  2 01:45:06 garm3 kernel: [40773.517766]  ? do_syscall_64+0x67/0x90
Mar  2 01:45:06 garm3 kernel: [40773.517965]  ? srso_alias_return_thunk+0x5/0x7f
Mar  2 01:45:06 garm3 kernel: [40773.518158]  ? exit_to_user_mode_prepare+0x9b/0xb0
Mar  2 01:45:06 garm3 kernel: [40773.518350]  ? srso_alias_return_thunk+0x5/0x7f
Mar  2 01:45:06 garm3 kernel: [40773.518539]  ? syscall_exit_to_user_mode+0x37/0x60
Mar  2 01:45:06 garm3 kernel: [40773.518726]  ? srso_alias_return_thunk+0x5/0x7f
Mar  2 01:45:06 garm3 kernel: [40773.518911]  ? do_syscall_64+0x67/0x90
Mar  2 01:45:06 garm3 kernel: [40773.519087]  ? do_syscall_64+0x67/0x90
Mar  2 01:45:06 garm3 kernel: [40773.519254]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Mar  2 01:45:06 garm3 kernel: [40773.519419] RIP: 0033:0x7f8ec11f39aa
Mar  2 01:45:06 garm3 kernel: [40773.519601] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
Mar  2 01:45:06 garm3 kernel: [40773.519946] RSP: 002b:00007ffcd4d0f148 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
Mar  2 01:45:06 garm3 kernel: [40773.520126] RAX: ffffffffffffffda RBX: 0000000000002000 RCX: 00007f8ec11f39aa
Mar  2 01:45:06 garm3 kernel: [40773.520306] RDX: 0000000000002000 RSI: 00007f8ec0fdc060 RDI: 00000000000002e0
Mar  2 01:45:06 garm3 kernel: [40773.520485] RBP: 0000000000000000 R08: 0000000000002000 R09: 000000000000000e
Mar  2 01:45:06 garm3 kernel: [40773.520661] R10: 0000000000002000 R11: 0000000000000246 R12: 00007ffcd4d0f3d8
Mar  2 01:45:06 garm3 kernel: [40773.520835] R13: 0000000000000000 R14: 00007ffcd4d0f278 R15: 0000000000000000
Mar  2 01:45:06 garm3 kernel: [40773.521012]  </TASK>
Mar  2 01:45:06 garm3 kernel: [40773.521184] Modules linked in: tls unix_diag overlay nf_conntrack_netlink xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE xfrm_user xfrm_algo xt_addrtype nft_compat veth nft_masq nft_chain_nat bridge stp llc zfs(PO) spl(O) ebtable_filter ebtables ip6table_raw ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter nf_tables nfnetlink vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock binfmt_misc intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass rapl eeepc_wmi wmi_bmof ccp k10temp mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ast ghash_clmulni_intel mfd_aaeon drm_shmem_helper asus_wmi aesni_intel
Mar  2 01:45:06 garm3 kernel: [40773.521263]  video drm_kms_helper ledtrig_audio sparse_keymap crypto_simd platform_profile nvme cryptd drm igb nvme_core i2c_piix4 ahci dca nvme_common i2c_algo_bit xhci_pci libahci xhci_pci_renesas wmi gpio_amdpt
Mar  2 01:45:06 garm3 kernel: [40773.523585] CR2: ffffadd4e4636000
Mar  2 01:45:06 garm3 kernel: [40773.523820] ---[ end trace 0000000000000000 ]---
Mar  2 01:45:06 garm3 kernel: [40773.524057] RIP: 0010:memcpy+0x8/0x10
Mar  2 01:45:06 garm3 kernel: [40773.524296] Code: 09 c2 48 89 d0 49 f7 e1 49 01 d0 eb c8 cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 31 81 01 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
Mar  2 01:45:06 garm3 kernel: [40773.524801] RSP: 0018:ffffadd4052239c0 EFLAGS: 00010286
Mar  2 01:45:06 garm3 kernel: [40773.525060] RAX: ffffadd4e4627490 RBX: ffff955018c42e38 RCX: 00000000000113d8
Mar  2 01:45:06 garm3 kernel: [40773.525324] RDX: 000000000001ff48 RSI: ffffadd4ec244bb8 RDI: ffffadd4e4636000
Mar  2 01:45:06 garm3 kernel: [40773.525587] RBP: ffffadd405223a00 R08: 0000000000000000 R09: 0000000000000000
Mar  2 01:45:06 garm3 kernel: [40773.525852] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9566547ea000
Mar  2 01:45:06 garm3 kernel: [40773.526122] R13: 000000000001ff48 R14: ffffadd4e4627490 R15: ffffadd4ec236000
Mar  2 01:45:06 garm3 kernel: [40773.526390] FS:  00007f8ec10de740(0000) GS:ffff956b6ee80000(0000) knlGS:0000000000000000
Mar  2 01:45:06 garm3 kernel: [40773.526660] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  2 01:45:06 garm3 kernel: [40773.526931] CR2: ffffadd4e4636000 CR3: 0000000eaa2dc000 CR4: 0000000000750ee0
Mar  2 01:45:06 garm3 kernel: [40773.527205] PKRU: 55555554
Mar  2 01:45:06 garm3 kernel: [40773.527479] note: fuse-overlayfs[2064375] exited with irqs disabled


Mar  2 01:45:08 garm3 kernel: [40775.600880] overlayfs: fs on '/home/runner/.local/share/containers/storage/overlay/compat2342864860/lower1' does not support file handles, falling back to xino=off.
Mar  2 01:45:09 garm3 kernel: [40775.919960] BUG: kernel NULL pointer dereference, address: 0000000000000020
Mar  2 01:45:09 garm3 kernel: [40775.920527] #PF: supervisor read access in kernel mode
Mar  2 01:45:09 garm3 kernel: [40775.921066] #PF: error_code(0x0000) - not-present page
Mar  2 01:45:09 garm3 kernel: [40775.921445] PGD 0 P4D 0
Mar  2 01:45:09 garm3 kernel: [40775.921798] Oops: 0000 [#2] PREEMPT SMP NOPTI
Mar  2 01:45:09 garm3 kernel: [40775.922173] CPU: 17 PID: 2038119 Comm: z_wr_int_h Tainted: P      D    O       6.5.0-21-generic #21~22.04.1-Ubuntu
Mar  2 01:45:09 garm3 kernel: [40775.922540] Hardware name: ASUS System Product Name/Pro WS 565-ACE, BIOS 9901 10/13/2022
Mar  2 01:45:09 garm3 kernel: [40775.922909] RIP: 0010:list_add+0x1/0x20 [spl]
Mar  2 01:45:09 garm3 kernel: [40775.923304] Code: 89 1c 24 5b 41 5c 41 5d 5d 31 c0 31 d2 31 f6 31 ff e9 53 24 c4 d9 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 <48> 8b 16 48 89 e5 e8 a4 ff ff ff 5d 31 d2 31 f6 31 ff e9 28 24 c4
Mar  2 01:45:09 garm3 kernel: [40775.924070] RSP: 0018:ffffadd4107afc18 EFLAGS: 00010046
Mar  2 01:45:09 garm3 kernel: [40775.924445] RAX: ffffadd4e4635000 RBX: 0000000000000000 RCX: 0000000000000000
Mar  2 01:45:09 garm3 kernel: [40775.924818] RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffffadd4e4635018
Mar  2 01:45:09 garm3 kernel: [40775.925189] RBP: ffffadd4107afc40 R08: 0000000000000000 R09: 0000000000000000
Mar  2 01:45:09 garm3 kernel: [40775.925563] R10: 0000000000000000 R11: 0000000000000000 R12: ffff954cea463400
Mar  2 01:45:09 garm3 kernel: [40775.925957] R13: ffff954cc190b920 R14: ffff954cc190b800 R15: ffff954cc190b8b8
Mar  2 01:45:09 garm3 kernel: [40775.926329] FS:  0000000000000000(0000) GS:ffff956b6ee40000(0000) knlGS:0000000000000000
Mar  2 01:45:09 garm3 kernel: [40775.926706] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar  2 01:45:09 garm3 kernel: [40775.927079] CR2: 0000000000000020 CR3: 0000000e67d5e000 CR4: 0000000000750ee0
Mar  2 01:45:09 garm3 kernel: [40775.927452] PKRU: 55555554
Mar  2 01:45:09 garm3 kernel: [40775.927816] Call Trace:
Mar  2 01:45:09 garm3 kernel: [40775.928175]  <TASK>
Mar  2 01:45:09 garm3 kernel: [40775.928528]  ? show_regs+0x6d/0x80
Mar  2 01:45:09 garm3 kernel: [40775.928877]  ? __die+0x24/0x80
Mar  2 01:45:09 garm3 kernel: [40775.929219]  ? page_fault_oops+0x99/0x1b0
Mar  2 01:45:09 garm3 kernel: [40775.929560]  ? do_user_addr_fault+0x31d/0x6b0
Mar  2 01:45:09 garm3 kernel: [40775.929903]  ? exc_page_fault+0x83/0x1b0
Mar  2 01:45:09 garm3 kernel: [40775.930228]  ? asm_exc_page_fault+0x27/0x30
Mar  2 01:45:09 garm3 kernel: [40775.930549]  ? list_add+0x1/0x20 [spl]
Mar  2 01:45:09 garm3 kernel: [40775.930864]  ? list_add+0xc/0x20 [spl]
Mar  2 01:45:09 garm3 kernel: [40775.931172]  ? spl_cache_shrink+0x27/0xc0 [spl]
Mar  2 01:45:09 garm3 kernel: [40775.931481]  spl_cache_flush+0x66
Mar  4 13:59:10 garm3 multipathd[764]: --------start up--------

The only thing from the kernel side I was able to retrieve was this (unable to scroll or catch it otherwise, sorry):
Bildschirmfoto 2024-03-04 um 06 59 13

Not sure if this is helpful since this is far from my expertise, but maybe it makes sense to anyone here.

@SystemKeeper SystemKeeper added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 11, 2024
@rincebrain
Copy link
Contributor

Go yell at Ubuntu until they include #15634.

@Gendra13
Copy link

Go yell at Ubuntu until they include #15634.

Well, considering the fact that their bundled kernel modules don't even include the dirty-dnode fix, you'll gonna need to yell very loud.

@SystemKeeper
Copy link
Author

Thanks for the feedback! So I guess we can close this here and I start yelling at ubuntu? 😅

@rincebrain
Copy link
Contributor

Go yell at Ubuntu until they include #15634.

Well, considering the fact that their bundled kernel modules don't even include the dirty-dnode fix, you'll gonna need to yell very loud.

Last I knew, didn't they add that cherrypick but didn't cut a new kernel package just for that?

Thanks for the feedback! So I guess we can close this here and I start yelling at ubuntu? 😅

I could be wrong, but this really seems to look like that bug, so yeah. You could try building 2.2.3 locally and running it and making sure your problem goes away.

@clhedrick
Copy link

clhedrick commented Mar 12, 2024

It's pretty easy to update the kernel. Check out the source. Then (the first line is installing the tools needed to build.)

sudo apt install build-essential autoconf automake libtool gawk alien fakeroot dkms libblkid-dev uuid-dev libudev-dev libssl-dev zlib1g-dev libaio-dev libattr1-dev libelf-dev linux-headers-generic python3 python3-dev python3-setuptools python3-cffi libffi-dev python3-packaging git libcurl4-openssl-dev debhelper-compat dh-python po-debconf python3-all-dev python3-sphinx parallel

cd ./zfs
git checkout zfs-2.2.3
sh autogen.sh
./configure
make -s -j$(nproc)

nproc is however many processors you want to give it to build.

When it's finished, copy

./module/zfs.ko
./module/spl.ko

to /lib/modules/6.5.0-25-generic/kernel/zfs/

This assumes you're updating from 2.2.x to 2.2.3 and reboot. I haven't tried it to go from 2.1.x. I think the .ko files are different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants