Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC at zfs_znode.c zfs_znode_sa_init() #10971

Closed
nerdcorenet opened this issue Sep 23, 2020 · 34 comments
Closed

PANIC at zfs_znode.c zfs_znode_sa_init() #10971

nerdcorenet opened this issue Sep 23, 2020 · 34 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@nerdcorenet
Copy link

System information

Type Linux VM
Distribution Name Ubuntu
Distribution Version 20.04.1 LTS (Focal Fossa)
Linux Kernel 5.4.0-48-generic
Architecture x86_64
ZFS Version 0.8.3-1ubuntu12.4
SPL Version 0.8.3-1ubuntu12.4

Describe the problem you're observing

A PANIC event is logged in dmesg

Describe how to reproduce the problem

Unsure

Include any warning/errors/backtraces from the system logs

VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed
PANIC at zfs_znode.c:335:zfs_znode_sa_init()
Showing stack for process 300061
CPU: 0 PID: 300061 Comm: BackupPC_nightl Tainted: P      D    O      5.4.0-48-generic #52-Ubuntu
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x6d/0x9a
 spl_dumpstack+0x29/0x2b [spl]
 spl_panic+0xd4/0x102 [spl]
 ? atomic_sub_return.constprop.0+0xd/0x20 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? dmu_buf_replace_user+0x75/0xa0 [zfs]
 ? dmu_buf_set_user+0x13/0x20 [zfs]
 ? dmu_buf_set_user_ie+0x15/0x20 [zfs]
 zfs_znode_sa_init.isra.0+0xde/0xf0 [zfs]
 zfs_znode_alloc+0x102/0x570 [zfs]
 ? atomic_cmpxchg+0x16/0x30 [zfs]
 ? _cond_resched+0x19/0x30
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? aggsum_add+0xca/0xe0 [zfs]
 ? atomic_sub_return.constprop.0+0xd/0x20 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? dbuf_read_impl+0x426/0x610 [zfs]
 ? atomic_sub_return.constprop.0+0xd/0x20 [zfs]
 ? atomic64_add+0x12/0x20 [zfs]
 ? dbuf_read+0x102/0x5f0 [zfs]
 ? arc_space_consume+0x54/0xe0 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? dnode_rele_and_unlock+0x68/0xc0 [zfs]
 ? atomic_cmpxchg+0x16/0x30 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? RW_WRITE_HELD+0xd/0x30 [zfs]
 ? atomic_sub_return.constprop.0+0xd/0x20 [zfs]
 ? atomic_dec_return+0x9/0x10 [zfs]
 zfs_zget+0x24a/0x290 [zfs]
 zfs_dirent_lock+0x41a/0x5a0 [zfs]
 zfs_dirlook+0x90/0x2b0 [zfs]
 zfs_lookup+0x202/0x3b0 [zfs]
 zpl_lookup+0x94/0x210 [zfs]
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 __lookup_slow+0x92/0x160
 lookup_slow+0x3b/0x60
 walk_component+0x1da/0x360
 ? link_path_walk.part.0+0x6d/0x550
 path_lookupat.isra.0+0x80/0x230
 ? kmem_cache_free+0x288/0x2b0
 filename_lookup+0xae/0x170
 ? __check_object_size+0x13f/0x150
 ? strncpy_from_user+0x4c/0x150
 user_path_at_empty+0x3a/0x50
 vfs_statx+0x7d/0xe0
 __do_sys_newlstat+0x3e/0x80
 __x64_sys_newlstat+0x16/0x20
 do_syscall_64+0x57/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fef462e755a
Code: ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 41 89 f8 48 89 f7 48 89 d6 41 83 f8 01 77 2d b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 06 c3 0f 1f 44 00 00 48 8b 15 01 a9 0d 00 f7
RSP: 002b:00007ffc2c409ad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
RAX: ffffffffffffffda RBX: 0000561fac283270 RCX: 00007fef462e755a
RDX: 0000561fab6d84b8 RSI: 0000561fab6d84b8 RDI: 0000561fabe924f0
RBP: 0000561fab6d82a0 R08: 0000000000000001 R09: aaaaaaaaaaaaaaab
R10: 0000561fac2df8d0 R11: 0000000000000246 R12: 0000561fac2df8d8
R13: 0000561fabe924f0 R14: 0000561fa980559e R15: 0000000000000000
@nerdcorenet nerdcorenet added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Sep 23, 2020
@Lakr233
Copy link

Lakr233 commented Oct 22, 2020

This issue is also happening to me. Any related process accessing "trigger file" would hangs there forever.

Message from syslogd@qaq-server at Oct 22 20:48:15 ...
 kernel:[  762.892839] VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed

Message from syslogd@qaq-server at Oct 22 20:48:15 ...
 kernel:[  762.892844] PANIC at zfs_znode.c:335:zfs_znode_sa_init()

In additionally, here is my system spec:

zfs-0.8.3-1ubuntu12.4
zfs-kmod-0.8.3-1ubuntu12.4
Linux qaq-server 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

@behlendorf behlendorf removed the Status: Triage Needed New issue which needs to be triaged label Nov 3, 2020
@lathiat
Copy link

lathiat commented Dec 2, 2020

I just started hitting this on Ubuntu Hirsute (development release) in the last couple of days for some unclear reason. The stacks all show code related to SA and for whatever reason it was happening with multiple Chrome/Electron apps trying to access the "Cache" dir specifically - but different instances of the cache dir in different paths (e.g. ~/.cache/google-chrome/Default/Cache and ~/.config/Mattermost/Cache) . Those processes stay hung forever and I can't strace/gdb them or even ls that same directory while the task is stuck presumably due to a lock or similar.

I had zfs-dkms installed, i removed that and went back to the version built with the kernel in Ubuntu and it's working OK but that version is 0.8.4-1ubuntu11 where as zfs-dkms was 0.8.4-1ubuntu16. They added quite a lot of patches in "ubuntu13" for Linux 5.9 compatability as part of https://bugs.launchpad.net/bugs/1899826 .. However given the other reporters were on stable versions it seems more likely they may be the same effect but different cause possibly?

Just reverting to the 0.8.4-1ubuntu11 code resolved it for me. I will try install zfs-dkms of the same version to see if it happens there in case it's some quirk of the DKMS build versus the build that happens in the Ubuntu kernel packages.

Happy to try debug if anyone has suggestions on what to look at. Reasonably competent programmer, debugger and very familiar with ZFS from an admin and various internals but not super familiar with the code-base as a whole. Can also look to try the native version and see if it hits or whether it's specific to the Ubuntu patches.

Also opened here: https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1906476

Note: not really expecting support for the Ubuntu patched version here so much as, this is the only Google hit for that error, so wanted to contribute information here in case it helps others and happy to try debug if that also helps.

Dec  2 12:36:42 optane kernel: [   72.857033] VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed
Dec  2 12:36:42 optane kernel: [   72.857036] PANIC at zfs_znode.c:335:zfs_znode_sa_init()
Dec  2 12:36:42 optane kernel: [   72.857037] Showing stack for process 19744
Dec  2 12:36:42 optane kernel: [   72.857038] CPU: 3 PID: 19744 Comm: ThreadPoolForeg Tainted: P           OE     5.8.18-acso #1
Dec  2 12:36:42 optane kernel: [   72.857039] Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1 WIFI-BK/Z97X-Gaming G1 WIFI-BK, BIOS F8 09/19/2015
Dec  2 12:36:42 optane kernel: [   72.857039] Call Trace:
Dec  2 12:36:42 optane kernel: [   72.857044]  dump_stack+0x74/0x95
Dec  2 12:36:42 optane kernel: [   72.857053]  spl_dumpstack+0x29/0x2b [spl]
Dec  2 12:36:42 optane kernel: [   72.857057]  spl_panic+0xd4/0xfc [spl]
Dec  2 12:36:42 optane kernel: [   72.857101]  ? sa_cache_constructor+0x27/0x50 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857103]  ? _cond_resched+0x19/0x40
Dec  2 12:36:42 optane kernel: [   72.857105]  ? mutex_lock+0x12/0x40
Dec  2 12:36:42 optane kernel: [   72.857129]  ? dmu_buf_set_user_ie+0x54/0x80 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857167]  zfs_znode_sa_init+0xe0/0xf0 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857205]  zfs_znode_alloc+0x101/0x700 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857229]  ? arc_buf_fill+0x270/0xd30 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857232]  ? __cv_init+0x42/0x60 [spl]
Dec  2 12:36:42 optane kernel: [   72.857260]  ? dnode_cons+0x28f/0x2a0 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857262]  ? _cond_resched+0x19/0x40
Dec  2 12:36:42 optane kernel: [   72.857263]  ? _cond_resched+0x19/0x40
Dec  2 12:36:42 optane kernel: [   72.857264]  ? mutex_lock+0x12/0x40
Dec  2 12:36:42 optane kernel: [   72.857288]  ? aggsum_add+0x153/0x170 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857292]  ? spl_kmem_alloc_impl+0xd8/0x110 [spl]
Dec  2 12:36:42 optane kernel: [   72.857316]  ? arc_space_consume+0x54/0xe0 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857341]  ? dbuf_read+0x4a0/0xb50 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857342]  ? _cond_resched+0x19/0x40
Dec  2 12:36:42 optane kernel: [   72.857343]  ? mutex_lock+0x12/0x40
Dec  2 12:36:42 optane kernel: [   72.857372]  ? dnode_rele_and_unlock+0x5a/0xc0 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857373]  ? _cond_resched+0x19/0x40
Dec  2 12:36:42 optane kernel: [   72.857374]  ? mutex_lock+0x12/0x40
Dec  2 12:36:42 optane kernel: [   72.857400]  ? dmu_object_info_from_dnode+0x84/0xb0 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857433]  zfs_zget+0x1c3/0x270 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857457]  ? dmu_buf_rele+0x3a/0x40 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857493]  zfs_dirent_lock+0x349/0x680 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857530]  zfs_dirlook+0x90/0x2a0 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857566]  ? zfs_zaccess+0x10c/0x480 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857600]  zfs_lookup+0x202/0x3b0 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857635]  zpl_lookup+0xca/0x1e0 [zfs]
Dec  2 12:36:42 optane kernel: [   72.857639]  path_openat+0x6a2/0xfe0
Dec  2 12:36:42 optane kernel: [   72.857641]  do_filp_open+0x9b/0x110
Dec  2 12:36:42 optane kernel: [   72.857645]  ? __check_object_size+0xdb/0x1b0
Dec  2 12:36:42 optane kernel: [   72.857647]  ? __alloc_fd+0x46/0x170
Dec  2 12:36:42 optane kernel: [   72.857649]  do_sys_openat2+0x217/0x2d0
Dec  2 12:36:42 optane kernel: [   72.857650]  ? do_sys_openat2+0x217/0x2d0
Dec  2 12:36:42 optane kernel: [   72.857651]  do_sys_open+0x59/0x80
Dec  2 12:36:42 optane kernel: [   72.857652]  __x64_sys_openat+0x20/0x30
Dec  2 12:36:42 optane kernel: [   72.857654]  do_syscall_64+0x48/0xc0
Dec  2 12:36:42 optane kernel: [   72.857656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec  2 12:36:42 optane kernel: [   72.857657] RIP: 0033:0x7f9e3e7f62b4
Dec  2 12:36:42 optane kernel: [   72.857659] Code: 24 20 eb 8f 66 90 44 89 54 24 0c e8 b6 f4 ff ff 44 8b 54 24 0c 44 89 e2 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 89 44 24 0c e8 08 f5 ff ff 8b 44
Dec  2 12:36:42 optane kernel: [   72.857659] RSP: 002b:00007f9e2a84aa10 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
Dec  2 12:36:42 optane kernel: [   72.857661] RAX: ffffffffffffffda RBX: 00007f9e2a84b070 RCX: 00007f9e3e7f62b4
Dec  2 12:36:42 optane kernel: [   72.857661] RDX: 0000000000000002 RSI: 0000239c0c6ddf00 RDI: 00000000ffffff9c
Dec  2 12:36:42 optane kernel: [   72.857662] RBP: 0000239c0c6ddf00 R08: 0000000000000000 R09: 00007ffc92524080
Dec  2 12:36:42 optane kernel: [   72.857662] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000002
Dec  2 12:36:42 optane kernel: [   72.857663] R13: 00007f9e2a84b070 R14: 0000239c0d73c5c0 R15: 0000000000008061
Dec  2 12:36:42 optane kernel: [   72.858063] VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed


@migrax
Copy link

migrax commented Dec 22, 2020

I have found the same problem. Going back to 0.8.4-1ubuntu11 fixes it for new files. If I move the old chrome cache to a different name, the problem disappears, but if I try to remove it, list it, etc... there is some persistent corruption in the filesystem that triggers the panic.

@lathiat
Copy link

lathiat commented Jan 15, 2021

I hit this problem again today, but now without zfs-dkms. After upgrading my kernel from 5.8.0-29-generic to 5.8.0-36-generic my Google Chrome Cache directory is broken again, had to rename it and then reboot to get out of the problem.

Curiously I found a 2016(??) report of similar here: https://bbs.archlinux.org/viewtopic.php?id=217204

The renamed directories still exist if any developers have an idea about anything I can do to try and debug or understand the issue.

@tilgovi
Copy link

tilgovi commented Jan 18, 2021

Having a similar problem. Same traceback, different files. Just started with the Ubuntu 5.8.0-36 kernel. Unfortunately, booting the old kernel doesn't seem to make the existing files accessible, either. I'm a bit worried and would love to help find the root cause and make sure I don't lose more data here.

@tilgovi
Copy link

tilgovi commented Jan 18, 2021

@migrax when you say that rolling back "fixes it for new files", do you have a reliable way to reproduce this? I only found that this problem occurred with some files, but could not figure out which ones or why.

@lathiat
Copy link

lathiat commented Jan 18, 2021

I had the same thing, basically at a certain package version the problem started happening. If you roll back to a kernel/package without the issue, existing files are still broken but it stops creating new broken files. That's my experience to.

From my naive attempt to read through the code, I think something is getting corrupted on disk that then causes the PANIC() when trying to read a file.. once that panic happens a lock is left held that stops other access to that and I suspect maybe some other unrelated files.. they maybe share some resource.. if you reboot sometimes some files that seemed broken are accessible again but the main problem file is still broken and once you try to access that file it seems to get stuck on a lock that then blocks access to other things. But I might be wrong about the blocking access to other things.

In the kernel trace you first see this PANIC():
VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed
PANIC at zfs_znode.c:335:zfs_znode_sa_init()

And then some hung task reports later.

@lathiat
Copy link

lathiat commented Jan 18, 2021

linux-image-5.8.0-29-generic: working
linux-image-5.8.0-36-generic: broken

When the issue first hit, I had zfs-dkms installed, i removed that and went back to the version built with the kernel in Ubuntu and it's working OK. That version was 0.8.4-1ubuntu11 where as zfs-dkms was 0.8.4-1ubuntu16.

Problem has now repeated as the 5.8.0-36-generic kernel has now picked up 0.8.4-1ubuntu16..
`
lathiat@optane ~/src/zfs[zfs-2.0-release]$ sudo modinfo /lib/modules/5.8.0-29-generic/kernel/zfs/zfs.ko|grep version
version: 0.8.4-1ubuntu11
srcversion: 75AFF98E9A918357B9D8C8D

lathiat@optane ~/src/zfs[zfs-2.0-release]$ sudo modinfo /lib/modules/5.8.0-36-generic/kernel/zfs/zfs.ko|grep version
version: 0.8.4-1ubuntu16
srcversion: 8A5C7E4F91E160085378C8C
`

I don't have a good quick/easy reproducer but just using my desktop for a day or two seems I am likely to hit the issue after a while.

I tried to install the upstream zfs-dkms package for 2.0 to see if I can bisect the issue on upstream versions but it breaks my boot for some reason I cannot quite figure out. I will continue to try and experiment and see if I can bisect which version broke it. Looking at the Ubuntu changelog I'd say the fix for https://bugs.launchpad.net/bugs/1899826 to backport the 5.9 and 5.10 compataibility patches is a prime suspect. I'll copy this info to the Ubuntu Launchpad bug and see if I can chase someone internally at Canonical to pick it up if I don't have enough time to continue the debug.

Side note: I sortof know what I'm doing in that I'm a Linux Software engineer, dabble in kernel stuff and I am a very long time deeply knowledable ZFS user at a user-space level but my code-level knowledge of ZFS is very basic so don't mistake any confidence for actually having real knowledge :)

@tilgovi
Copy link

tilgovi commented Jan 18, 2021

I ran into this on the 0.8.4-1ubuntu16 packaged with the 5.8.0-36 kernel. I was able to use my zsys snapshots to get back to a good state from before I upgraded.

Side note: I sortof know what I'm doing in that I'm a Linux Software engineer, dabble in kernel stuff and I am a very long time deeply knowledable ZFS user at a user-space level but my code-level knowledge of ZFS is very basic so don't mistake any confidence for actually having real knowledge :)

Not too different here :). The significant changes came in 0.8.4-1ubuntu13.

https://git.launchpad.net/ubuntu/+source/zfs-linux/commit/?h=import/0.8.4-1ubuntu13&id=d2d7f811767bd92eca03244b38fe2a54b321d867

@lathiat
Copy link

lathiat commented Jan 18, 2021

zfs-2.0.1 is in hirsute-proposd so I am going to try that. Reasonable chance it will have fixed it since those patches are probably dropped.

@tilgovi
Copy link

tilgovi commented Jan 18, 2021

Yeah, all those patches were dropped. Which means the issue is either fixed or upstream.

❯ git diff --summary applied/0.8.4-1ubuntu13 applied/2.0.1-1ubuntu1 debian/patches
 delete mode 100644 debian/patches/4000-mount-encrypted-dataset-fix.patch
 delete mode 100644 debian/patches/4520-Linux-5.8-compat-__vmalloc.patch
 delete mode 100644 debian/patches/4521-enable-risc-v-isa.patch
 delete mode 100644 debian/patches/4700-Fix-DKMS-build-on-arm64-with-PREEMPTION-and-BLK_CGRO.patch
 create mode 100644 debian/patches/4701-enable-ARC-FILL-LOCKED-flag.patch
 delete mode 100644 debian/patches/4710-Use-percpu_counter-for-obj_alloc-counter-of-Linux-ba.patch
 delete mode 100644 debian/patches/4720-Linux-5.7-compat-Include-linux-sched.h-in-spl-sys-mu.patch
 delete mode 100644 debian/patches/4800-Linux-5.9-compat-add-linux-blkdev.h-include.patch
 delete mode 100644 debian/patches/4801-Linux-5.9-compat-NR_SLAB_RECLAIMABLE.patch
 delete mode 100644 debian/patches/4802-Linux-5.9-compat-make_request_fn-replaced-with-submi.patch
 delete mode 100644 debian/patches/4803-Increase-Supported-Linux-Kernel-to-5.9.patch
 delete mode 100644 debian/patches/4804-Linux-5.10-compat-frame.h-renamed-objtool.h.patch
 delete mode 100644 debian/patches/4805-Linux-5.10-compat-percpu_ref-added-data-member.patch
 delete mode 100644 debian/patches/4806-Linux-5.10-compat-check_disk_change-removed.patch
 delete mode 100644 debian/patches/4807-Linux-5.10-compat-revalidate_disk_size-added.patch
 delete mode 100644 debian/patches/4808-Linux-5.10-compat-misc.patch
 delete mode 100644 debian/patches/git_fix_dependency_loop_encryption1.patch
 delete mode 100644 debian/patches/git_fix_dependency_loop_encryption2.patch

@tilgovi
Copy link

tilgovi commented Apr 11, 2021

I have not run into this issue since 2.0.2.

@tilgovi
Copy link

tilgovi commented May 9, 2021

Still running smoothly. I think this can be closed.

@decisionpreneur
Copy link

The issue appeared :(

2021 May 16 21:19:09 laptop VERIFY(0 == sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl)) failed                                                                                                                   
2021 May 16 21:19:09 laptop PANIC at zfs_znode.c:339:zfs_znode_sa_init()

Linux laptop 5.8.0-45-generic

zfs-2.0.4
zfs-kmod-2.0.4

@decisionpreneur
Copy link

Is there anything I can do to provide more debug info needed for the fix?

@mhosken
Copy link

mhosken commented Jul 26, 2021

me too on 2.0.2 on ubuntu kernel 5.13.0-12-generic

@chaostya
Copy link

chaostya commented Sep 7, 2021

Same issue here. Skypeforlinux, MS Teams, VS Code, IntelliJ Idea, Firefox hangs are spotted.

user@user-laptop:~$ sudo modinfo /lib/modules/5.13.0-14-generic/kernel/zfs/zfs.ko
filename: /lib/modules/5.13.0-14-generic/kernel/zfs/zfs.ko
version: 2.0.3-8ubuntu6
srcversion: EEFC177471F615FA0A30B6B

Sample stack:

INFO: task skypeforlinux:5627 blocked for more than 362 seconds.
      Tainted: P           O      5.13.0-14-generic #14-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:skypeforlinux   state:D stack:    0 pid: 5627 ppid:  4583 flags:0x00004000
Call Trace:
 __schedule+0x268/0x680
 schedule+0x4f/0xc0
 spl_panic+0xfa/0xfc [spl]
 ? queued_spin_unlock+0x9/0x10 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_spin_unlock+0x9/0x10 [zfs]
 ? dmu_buf_replace_user+0x65/0x80 [zfs]
 ? dmu_buf_set_user+0x13/0x20 [zfs]
 ? dmu_buf_set_user_ie+0x15/0x20 [zfs]
 zfs_znode_sa_init+0xd9/0xe0 [zfs]
 zfs_znode_alloc+0x101/0x560 [zfs]
 ? dmu_buf_unlock_parent+0x5d/0x90 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? dbuf_read_impl.constprop.0+0x316/0x3e0 [zfs]
 ? dbuf_rele_and_unlock+0x13b/0x4f0 [zfs]
 ? __cond_resched+0x1a/0x50
 ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
 ? queued_spin_unlock+0x9/0x10 [zfs]
 ? __cond_resched+0x1a/0x50
 ? down_read+0x13/0x90
 ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
 ? queued_spin_unlock+0x9/0x10 [zfs]
 ? do_raw_spin_unlock+0x9/0x10 [zfs]
 ? __raw_callee_save___native_queued_spin_unlock+0x15/0x23
 ? dmu_object_info_from_dnode+0x8e/0xa0 [zfs]
 zfs_zget+0x237/0x280 [zfs]
 zfs_dirent_lock+0x42a/0x570 [zfs]
 zfs_dirlook+0x91/0x2a0 [zfs]
 zfs_lookup+0x1fb/0x3f0 [zfs]
 zpl_lookup+0xcb/0x230 [zfs]
 ? step_into+0xf1/0x260
 __lookup_slow+0x84/0x150
 walk_component+0x141/0x1b0
 ? path_init+0x2c1/0x3f0
 path_lookupat+0x6e/0x1c0
 ? schedule+0x4f/0xc0
 filename_lookup+0xbb/0x1c0
 ? __check_object_size.part.0+0x128/0x150
 ? __check_object_size+0x1c/0x20
 ? strncpy_from_user+0x44/0x150
 user_path_at_empty+0x59/0x90
 ? make_kuid+0x13/0x20
 do_faccessat+0x7f/0x1e0
 __x64_sys_access+0x1d/0x20
 do_syscall_64+0x61/0xb0
 ? do_syscall_64+0x6e/0xb0
 ? do_syscall_64+0x6e/0xb0
 ? exit_to_user_mode_prepare+0x95/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? do_syscall_64+0x6e/0xb0
 ? do_syscall_64+0x6e/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? __x64_sys_access+0x1d/0x20
 ? do_syscall_64+0x6e/0xb0
 ? do_syscall_64+0x6e/0xb0
 ? do_syscall_64+0x6e/0xb0
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f195f16983b
RSP: 002b:00007f19367f95e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000015
RAX: ffffffffffffffda RBX: 000055aac1120ea8 RCX: 00007f195f16983b
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055aac1184690
RBP: 00007f19367fb810 R08: 0000000000000000 R09: 00007f19367fb780
R10: 0000000000000000 R11: 0000000000000206 R12: 000055aac1120de8
R13: 00007f19367fc520 R14: 000055aac1120f50 R15: 000000000000000c

@lkishalmi
Copy link

Happens here as well. I fear that this renders my computer unusable beyond a point.
Kernel: Ubuntu 5.13.0, zfs 0.8.3-1ubuntu12.12

@lathiat
Copy link

lathiat commented Sep 26, 2021

I believe I have tracked down the cause of this issue to be an Ubuntu-specific ZFS patch and have a reliable reproducer. Full details in https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1906476

I am not aware at this time of any good way to "fix" the issue on an existing dataset. For now I've just been moving the files into a "broken" directory and try not to access them.

@emesterhazy
Copy link

Is there an actual fix for this yet that I can apply? I'm concerned about data corruption, and of course my system is pretty much unusable due to this. I read through the launchpad chatter but it's not clear exactly what I should do to fix this today :\

Should I be downgrading to zfsutils-linux=2.0.2-1ubuntu5.2 ?

Kernel: 5.13.0-7614-generic
Module: zfs-kmod-2.0.3-8ubuntu6

@lathiat
Copy link

lathiat commented Oct 7, 2021

If you're using the zfs-dkms package it's fixed in:

  • Hirsute (21.04): 2.0.2-1ubuntu5.2 and later
  • Impish (21.10 development beta): 2.0.6-1ubuntu2 and later

The kernel builds ZFS into a module at the time of the kernel release. New kernels are released on a regular 3 week cadence but one hasn't yet been released to incorporate this fix. So for now you can install zfs-dkms to build your own module from the updated source (assuming your zfs-dkms package is one of the above two versions). Within 3 weeks or so there should be an updated kernel incorporating the fix in the pre-built zfs module.

As best I can tell, it will only affect you if you have an encrypted dataset.

@lathiat
Copy link

lathiat commented Oct 7, 2021

Hirsute's current kernel is 5.11.0-37 that does not have the fix. Hopefully -38 will.
Impish's current kernel is 5.13.0-16 that does not have the fix either. Hopefully -17 will.

You can verify the zfs version included in your currently running kernel with "modinfo zfs".

@emesterhazy
Copy link

@lathiat Thank you for this info. I couldn't use 2.0.2-1ubuntu5.2 since it seems to only support kernels up to 5.10. I went ahead and installed zfs 2.0.6 from source using DKMS, and it looks like I still can't remove files that were previously affected by this.. at least not without causing the same zfs_znode_sa_init() panic.

Does this indeed mean that there is permanent corruption of affected files? I saw different opinions on this in the launchpad discussion.

Current zfs versions:
zfs-2.0.6-1
zfs-kmod-2.0.6-1

@lathiat
Copy link

lathiat commented Oct 8, 2021

Yeah for me there is permanent corruption I can't fix and that scrub doesn't find. I had to move all those files to an un-used directory.

Others are having the issue only on boot, I think basically what happens, is that the data is corrupted when loaded into the ARC and then that data may or may not get flushed back to disk. For some people it happens on boot and I think it never gets flushed to disk, because their whole / is encrypted, for me only /home is encrypted so the rest of the system keeps working and maybe that gives it an oppurtunity to end up back on disk.

I don't currently have a solution (other than just moving them out of the way into /home/broken) to get rid of the broken files.

@pharshalp
Copy link

pharshalp commented Oct 18, 2021

looks like the fix has been uploaded to the proposed channel for Ubuntu 21.10

https://launchpad.net/ubuntu/+source/linux/5.13.0-20.20

  • PANIC at zfs_znode.c:335:zfs_znode_sa_init() // VERIFY(0 ==
    sa_handle_get_from_db(zfsvfs->z_os, db, zp, SA_HDL_SHARED, &zp->z_sa_hdl))
    failed (LP: #1906476)
    • debian/dkms-versions -- Update zfs to latest version

@corny
Copy link

corny commented Oct 19, 2021

After upgrading my system from Ubuntu 21.04 (openzfs 2.0.2, Linux 5.11) to 21.10 (openzfs 2.0.6, Linux 5.13.0-19) my system is also affected by this issue.

@lathiat
Copy link

lathiat commented Oct 19, 2021

The fixed kernel is now released. Please upgrade your kernel to 5.13.0-20 and reboot. And try not to use ZFS with the Kernel at all

If you still get the errors after the new kernel it means the corruption got written to the FS and there is no known way to fix that currently. You have to figure out which files are broken and move them somewhere they won’t be accessed. Scrub does not identify it.

@pharshalp
Copy link

pharshalp commented Oct 19, 2021

@lathiat when you say "try not to use ZFS with the Kernel at all", are you implying that it will always be safer to install and use the zfs-dkms package instead?

@lathiat
Copy link

lathiat commented Oct 20, 2021

@lathiat when you say "try not to use ZFS with the Kernel at all", are you implying that it will always be safer to install and use the zfs-dkms package instead?

No I meant just don’t use the broken kernel release. As corrupt data can get committed to disk. With the latest kernel on Impish it’s all good no need for the DKMs package now.

@pharshalp
Copy link

pharshalp commented Oct 20, 2021

Thanks for the response.

I understand that zpool scrub isn't going to show any errors for this type of corruption. So, to check if any of the files in a given directory were corrupted, would it be sufficient to run sudo find . -exec stat {} + and check if the command returns without getting stuck at any of the files?

@openzfs openzfs deleted a comment from hideout Oct 27, 2021
@alek-p
Copy link
Contributor

alek-p commented Oct 27, 2021

This problem is caused by a patch that we don't have, ubuntu has released a fix for this, see https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1906476

If you hit this - upgrade to kernel 5.13.0-20 or later

@alek-p alek-p closed this as completed Oct 27, 2021
@deepio
Copy link

deepio commented May 18, 2022

Seems like there is a regression in kernel 5.17.5. I got this bug after upgrading to pop OS 22.04 and it wasn't a problem before on 20.04.

@deepio
Copy link

deepio commented May 18, 2022

sec thing helps to be near!!.

I don't know what this means @ineo00048

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

16 participants