-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs send/receive coredump with docker dataset #13605
Comments
I downgraded to zfs 2.1.4 for kernel 5.15.50 and the issue still occurs. |
Can you share a stacktrace from the core dump? It's difficult to speculate what might be going wrong if it's not readily reproducible, and I couldn't immediately reproduce it. e: Sorry, just found the incomplete stacktrace at the bottom of the coredump info paste, That's...not the most helpful. Hm. |
Can you share another example stacktrace from another dump or two? My default guess if it's very inconsistent like that and crashing in a SIMD-accelerated checksum function would be something is messing up FPU save/restore. If you run, say, |
3 more stacktraces: Nr. 1
Nr. 2
Nr. 3
|
I tried it and openssl is not crashing while I executed several zfs send/receive which did core dump. Hardware is:
with 64 GB of ECC RAM |
It looks like sometimes it's crashing in the scalar fletcher4 call (since there's two different callsites in I don't see any obviously related fixes or bug reports against master, quickly looking, so not something to easily cherrypick a fix on top of 2.1.x for, I think. Try running the |
The core dumps started to happen on 23. June. Nothing before that date. I realized that I changed CFLAGS in /etc/makepkg.conf around that time to use Now I changed that to Can it be that zfs-utils are breaking with PS PPS
|
Spicy. It makes sense that it would be compiling userland that would matter, since A) that's what's crashing here and B) as I mentioned in my reply in #13202, compiling with -march=znver3 in the kernel wouldn't buy you using most of the interesting instructions and optimizations, because it's not safe to just use them without planning and explicit guards around things (and those are expensive, in terms of time spent, so you'd really only want to use them where it's a huge benefit), so the kernel passes lots of flags to tell the compiler not to do it in random places. Conceptually, though, userland should be mostly safe to just randomly use them in - unless you end up calling into a block that doesn't properly clean up around itself or assumes some state that isn't true, you shouldn't be burning the world down... ...I wonder if -march=znver3 is compiling the scalar versions into something like the AVX2 versions and someone somewhere is assuming it doesn't need barriers for that version... Which compiler and version? I assume Clang, since I believe gcc doesn't have an x86-64-v3 option to march unless it's newer than the newest gcc I've tried... |
This is with And it has the x86-64-v3 option: |
Ah, added in gcc 11. That tracks. The next step I would do would be to break it down to compiling specific files or subsets with the different CFLAGS and seeing if there's one in particular which, if compiled with -march=znver3, goes bang. (Alternative next steps include running in a debugger to investigate why it goes bang, and disassembling the two different binaries for the fletcher4 objects and seeing if there's something obviously broken.) I'll try to take a look at it when I get a moment. |
Reproduced (via simple example from #13620, e.g. Observations:
|
Workaround issue with GCC 12 until solved upstream. Segfault occurs w/ 'zfs send' otherwise (and very possibly other commands). Bug: openzfs/zfs#13605 Bug: openzfs/zfs#13620 Closes: https://bugs.gentoo.org/856373 Signed-off-by: Sam James <sam@gentoo.org>
Workaround issue with GCC 12 until solved upstream. Segfault occurs w/ 'zfs send' otherwise (and very possibly other commands). Let's backport for older versions to be safe after discussion w/ gyakovlev. Bug: openzfs/zfs#13605 Bug: openzfs/zfs#13620 Closes: https://bugs.gentoo.org/856373 See: 1cbf3fb Signed-off-by: Sam James <sam@gentoo.org>
FWIW, we (mostly @rincebrain) made a bit of progress on this last night:
|
Question out of curiosity: |
They're pretty different:
Note that with e.g. (This isn't a complete list, but see https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/i386/x86-tune.def for just an example of the things GCC keeps track of per-processor family. This isn't even including the costings and cache sizes.) Especially for the new vectoriser cost model (the "very cheap" one which is enabled by default), it's quite conservative if it, for some reason, thinks it may not be worth it to vectorise. If you're bored, you can try enabling each of the above options manually and see what ends up triggering it. But I wouldn't really bother. It's not AMD specific or anything (see above) but Rich reproduced this on an Intel machine anyhow. (Part of it depends on what instructions the compiler is at liberty to use, but this could've happened in a range of situations really - that's how UB is. Could have even happened with some lower |
FYI to the thread, I have a few different patches which avoid both this problem (the crashing due to unaligned access of something it thought it could assume was aligned) and the compiler assuming it can treat that as aligned at all. Which one, if any, gets merged will, I suppose, depend on the PR review after the branch finishes running through initial tests, assuming Github's runners ever manage to not time out... |
Workaround issue with GCC 12 until solved upstream. Segfault occurs w/ 'zfs send' otherwise (and very possibly other commands). Bug: openzfs/zfs#13605 Bug: openzfs/zfs#13620 Closes: https://bugs.gentoo.org/856373 Signed-off-by: Sam James <sam@gentoo.org>
Workaround issue with GCC 12 until solved upstream. Segfault occurs w/ 'zfs send' otherwise (and very possibly other commands). Let's backport for older versions to be safe after discussion w/ gyakovlev. Bug: openzfs/zfs#13605 Bug: openzfs/zfs#13620 Closes: https://bugs.gentoo.org/856373 See: 1cbf3fbc336adfdcd122da5b0989c2993de358dc Signed-off-by: Sam James <sam@gentoo.org>
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
On arch linux the package zfs-utils has a workaround implemented since version 2.1.8-1. From that version on the PKGBUILD is including the compiler flags
A PR is open to fix this for good in zfs: #13631 |
This got mooted in #14649, I hope. |
This is long fixed. Compiler option " |
This report is for Arch Linux with
installed via
zfs-dkms 2.1.5-1
kernel is either 5.18.7 or 5.15.50. It happens with both.
I experience crashes when trying to send/receive a docker dataset. I am using send/receive for my regular backus with many other datasets and never experienced an issue like this. It seems to be related to this docker dataset. (correction see PS)
If I repeat the zfs send/receive command often enough it finally succeeds at some point in time. But only after multiple tries and multiple coredumps.
PS
It also happens with 2 other datasets related to the nextcloud docker installation:
Same behaviour. Several coredumps before send/receive finally succeeds.
coredump info:
The text was updated successfully, but these errors were encountered: