-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupt mmap()
ed data when reading in parallel on clone
#13608
Comments
So, this is quite strange, because the data should be checksummed on every actual disk read, and shouldn't really be mutated for its lifetime thereafter - that is, you shouldn't get an inconsistent view of the world. Once it produces incorrect data reading, does it continue to do that again forever, or is it entirely transient? It might be interesting to build ZFS with --enable-debug and ensure the |
That sounds like a great test. I appreciate your suggestion. Despite my use of ZFS for almost a decade I am not very familiar with the internals. I will reproduce on both a debug build and with the |
I tested the existing (non-debug) kernel and set I compiled
Looking at the code it appears that ZFS keeps 2 copies of |
...huh. Yeah, there it goes, on git (482505f). The handwavey parts of the stack trace make me suspicious, let me go try a patch I have...
|
mmap()
ed data when reading in parallel
You changed the title - in experimenting, I only saw it with clones, are you saying you've reproduced it without that? |
My apologies, my testing did show it only with memory mapped files, but I have not definitively reproduced without a clone. I was just trying to make the title less unwieldy. For brevity I will put that back. |
mmap()
ed data when reading in parallelmmap()
ed data when reading in parallel on clone
I realize that it is late to say this, but could you rebuild with --enable-debug --enable-debuginfo --enable-asan --enable-ubsan and try to reproduce this. |
I'll try and do a test with those options some time this week. |
I just realized that |
Good to note. I run my own kernel build so it shouldn't be an issue to get the sanitizer support built-in on my test bench. |
Great. Keep in mind that the sanitizers have performance and memory overhead, so things will run somewhat slowly (although not as slowly as if you had somehow used valgrind on the kernel) and system memory usage will be a few times more than normal. |
You'd also probably need a couple patches I've given up on upstreaming to make it not flood the logs with useless messages about lz4 and the crypto code using uninited memory, at least that's my recollection from the last time I broke out kASAN. |
The last time I used kASAN was before we had LZ4 or encryption in ZFS, so I had been unaware of that. |
Sure, I wasn't remarking on it as a criticism or something, just to save a round trip of "hey I'm out of space because syslog flooded with these :C" |
I compiled 5.18.14 with After over an hour I did get another crash, but surprisingly there were no other log messages from the sanitizers. The only time they output anything at all was when running The assert is the same as before, but the stack is slightly different:
|
The sanitizers only applied to the userspace code. You need to enable them in a custom kernel build's I wish that you had tested using 2.1.6. There are some undefined behavior fixes in that. Confirming it with those fixes in place would at least let us not need to worry about them. |
Correct me if I am wrong, but are these Linux config options not what is required? Are there more options I am missing?
I can re-test with 2.1.6 or mainline, but before I do I want to ensure my kernel is configured correctly. Edit: I have missed |
Those would enable KASAN inside the kernel. I noticed that you said that you used You might want to also enable |
Turns out I also had that enabled as well (I configured this kernel weeks ago and just got to running it now):
|
There appears to be one more sanitizer that you can turn on if your kernel and toolchain are new enough: https://www.kernel.org/doc/html/latest/dev-tools/kcsan.html?highlight=sanitizer |
Someone should poke @behlendorf, he and I looked at this for a few minutes because I had a convenient reproducer set up already, but I forget what came of it other than "well that didn't help" a few times. |
Some more test results. I re-tested with
Now I tried using KCSAN (which requires disabling KASAN) using the same ZFS commit. There is some noise from this sanitizer as just idling the system prints a benign message every few seconds. I was still able to trigger the assert. Not sure how much of the log is relevant, so here is the sanitizer output for the 12 seconds leading up to and just after the panic (at 01:25:12). I can provide a complete dump of the syslog for the system if anyone wants it.
|
A note - my limited understanding of the problem is that:
So I don't think you're going to trip the sanitizers (since it's a correctly allocated and sized buffer, and I would not be surprised if things were zeroing the source of the buffer before we got to this point...), since it's an unhandled edge case, not a wild pointer or incorrectly sized item. Technically I imagine if you annotated KCSAN enough to know what was going on you might be able to convince it it's a data race? But this is based on vague memories of looking at this when it was first posted, so who knows. |
I'd be surprised if the buffer is zeroed before being filled in. In the one case I managed to see the page contents it was definitely not zeroed so that lends some credence to that. So if that's the case I would expect this to be treated as a uninitialized read by a sanitizer, which is something they should be able to catch (as long as it understands the memory allocator). This is all just speculation from me however - the memory mapped code goes over my head, so I'm just collecting data for someone else to figure it out. |
My guess would be it doesn't understand the allocator well enough, but I don't know, I've rarely seen sanitizers do useful things with mmap in particular, so my expectations are pretty low. |
We have seen a similar issue / stack from chrome running inside a docker container:
|
When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13608 Closes openzfs#14498
When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13608 Closes openzfs#14498
When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13608 Closes openzfs#14498
When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13608 Closes openzfs#14498
When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13608 Closes #14498
System information
Summary
Sometimes after creating a clone and having multiple processes read the same file(s) from that clone, one process will receive corrupt data from ZFS. The corruption is transient and in-memory only. The pool is always consistent as verified by scrubs, and re-reading the corrupt file succeeds. This is not caused by hardware, as it has been reproduced on at least 5 servers. More details on my testing and how to reproduce follow.
Symptoms from Read Corruption
The easiest way to reproduce is to perform a code compile with
make
running multiple parallel jobs within a Docker container on ZFS, as Docker uses clones (this is detailed below). This reads thegcc
binary and its shared libraries multiple times. When a file is corrupted on read, it usually manifests as one of the following errors:ZFS Versions Tested
I have reproduced on all versions of ZFS between
v0.8.2
andv2.1.4
and on Linuxv5.3
throughv5.16
. This also happens on the Ubuntu 20.04.4 stock kernel and distribution-provided ZFS modules, so it is not unique to my kernel configuration.HW Reproduced On
I have reproduced on at least 5 different machines, all running different hardware, from Intel 4th gen Core, to AMD 3rd gen Threadripper. All the systems use SSD-backed pools except for one with a 10k RPM HDD, so it appears to be possible to reproduce on slower storage. All systems use either a single vdev, or multiple top-level single disk vdevs. It can also be reproduced in a VM using libvirt/KVM.
Reproduction with Docker
The easiest way to reproduce (and the way this was first discovered) is using Docker to compile C/C++ with
make
. However Docker is not required to reproduce as I will explain shortly.This was originally discovered with my company's codebase, but I have been able to reproduce with open source projects, of which I will provide 2 examples.
Prerequisites:
user
exists with uid1000
, and can be substituted as needed.Create a Docker image to build the code:
Example 1: htop
Prepare for build:
The actual reproducer:
This will build the project over and over again until it fails. Unfortunately due to this being an intermittent issue, you may have to wait a while. Sometimes it fails within only a few iterations in under a minute. Other times it takes hours. I feel that my company's build system is faster at reproducing, however I cannot share that unfortunately.
Eventually, a file from the cloned dataset that Docker creates will be read incorrectly and will result in a failure. It usually manifests as one of the errors shown above.
Example 2: cpputest
Prepare for build:
The reproducer:
Other Cases Tested
This does not appear to be a race condition with the creation of the clone. I inserted a 3 second delay between creation and starting
make
, and the issue is still reproducible.Compiling with only 1 process (
make -j 1
) does not appear to trigger this behaviour. It is possible that it can happen and has simply not happened yet in my limited testing of this case. The minimum parallel job count which has successfully reproduced this is 4. It appears that the more cores the system has, the easier the issue is to reproduce. I have not tested moremake
jobs than cores.Running the same Docker-based test on the same hardware using
ext4
andoverlayfs
does not exhibit this issue, so the issue lies with ZFS.It is also worth noting that I have used several versions of GCC, including versions for cross-compiling, so I do not see the version of GCC being important. I also expect this to be possible to reproduce without GCC, but I have not explored this yet.
Reproduction without Docker
To rule out Docker, I unpacked a Docker image (containing a root filesystem) into a dataset and iterated the following:
chroot
into the cloneI was able to reproduce using this method, so this issue is not dependent on Docker. I have used datasets with both
legacy
and "normal" mountpoints.To "prove" this is bad data from the filesystem, on one occasion a source code file was read incorrectly and I managed to get a look at the beginning of the file, which looked like this. This is clearly not data from another location in the same file.
I have a system setup to test this right now and I would be happy to run tests, try code changes, or provide debug output.
The text was updated successfully, but these errors were encountered: