-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption with 519851122b1703b8 ("ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()") #14753
Comments
This reverts commit 5198511. It results in data corruption, see: openzfs/zfs#14753 Sponsored by: Rubicon Communications, LLC ("Netgate")
I would suspect the various reproducers from #11900 would work here, if need be (which were, in fact, also package building and sparse copying.) |
Thanks, I was one of the main participants in #11900 and have hit this again with package building (a handful of GCC's internal libraries like I'm going to queue a revert in Gentoo for now. Very much appreciate @mjguzik reporting this so quickly to save me tearing my hair out :) |
cc @behlendorf too as you fixed it last time around and hopefully still have some of this paged in to your brain |
This reverts commit 4b3133e671b958fa2c915a4faf57812820124a7b upstream. See #14753 - possible corruption again, very similar symptoms to the nightmare that was #11900 and same area of code. We can safely revert it as it's an optimisation rather than a bugfix in itself. Keen Gentoo users may remember the following commits (in Gentoo): - 8e5626d - 9fb275f - 95c250a Bug: openzfs/zfs#14753 Bug: openzfs/zfs#11900 Signed-off-by: Sam James <sam@gentoo.org>
i wonder why automated testing did not find , that this change causing corruption. would it make sense to create some repro-case and improve zfs automated testing with such tool/method ? |
I believe Brian added a test case in #12724; that said, it's a race condition, so unless you have the code instrumented to force lockstep ordering of a specific outcome, it's gonna be nondeterministic whether you hit it in a given test run. |
An earlier test we FreeBSD folks did was issuing |
@tonyhutter I'm sure if I do this one or two more times you'll be cross, but uh, "we re-introduced a silent data corruption we previously fixed in 2.1.10" seems like it's worth pinging you and asking for a 2.1.11 reverting 4b3133e. |
This reverts commit 4b3133e. Users identified this commit as a possible source of data corruption: openzfs#14753 Signed-off-by: Tony Hutter <hutter2@llnl.gov>
@rincebrain thanks, we'll do the revert in master (#14761) and then pull it into a new release. |
This reverts commit 4b3133e. Users identified this commit as a possible source of data corruption: openzfs#14753 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue openzfs#14753 Closes openzfs#14761
This reverts commit 4b3133e. Users identified this commit as a possible source of data corruption: openzfs#14753 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue openzfs#14753 Closes openzfs#14761
This reverts commit 4b3133e. Users identified this commit as a possible source of data corruption: openzfs#14753 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue openzfs#14753 Closes openzfs#14761
will a scrub show this? is there another way to detect this kind of corruption? |
A scrub would not, because from ZFS's perspective, it wrote things correctly, it just was at read time that something tried to use the SEEK_HOLE/DATA information to read things and got incorrect results. (I think.) |
This reverts commit 5198511. It results in data corruption, see: openzfs/zfs#14753 Sponsored by: Rubicon Communications, LLC ("Netgate")
This reverts commit 4b3133e671b958fa2c915a4faf57812820124a7b upstream. See #14753 - possible corruption again, very similar symptoms to the nightmare that was #11900 and same area of code. We can safely revert it as it's an optimisation rather than a bugfix in itself. Keen Gentoo users may remember the following commits (in Gentoo): - 8e5626dc90e4e6166c2e296371b6ff5a9d13a8c4 - 9fb275f656de639e25acc9497b70b4cae593d35d - 95c250a3f3986b2bc2091dd3981ff1e1d3de0c73 Bug: openzfs/zfs#14753 Bug: openzfs/zfs#11900 Signed-off-by: Sam James <sam@gentoo.org>
fixed here #14761 |
This is on FreeBSD, but it should not matter.
I don't have a trivial testcase. The workload which runs into it is rather i/o heavy (package building) and the issue randomly pops up. Manifests itself mostly as strip(1) complaining about invalid file format in various ports, trying to build the same port the second time works just fine.
Note things are a little sketchy since there are 2 unrelated data corruption bugs, the other one being block_cloning (fixed since).
That said, grabbing the tree as of 5198511 ("ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()") results in the non-deterministic corruption. Going one commit below makes it go away. I verified a bunch of times by running the wokload for ~1h -- with the problematic commit first problems show up within ~15 minutes or so. Unfortunately due to the nature of the machinery doing the build it is not easy to grab the corrupted data, but this can be worked out.
At the same time, if you can't repro the problem, access to the affected machine can be arranged.
In the meantime seeing as the commit was supposed to be just an optimization it should be reverted.
The pool uses one drive with 0 magic, created with mere 'zpool create'.
The text was updated successfully, but these errors were encountered: