Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption with 519851122b1703b8 ("ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()") #14753

Closed
mjguzik opened this issue Apr 15, 2023 · 12 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@mjguzik
Copy link
Contributor

mjguzik commented Apr 15, 2023

This is on FreeBSD, but it should not matter.

I don't have a trivial testcase. The workload which runs into it is rather i/o heavy (package building) and the issue randomly pops up. Manifests itself mostly as strip(1) complaining about invalid file format in various ports, trying to build the same port the second time works just fine.

Note things are a little sketchy since there are 2 unrelated data corruption bugs, the other one being block_cloning (fixed since).

That said, grabbing the tree as of 5198511 ("ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()") results in the non-deterministic corruption. Going one commit below makes it go away. I verified a bunch of times by running the wokload for ~1h -- with the problematic commit first problems show up within ~15 minutes or so. Unfortunately due to the nature of the machinery doing the build it is not easy to grab the corrupted data, but this can be worked out.

At the same time, if you can't repro the problem, access to the affected machine can be arranged.

In the meantime seeing as the commit was supposed to be just an optimization it should be reverted.

The pool uses one drive with 0 magic, created with mere 'zpool create'.

@mjguzik mjguzik added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 15, 2023
@mjguzik
Copy link
Contributor Author

mjguzik commented Apr 15, 2023

@ahrens

freebsd-git pushed a commit to freebsd/freebsd-src that referenced this issue Apr 15, 2023
This reverts commit 5198511.

It results in data corruption, see:
openzfs/zfs#14753

Sponsored by:	Rubicon Communications, LLC ("Netgate")
@rincebrain
Copy link
Contributor

I would suspect the various reproducers from #11900 would work here, if need be (which were, in fact, also package building and sparse copying.)

@thesamesam
Copy link
Contributor

thesamesam commented Apr 16, 2023

Thanks, I was one of the main participants in #11900 and have hit this again with package building (a handful of GCC's internal libraries like libitm.so were installed as garbage).

I'm going to queue a revert in Gentoo for now. Very much appreciate @mjguzik reporting this so quickly to save me tearing my hair out :)

@thesamesam
Copy link
Contributor

cc @behlendorf too as you fixed it last time around and hopefully still have some of this paged in to your brain

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Apr 16, 2023
This reverts commit 4b3133e671b958fa2c915a4faf57812820124a7b upstream.

See #14753 - possible corruption again, very similar symptoms to the
nightmare that was #11900 and same area of code.

We can safely revert it as it's an optimisation rather than a bugfix
in itself.

Keen Gentoo users may remember the following commits (in Gentoo):
- 8e5626d
- 9fb275f
- 95c250a

Bug: openzfs/zfs#14753
Bug: openzfs/zfs#11900
Signed-off-by: Sam James <sam@gentoo.org>
@devZer0
Copy link

devZer0 commented Apr 16, 2023

i wonder why automated testing did not find , that this change causing corruption.

would it make sense to create some repro-case and improve zfs automated testing with such tool/method ?

@rincebrain
Copy link
Contributor

I believe Brian added a test case in #12724; that said, it's a race condition, so unless you have the code instrumented to force lockstep ordering of a specific outcome, it's gonna be nondeterministic whether you hit it in a given test run.

@vishwin
Copy link

vishwin commented Apr 17, 2023

An earlier test we FreeBSD folks did was issuing cp -R on a sufficiently large directory tree to another location on the same dataset. It helped scope down what caused block_cloning corruption but not this, probably because cp -R operates sequentially. Package building on the other hand, especially with jobservers involved, does not necessarily operate sequentially.

@rincebrain
Copy link
Contributor

@tonyhutter I'm sure if I do this one or two more times you'll be cross, but uh, "we re-introduced a silent data corruption we previously fixed in 2.1.10" seems like it's worth pinging you and asking for a 2.1.11 reverting 4b3133e.

tonyhutter added a commit to tonyhutter/zfs that referenced this issue Apr 17, 2023
This reverts commit 4b3133e.

Users identified this commit as a possible source of data
corruption:
openzfs#14753

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
@tonyhutter
Copy link
Contributor

@rincebrain thanks, we'll do the revert in master (#14761) and then pull it into a new release.

behlendorf pushed a commit that referenced this issue Apr 18, 2023
This reverts commit 4b3133e.

Users identified this commit as a possible source of data
corruption:
#14753

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #14753 
Closes #14761
tonyhutter added a commit to tonyhutter/zfs that referenced this issue Apr 18, 2023
This reverts commit 4b3133e.

Users identified this commit as a possible source of data
corruption:
openzfs#14753

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue openzfs#14753 
Closes openzfs#14761
@tonyhutter tonyhutter mentioned this issue Apr 18, 2023
13 tasks
ixhamza pushed a commit to truenas/zfs that referenced this issue Apr 20, 2023
This reverts commit 4b3133e.

Users identified this commit as a possible source of data
corruption:
openzfs#14753

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue openzfs#14753 
Closes openzfs#14761
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Apr 30, 2023
This reverts commit 4b3133e.

Users identified this commit as a possible source of data
corruption:
openzfs#14753

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue openzfs#14753 
Closes openzfs#14761
@psy0rz
Copy link

psy0rz commented Jun 5, 2023

will a scrub show this? is there another way to detect this kind of corruption?

@rincebrain
Copy link
Contributor

A scrub would not, because from ZFS's perspective, it wrote things correctly, it just was at read time that something tried to use the SEEK_HOLE/DATA information to read things and got incorrect results. (I think.)

bsdjhb pushed a commit to bsdjhb/cheribsd that referenced this issue Jun 21, 2023
This reverts commit 5198511.

It results in data corruption, see:
openzfs/zfs#14753

Sponsored by:	Rubicon Communications, LLC ("Netgate")
gentoo-repo-qa-bot pushed a commit to gentoo-mirror/linux-be that referenced this issue Jul 2, 2023
This reverts commit 4b3133e671b958fa2c915a4faf57812820124a7b upstream.

See #14753 - possible corruption again, very similar symptoms to the
nightmare that was #11900 and same area of code.

We can safely revert it as it's an optimisation rather than a bugfix
in itself.

Keen Gentoo users may remember the following commits (in Gentoo):
- 8e5626dc90e4e6166c2e296371b6ff5a9d13a8c4
- 9fb275f656de639e25acc9497b70b4cae593d35d
- 95c250a3f3986b2bc2091dd3981ff1e1d3de0c73

Bug: openzfs/zfs#14753
Bug: openzfs/zfs#11900
Signed-off-by: Sam James <sam@gentoo.org>
@mjguzik
Copy link
Contributor Author

mjguzik commented Mar 30, 2024

fixed here #14761

@mjguzik mjguzik closed this as completed Mar 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

7 participants