-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Illumos #6214, concurrency race in manipulating ARC buffer flags during L2ARC writeout can cause pool corruption #3757
Comments
We'll definitely want to pull the upstream fix when it's available right away for this issue. The same kind of race should be possible in ZoL but may be very unlikely in practice. |
@behlendorf It requires that the system use L2ARC, which does not get test converage on the buildbot. Consequently, we have no idea how likely/unlikely it is. I think we should revert b9541d6 as a precaution while it is still not in a tagged release. |
Quickly skimming the discussion on https://www.reddit.com/r/sysadmin/comments/1hmvel/zfs_your_worst_experiences/ and the issues entry at https://www.illumos.org/issues/6214 - there's one case of corrupt metadata (haven't found more via google with a quick search, but from 2 years ago - so it clearly can't be attributed to these changes) Are there any pointers to the reports ? What platform did this is appear on (mainly Illumos ? *BSD ? ZOL ?) ? What patchstack ? There were at least 2 subsequent fixes following that referenced commit, which fixed calculation and other problems with the l2arc - but I doubt they change anything in relation to zpool metadata corruption/issue |
@siebenmann thanks for bringing this to our attention. I think that even though this issue is exceptionally unlikely we need treat this as a blocker for 0.6.5 and delay the release. @ryao unfortunately we can't easily revert b9541d6 so we'll need to fix the root cause. Luckily it's my understanding that there's already fix in the works from the illumos guys. |
I believe that the current illumos-devel thread discussing the proposed fix for this starts here, if people want a head start before the fix lands in the illumos-gate repo. This initial illumos-devel message describes where to find the original thread with the problem report and discussion in the ZFS and OmniOS mailing lists. I don't know more than this, @kernelOfTruth, because I haven't been paying particular attention to the discussion of it on either list before now. (What prompted my report here is this strong warning about L2ARCs on omnios-discuss from one of the OmniOS developers.) |
Reintroduce b_compress to the l2arc_buf_hdr_t. In commit b9541d6 the compression flags were moved to the generic b_flags in the arc_buf_hdr_t. This is a problem because l2arc_compress_buf() may manipulate the compression flags and this can only be done safely under the hash lock which is not held. See Illumos 6214 for a detailed analysis of the race. References: https://www.illumos.org/issues/6214 http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/ Porting Notes: HDR_GET_COMPRESS() macro was removed from arc_buf_info(). In l2arc_read_done() the new ASSERTs were changed to VERIFYs to act as an extra layer of protection for production builds. Issue openzfs#3757
Reintroduce b_compress to the l2arc_buf_hdr_t. In commit b9541d6 the compression flags were moved to the generic b_flags in the arc_buf_hdr_t. This is a problem because l2arc_compress_buf() may manipulate the compression flags and this can only be done safely under the hash lock which is not held. See Illumos 6214 for a detailed analysis of the race. References: https://www.illumos.org/issues/6214 http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/ Porting Notes: HDR_GET_COMPRESS() macro was removed from arc_buf_info(). Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#3757
This is a just-filed Illumos bug where a concurrency race in manipulating hdr->b_flags between L2ARC writeout and ARC eviction can cause corrupt metadata to get written to the primary pool, Illumos #6214 - zpools going south. I don't know if this applies to ZoL given the different kernel environments, but ZoL does have the base commit that introduced the problem in Illumos in b9541d6.
The text was updated successfully, but these errors were encountered: