Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arc: Drop an incorrect assert #12246

Merged
merged 1 commit into from
Sep 8, 2021
Merged

Conversation

rincebrain
Copy link
Contributor

Motivation and Context

#9897, #12020

Description

Dropped an assert that was, in fact, merely often true when correct operation was happening.

How Has This Been Tested?

I had two reliable reproducers on my Debian buster systems - both of them, without this change, would trip this assert in minutes; with this change, they never did in over an hour of load each, and no obvious bad behavior occurred either. (The two reproducers, just in case this regresses, were "mild/moderate IO from inside a chroot on a zpool over NFS from a sparc64 client (rebuilding mandb, in particular, was quite good at it)" and "disk IO on a file on a zpool using kvm and virtio-blk/virtio-scsi storage". (Some of these constraints may be unnecessarily specific, but that is what I used.))

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.

Let's not, and say we did.

Closes: openzfs#9897
Closes: openzfs#12020

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
@behlendorf behlendorf requested a review from grwilson June 16, 2021 16:33
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jun 16, 2021
@behlendorf behlendorf requested a review from ahrens June 17, 2021 20:07
@ahrens
Copy link
Member

ahrens commented Jun 17, 2021

The assertion is based on the understanding that anonymous HDR's do not have DVA's, which I think is one of the defining characteristic of anonymous HDR's. During correct operation, how do we come to have an anonymous HDR with a DVA?

@behlendorf
Copy link
Contributor

I believe we can come to have an anonymous buffer here with a non-empty header in the event that it was overridden after being sync'd out. If you look a few lines down in arc_release() you'll see there's a comment which explicitly calls out this case. I was hoping either you or George could take a look and confirm this is the case, this #9897 (comment) pointed out the original illumos/illumos-gate@dcbf3bd commit ( 6950 ARC should cache compressed data ) which @grwilson authored and you committed to illumos.

@ahrens
Copy link
Member

ahrens commented Jun 24, 2021

@rincebrain I'm still not understanding how we get into this situation. @grwilson and I have been looking at the code trying to see what could be happening here. We noticed some interesting code paths that are taken when using dedup. When you reproduce the problem, are you using dedup (or have you ever used it on this storage pool)?

(Specifically: for dedup'ed-away writes, if the matching block is already in the ARC, then arc_write_done() seems to set b_dva but not call arc_access() (because exists != NULL), and therefore not change b_state out of anon. However, this would seem to happen very commonly (if using dedup), so it may be that something else normally mitigates this unexpected state.)

@rincebrain
Copy link
Contributor Author

@rincebrain I'm still not understanding how we get into this situation. @grwilson and I have been looking at the code trying to see what could be happening here. We noticed some interesting code paths that are taken when using dedup. When you reproduce the problem, are you using dedup (or have you ever used it on this storage pool)?

(Specifically: for dedup'ed-away writes, if the matching block is already in the ARC, then arc_write_done() seems to set b_dva but not call arc_access() (because exists != NULL), and therefore not change b_state out of anon. However, this would seem to happen very commonly (if using dedup), so it may be that something else normally mitigates this unexpected state.)

No DDT entries are present on any pool on my original system, according to zpool status -D - I would be astonished if I had ever enabled dedup on it, but I have a lot of auto snapshotting, so zpool history can't tell me anything useful.

On the testbed I was playing with this on, no, never dedup enabled, ever.

@mmaybee
Copy link
Contributor

mmaybee commented Aug 17, 2021

@grwilson and @ahrens, can we come to a consensus on this proposed change? As @behlendorf has pointed out, there is an explicit clear of the header just below this ASSERT with a comment indicating that the DVA could be non-null in some circumstances (BTW, both code and ASSERT came from George in the same change set). It seems like either the code or the ASSERT have to be in error here.

@blastwave
Copy link

Merely a follow up that I have seen this also on FreeBSD 14.0 CURRENT ( amd64/x86_64 ) and the fix is trivial. Just comment out line 6538 from arc.c and life goes on just fine. Please bear in mind that we only ever see this panic when running a full debug kernel with witness options etc etc etc. Even more rare and strange is that I can trigger the panic repeatedly when I run QEMU on the FreeBSD target system.

panic_assert_freebsd_14_current

@grwilson
Copy link
Member

grwilson commented Sep 2, 2021

I was able to go back through our internal repos to find the origin of the assertion and after looking at various commits, it is clear to me that the assertion was added in error. The code change proposed here is correct.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 2, 2021
@behlendorf behlendorf merged commit 7676ffc into openzfs:master Sep 8, 2021
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 15, 2021
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.  This assertion was
added in error, so we're removing it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#9897
Closes openzfs#12020
Closes openzfs#12246
rincebrain added a commit to rincebrain/zfs that referenced this pull request Sep 22, 2021
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.  This assertion was
added in error, so we're removing it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#9897
Closes openzfs#12020
Closes openzfs#12246
datacore-rm pushed a commit to DataCoreSoftware/openzfs that referenced this pull request Oct 4, 2022
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.  This assertion was
added in error, so we're removing it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#9897
Closes openzfs#12020
Closes openzfs#12246
datacore-rm pushed a commit to DataCoreSoftware/openzfs that referenced this pull request Oct 13, 2022
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.  This assertion was
added in error, so we're removing it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#9897
Closes openzfs#12020
Closes openzfs#12246
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants