Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(nexus): assertion failure when unpublishing a faulted nexus #604

Merged
merged 1 commit into from
Jan 6, 2021

Conversation

jonathan-teh
Copy link
Contributor

The assertion occurs due to the nvmf subsystem transitioning from a
paused to an inactive state, which is forbidden. Always resume the
nexus in child_retire so that it is usually in the active state, which
avoids the assertion when unpublishing it.

Fixes #549, CAS-549, CAS-606.

@jonathan-teh jonathan-teh self-assigned this Jan 5, 2021
@commit-lint
Copy link

commit-lint bot commented Jan 5, 2021

Bug Fixes

  • nexus: assertion failure when unpublishing a faulted nexus (957e119)

Contributors

jonathan-teh

Commit-Lint commands

You can trigger Commit-Lint actions by commenting on this PR:

  • @Commit-Lint merge patch will merge dependabot PR on "patch" versions (X.X.Y - Y change)
  • @Commit-Lint merge minor will merge dependabot PR on "minor" versions (X.Y.Y - Y change)
  • @Commit-Lint merge major will merge dependabot PR on "major" versions (Y.Y.Y - Y change)
  • @Commit-Lint merge disable will desactivate merge dependabot PR
  • @Commit-Lint review will approve dependabot PR
  • @Commit-Lint stop review will stop approve dependabot PR

@jonathan-teh
Copy link
Contributor Author

bors try

bors bot pushed a commit that referenced this pull request Jan 5, 2021
@bors
Copy link
Contributor

bors bot commented Jan 5, 2021

try

Build succeeded:

Copy link
Contributor

@chriswldenyer chriswldenyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but see comment. Maybe mention in the description that IOs now fail when there are no children, and that the timeout test is back in action.

The assertion occurs due to the nvmf subsystem transitioning from a
paused to an inactive state, which is forbidden. Always resume the
nexus in child_retire so that it is usually in the active state, which
avoids the assertion when unpublishing it.

When there is only a single faulted child, the nexus itself is in a
faulted state, which means no IO is possible. In nexus_bdev, fail the
IO if all submissions failed, which also includes the case where no
submissions were made. Refactor the post-IO submission to avoid
repetition. Also include the case where no usable child was found when
doing a round-robin for reads. This also fixes CAS-606.

Repurpose the existing cargo test for a replica that is stopped, then
continued, to test the cases above with a single remote replica.
@jonathan-teh
Copy link
Contributor Author

bors merge

@bors
Copy link
Contributor

bors bot commented Jan 6, 2021

Build succeeded:

@bors bors bot merged commit ec43e53 into develop Jan 6, 2021
@bors bors bot deleted the cas-549a branch January 6, 2021 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assertion failure when unpublishing or destroying a nexus in faulted state
4 participants