Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: retry claims from client #12113

Merged
merged 1 commit into from
Feb 24, 2022
Merged

CSI: retry claims from client #12113

merged 1 commit into from
Feb 24, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Feb 23, 2022

Fixes #8609 when combined with #12112
Fixes #11477

When the alloc runner claims a volume, an allocation for a previous
version of the job may still have the volume claimed because it's
still shutting down. In this case we'll receive an error from the
server. Retry this error until we succeed or until a very long timeout
expires, to give operators a chance to recover broken plugins.

Make the claim hook tolerate temporary RPC failures in general to
prevent errors during client restart or during leadership elections.


(Tests don't pass until #12112 is merged; I'll rebase this on that PR)

Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, some CSI test failures

@tgross
Copy link
Member Author

tgross commented Feb 24, 2022

eh, some CSI test failures

Yup! This won't pass tests until #12112 is merged and then I'll rebase this PR on that (that's why this is in draft still... not sure why I added reviewers to it last night but maybe I was hoping #12112 was going to get merged by the time anyone looked at it! 😀 )

When the alloc runner claims a volume, an allocation for a previous
version of the job may still have the volume claimed because it's
still shutting down. In this case we'll receive an error from the
server. Retry this error until we succeed or until a very long timeout
expires, to give operators a chance to recover broken plugins.

Make the alloc runner hook tolerant of temporary RPC failures.
@tgross
Copy link
Member Author

tgross commented Feb 24, 2022

@shoenig tests pass now (except for the one flake #12115) which I'm re-running for now.

Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tgross tgross merged commit 649f1e3 into main Feb 24, 2022
tgross added a commit that referenced this pull request Mar 25, 2022
In #12112 and #12113 we solved for the problem of races in releasing
volume claims, but there was a case that we missed. During a node
drain with a controller attach/detach, we can hit a race where we call
controller publish before the unpublish has completed. This is
discouraged in the spec but plugins are supposed to handle it
safely. But if the storage provider's API is slow enough and the
plugin doesn't handle the case safely, the volume can get "locked"
into a state where the provider's API won't detach it cleanly.

Check the claim before making any external controller publish RPC
calls so that Nomad is responsible for the canonical information about
whether a volume is currently claimed.

This has a couple side-effects that also had to get fixed here:

* Changing the order means that the volume will have a past claim
  without a valid external node ID because it came from the client, and
  this uncovered a separate bug where we didn't assert the external node
  ID was valid before returning it. Fallthrough to getting the ID from
  the plugins in the state store in this case. We avoided this
  originally because of concerns around plugins getting lost during node
  drain but now that we've fixed that we may want to revisit it in
  future work.
* We should make sure we're handling `FailedPrecondition` cases from
  the controller plugin the same way we handle other retryable cases.
tgross added a commit that referenced this pull request Mar 28, 2022
In #12112 and #12113 we solved for the problem of races in releasing
volume claims, but there was a case that we missed. During a node
drain with a controller attach/detach, we can hit a race where we call
controller publish before the unpublish has completed. This is
discouraged in the spec but plugins are supposed to handle it
safely. But if the storage provider's API is slow enough and the
plugin doesn't handle the case safely, the volume can get "locked"
into a state where the provider's API won't detach it cleanly.

Check the claim before making any external controller publish RPC
calls so that Nomad is responsible for the canonical information about
whether a volume is currently claimed.

This has a couple side-effects that also had to get fixed here:

* Changing the order means that the volume will have a past claim
  without a valid external node ID because it came from the client, and
  this uncovered a separate bug where we didn't assert the external node
  ID was valid before returning it. Fallthrough to getting the ID from
  the plugins in the state store in this case. We avoided this
  originally because of concerns around plugins getting lost during node
  drain but now that we've fixed that we may want to revisit it in
  future work.
* We should make sure we're handling `FailedPrecondition` cases from
  the controller plugin the same way we handle other retryable cases.
tgross added a commit that referenced this pull request Mar 28, 2022
In #12112 and #12113 we solved for the problem of races in releasing
volume claims, but there was a case that we missed. During a node
drain with a controller attach/detach, we can hit a race where we call
controller publish before the unpublish has completed. This is
discouraged in the spec but plugins are supposed to handle it
safely. But if the storage provider's API is slow enough and the
plugin doesn't handle the case safely, the volume can get "locked"
into a state where the provider's API won't detach it cleanly.

Check the claim before making any external controller publish RPC
calls so that Nomad is responsible for the canonical information about
whether a volume is currently claimed.

This has a couple side-effects that also had to get fixed here:

* Changing the order means that the volume will have a past claim
  without a valid external node ID because it came from the client, and
  this uncovered a separate bug where we didn't assert the external node
  ID was valid before returning it. Fallthrough to getting the ID from
  the plugins in the state store in this case. We avoided this
  originally because of concerns around plugins getting lost during node
  drain but now that we've fixed that we may want to revisit it in
  future work.
* We should make sure we're handling `FailedPrecondition` cases from
  the controller plugin the same way we handle other retryable cases.
* Several tests had to be updated because they were assuming we fail
  in a particular order that we're no longer doing.
tgross added a commit that referenced this pull request Mar 29, 2022
In #12112 and #12113 we solved for the problem of races in releasing
volume claims, but there was a case that we missed. During a node
drain with a controller attach/detach, we can hit a race where we call
controller publish before the unpublish has completed. This is
discouraged in the spec but plugins are supposed to handle it
safely. But if the storage provider's API is slow enough and the
plugin doesn't handle the case safely, the volume can get "locked"
into a state where the provider's API won't detach it cleanly.

Check the claim before making any external controller publish RPC
calls so that Nomad is responsible for the canonical information about
whether a volume is currently claimed.

This has a couple side-effects that also had to get fixed here:

* Changing the order means that the volume will have a past claim
  without a valid external node ID because it came from the client, and
  this uncovered a separate bug where we didn't assert the external node
  ID was valid before returning it. Fallthrough to getting the ID from
  the plugins in the state store in this case. We avoided this
  originally because of concerns around plugins getting lost during node
  drain but now that we've fixed that we may want to revisit it in
  future work.
* We should make sure we're handling `FailedPrecondition` cases from
  the controller plugin the same way we handle other retryable cases.
* Several tests had to be updated because they were assuming we fail
  in a particular order that we're no longer doing.
@lgfa29 lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Apr 19, 2022
@lgfa29 lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line and removed backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Apr 19, 2022
@lgfa29 lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line and removed backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Apr 19, 2022
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line theme/storage type/bug
Projects
None yet
3 participants