Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: volume has exhausted its available writer claims #15415

Closed
ygersie opened this issue Nov 29, 2022 · 12 comments · Fixed by #17996
Closed

CSI: volume has exhausted its available writer claims #15415

ygersie opened this issue Nov 29, 2022 · 12 comments · Fixed by #17996

Comments

@ygersie
Copy link
Contributor

ygersie commented Nov 29, 2022

Nomad version

1.4.1-ent

Issue

We're still running into the issue of jobs getting stuck pending waiting for a volume claim to be released as described here: #12346 (comment). The trigger seems to be related to #8336. Multiple jobs were stopped / started at the same time, the EBS controller threw an error Attachment point /dev/xvdba is already in use. The allocations were in a failed state and consequent job stops and starts now incorrectly report there's already a claim.

I've checked the AWS side but the volumes are not actually mounted to any instances, nor does a nomad volume status report associated allocs (possibly due to GC). When stopping + purging the respective job the volumes that failed to be claimed look like:

ID                 Name               Plugin ID  Schedulable  Access Mode
zookeeper[0]       zookeeper[0]       aws-ebs    true         single-node-writer
zookeeper[1]       zookeeper[1]       aws-ebs    true         single-node-writer
zookeeper[2]       zookeeper[2]       aws-ebs    true         <none>

The first two volumes throw claim errors and have "Access Mode" still set to single-node-writer and a volume status reports: No allocations placed. At some point after multiple job stops, running the job now fails to be placed with:

Constraint "CSI volume zookeeper[0] has exhausted its available writer claims and is claimed by a garbage collected allocation 138a0fe3-8f9a-6176-014a-d46b7ee37249; waiting for claim to be released

Reproduction steps

Try concurrently spinning up multiple allocations at once to trigger issue #8336. Then stop/start jobs that failed before.

@lgfa29 lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Dec 1, 2022
@lgfa29 lgfa29 moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Dec 1, 2022
@lgfa29
Copy link
Contributor

lgfa29 commented Dec 1, 2022

Thanks for the report @ygersie.

We will need some more time to investigate this as it's not immediately clear to me what's going on. I've added this to our backlog so we can triage it more.

@ygersie
Copy link
Contributor Author

ygersie commented Dec 1, 2022

@lgfa29 thank you. For now I've reduced the controller count to 1 instead of 2 since the NewDevice method has a mutex to seemingly prevent this issue from occurring. Either way Nomad should release the volume after the controllerpublishvolume method failed with an error which it doesn't seem to do.

@dhung-hashicorp dhung-hashicorp added the hcc/cst Admin - internal label Dec 1, 2022
@Xuxe
Copy link

Xuxe commented Jan 3, 2023

Hey,
observed the same issue on Nomad OSS 1.4.3 with the Linode CSI driver.

One workaround for me to be able to recover from this without reregistering the volume is:

I lowered csi_volume_claim_gc_threshold to 3 minutes.
Paired with a restart block of:

    restart {
       delay = "3m"
       mode = "delay"
       attempts = 1
       interval = "6m"
    }

Other wise fully stopping the task and running "nomad system gc" also seems todo the trick.

@ygersie
Copy link
Contributor Author

ygersie commented Mar 20, 2023

Still running into this issue, now after the 1.5.1 upgrade. In one of our clusters I have an alloc that fails to be scheduled due to:

Constraint "CSI volume vmstorage-a[9] has exhausted its available writer claims and is claimed by a garbage collected allocation dfc38c1f-2555-0fcd-cb89-fe3c8be7a63e; waiting for claim to be released": 17 nodes excluded by filter

I checked the code and the volumes_watcher routine should handle the releasing of the active claim but this isn't happening. Looking at the current Nomad leader logs there was nothing showing up related to this at all. I then decided to trigger a leader failover by shutting down the active Nomad leader. The new leader is spamming the logs hard with the following:

    2023-03-20T11:42:24.564Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.569Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.574Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.579Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.584Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c

Trying a force detach never works in any of these scenarios either:

$ nomad volume detach vmstorage-a[9] ba33df3d-a1cc-6db3-9986-ede682b6c94c
Error detaching volume: Unexpected response code: 500 (rpc error: Unknown allocation)

And I'm again stuck with many allocations failing to be scheduled and the only way out of this mess is by force re-registering the volume. One of our clusters has a couple dozen of failed allocs for which we have to go through each job to figure out which volumes are stuck.

Just for additional info, my upgrade procedure for a cluster is as follows:

  • Upgrade servers 1 by 1.
  • Drain each client while leaving the system jobs running 1 by 1 waiting for each node to be successfully replaced. Only when the plugins on the replacement node are healthy we sleep for another minute before moving on the next instance. We have to drain the nodes to prevent the issues of zombie containers being created due to another issue related to csi plugins (a race on startup) which makes another gigantic mess of the volume and other resource claims.

We only run a single instance of the CSI controller since it looks like that EBS controller was never built to be run in HA mode, see my previous comment above

@ygersie
Copy link
Contributor Author

ygersie commented Mar 24, 2023

I wonder if problems here are caused by the unavailability of the CSI driver the moment a drain occurs. I run only a single instance of the EBS CSI controller as mentioned previously. When I drain a node that runs this controller there's temporary unavailability meaning any volumes for jobs running on that box couldn't be released at that point until the controller is back (should be back within a minute on another node though).

Just to be sure I scripted the draining which now checks if the CSI controller is running on the to be drained node, it'll mark the node ineligible and stop the allocation to have it moved somewhere else before proceeding with the drain. The last test I did, although a smaller cluster, seems to have gone fine using this sequence.

@tgross
Copy link
Member

tgross commented Mar 24, 2023

Sorry, it's been a bit since I've had a chance to review this and I'm ramping up to help @gulducat work on the other issue you have open around restarts.

I run only a single instance of the EBS CSI controller as mentioned previously. When I drain a node that runs this controller there's temporary unavailability meaning any volumes for jobs running on that box couldn't be released at that point until the controller is back (should be back within a minute on another node though).

That's expected behavior and there's very little Nomad can do about this. The EBS CSI driver is frankly buggy if it can't run more than one instance at a time. We handle this as gracefully as we can by having the claim GC process. We call this out in the CSI Plugins concept docs:

Controller plugins can create and attach volumes anywhere they can communicate with the storage provider's API, so they can usually be run as service jobs. You should always run more than one controller plugin allocation for high availability.

@ygersie
Copy link
Contributor Author

ygersie commented Mar 24, 2023

@tgross the EBS controller can't run multiple instances because it assigns a devicename. If you run a single controller that's not an issue as the devicename assignment is guarded by a mutex but when you run multiple controllers there is a problem as you end up with conflicting devicenames when starting multiple tasks at once for the same node.

I understand there's now a GC happening as a trade-off but the issue is, it wasn't working. See the logs I posted above:

    2023-03-20T11:42:24.564Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

The claims were never released. First there was an issue where the Nomad leader wasn't logging anything, then I decided to force a leader failover by shutting down the active leader and then the Nomad server started spamming logs with above message. The volumes_watcher routine wasn't able to release the claim.

@tgross
Copy link
Member

tgross commented Mar 28, 2023

the EBS controller can't run multiple instances because it assigns a devicename. If you run a single controller that's not an issue as the devicename assignment is guarded by a mutex but when you run multiple controllers there is a problem as you end up with conflicting devicenames when starting multiple tasks at once for the same node.

That makes sense. Apparently the EBS controller supports some kind of leader election (ref Client Restarts) but I can't find any documentation for it.

I understand there's now a GC happening as a trade-off but the issue is, it wasn't working.

Understood. I suspect that what we've got here is a case where Nomad has multiple RPCs in-flight to the controller for the same volume. We're not supposed to be doing that either (it's a "should" in the spec, not a "must"), but it's clear that at least a few plugins don't tolerate it (even though it's a "must" for them). I've got some thoughts I'm putting together around running volume claim updates thru the eval broker which might help with this whole class of problem.

@tgross
Copy link
Member

tgross commented Jul 17, 2023

Ok, thanks for your patience all! I'm picking this back up and I did some digging into the EBS plugin's behavior and design. From their design docs:

Chooses the right device name for the volume on the node (more on that below) and issues AttachVolume. TODO: this has complicated idempotency expectations. It cancels previously called ControllerUnpublishVolume that may be still in progress (i.e. AWS is still detaching the volume and Kubernetes now wants the volume to be attached back).

"It's complicated" isn't exactly encouraging. On further investigation I discovered that the EBS plugin does require leader election, and that it leans on k8s-specific sidecar containers to do so (ref controller.yaml#L137). The underlying problem is challenging within the boundaries of the CSI spec, but now we have to compensate for the unfortunate choice of the plugins to lean on k8s-specific APIs.

My next pass at this will be twofold:

  • Serialize the controller publish/unpublish RPCs on a per-plugin basis. This is frankly not ideal, as it's going to impact allocation startup performance quite a bit for jobs that spin up a lot of allocations using the same plugin but different volumes. We might be able to get with having the serialization key being the plugin ID + client ID, which would be more fine-grained. But I need to do more testing.
  • Use a simply bully selection (lowest client ID) so that we generally send these RPCs to the same controller instance each time, unless they're unavailable.

I have a working branch at csi-concurrent-requests, but haven't done more than some simple smoke testing of that yet. I'm not entirely convinced that concurrent requests are the only problem we're facing here. There may also be some interaction with #17756 and I want to pull on that thread as well. Will update here (and potentially in #17756) as I know more.

tgross added a commit that referenced this issue Jul 19, 2023
The CSI specification says that we "SHOULD" send no more than one in-flight
request per *volume* at a time, with an allowance for losing state
(ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
mostly succesfully serialize node and controller RPCs for the same volume,
except when Nomad clients are lost.
(See also container-storage-interface/spec#512)

These concurrency requirements in the spec fall short because Storage Provider
APIs aren't necessarily safe to call concurrently on the same host. For example,
concurrently attaching AWS EBS volumes to an EC2 instance results in a race for
device names, which results in failure to attach and confused results when
releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars
for serializing storage provider API calls globally. As a result, we have to be
much more conservative about concurrency in Nomad than the spec allows.

This changeset includes two major changes to fix this:
* Add a serializer method to the CSI volume RPC handler. When the
  RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this
  serializer and only one RPC is sent at a time. Any other RPCs in flight will
  block.
* Ensure that requests go to the same controller plugin instance whenever
  possible by sorting by lowest client ID out of the healthy plugin instances.

Fixes: #15415
tgross added a commit that referenced this issue Jul 19, 2023
The CSI specification says that we "SHOULD" send no more than one in-flight
request per *volume* at a time, with an allowance for losing state
(ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
mostly succesfully serialize node and controller RPCs for the same volume,
except when Nomad clients are lost.
(See also container-storage-interface/spec#512)

These concurrency requirements in the spec fall short because Storage Provider
APIs aren't necessarily safe to call concurrently on the same host. For example,
concurrently attaching AWS EBS volumes to an EC2 instance results in a race for
device names, which results in failure to attach and confused results when
releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars
for serializing storage provider API calls globally. As a result, we have to be
much more conservative about concurrency in Nomad than the spec allows.

This changeset includes two major changes to fix this:
* Add a serializer method to the CSI volume RPC handler. When the
  RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this
  serializer and only one RPC is sent at a time. Any other RPCs in flight will
  block.
* Ensure that requests go to the same controller plugin instance whenever
  possible by sorting by lowest client ID out of the healthy plugin instances.

Fixes: #15415
tgross added a commit that referenced this issue Jul 19, 2023
The CSI specification says that we "SHOULD" send no more than one in-flight
request per *volume* at a time, with an allowance for losing state
(ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
mostly succesfully serialize node and controller RPCs for the same volume,
except when Nomad clients are lost.
(See also container-storage-interface/spec#512)

These concurrency requirements in the spec fall short because Storage Provider
APIs aren't necessarily safe to call concurrently on the same host. For example,
concurrently attaching AWS EBS volumes to an EC2 instance results in a race for
device names, which results in failure to attach and confused results when
releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars
for serializing storage provider API calls globally. As a result, we have to be
much more conservative about concurrency in Nomad than the spec allows.

This changeset includes two major changes to fix this:
* Add a serializer method to the CSI volume RPC handler. When the
  RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this
  serializer and only one RPC is sent at a time. Any other RPCs in flight will
  block.
* Ensure that requests go to the same controller plugin instance whenever
  possible by sorting by lowest client ID out of the healthy plugin instances.

Fixes: #15415
@tgross
Copy link
Member

tgross commented Jul 19, 2023

Draft PR is up here #17996. I've tested that out with AWS EBS and it looks to make things much more stable.

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Jul 20, 2023
tgross added a commit that referenced this issue Jul 20, 2023
The CSI specification says that we "SHOULD" send no more than one in-flight
request per *volume* at a time, with an allowance for losing state
(ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
mostly successfully serialize node and controller RPCs for the same volume,
except when Nomad clients are lost. (See also
container-storage-interface/spec#512)

These concurrency requirements in the spec fall short because Storage Provider
APIs aren't necessarily safe to call concurrently on the same host even for
_different_ volumes. For example, concurrently attaching AWS EBS volumes to an
EC2 instance results in a race for device names, which results in failure to
attach (because the device name is taken already and the API call fails) and
confused results when releasing claims. So in practice many CSI plugins rely on
k8s-specific sidecars for serializing storage provider API calls globally. As a
result, we have to be much more conservative about concurrency in Nomad than the
spec allows.

This changeset includes four major changes to fix this:
* Add a serializer method to the CSI volume RPC handler. When the RPC handler
  makes a destructive CSI Controller RPC, we send the RPC thru this serializer
  and only one RPC is sent at a time. Any other RPCs in flight will block.
* Ensure that requests go to the same controller plugin instance whenever
  possible by sorting by lowest client ID out of the plugin instances.
* Ensure that requests go to _healthy_ plugin instances only.
* Ensure that requests for controllers can go to a controller on any _live_
  node, not just ones eligible for scheduling (which CSI controllers don't care
  about)

Fixes: #15415
@tgross
Copy link
Member

tgross commented Jul 20, 2023

I've just merged #17996 with the fix. It looks like we're going to have a 1.6.1 fairly soon because of another open issue, along with backports, so expect to see this roll out shortly.

@ygersie
Copy link
Contributor Author

ygersie commented Jul 20, 2023

Awesome stuff @tgross really appreciate you taking another stab at this!

jrasell pushed a commit that referenced this issue Jul 26, 2023
The CSI specification says that we "SHOULD" send no more than one in-flight
request per *volume* at a time, with an allowance for losing state
(ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
mostly successfully serialize node and controller RPCs for the same volume,
except when Nomad clients are lost. (See also
container-storage-interface/spec#512)

These concurrency requirements in the spec fall short because Storage Provider
APIs aren't necessarily safe to call concurrently on the same host even for
_different_ volumes. For example, concurrently attaching AWS EBS volumes to an
EC2 instance results in a race for device names, which results in failure to
attach (because the device name is taken already and the API call fails) and
confused results when releasing claims. So in practice many CSI plugins rely on
k8s-specific sidecars for serializing storage provider API calls globally. As a
result, we have to be much more conservative about concurrency in Nomad than the
spec allows.

This changeset includes four major changes to fix this:
* Add a serializer method to the CSI volume RPC handler. When the RPC handler
  makes a destructive CSI Controller RPC, we send the RPC thru this serializer
  and only one RPC is sent at a time. Any other RPCs in flight will block.
* Ensure that requests go to the same controller plugin instance whenever
  possible by sorting by lowest client ID out of the plugin instances.
* Ensure that requests go to _healthy_ plugin instances only.
* Ensure that requests for controllers can go to a controller on any _live_
  node, not just ones eligible for scheduling (which CSI controllers don't care
  about)

Fixes: #15415
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

5 participants