-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI: volume has exhausted its available writer claims #15415
Comments
Thanks for the report @ygersie. We will need some more time to investigate this as it's not immediately clear to me what's going on. I've added this to our backlog so we can triage it more. |
@lgfa29 thank you. For now I've reduced the controller count to 1 instead of 2 since the NewDevice method has a mutex to seemingly prevent this issue from occurring. Either way Nomad should release the volume after the controllerpublishvolume method failed with an error which it doesn't seem to do. |
Hey, One workaround for me to be able to recover from this without reregistering the volume is: I lowered csi_volume_claim_gc_threshold to 3 minutes.
Other wise fully stopping the task and running "nomad system gc" also seems todo the trick. |
Still running into this issue, now after the 1.5.1 upgrade. In one of our clusters I have an alloc that fails to be scheduled due to:
I checked the code and the
Trying a force detach never works in any of these scenarios either:
And I'm again stuck with many allocations failing to be scheduled and the only way out of this mess is by force re-registering the volume. One of our clusters has a couple dozen of failed allocs for which we have to go through each job to figure out which volumes are stuck. Just for additional info, my upgrade procedure for a cluster is as follows:
We only run a single instance of the CSI controller since it looks like that EBS controller was never built to be run in HA mode, see my previous comment above |
I wonder if problems here are caused by the unavailability of the CSI driver the moment a drain occurs. I run only a single instance of the EBS CSI controller as mentioned previously. When I drain a node that runs this controller there's temporary unavailability meaning any volumes for jobs running on that box couldn't be released at that point until the controller is back (should be back within a minute on another node though). Just to be sure I scripted the draining which now checks if the CSI controller is running on the to be drained node, it'll mark the node ineligible and stop the allocation to have it moved somewhere else before proceeding with the drain. The last test I did, although a smaller cluster, seems to have gone fine using this sequence. |
Sorry, it's been a bit since I've had a chance to review this and I'm ramping up to help @gulducat work on the other issue you have open around restarts.
That's expected behavior and there's very little Nomad can do about this. The EBS CSI driver is frankly buggy if it can't run more than one instance at a time. We handle this as gracefully as we can by having the claim GC process. We call this out in the CSI Plugins concept docs:
|
@tgross the EBS controller can't run multiple instances because it assigns a devicename. If you run a single controller that's not an issue as the devicename assignment is guarded by a mutex but when you run multiple controllers there is a problem as you end up with conflicting devicenames when starting multiple tasks at once for the same node. I understand there's now a GC happening as a trade-off but the issue is, it wasn't working. See the logs I posted above:
The claims were never released. First there was an issue where the Nomad leader wasn't logging anything, then I decided to force a leader failover by shutting down the active leader and then the Nomad server started spamming logs with above message. The |
That makes sense. Apparently the EBS controller supports some kind of leader election (ref Client Restarts) but I can't find any documentation for it.
Understood. I suspect that what we've got here is a case where Nomad has multiple RPCs in-flight to the controller for the same volume. We're not supposed to be doing that either (it's a "should" in the spec, not a "must"), but it's clear that at least a few plugins don't tolerate it (even though it's a "must" for them). I've got some thoughts I'm putting together around running volume claim updates thru the eval broker which might help with this whole class of problem. |
Ok, thanks for your patience all! I'm picking this back up and I did some digging into the EBS plugin's behavior and design. From their design docs:
"It's complicated" isn't exactly encouraging. On further investigation I discovered that the EBS plugin does require leader election, and that it leans on k8s-specific sidecar containers to do so (ref My next pass at this will be twofold:
I have a working branch at csi-concurrent-requests, but haven't done more than some simple smoke testing of that yet. I'm not entirely convinced that concurrent requests are the only problem we're facing here. There may also be some interaction with #17756 and I want to pull on that thread as well. Will update here (and potentially in #17756) as I know more. |
The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly succesfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes two major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the healthy plugin instances. Fixes: #15415
The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly succesfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes two major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the healthy plugin instances. Fixes: #15415
The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly succesfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes two major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the healthy plugin instances. Fixes: #15415
Draft PR is up here #17996. I've tested that out with AWS EBS and it looks to make things much more stable. |
The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly successfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host even for _different_ volumes. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach (because the device name is taken already and the API call fails) and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes four major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the plugin instances. * Ensure that requests go to _healthy_ plugin instances only. * Ensure that requests for controllers can go to a controller on any _live_ node, not just ones eligible for scheduling (which CSI controllers don't care about) Fixes: #15415
I've just merged #17996 with the fix. It looks like we're going to have a 1.6.1 fairly soon because of another open issue, along with backports, so expect to see this roll out shortly. |
Awesome stuff @tgross really appreciate you taking another stab at this! |
The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly successfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host even for _different_ volumes. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach (because the device name is taken already and the API call fails) and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes four major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the plugin instances. * Ensure that requests go to _healthy_ plugin instances only. * Ensure that requests for controllers can go to a controller on any _live_ node, not just ones eligible for scheduling (which CSI controllers don't care about) Fixes: #15415
Nomad version
1.4.1-ent
Issue
We're still running into the issue of jobs getting stuck pending waiting for a volume claim to be released as described here: #12346 (comment). The trigger seems to be related to #8336. Multiple jobs were stopped / started at the same time, the EBS controller threw an error
Attachment point /dev/xvdba is already in use
. The allocations were in a failed state and consequent job stops and starts now incorrectly report there's already a claim.I've checked the AWS side but the volumes are not actually mounted to any instances, nor does a
nomad volume status
report associated allocs (possibly due to GC). When stopping + purging the respective job the volumes that failed to be claimed look like:The first two volumes throw claim errors and have "Access Mode" still set to
single-node-writer
and a volume status reports:No allocations placed
. At some point after multiple job stops, running the job now fails to be placed with:Reproduction steps
Try concurrently spinning up multiple allocations at once to trigger issue #8336. Then stop/start jobs that failed before.
The text was updated successfully, but these errors were encountered: