CSI: volume has exhausted its available writer claims #15415

ygersie · 2022-11-29T12:36:25Z

Nomad version

1.4.1-ent

Issue

We're still running into the issue of jobs getting stuck pending waiting for a volume claim to be released as described here: #12346 (comment). The trigger seems to be related to #8336. Multiple jobs were stopped / started at the same time, the EBS controller threw an error Attachment point /dev/xvdba is already in use. The allocations were in a failed state and consequent job stops and starts now incorrectly report there's already a claim.

I've checked the AWS side but the volumes are not actually mounted to any instances, nor does a nomad volume status report associated allocs (possibly due to GC). When stopping + purging the respective job the volumes that failed to be claimed look like:

ID                 Name               Plugin ID  Schedulable  Access Mode
zookeeper[0]       zookeeper[0]       aws-ebs    true         single-node-writer
zookeeper[1]       zookeeper[1]       aws-ebs    true         single-node-writer
zookeeper[2]       zookeeper[2]       aws-ebs    true         <none>

The first two volumes throw claim errors and have "Access Mode" still set to single-node-writer and a volume status reports: No allocations placed. At some point after multiple job stops, running the job now fails to be placed with:

Constraint "CSI volume zookeeper[0] has exhausted its available writer claims and is claimed by a garbage collected allocation 138a0fe3-8f9a-6176-014a-d46b7ee37249; waiting for claim to be released

Reproduction steps

Try concurrently spinning up multiple allocations at once to trigger issue #8336. Then stop/start jobs that failed before.

The text was updated successfully, but these errors were encountered:

lgfa29 · 2022-12-01T01:57:01Z

Thanks for the report @ygersie.

We will need some more time to investigate this as it's not immediately clear to me what's going on. I've added this to our backlog so we can triage it more.

ygersie · 2022-12-01T14:31:56Z

@lgfa29 thank you. For now I've reduced the controller count to 1 instead of 2 since the NewDevice method has a mutex to seemingly prevent this issue from occurring. Either way Nomad should release the volume after the controllerpublishvolume method failed with an error which it doesn't seem to do.

Xuxe · 2023-01-03T09:12:10Z

Hey,
observed the same issue on Nomad OSS 1.4.3 with the Linode CSI driver.

One workaround for me to be able to recover from this without reregistering the volume is:

I lowered csi_volume_claim_gc_threshold to 3 minutes.
Paired with a restart block of:

    restart {
       delay = "3m"
       mode = "delay"
       attempts = 1
       interval = "6m"
    }

Other wise fully stopping the task and running "nomad system gc" also seems todo the trick.

ygersie · 2023-03-20T12:01:46Z

Still running into this issue, now after the 1.5.1 upgrade. In one of our clusters I have an alloc that fails to be scheduled due to:

Constraint "CSI volume vmstorage-a[9] has exhausted its available writer claims and is claimed by a garbage collected allocation dfc38c1f-2555-0fcd-cb89-fe3c8be7a63e; waiting for claim to be released": 17 nodes excluded by filter

I checked the code and the volumes_watcher routine should handle the releasing of the active claim but this isn't happening. Looking at the current Nomad leader logs there was nothing showing up related to this at all. I then decided to trigger a leader failover by shutting down the active Nomad leader. The new leader is spamming the logs hard with the following:

    2023-03-20T11:42:24.564Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.569Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.574Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.579Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

    2023-03-20T11:42:24.584Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c

Trying a force detach never works in any of these scenarios either:

$ nomad volume detach vmstorage-a[9] ba33df3d-a1cc-6db3-9986-ede682b6c94c
Error detaching volume: Unexpected response code: 500 (rpc error: Unknown allocation)

And I'm again stuck with many allocations failing to be scheduled and the only way out of this mess is by force re-registering the volume. One of our clusters has a couple dozen of failed allocs for which we have to go through each job to figure out which volumes are stuck.

Just for additional info, my upgrade procedure for a cluster is as follows:

Upgrade servers 1 by 1.
Drain each client while leaving the system jobs running 1 by 1 waiting for each node to be successfully replaced. Only when the plugins on the replacement node are healthy we sleep for another minute before moving on the next instance. We have to drain the nodes to prevent the issues of zombie containers being created due to another issue related to csi plugins (a race on startup) which makes another gigantic mess of the volume and other resource claims.

We only run a single instance of the CSI controller since it looks like that EBS controller was never built to be run in HA mode, see my previous comment above

ygersie · 2023-03-24T11:55:41Z

I wonder if problems here are caused by the unavailability of the CSI driver the moment a drain occurs. I run only a single instance of the EBS CSI controller as mentioned previously. When I drain a node that runs this controller there's temporary unavailability meaning any volumes for jobs running on that box couldn't be released at that point until the controller is back (should be back within a minute on another node though).

Just to be sure I scripted the draining which now checks if the CSI controller is running on the to be drained node, it'll mark the node ineligible and stop the allocation to have it moved somewhere else before proceeding with the drain. The last test I did, although a smaller cluster, seems to have gone fine using this sequence.

tgross · 2023-03-24T12:35:32Z

Sorry, it's been a bit since I've had a chance to review this and I'm ramping up to help @gulducat work on the other issue you have open around restarts.

I run only a single instance of the EBS CSI controller as mentioned previously. When I drain a node that runs this controller there's temporary unavailability meaning any volumes for jobs running on that box couldn't be released at that point until the controller is back (should be back within a minute on another node though).

That's expected behavior and there's very little Nomad can do about this. The EBS CSI driver is frankly buggy if it can't run more than one instance at a time. We handle this as gracefully as we can by having the claim GC process. We call this out in the CSI Plugins concept docs:

Controller plugins can create and attach volumes anywhere they can communicate with the storage provider's API, so they can usually be run as service jobs. You should always run more than one controller plugin allocation for high availability.

ygersie · 2023-03-24T15:13:09Z

@tgross the EBS controller can't run multiple instances because it assigns a devicename. If you run a single controller that's not an issue as the devicename assignment is guarded by a mutex but when you run multiple controllers there is a problem as you end up with conflicting devicenames when starting multiple tasks at once for the same node.

I understand there's now a GC happening as a trade-off but the issue is, it wasn't working. See the logs I posted above:

    2023-03-20T11:42:24.564Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=monitoring volume_id=vmstorage-a[9]
  error=
  | 1 error occurred:
  | \t* missing external node ID: Unknown node: ba33df3d-a1cc-6db3-9986-ede682b6c94c
  |

The claims were never released. First there was an issue where the Nomad leader wasn't logging anything, then I decided to force a leader failover by shutting down the active leader and then the Nomad server started spamming logs with above message. The volumes_watcher routine wasn't able to release the claim.

tgross · 2023-03-28T17:58:41Z

the EBS controller can't run multiple instances because it assigns a devicename. If you run a single controller that's not an issue as the devicename assignment is guarded by a mutex but when you run multiple controllers there is a problem as you end up with conflicting devicenames when starting multiple tasks at once for the same node.

That makes sense. Apparently the EBS controller supports some kind of leader election (ref Client Restarts) but I can't find any documentation for it.

I understand there's now a GC happening as a trade-off but the issue is, it wasn't working.

Understood. I suspect that what we've got here is a case where Nomad has multiple RPCs in-flight to the controller for the same volume. We're not supposed to be doing that either (it's a "should" in the spec, not a "must"), but it's clear that at least a few plugins don't tolerate it (even though it's a "must" for them). I've got some thoughts I'm putting together around running volume claim updates thru the eval broker which might help with this whole class of problem.

tgross · 2023-07-17T21:13:23Z

Ok, thanks for your patience all! I'm picking this back up and I did some digging into the EBS plugin's behavior and design. From their design docs:

Chooses the right device name for the volume on the node (more on that below) and issues AttachVolume. TODO: this has complicated idempotency expectations. It cancels previously called ControllerUnpublishVolume that may be still in progress (i.e. AWS is still detaching the volume and Kubernetes now wants the volume to be attached back).

"It's complicated" isn't exactly encouraging. On further investigation I discovered that the EBS plugin does require leader election, and that it leans on k8s-specific sidecar containers to do so (ref controller.yaml#L137). The underlying problem is challenging within the boundaries of the CSI spec, but now we have to compensate for the unfortunate choice of the plugins to lean on k8s-specific APIs.

My next pass at this will be twofold:

Serialize the controller publish/unpublish RPCs on a per-plugin basis. This is frankly not ideal, as it's going to impact allocation startup performance quite a bit for jobs that spin up a lot of allocations using the same plugin but different volumes. We might be able to get with having the serialization key being the plugin ID + client ID, which would be more fine-grained. But I need to do more testing.
Use a simply bully selection (lowest client ID) so that we generally send these RPCs to the same controller instance each time, unless they're unavailable.

I have a working branch at csi-concurrent-requests, but haven't done more than some simple smoke testing of that yet. I'm not entirely convinced that concurrent requests are the only problem we're facing here. There may also be some interaction with #17756 and I want to pull on that thread as well. Will update here (and potentially in #17756) as I know more.

The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly succesfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes two major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the healthy plugin instances. Fixes: #15415

tgross · 2023-07-19T21:01:13Z

Draft PR is up here #17996. I've tested that out with AWS EBS and it looks to make things much more stable.

The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly successfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host even for _different_ volumes. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach (because the device name is taken already and the API call fails) and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes four major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the plugin instances. * Ensure that requests go to _healthy_ plugin instances only. * Ensure that requests for controllers can go to a controller on any _live_ node, not just ones eligible for scheduling (which CSI controllers don't care about) Fixes: #15415

tgross · 2023-07-20T18:53:33Z

I've just merged #17996 with the fix. It looks like we're going to have a 1.6.1 fairly soon because of another open issue, along with backports, so expect to see this roll out shortly.

ygersie · 2023-07-20T22:34:56Z

Awesome stuff @tgross really appreciate you taking another stab at this!

The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly successfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host even for _different_ volumes. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach (because the device name is taken already and the API call fails) and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes four major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the plugin instances. * Ensure that requests go to _healthy_ plugin instances only. * Ensure that requests for controllers can go to a controller on any _live_ node, not just ones eligible for scheduling (which CSI controllers don't care about) Fixes: #15415

ygersie added the type/bug label Nov 29, 2022

tgross added the theme/storage label Nov 29, 2022

lgfa29 added the stage/needs-investigation label Dec 1, 2022

lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Dec 1, 2022

lgfa29 moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Dec 1, 2022

dhung-hashicorp added the hcc/cst Admin - internal label Dec 1, 2022

tgross mentioned this issue Apr 19, 2023

Allocations with CSI volume fail during Nomad agent restart #13028

Closed

tgross self-assigned this Jun 26, 2023

mikenomitch unassigned tgross Jun 27, 2023

tgross self-assigned this Jun 27, 2023

tgross added this to the 1.6.x milestone Jul 7, 2023

tgross mentioned this issue Jul 7, 2023

CSI: persist previous mounts on client to restore during restart #17840

Merged

tgross mentioned this issue Jul 19, 2023

CSI: improve controller RPC reliability #17996

Merged

tgross closed this as completed in #17996 Jul 20, 2023

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Jul 20, 2023

This was referenced Jul 20, 2023

Backport of CSI: improve controller RPC reliability into release/1.4.x #18013

Merged

Backport of CSI: improve controller RPC reliability into release/1.5.x #18014

Merged

Backport of CSI: improve controller RPC reliability into release/1.6.x #18015

Merged

lgfa29 mentioned this issue Jul 20, 2023

cannot restart jobs with attached CSI volumes #17756

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: volume has exhausted its available writer claims #15415

CSI: volume has exhausted its available writer claims #15415

ygersie commented Nov 29, 2022

lgfa29 commented Dec 1, 2022

ygersie commented Dec 1, 2022

Xuxe commented Jan 3, 2023

ygersie commented Mar 20, 2023

ygersie commented Mar 24, 2023

tgross commented Mar 24, 2023

ygersie commented Mar 24, 2023 •

edited

tgross commented Mar 28, 2023

tgross commented Jul 17, 2023

tgross commented Jul 19, 2023

tgross commented Jul 20, 2023

ygersie commented Jul 20, 2023

CSI: volume has exhausted its available writer claims #15415

CSI: volume has exhausted its available writer claims #15415

Comments

ygersie commented Nov 29, 2022

Nomad version

Issue

Reproduction steps

lgfa29 commented Dec 1, 2022

ygersie commented Dec 1, 2022

Xuxe commented Jan 3, 2023

ygersie commented Mar 20, 2023

ygersie commented Mar 24, 2023

tgross commented Mar 24, 2023

ygersie commented Mar 24, 2023 • edited

tgross commented Mar 28, 2023

tgross commented Jul 17, 2023

tgross commented Jul 19, 2023

tgross commented Jul 20, 2023

ygersie commented Jul 20, 2023

ygersie commented Mar 24, 2023 •

edited