CSI: improve controller RPC reliability #17996

tgross · 2023-07-19T20:27:45Z

The CSI specification says that we "SHOULD" send no more than one in-flight request per volume at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly successfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512)

These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host even for different volumes. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach (because the device name is taken already and the API call fails) and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows.

This changeset includes four major changes to fix this:

Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block.
Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the plugin instances.
Ensure that requests go to healthy plugin instances only.
Ensure that requests for controllers can go to a controller on any live node, not just ones eligible for scheduling (which CSI controllers don't care about)

Fixes: #15415

The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly succesfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes two major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the healthy plugin instances. Fixes: #15415

nomad/client_csi_endpoint.go

tgross · 2023-07-20T13:35:22Z

nomad/csi_endpoint_test.go

@@ -1971,3 +1973,49 @@ func TestCSI_RPCVolumeAndPluginLookup(t *testing.T) {
 	require.Nil(t, vol)
 	require.EqualError(t, err, fmt.Sprintf("volume not found: %s", id2))
 }
+
+func TestCSI_SerializedControllerRPC(t *testing.T) {


I've done some end-to-end testing of this new function with reschedules but I want to follow-up with some additional testing of drains, broken plugin configurations, etc., to make sure that everything is recoverable even in the corner cases.

gulducat

Looks good! I think it'll cover a lot of our bases, and should also make it easier to debug any future issues.

Mainly though I'm not clear on some math in the test (which makes me feel a little not-smart, since it seems like basic multiplication?), so either something to fix there or I have a facepalm in my future.

gulducat · 2023-07-20T16:00:00Z

nomad/csi_endpoint.go

+// serializedControllerRPC ensures we're only sending a single controller RPC to
+// a given plugin if the RPC can cause conflicting state changes.


I think we're good, but sharing here some of my thought process: For this to work, all such RPCs need to be run on a single server to block in memory. Looking at what I think are the relevant RPCs, it seems like they are all(?) v.srv.forward()ed (to the leader). I suppose this is what you meant by "with an allowance for losing state (ex. leadership transitions)" - i.e. this serialization could be broken by switching leaders? and we can't reasonably avoid that caveat.

Right, that's exactly it.

gulducat · 2023-07-20T16:05:54Z

nomad/csi_endpoint.go

+			// that we can ensure we've cleared it from the map before allowing
+			// anyone else to take the lock and write a new one
+			v.srv.volumeControllerLock.Lock()
+			futureDone()


I find myself wanting simpler map thread-safety, and a sync.Map seems just about right instead of a manual mutex, but its auto-locking-for-you behavior might not be compatible with your concern here?

Yeah we're locking around both access to the map and the construction/cancellation of contexts. With a sync.Map that would end up looking like this:

// spin up a context goroutine whether we need it or not! future, futureDone := context.WithCancel(v.srv.shutdownCtx) defer futureDone() for { ok := v.srv.volumeControllerFutures.CompareAndSwap(pluginID, nil, future) if ok { // success path err := fn() futureDone() v.srv.volumeControllerFutures.CompareAndDelete(pluginID, future) return err } else { // wait path // racy! what if it's been deleted? waitOn, ok := v.srv.volumeControllerFutures.Load(pluginID) if !ok { continue // lost the race with the delete, start again } select { case <-waitOn.Done(): continue // other goroutine is done case <-v.srv.shutdownCh: return structs.ErrNoLeader } } }

There are a few reasons why this isn't the best solution:

A sync.Map doesn't let us compare-and-swap unless we've already constructed the object we're intending to write. So we end up awkwardly constructing the context outside the loop.

In the "success path" we need to cancel the context before deleting it from the map, otherwise someone else can start their future before we've closed our context (and any downstream contexts we add in the future). But doing so will unblock the "wait path" before the CompareAndSwap will succeed, potentially leading to that goroutine trying to grab the CompareAndSwap multiple times before it can succeed, which is wasteful.

CompareAndSwap doesn't return the existing value, so we need to Load it the value and it may have been deleted in the meanwhile. This adds more contention around that lock.

The CompareAndSwap operation returns false when it hasn't been previously set! We could probably hack around that by creating a dummy context that's always treated as "unused" but then we've spun up a goroutine per plugin just to hold that spot.

Nice, thanks for the mock-up and detailed explanation!

gulducat · 2023-07-20T16:08:51Z

nomad/csi_endpoint.go

+				// volumewatcher will iterate all the claims at startup to
+				// detect this and mop up any claims in the NodeDetached state
+				// (volume GC will run periodically as well)
+				return structs.ErrNoLeader


Worth wrapping this with any more info about where it came from, or does that happen upstream of here?

The caller's logger will wrap it sufficiently. "No leader" is fairly foundational and there's nothing we can meaningfully add.

gulducat · 2023-07-20T16:16:57Z

nomad/csi_endpoint_test.go

+	wg.Wait()
+
+	// plugin1 RPCs should block each other
+	must.GreaterEq(t, 150*time.Millisecond, totals["plugin1"])


I see the test passes, but how does 2 * 50 milliseconds add up to > 150 ? "plugin1" should only be blocked behind itself, so "plugin2"'s 50ms shouldn't be included, right?

For plugin1: the goroutine on line 2002 runs for 50ms. The goroutine on line 2004 waits at least 50ms for the other to finish and then runs for 50ms, for a total of 100ms. We're summing up the elapsed time of all the goroutines for a given plugin, because that way we're asserting the wait time of the method under test from the perspective of each goroutine and not just the elapsed wall clock time. So it's a minimum of 50ms + 100ms.

The maximum value of 200ms is assuming that any overhead is (far) less than 50ms, so that asserts that we're not accidentally waiting behind plugin2 as well.

tgross · 2023-07-20T18:49:46Z

In addition to the tests added above, I've done some end-to-end testing here:

client restart on a node that has a job that mounts a volume and a controller + node plugin
rescheduling a job that mounts a volume
draining a node with a job that mounts a volume and a controller plugin on it

The backport for #17996 was automatically merged by BPA but the build did not work due to missing imports and changes to the signatures of some our RPC structs. Fix the backport.

The CSI specification says that we "SHOULD" send no more than one in-flight request per *volume* at a time, with an allowance for losing state (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We mostly successfully serialize node and controller RPCs for the same volume, except when Nomad clients are lost. (See also container-storage-interface/spec#512) These concurrency requirements in the spec fall short because Storage Provider APIs aren't necessarily safe to call concurrently on the same host even for _different_ volumes. For example, concurrently attaching AWS EBS volumes to an EC2 instance results in a race for device names, which results in failure to attach (because the device name is taken already and the API call fails) and confused results when releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars for serializing storage provider API calls globally. As a result, we have to be much more conservative about concurrency in Nomad than the spec allows. This changeset includes four major changes to fix this: * Add a serializer method to the CSI volume RPC handler. When the RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this serializer and only one RPC is sent at a time. Any other RPCs in flight will block. * Ensure that requests go to the same controller plugin instance whenever possible by sorting by lowest client ID out of the plugin instances. * Ensure that requests go to _healthy_ plugin instances only. * Ensure that requests for controllers can go to a controller on any _live_ node, not just ones eligible for scheduling (which CSI controllers don't care about) Fixes: #15415

tgross force-pushed the csi-concurrent-requests branch from 43348d2 to f2253cf Compare July 19, 2023 20:28

tgross added theme/storage type/bug hcc/cst Admin - internal labels Jul 19, 2023

tgross added this to the 1.6.x milestone Jul 19, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui July 19, 2023 20:32 View deployment

tgross force-pushed the csi-concurrent-requests branch from f2253cf to 299d897 Compare July 19, 2023 21:00

tgross mentioned this pull request Jul 19, 2023

CSI: volume has exhausted its available writer claims #15415

Closed

tgross requested review from jrasell and gulducat July 19, 2023 21:01

vercel bot deployed to Preview – nomad-storybook-and-ui July 19, 2023 21:02 View deployment

tgross requested review from lgfa29 and removed request for jrasell July 20, 2023 13:31

tgross commented Jul 20, 2023

View reviewed changes

nomad/client_csi_endpoint.go Outdated Show resolved Hide resolved

tgross commented Jul 20, 2023

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui July 20, 2023 16:08 View deployment

tgross force-pushed the csi-concurrent-requests branch from 2657637 to 5fa2ec5 Compare July 20, 2023 16:19

improve error handling and fix bug for ineligible nodes

2ae8711

tgross force-pushed the csi-concurrent-requests branch from 5fa2ec5 to 2ae8711 Compare July 20, 2023 16:19

vercel bot deployed to Preview – nomad-storybook-and-ui July 20, 2023 16:28 View deployment

gulducat approved these changes Jul 20, 2023

View reviewed changes

tgross marked this pull request as ready for review July 20, 2023 18:47

tgross added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Jul 20, 2023

tgross changed the title ~~CSI: serialize controller RPCs per plugin~~ CSI: improve controller RPC reliability Jul 20, 2023

tgross merged commit f529124 into main Jul 20, 2023
31 checks passed

tgross deleted the csi-concurrent-requests branch July 20, 2023 18:51

This was referenced Jul 20, 2023

Backport of CSI: improve controller RPC reliability into release/1.4.x #18013

Merged

Backport of CSI: improve controller RPC reliability into release/1.5.x #18014

Merged

Backport of CSI: improve controller RPC reliability into release/1.6.x #18015

Merged

josh-m-sharpe mentioned this pull request Jul 20, 2023

cannot restart jobs with attached CSI volumes #17756

Closed

tgross added a commit that referenced this pull request Jul 20, 2023

backport fix for PR #17996 to 1.4.x

240f940

The backport for #17996 was automatically merged by BPA but the build did not work due to missing imports and changes to the signatures of some our RPC structs. Fix the backport.

tgross mentioned this pull request Jul 20, 2023

backport fix for PR #17996 to 1.4.x #18019

Merged

zackelan mentioned this pull request Aug 7, 2023

"error performing RPC to server which is not safe to automatically retry" error when detaching a CSI volume can lead to Nomad job in unschedulable state #18174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: improve controller RPC reliability #17996

CSI: improve controller RPC reliability #17996

tgross commented Jul 19, 2023 •

edited

tgross Jul 20, 2023

gulducat left a comment

gulducat Jul 20, 2023

tgross Jul 20, 2023

gulducat Jul 20, 2023

tgross Jul 20, 2023

gulducat Jul 20, 2023

gulducat Jul 20, 2023

tgross Jul 20, 2023 •

edited

gulducat Jul 20, 2023

tgross Jul 20, 2023

tgross commented Jul 20, 2023

		// serializedControllerRPC ensures we're only sending a single controller RPC to
		// a given plugin if the RPC can cause conflicting state changes.

CSI: improve controller RPC reliability #17996

CSI: improve controller RPC reliability #17996

Conversation

tgross commented Jul 19, 2023 • edited

Choose a reason for hiding this comment

gulducat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross Jul 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross commented Jul 20, 2023

tgross commented Jul 19, 2023 •

edited

tgross Jul 20, 2023 •

edited