Parallelize attach operations across different nodes for volumes that allow multi-attach #88678

verult · 2020-02-29T00:24:44Z

What type of PR is this?
/kind bug

What this PR does / why we need it: This is the improved version of #87258 that fixes the problem discovered in #88355. The following is different from the previous PR:

The fix: updated getOperations() and deleteOperations() in the second commit.
Includes a unit test to capture the scenario that likely caused storage e2e tests: Leaking PDs due to #87258 #88355, and captures the fix.
Changed operationKey string representation to use %s instead of %q.

Which issue(s) this PR fixes:

Fixes #73972, and partially #88355

Special notes for your reviewer:
Special notes from #87258:
"""
There are 2 commits. The first one is a refactor of the operation key structure in NestedPendingOperations. The second one is the main change.

I'm not a fan of the method signature of Run() and IsOperationPending() (especially the giant comment block above Run()). I'd like to refactor it to use a single generic OperationKey type, but since it could be a controversial change I'm going to have it in a separate PR.
"""

Does this PR introduce a user-facing change?:

For volumes that allow attaches across multiple nodes, attach and detach operations across different nodes are now executed in parallel.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

/sig storage
/assign @jingxu97 @misterikkit @saad-ali

verult · 2020-02-29T00:25:21Z

/priority important-longterm

gnufied · 2020-02-29T13:42:03Z

/assign

saad-ali · 2020-03-04T23:13:09Z

/milestone v1.18

verult · 2020-03-04T23:22:08Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go

 	// Assumes lock has been acquired by caller.
-	volumeName v1.UniqueVolumeName,
-	podName types.UniquePodName) {

 	opIndex := -1


We should be defensive about indexing on a negative number and return if op isn't found. Update coming soon.

saad-ali

/lgtm
/approve

k8s-ci-robot · 2020-03-05T01:50:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saad-ali, verult

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/volume/attachdetach/OWNERS~~ [saad-ali]
~~pkg/kubelet/volumemanager/OWNERS~~ [saad-ali]
~~pkg/volume/util/OWNERS~~ [saad-ali]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

verult · 2020-03-05T01:55:25Z

/hold
@misterikkit is in the middle of review. Feel free to remove when you think it's ready. Thanks!

jingxu97 · 2020-03-05T18:47:54Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go

-			// No match, keep searching
-			continue
-		}
+		volumeNameMatch := previousOp.key.volumeName == key.volumeName


if volumeNameMatch is false, we could skip it and continue with next loop.

The next few conditions evaluations should be pretty fast. For the sake of clarity I have a slight preference for the current form

jingxu97 · 2020-03-05T18:55:44Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go

+	// - volumeName exists, podName exists, nodeName empty
+	//   This key conflicts with:
+	//   - the same volumeName and podName
+	//   - the same volumeName, but no podName


maybe use empty name instead of "no name" in all places.

jingxu97 · 2020-03-05T18:58:12Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

-				klog.V(10).Infof("Operation for volume %q is already running. Can't start detach for %q", attachedVolume.VolumeName, attachedVolume.NodeName)
-				continue
+			if util.IsMultiAttachForbidden(attachedVolume.VolumeSpec) {
+				if rc.attacherDetacher.IsOperationPending(attachedVolume.VolumeName, "" /* podName */, "" /* nodeName */) {


a small thing, change "" to nestedpendingoperations.EmptyNodeName and similar places?

also to save some code, maybe define podName and nodeName based on util.IsMultiAttachForbidden, and then share the same code. Same thing applied below during attach.

small thing, change "" to nestedpendingoperations.EmptyNodeName and similar places?

The reconciler package actually doesn't import the nestedpendingoperations package. By referring to nestedpendingoperations.EmptyNodeName we might be breaking the abstraction boundary, though ideally Empty*Names should really be moved to a different place. On the other hand, I'm doing a refactor that will largely remove the need of specifying empty arguments, so IMO it's OK to leave them here for now.

also to save some code, maybe define podName and nodeName based on util.IsMultiAttachForbidden, and then share the same code. Same thing applied below during attach.

The log messages are slight different between the two cases to make it more obvious which branch was followed

it is a level 10 log, Also I don't see that difference matters. you log the volumename, nodename, isn't that enough?

IMO it's still good to keep it because the log messages say slightly different things to more accurately describe what kind of operation is pending.

jingxu97 · 2020-03-05T19:21:52Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go

-			op.podName == podName {
+		if op.key.volumeName == key.volumeName &&
+			op.key.podName == key.podName &&
+			op.key.nodeName == key.nodeName {
 			return uint(i), nil


the fix to https://github.com/kubernetes/kubernetes/pull/87258/files is here?
just want to understand how this causes PD leaks. Do you have the workflow of how it happens?

Added a description here: #88355 (comment)

Let me know if anything is unclear, happy to chat in person if you'd like!

misterikkit · 2020-03-04T00:14:23Z

pkg/controller/volume/attachdetach/reconciler/reconciler.go

@@ -182,9 +146,16 @@ func (rc *reconciler) reconcile() {
 			// This check must be done before we do any other checks, as otherwise the other checks


This comment is now out of date. Could you explain the new logic here and/or add comments in the if/else blocks?

misterikkit · 2020-03-04T00:20:00Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go

+	// - volumeName empty, podName empty, nodeName empty
+	//   This key does not have any conflicting keys.
+	// - volumeName exists, podName empty, nodeName empty
+	//   This key conflicts with all other keys with the same volumeName.


I think the use of words "conflicting keys" is confusing here. Can we talk instead about whether two keys match?

misterikkit · 2020-03-04T00:21:25Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go

+	// Run adds the concatenation of volumeName and one of podName or nodeName to
+	// the list of running operations and spawns a new go routine to execute
+	// OperationFunc inside generatedOperations.


I know you are updating the existing wording, but maybe we can explain this better. Run's job is to run the given operation in a goroutine. If a goroutine is already running with a matching key, then Run returns an error. Callers should not care about whether the inputs are concatenated.

EDIT: Since you are planning to follow-up with a refactor, we can defer comment re-writes to that PR.

ACK, yeah some of this will be changed naturally when the key is more generic

misterikkit · 2020-03-04T00:21:44Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go

+
+	// volumeName, podName, and nodeName collectively form the operation key.
+	// The following forms of operation keys are supported:
+	// - volumeName empty, podName empty, nodeName empty


is this a valid input?

Yeah, it's used here: https://github.com/kubernetes/kubernetes/pull/88678/files#diff-cf7942318bd5cb476354f4316068597cR795

misterikkit · 2020-03-04T00:23:16Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go

+	//   - the same volumeName and nodeName
+	//   - the same volumeName but no nodeName
+
+	// If an operation with the same operationName and a conflicting key exists,


operationName does not seem to be a type or concept that this API exposes.

Updated, and also made the whole comment block to be more specific.

misterikkit · 2020-03-05T21:27:30Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations_test.go

+
+	// Arrange
+	grm := NewNestedPendingOperations(false /* exponentialBackOffOnError */)
+	operation1DoneCh := make(chan interface{}, 0 /* bufferSize */)


Do you leak a goroutine by not closing or writing to this channel? Even for test, I think it matters a little.

It's a lot of spots to change due to the fact that Fatalf() changes control flow. I made a note to include it for the refactor, but please bring it up if I forget

Fatalf() does not prevent a defer, but let's leave this as is since a lot will change in the refactor.

misterikkit · 2020-03-05T21:32:22Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations_test.go

+
+	errZ := grm.Run(opZVolumeName, "" /* podName */, "" /* nodeName */, volumetypes.GeneratedOperations{OperationFunc: operationZ})
+	if errZ != nil {
+		t.Fatalf("NestedPendingOperations failed. Expected: <no error> Actual: <%v>", errZ)


If I were reading a test log, it would be difficult to tell which of the three Fatalf logs was hit here. Could you replace NestedPendingOperations with operationZ, operation1, etc. in the failure messages?

Do not need to fix the whole file - we can save that for when we refactor.

misterikkit · 2020-03-05T22:17:42Z

pkg/volume/util/operationexecutor/operation_executor_test.go

+	}
+}
+
+func TestOperationExecutor_DetachSingleNodeVolumeConcurrentlyFromDifferentNodes(t *testing.T) {


How does one attach a single node volume to multiple nodes in the first place, I wonder...?

I can't think of anything either :) over-tested a bit here, will remove

misterikkit · 2020-03-05T22:23:36Z

pkg/volume/util/util.go

+// false, it is not guaranteed that multi-attach is actually supported by the volume type and we must rely on the
+// attacher to fail fast in such cases.
+// Please see https://github.com/kubernetes/kubernetes/issues/40669 and https://github.com/kubernetes/kubernetes/pull/40148#discussion_r98055047
+func IsMultiAttachForbidden(volumeSpec *volume.Spec) bool {


Generally function names are easier to read when worded in the "positive". Currently we have code that reads like, "is it forbidden" and "is it not forbidden", and the double negative is an extra cognitive load.

misterikkit · 2020-03-05T22:31:11Z

pkg/volume/util/util.go

+		// Check for persistent volume types which do not fail when trying to multi-attach
+		if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 {
+			// No access mode specified so we don't know for sure. Let the attacher fail if needed
+			return false


There are multiple "uncertain" returns from this function, and they don't all return the same value. Why?

Which inconsistency are you referring to? The true returns are (1) if it's an inline volume with Azure/Cinder, or (2) if it's a PV with only the ReadWriteOnce access mode

misterikkit

My only nit is to use t.Run instead of logging the test ID. Other than that, I'd say we can squash commits and get this merged.

misterikkit · 2020-03-06T05:23:34Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations.go

@@ -141,7 +151,7 @@ func (grm *nestedPendingOperations) Run(
 			return NewAlreadyExistsError(opKey)
 		}

-		backOffErr := previousOp.expBackoff.SafeToRetry(opKey.String())
+		backOffErr := previousOp.expBackoff.SafeToRetry(fmt.Sprintf("%+v", opKey))


OMG can we include the exponential backoff package in the refactor? I just read why you even need to pass this string in.

misterikkit · 2020-03-06T05:40:02Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations_test.go

@@ -737,6 +701,9 @@ func Test_NestedPendingOperations_Positive_Issue_88355(t *testing.T) {
 		// delay after an operation is signaled to finish to ensure it actually
 		// finishes before running the next operation.
 		delay = 50 * time.Millisecond
+
+		// Replicates the default AttachDetachController reconcile period
+		reconcilerPeriod = 100 * time.Millisecond


If the test actually depends on this time value matching one in the code under test, then it smells funny to me. However, I think it would be better to address this during the refactor.

misterikkit · 2020-03-06T05:44:05Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations_test.go

-		EmptyNodeName)
-}
+		if test.expectPass {
+			testConcurrentOperationsPositive(t, test.testId,


It is better to use t.Run() than to pass the testId into the helper function. e.g.

t.Run(fmt.Sprintf("test %d", test.testId), func(t *testing.T){ if test.expectPass... })

… allow multi-attach

misterikkit

/lgtm

misterikkit · 2020-03-06T07:05:11Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations_test.go

+	op1ContinueCh <- true
+	time.Sleep(delay)
+
+	for {


That seems like the right behavior to me. Just wanted to make sure that the failure mode did not hang tests.

misterikkit · 2020-03-06T07:06:05Z

pkg/volume/util/nestedpendingoperations/nestedpendingoperations_test.go

+
+	// Arrange
+	grm := NewNestedPendingOperations(false /* exponentialBackOffOnError */)
+	operation1DoneCh := make(chan interface{}, 0 /* bufferSize */)


Fatalf() does not prevent a defer, but let's leave this as is since a lot will change in the refactor.

verult · 2020-03-06T17:36:34Z

Weird, verify is shown as failed even though all the tests passed...

/test pull-kubernetes-verify
/test pull-kubernetes-e2e-gce-csi-serial

verult · 2020-03-06T18:40:52Z

/hold cancel

mattjmcnaughton · 2020-03-19T17:22:53Z

@verult do you mind sharing a little more about why this PR should be cherry-picked to previous releases? Is it to address the test flakes that you mentioned this change will help with?

Thanks :)

gnufied · 2020-03-19T18:30:48Z

I tend to agree with @mattjmcnaughton on this one. Why do we have to cherry pick this fix? The PR does not fixes something that was broken but speed things up (nice to have).

verult · 2020-03-25T19:45:01Z

@mattjmcnaughton @gnufied we are seeing many use cases where a user starts up many pods across many nodes all reading from the same volume. This is common especially among batch workloads (AI/ML use cases for example). For many users, taking an hour for something that should take no more than a few min (at most) is unacceptable and essentially renders this case unusable. Some users are considering Mesos instead since this is handled smoothly and attaches all happen in parallel.

mattjmcnaughton · 2020-03-25T22:37:17Z

@mattjmcnaughton @gnufied we are seeing many use cases where a user starts up many pods across many nodes all reading from the same volume. This is common especially among batch workloads (AI/ML use cases for example). For many users, taking an hour for something that should take no more than a few min (at most) is unacceptable and essentially renders this case unusable. Some users are considering Mesos instead since this is handled smoothly and attaches all happen in parallel.

Thanks for the extra context :) I defer to whoever is in charge of cherry-picks :)

tpepper · 2020-04-03T22:29:15Z

@mattjmcnaughton @gnufied we are seeing many use cases where a user starts up many pods across many nodes all reading from the same volume. This is common especially among batch workloads (AI/ML use cases for example). For many users, taking an hour for something that should take no more than a few min (at most) is unacceptable and essentially renders this case unusable. Some users are considering Mesos instead since this is handled smoothly and attaches all happen in parallel.

Thanks for the extra context :) I defer to whoever is in charge of cherry-picks :)

This feels quite impactful and in my mind could merit cherry picking. The three cherry-picks first though need lgtm and approve from appropriate OWNERS. Then the @kubernetes/patch-release-team will approve.

mattjmcnaughton · 2020-04-04T14:15:22Z

Thanks for the additional context @tpepper!

mattjmcnaughton · 2020-04-04T14:17:04Z

cc @saad-ali I assigned you to all of the cherry-picks, as you were the approver of the original diff. Thanks :)

…-upstream-release-1.17 Automated cherry pick of #88678: Parallelize attach operations across different nodes for

…-upstream-release-1.16 Automated cherry pick of #88678: Parallelize attach operations across different nodes for

…-upstream-release-1.15 Automated cherry pick of #88678: Parallelize attach operations across different nodes for

k8s-ci-robot assigned jingxu97 Feb 29, 2020

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 29, 2020

k8s-ci-robot assigned misterikkit and saad-ali Feb 29, 2020

k8s-ci-robot requested review from gnufied and jsafrane February 29, 2020 00:26

verult mentioned this pull request Feb 29, 2020

storage e2e tests: Leaking PDs due to #87258 #88355

Closed

k8s-ci-robot assigned gnufied Feb 29, 2020

k8s-ci-robot added this to the v1.18 milestone Mar 4, 2020

verult commented Mar 4, 2020

View reviewed changes

verult force-pushed the slow-rxm-attach branch from c638d62 to da84ace Compare March 4, 2020 23:24

saad-ali approved these changes Mar 5, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 5, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2020

jingxu97 reviewed Mar 5, 2020

View reviewed changes

misterikkit reviewed Mar 5, 2020

View reviewed changes

verult force-pushed the slow-rxm-attach branch from da84ace to 390b5e8 Compare March 6, 2020 03:48

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2020

verult force-pushed the slow-rxm-attach branch from 390b5e8 to 735bdc2 Compare March 6, 2020 04:06

misterikkit reviewed Mar 6, 2020

View reviewed changes

Parallelize attach operations across different nodes for volumes that…

ef3d66b

… allow multi-attach

verult force-pushed the slow-rxm-attach branch from 735bdc2 to ef3d66b Compare March 6, 2020 06:22

misterikkit reviewed Mar 6, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2020

k8s-ci-robot merged commit ef672c1 into kubernetes:master Mar 6, 2020

k8s-ci-robot added a commit that referenced this pull request Apr 9, 2020

Merge pull request #89239 from verult/automated-cherry-pick-of-#88678…

4325196

…-upstream-release-1.17 Automated cherry pick of #88678: Parallelize attach operations across different nodes for

k8s-ci-robot added a commit that referenced this pull request Apr 9, 2020

Merge pull request #89240 from verult/automated-cherry-pick-of-#88678…

dbf25c6

…-upstream-release-1.16 Automated cherry pick of #88678: Parallelize attach operations across different nodes for

k8s-ci-robot added a commit that referenced this pull request Apr 10, 2020

Merge pull request #89241 from verult/automated-cherry-pick-of-#88678…

b31517c

…-upstream-release-1.15 Automated cherry pick of #88678: Parallelize attach operations across different nodes for

		@@ -182,9 +146,16 @@ func (rc *reconciler) reconcile() {
		// This check must be done before we do any other checks, as otherwise the other checks

Parallelize attach operations across different nodes for volumes that allow multi-attach #88678

Parallelize attach operations across different nodes for volumes that allow multi-attach #88678

Conversation

verult commented Feb 29, 2020 • edited

verult commented Feb 29, 2020

gnufied commented Feb 29, 2020

saad-ali commented Mar 4, 2020

Choose a reason for hiding this comment

saad-ali left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 5, 2020

verult commented Mar 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingxu97 Mar 5, 2020 • edited

Choose a reason for hiding this comment

verult Mar 5, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

misterikkit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

misterikkit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

verult commented Mar 6, 2020

verult commented Mar 6, 2020

mattjmcnaughton commented Mar 19, 2020

gnufied commented Mar 19, 2020

verult commented Mar 25, 2020 • edited

mattjmcnaughton commented Mar 25, 2020

tpepper commented Apr 3, 2020 • edited

mattjmcnaughton commented Apr 4, 2020

mattjmcnaughton commented Apr 4, 2020

verult commented Feb 29, 2020 •

edited

jingxu97 Mar 5, 2020 •

edited

verult Mar 5, 2020 •

edited

verult commented Mar 25, 2020 •

edited

tpepper commented Apr 3, 2020 •

edited