New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize attach operations across different nodes for volumes that allow multi-attach #88678
Conversation
/priority important-longterm |
/assign |
/milestone v1.18 |
// Assumes lock has been acquired by caller. | ||
volumeName v1.UniqueVolumeName, | ||
podName types.UniquePodName) { | ||
|
||
opIndex := -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be defensive about indexing on a negative number and return if op isn't found. Update coming soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: saad-ali, verult The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
// No match, keep searching | ||
continue | ||
} | ||
volumeNameMatch := previousOp.key.volumeName == key.volumeName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if volumeNameMatch is false, we could skip it and continue with next loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next few conditions evaluations should be pretty fast. For the sake of clarity I have a slight preference for the current form
// - volumeName exists, podName exists, nodeName empty | ||
// This key conflicts with: | ||
// - the same volumeName and podName | ||
// - the same volumeName, but no podName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use empty name instead of "no name" in all places.
klog.V(10).Infof("Operation for volume %q is already running. Can't start detach for %q", attachedVolume.VolumeName, attachedVolume.NodeName) | ||
continue | ||
if util.IsMultiAttachForbidden(attachedVolume.VolumeSpec) { | ||
if rc.attacherDetacher.IsOperationPending(attachedVolume.VolumeName, "" /* podName */, "" /* nodeName */) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a small thing, change "" to nestedpendingoperations.EmptyNodeName and similar places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also to save some code, maybe define podName and nodeName based on util.IsMultiAttachForbidden, and then share the same code. Same thing applied below during attach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small thing, change "" to nestedpendingoperations.EmptyNodeName and similar places?
The reconciler package actually doesn't import the nestedpendingoperations package. By referring to nestedpendingoperations.EmptyNodeName we might be breaking the abstraction boundary, though ideally Empty*Names should really be moved to a different place. On the other hand, I'm doing a refactor that will largely remove the need of specifying empty arguments, so IMO it's OK to leave them here for now.
also to save some code, maybe define podName and nodeName based on util.IsMultiAttachForbidden, and then share the same code. Same thing applied below during attach.
The log messages are slight different between the two cases to make it more obvious which branch was followed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a level 10 log, Also I don't see that difference matters. you log the volumename, nodename, isn't that enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's still good to keep it because the log messages say slightly different things to more accurately describe what kind of operation is pending.
op.podName == podName { | ||
if op.key.volumeName == key.volumeName && | ||
op.key.podName == key.podName && | ||
op.key.nodeName == key.nodeName { | ||
return uint(i), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fix to https://github.com/kubernetes/kubernetes/pull/87258/files is here?
just want to understand how this causes PD leaks. Do you have the workflow of how it happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a description here: #88355 (comment)
Let me know if anything is unclear, happy to chat in person if you'd like!
@@ -182,9 +146,16 @@ func (rc *reconciler) reconcile() { | |||
// This check must be done before we do any other checks, as otherwise the other checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is now out of date. Could you explain the new logic here and/or add comments in the if/else blocks?
// - volumeName empty, podName empty, nodeName empty | ||
// This key does not have any conflicting keys. | ||
// - volumeName exists, podName empty, nodeName empty | ||
// This key conflicts with all other keys with the same volumeName. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of words "conflicting keys" is confusing here. Can we talk instead about whether two keys match?
// Run adds the concatenation of volumeName and one of podName or nodeName to | ||
// the list of running operations and spawns a new go routine to execute | ||
// OperationFunc inside generatedOperations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you are updating the existing wording, but maybe we can explain this better. Run's job is to run the given operation in a goroutine. If a goroutine is already running with a matching key, then Run returns an error. Callers should not care about whether the inputs are concatenated.
EDIT: Since you are planning to follow-up with a refactor, we can defer comment re-writes to that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK, yeah some of this will be changed naturally when the key is more generic
|
||
// volumeName, podName, and nodeName collectively form the operation key. | ||
// The following forms of operation keys are supported: | ||
// - volumeName empty, podName empty, nodeName empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a valid input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// - the same volumeName and nodeName | ||
// - the same volumeName but no nodeName | ||
|
||
// If an operation with the same operationName and a conflicting key exists, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operationName
does not seem to be a type or concept that this API exposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, and also made the whole comment block to be more specific.
|
||
// Arrange | ||
grm := NewNestedPendingOperations(false /* exponentialBackOffOnError */) | ||
operation1DoneCh := make(chan interface{}, 0 /* bufferSize */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you leak a goroutine by not closing or writing to this channel? Even for test, I think it matters a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a lot of spots to change due to the fact that Fatalf()
changes control flow. I made a note to include it for the refactor, but please bring it up if I forget
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fatalf()
does not prevent a defer, but let's leave this as is since a lot will change in the refactor.
|
||
errZ := grm.Run(opZVolumeName, "" /* podName */, "" /* nodeName */, volumetypes.GeneratedOperations{OperationFunc: operationZ}) | ||
if errZ != nil { | ||
t.Fatalf("NestedPendingOperations failed. Expected: <no error> Actual: <%v>", errZ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were reading a test log, it would be difficult to tell which of the three Fatalf
logs was hit here. Could you replace NestedPendingOperations
with operationZ
, operation1
, etc. in the failure messages?
Do not need to fix the whole file - we can save that for when we refactor.
} | ||
} | ||
|
||
func TestOperationExecutor_DetachSingleNodeVolumeConcurrentlyFromDifferentNodes(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does one attach a single node volume to multiple nodes in the first place, I wonder...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of anything either :) over-tested a bit here, will remove
pkg/volume/util/util.go
Outdated
// false, it is not guaranteed that multi-attach is actually supported by the volume type and we must rely on the | ||
// attacher to fail fast in such cases. | ||
// Please see https://github.com/kubernetes/kubernetes/issues/40669 and https://github.com/kubernetes/kubernetes/pull/40148#discussion_r98055047 | ||
func IsMultiAttachForbidden(volumeSpec *volume.Spec) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally function names are easier to read when worded in the "positive". Currently we have code that reads like, "is it forbidden" and "is it not forbidden", and the double negative is an extra cognitive load.
pkg/volume/util/util.go
Outdated
// Check for persistent volume types which do not fail when trying to multi-attach | ||
if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 { | ||
// No access mode specified so we don't know for sure. Let the attacher fail if needed | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple "uncertain" returns from this function, and they don't all return the same value. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which inconsistency are you referring to? The true returns are (1) if it's an inline volume with Azure/Cinder, or (2) if it's a PV with only the ReadWriteOnce access mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only nit is to use t.Run instead of logging the test ID. Other than that, I'd say we can squash commits and get this merged.
@@ -141,7 +151,7 @@ func (grm *nestedPendingOperations) Run( | |||
return NewAlreadyExistsError(opKey) | |||
} | |||
|
|||
backOffErr := previousOp.expBackoff.SafeToRetry(opKey.String()) | |||
backOffErr := previousOp.expBackoff.SafeToRetry(fmt.Sprintf("%+v", opKey)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG can we include the exponential backoff package in the refactor? I just read why you even need to pass this string in.
@@ -737,6 +701,9 @@ func Test_NestedPendingOperations_Positive_Issue_88355(t *testing.T) { | |||
// delay after an operation is signaled to finish to ensure it actually | |||
// finishes before running the next operation. | |||
delay = 50 * time.Millisecond | |||
|
|||
// Replicates the default AttachDetachController reconcile period | |||
reconcilerPeriod = 100 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test actually depends on this time value matching one in the code under test, then it smells funny to me. However, I think it would be better to address this during the refactor.
EmptyNodeName) | ||
} | ||
if test.expectPass { | ||
testConcurrentOperationsPositive(t, test.testId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use t.Run()
than to pass the testId into the helper function. e.g.
t.Run(fmt.Sprintf("test %d", test.testId), func(t *testing.T){
if test.expectPass...
})
… allow multi-attach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
op1ContinueCh <- true | ||
time.Sleep(delay) | ||
|
||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like the right behavior to me. Just wanted to make sure that the failure mode did not hang tests.
|
||
// Arrange | ||
grm := NewNestedPendingOperations(false /* exponentialBackOffOnError */) | ||
operation1DoneCh := make(chan interface{}, 0 /* bufferSize */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fatalf()
does not prevent a defer, but let's leave this as is since a lot will change in the refactor.
Weird, verify is shown as failed even though all the tests passed... /test pull-kubernetes-verify |
/hold cancel |
@verult do you mind sharing a little more about why this PR should be cherry-picked to previous releases? Is it to address the test flakes that you mentioned this change will help with? Thanks :) |
I tend to agree with @mattjmcnaughton on this one. Why do we have to cherry pick this fix? The PR does not fixes something that was broken but speed things up (nice to have). |
@mattjmcnaughton @gnufied we are seeing many use cases where a user starts up many pods across many nodes all reading from the same volume. This is common especially among batch workloads (AI/ML use cases for example). For many users, taking an hour for something that should take no more than a few min (at most) is unacceptable and essentially renders this case unusable. Some users are considering Mesos instead since this is handled smoothly and attaches all happen in parallel. |
Thanks for the extra context :) I defer to whoever is in charge of cherry-picks :) |
This feels quite impactful and in my mind could merit cherry picking. The three cherry-picks first though need lgtm and approve from appropriate OWNERS. Then the @kubernetes/patch-release-team will approve. |
Thanks for the additional context @tpepper! |
cc @saad-ali I assigned you to all of the cherry-picks, as you were the approver of the original diff. Thanks :) |
…-upstream-release-1.17 Automated cherry pick of #88678: Parallelize attach operations across different nodes for
…-upstream-release-1.16 Automated cherry pick of #88678: Parallelize attach operations across different nodes for
…-upstream-release-1.15 Automated cherry pick of #88678: Parallelize attach operations across different nodes for
What type of PR is this?
/kind bug
What this PR does / why we need it: This is the improved version of #87258 that fixes the problem discovered in #88355. The following is different from the previous PR:
getOperations()
anddeleteOperations()
in the second commit.operationKey
string representation to use%s
instead of%q
.Which issue(s) this PR fixes:
Fixes #73972, and partially #88355
Special notes for your reviewer:
Special notes from #87258:
"""
There are 2 commits. The first one is a refactor of the operation key structure in NestedPendingOperations. The second one is the main change.
I'm not a fan of the method signature of Run() and IsOperationPending() (especially the giant comment block above Run()). I'd like to refactor it to use a single generic OperationKey type, but since it could be a controversial change I'm going to have it in a separate PR.
"""
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/sig storage
/assign @jingxu97 @misterikkit @saad-ali