Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize attach operations across different nodes for volumes that allow multi-attach #88678

Merged
merged 1 commit into from Mar 6, 2020

Conversation

verult
Copy link
Contributor

@verult verult commented Feb 29, 2020

What type of PR is this?
/kind bug

What this PR does / why we need it: This is the improved version of #87258 that fixes the problem discovered in #88355. The following is different from the previous PR:

  • The fix: updated getOperations() and deleteOperations() in the second commit.
  • Includes a unit test to capture the scenario that likely caused storage e2e tests: Leaking PDs due to #87258 #88355, and captures the fix.
  • Changed operationKey string representation to use %s instead of %q.

Which issue(s) this PR fixes:

Fixes #73972, and partially #88355

Special notes for your reviewer:
Special notes from #87258:
"""
There are 2 commits. The first one is a refactor of the operation key structure in NestedPendingOperations. The second one is the main change.

I'm not a fan of the method signature of Run() and IsOperationPending() (especially the giant comment block above Run()). I'd like to refactor it to use a single generic OperationKey type, but since it could be a controversial change I'm going to have it in a separate PR.
"""

Does this PR introduce a user-facing change?:

For volumes that allow attaches across multiple nodes, attach and detach operations across different nodes are now executed in parallel.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

/sig storage
/assign @jingxu97 @misterikkit @saad-ali

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 29, 2020
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 29, 2020
@verult
Copy link
Contributor Author

verult commented Feb 29, 2020

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. area/kubelet sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 29, 2020
@gnufied
Copy link
Member

gnufied commented Feb 29, 2020

/assign

@saad-ali
Copy link
Member

saad-ali commented Mar 4, 2020

/milestone v1.18

@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Mar 4, 2020
// Assumes lock has been acquired by caller.
volumeName v1.UniqueVolumeName,
podName types.UniquePodName) {

opIndex := -1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be defensive about indexing on a negative number and return if op isn't found. Update coming soon.

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saad-ali, verult

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 5, 2020
@verult
Copy link
Contributor Author

verult commented Mar 5, 2020

/hold
@misterikkit is in the middle of review. Feel free to remove when you think it's ready. Thanks!

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2020
// No match, keep searching
continue
}
volumeNameMatch := previousOp.key.volumeName == key.volumeName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if volumeNameMatch is false, we could skip it and continue with next loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next few conditions evaluations should be pretty fast. For the sake of clarity I have a slight preference for the current form

// - volumeName exists, podName exists, nodeName empty
// This key conflicts with:
// - the same volumeName and podName
// - the same volumeName, but no podName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use empty name instead of "no name" in all places.

klog.V(10).Infof("Operation for volume %q is already running. Can't start detach for %q", attachedVolume.VolumeName, attachedVolume.NodeName)
continue
if util.IsMultiAttachForbidden(attachedVolume.VolumeSpec) {
if rc.attacherDetacher.IsOperationPending(attachedVolume.VolumeName, "" /* podName */, "" /* nodeName */) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small thing, change "" to nestedpendingoperations.EmptyNodeName and similar places?

Copy link
Contributor

@jingxu97 jingxu97 Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also to save some code, maybe define podName and nodeName based on util.IsMultiAttachForbidden, and then share the same code. Same thing applied below during attach.

Copy link
Contributor Author

@verult verult Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small thing, change "" to nestedpendingoperations.EmptyNodeName and similar places?

The reconciler package actually doesn't import the nestedpendingoperations package. By referring to nestedpendingoperations.EmptyNodeName we might be breaking the abstraction boundary, though ideally Empty*Names should really be moved to a different place. On the other hand, I'm doing a refactor that will largely remove the need of specifying empty arguments, so IMO it's OK to leave them here for now.

also to save some code, maybe define podName and nodeName based on util.IsMultiAttachForbidden, and then share the same code. Same thing applied below during attach.

The log messages are slight different between the two cases to make it more obvious which branch was followed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a level 10 log, Also I don't see that difference matters. you log the volumename, nodename, isn't that enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's still good to keep it because the log messages say slightly different things to more accurately describe what kind of operation is pending.

op.podName == podName {
if op.key.volumeName == key.volumeName &&
op.key.podName == key.podName &&
op.key.nodeName == key.nodeName {
return uint(i), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fix to https://github.com/kubernetes/kubernetes/pull/87258/files is here?
just want to understand how this causes PD leaks. Do you have the workflow of how it happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a description here: #88355 (comment)

Let me know if anything is unclear, happy to chat in person if you'd like!

@@ -182,9 +146,16 @@ func (rc *reconciler) reconcile() {
// This check must be done before we do any other checks, as otherwise the other checks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is now out of date. Could you explain the new logic here and/or add comments in the if/else blocks?

// - volumeName empty, podName empty, nodeName empty
// This key does not have any conflicting keys.
// - volumeName exists, podName empty, nodeName empty
// This key conflicts with all other keys with the same volumeName.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of words "conflicting keys" is confusing here. Can we talk instead about whether two keys match?

Comment on lines 53 to 55
// Run adds the concatenation of volumeName and one of podName or nodeName to
// the list of running operations and spawns a new go routine to execute
// OperationFunc inside generatedOperations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you are updating the existing wording, but maybe we can explain this better. Run's job is to run the given operation in a goroutine. If a goroutine is already running with a matching key, then Run returns an error. Callers should not care about whether the inputs are concatenated.

EDIT: Since you are planning to follow-up with a refactor, we can defer comment re-writes to that PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, yeah some of this will be changed naturally when the key is more generic


// volumeName, podName, and nodeName collectively form the operation key.
// The following forms of operation keys are supported:
// - volumeName empty, podName empty, nodeName empty

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a valid input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// - the same volumeName and nodeName
// - the same volumeName but no nodeName

// If an operation with the same operationName and a conflicting key exists,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operationName does not seem to be a type or concept that this API exposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, and also made the whole comment block to be more specific.


// Arrange
grm := NewNestedPendingOperations(false /* exponentialBackOffOnError */)
operation1DoneCh := make(chan interface{}, 0 /* bufferSize */)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you leak a goroutine by not closing or writing to this channel? Even for test, I think it matters a little.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a lot of spots to change due to the fact that Fatalf() changes control flow. I made a note to include it for the refactor, but please bring it up if I forget

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fatalf() does not prevent a defer, but let's leave this as is since a lot will change in the refactor.


errZ := grm.Run(opZVolumeName, "" /* podName */, "" /* nodeName */, volumetypes.GeneratedOperations{OperationFunc: operationZ})
if errZ != nil {
t.Fatalf("NestedPendingOperations failed. Expected: <no error> Actual: <%v>", errZ)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were reading a test log, it would be difficult to tell which of the three Fatalf logs was hit here. Could you replace NestedPendingOperations with operationZ, operation1, etc. in the failure messages?

Do not need to fix the whole file - we can save that for when we refactor.

}
}

func TestOperationExecutor_DetachSingleNodeVolumeConcurrentlyFromDifferentNodes(t *testing.T) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does one attach a single node volume to multiple nodes in the first place, I wonder...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of anything either :) over-tested a bit here, will remove

// false, it is not guaranteed that multi-attach is actually supported by the volume type and we must rely on the
// attacher to fail fast in such cases.
// Please see https://github.com/kubernetes/kubernetes/issues/40669 and https://github.com/kubernetes/kubernetes/pull/40148#discussion_r98055047
func IsMultiAttachForbidden(volumeSpec *volume.Spec) bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally function names are easier to read when worded in the "positive". Currently we have code that reads like, "is it forbidden" and "is it not forbidden", and the double negative is an extra cognitive load.

// Check for persistent volume types which do not fail when trying to multi-attach
if len(volumeSpec.PersistentVolume.Spec.AccessModes) == 0 {
// No access mode specified so we don't know for sure. Let the attacher fail if needed
return false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple "uncertain" returns from this function, and they don't all return the same value. Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which inconsistency are you referring to? The true returns are (1) if it's an inline volume with Azure/Cinder, or (2) if it's a PV with only the ReadWriteOnce access mode

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2020
Copy link

@misterikkit misterikkit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only nit is to use t.Run instead of logging the test ID. Other than that, I'd say we can squash commits and get this merged.

@@ -141,7 +151,7 @@ func (grm *nestedPendingOperations) Run(
return NewAlreadyExistsError(opKey)
}

backOffErr := previousOp.expBackoff.SafeToRetry(opKey.String())
backOffErr := previousOp.expBackoff.SafeToRetry(fmt.Sprintf("%+v", opKey))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG can we include the exponential backoff package in the refactor? I just read why you even need to pass this string in.

@@ -737,6 +701,9 @@ func Test_NestedPendingOperations_Positive_Issue_88355(t *testing.T) {
// delay after an operation is signaled to finish to ensure it actually
// finishes before running the next operation.
delay = 50 * time.Millisecond

// Replicates the default AttachDetachController reconcile period
reconcilerPeriod = 100 * time.Millisecond

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test actually depends on this time value matching one in the code under test, then it smells funny to me. However, I think it would be better to address this during the refactor.

EmptyNodeName)
}
if test.expectPass {
testConcurrentOperationsPositive(t, test.testId,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use t.Run() than to pass the testId into the helper function. e.g.

t.Run(fmt.Sprintf("test %d", test.testId), func(t *testing.T){
    if test.expectPass...
})

Copy link

@misterikkit misterikkit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

op1ContinueCh <- true
time.Sleep(delay)

for {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like the right behavior to me. Just wanted to make sure that the failure mode did not hang tests.


// Arrange
grm := NewNestedPendingOperations(false /* exponentialBackOffOnError */)
operation1DoneCh := make(chan interface{}, 0 /* bufferSize */)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fatalf() does not prevent a defer, but let's leave this as is since a lot will change in the refactor.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2020
@verult
Copy link
Contributor Author

verult commented Mar 6, 2020

Weird, verify is shown as failed even though all the tests passed...

/test pull-kubernetes-verify
/test pull-kubernetes-e2e-gce-csi-serial

@verult
Copy link
Contributor Author

verult commented Mar 6, 2020

/hold cancel

@mattjmcnaughton
Copy link
Contributor

@verult do you mind sharing a little more about why this PR should be cherry-picked to previous releases? Is it to address the test flakes that you mentioned this change will help with?

Thanks :)

@gnufied
Copy link
Member

gnufied commented Mar 19, 2020

I tend to agree with @mattjmcnaughton on this one. Why do we have to cherry pick this fix? The PR does not fixes something that was broken but speed things up (nice to have).

@verult
Copy link
Contributor Author

verult commented Mar 25, 2020

@mattjmcnaughton @gnufied we are seeing many use cases where a user starts up many pods across many nodes all reading from the same volume. This is common especially among batch workloads (AI/ML use cases for example). For many users, taking an hour for something that should take no more than a few min (at most) is unacceptable and essentially renders this case unusable. Some users are considering Mesos instead since this is handled smoothly and attaches all happen in parallel.

@mattjmcnaughton
Copy link
Contributor

@mattjmcnaughton @gnufied we are seeing many use cases where a user starts up many pods across many nodes all reading from the same volume. This is common especially among batch workloads (AI/ML use cases for example). For many users, taking an hour for something that should take no more than a few min (at most) is unacceptable and essentially renders this case unusable. Some users are considering Mesos instead since this is handled smoothly and attaches all happen in parallel.

Thanks for the extra context :) I defer to whoever is in charge of cherry-picks :)

@tpepper
Copy link
Member

tpepper commented Apr 3, 2020

@mattjmcnaughton @gnufied we are seeing many use cases where a user starts up many pods across many nodes all reading from the same volume. This is common especially among batch workloads (AI/ML use cases for example). For many users, taking an hour for something that should take no more than a few min (at most) is unacceptable and essentially renders this case unusable. Some users are considering Mesos instead since this is handled smoothly and attaches all happen in parallel.

Thanks for the extra context :) I defer to whoever is in charge of cherry-picks :)

This feels quite impactful and in my mind could merit cherry picking. The three cherry-picks first though need lgtm and approve from appropriate OWNERS. Then the @kubernetes/patch-release-team will approve.

@mattjmcnaughton
Copy link
Contributor

Thanks for the additional context @tpepper!

@mattjmcnaughton
Copy link
Contributor

cc @saad-ali I assigned you to all of the cherry-picks, as you were the approver of the original diff. Thanks :)

k8s-ci-robot added a commit that referenced this pull request Apr 9, 2020
…-upstream-release-1.17

Automated cherry pick of #88678: Parallelize attach operations across different nodes for
k8s-ci-robot added a commit that referenced this pull request Apr 9, 2020
…-upstream-release-1.16

Automated cherry pick of #88678: Parallelize attach operations across different nodes for
k8s-ci-robot added a commit that referenced this pull request Apr 10, 2020
…-upstream-release-1.15

Automated cherry pick of #88678: Parallelize attach operations across different nodes for
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow volume attach for single ReadOnlyMany volume across multiple nodes
9 participants