New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix nestedPendingOperations mount and umount parallel bug -- minimal change #110951
fix nestedPendingOperations mount and umount parallel bug -- minimal change #110951
Conversation
/test pull-kubernetes-e2e-gce-storage-snapshot |
} | ||
} | ||
if opIndex != -1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return opIndex != -1, opIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
nodeName := EmptyNodeName | ||
// delay after an operation is signaled to finish to ensure it actually | ||
// finishes before running the next operation. | ||
delay := 600 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
600 Millisecond may be too long, we can use CompleteFunc
operation1DoneCh := make(chan struct{})
completeFunc := func(c types.CompleteFuncParam) {
operation1DoneCh <- struct{}{}
}
err1 := grm.Run(volumeName, podName1, nodeName /* nodeName */, volumetypes.GeneratedOperations{OperationFunc: errorFunc,CompleteFunc: completeFunc ,OperationName: "umount"})
if err1 != nil {
}
<- operation1DoneCh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but CompleteFunc
is still called before operationComplete
, so it could not guarantee that the last operation was complete. Now I reduce the delay and split to a backoff delay, if there are other nice handlers, iI will refresh again. thanks
77c1a1e
to
297d727
Compare
/sig node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/retest
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we have to wait for this pr to merge?
#110980
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/retest
/triage accepted |
@itroyano - Both mount and unmount operation already have volume name in their key and hence multiple mount operations on same volume can't proceed. We however do allow, multiple unmount operation on same volume to proceed (if they are being used by different pods) and that is by design. |
297d727
to
dcb2e67
Compare
/retest |
// operation4 override operation1 or operation3, and operation5 will override operation2, | ||
// so finally, only operation4, operation1 or operation3 left | ||
grm.(*nestedPendingOperations).lock.Lock() | ||
defer grm.(*nestedPendingOperations).lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: can we add a comment that is on the lines of:
"Since we successfully finished unmount
operation on pod2, it should be removed from operations
array"
IMO for future readers of this test case, it may be easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gnufied Done
/priority important-soon |
dcb2e67
to
593f6c9
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 249043822, gnufied The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherrypick 1.25 |
@249043822 can you cherry pick to 1.25, 1.24 and 1.23? |
np |
…10951-upstream-release-1.25 Automated cherry pick of #110951: fix nestedPendingOperations mount and umount parallel bug
…10951-upstream-release-1.24 Automated cherry pick of #110951: fix nestedPendingOperations mount and umount parallel bug
…10951-upstream-release-1.23 Automated cherry pick of #110951: fix nestedPendingOperations mount and umount parallel bug
…nto 'tke/v1.20.6' (merge request !907) fix nestedPendingOperations mount and umount parallel bug Issue: kubernetes#109047 Cherry Pick: kubernetes#110951 详细内容见Issue,volume manager中存在某些时序性bug,导致某个PV对应的mount操作都会卡住并且和这个PV相关的pod都卡在Creating,只有重启kubelet才会恢复。
What type of PR is this?
/kind bug
What this PR does / why we need it:
We have committed the refactor version for nestedPendingOperations #109190,
but it is a big change and may bring unknown risks and would be landed for a long time, so I think we can use a smaller way to fix bug first, and promote a nice refactor in next step.
Which issue(s) this PR fixes:
Fixes #109047
Special notes for your reviewer:
/cc @jingxu97 @gnufied @Dingshujie
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: