Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix nestedPendingOperations mount and umount parallel bug -- minimal change #110951

Merged

Conversation

249043822
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

We have committed the refactor version for nestedPendingOperations #109190,
but it is a big change and may bring unknown risks and would be landed for a long time, so I think we can use a smaller way to fix bug first, and promote a nice refactor in next step.

Which issue(s) this PR fixes:

Fixes #109047

Special notes for your reviewer:

/cc @jingxu97 @gnufied @Dingshujie

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 5, 2022
@Dingshujie
Copy link
Member

/test pull-kubernetes-e2e-gce-storage-snapshot

}
}
if opIndex != -1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return opIndex != -1, opIndex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

nodeName := EmptyNodeName
// delay after an operation is signaled to finish to ensure it actually
// finishes before running the next operation.
delay := 600 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

600 Millisecond may be too long, we can use CompleteFunc

operation1DoneCh := make(chan struct{})
completeFunc := func(c types.CompleteFuncParam) {
  operation1DoneCh <- struct{}{}
}
err1 := grm.Run(volumeName, podName1, nodeName /* nodeName */, volumetypes.GeneratedOperations{OperationFunc: errorFunc,CompleteFunc:  completeFunc ,OperationName: "umount"})
if err1 != nil {
}
<- operation1DoneCh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but CompleteFunc is still called before operationComplete, so it could not guarantee that the last operation was complete. Now I reduce the delay and split to a backoff delay, if there are other nice handlers, iI will refresh again. thanks

@Dingshujie
Copy link
Member

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jul 7, 2022
Copy link
Contributor

@yangjunmyfm192085 yangjunmyfm192085 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/retest

@Dingshujie
Copy link
Member

/retest

Copy link
Contributor

@yangjunmyfm192085 yangjunmyfm192085 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we have to wait for this pr to merge?
#110980

Copy link
Contributor

@yangjunmyfm192085 yangjunmyfm192085 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/retest

@xing-yang
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 24, 2022
@xing-yang
Copy link
Contributor

/assign @jingxu97 @gnufied

@endocrimes endocrimes added this to Triage in SIG Node PR Triage Jul 26, 2022
@endocrimes endocrimes moved this from Triage to Needs Reviewer in SIG Node PR Triage Jul 26, 2022
@gnufied
Copy link
Member

gnufied commented Aug 22, 2022

@itroyano - Both mount and unmount operation already have volume name in their key and hence multiple mount operations on same volume can't proceed.

We however do allow, multiple unmount operation on same volume to proceed (if they are being used by different pods) and that is by design.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 24, 2022
@249043822 249043822 requested review from gnufied and removed request for jingxu97 August 24, 2022 09:08
@gnufied
Copy link
Member

gnufied commented Aug 24, 2022

/retest

// operation4 override operation1 or operation3, and operation5 will override operation2,
// so finally, only operation4, operation1 or operation3 left
grm.(*nestedPendingOperations).lock.Lock()
defer grm.(*nestedPendingOperations).lock.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: can we add a comment that is on the lines of:

"Since we successfully finished unmount operation on pod2, it should be removed from operations array"

IMO for future readers of this test case, it may be easier to understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnufied Done

@gnufied
Copy link
Member

gnufied commented Aug 25, 2022

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 25, 2022
@gnufied
Copy link
Member

gnufied commented Aug 26, 2022

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 26, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 249043822, gnufied

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2022
@k8s-ci-robot k8s-ci-robot merged commit 4d0ad07 into kubernetes:master Aug 26, 2022
SIG Node PR Triage automation moved this from Needs Reviewer to Done Aug 26, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Aug 26, 2022
@gnufied
Copy link
Member

gnufied commented Aug 26, 2022

/cherrypick 1.25

@gnufied
Copy link
Member

gnufied commented Aug 29, 2022

@249043822 can you cherry pick to 1.25, 1.24 and 1.23?

@249043822
Copy link
Member Author

@249043822 can you cherry pick to 1.25, 1.24 and 1.23?

np

k8s-ci-robot added a commit that referenced this pull request Sep 7, 2022
…10951-upstream-release-1.25

Automated cherry pick of #110951: fix nestedPendingOperations mount and umount parallel bug
k8s-ci-robot added a commit that referenced this pull request Sep 7, 2022
…10951-upstream-release-1.24

Automated cherry pick of #110951: fix nestedPendingOperations mount and umount parallel bug
k8s-ci-robot added a commit that referenced this pull request Sep 8, 2022
…10951-upstream-release-1.23

Automated cherry pick of #110951: fix nestedPendingOperations mount and umount parallel bug
chenchun pushed a commit to chenchun/kubernetes that referenced this pull request Mar 20, 2024
…nto 'tke/v1.20.6' (merge request !907)

fix nestedPendingOperations mount and umount parallel bug
Issue: kubernetes#109047
Cherry Pick: kubernetes#110951

详细内容见Issue,volume manager中存在某些时序性bug,导致某个PV对应的mount操作都会卡住并且和这个PV相关的pod都卡在Creating,只有重启kubelet才会恢复。
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

kubelet nestedPendingOperations may leak operation lead same pv not to do mount or umount operation
8 participants