Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix nestedPendingOperations mount and umount parallel bug #109190

Closed
wants to merge 1 commit into from

Conversation

249043822
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #109047

Special notes for your reviewer:

/cc @jingxu97 @gnufied

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Mar 31, 2022
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 31, 2022
Copy link
Contributor

@yangjunmyfm192085 yangjunmyfm192085 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 31, 2022
@249043822
Copy link
Member Author

/retest

@@ -256,10 +257,18 @@ func (grm *nestedPendingOperations) isOperationExists(key operationKey) (bool, i
previousOp.key.nodeName == key.nodeName

if volumeNameMatch && podNameMatch && nodeNameMatch {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend no else

		if volumeNameMatch && podNameMatch && nodeNameMatch {
			if previousOp.operationPending {
				return true, previousOpIndex
			}
			if opIndex == -1 {
				opIndex = previousOpIndex
			}
		}

} else {
if opIndex == -1 {
opIndex = previousOpIndex
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t0 Operations: {umount, pvname2, pod2, mount, pending=true} {umount, pvname, pod1, pending=true} {umount, pvname, pod2, pending=true}
t1 Operations: {umount, pvname2, pod2, mount, pending=true} {umount, pvname, pod1, pending=false} {umount, pvname, pod2, pending=false} // two umount failed
t2 Operations: {umount, pvname2, pod2, mount, pending=true} {mount, pvname, EmptyUniquePodName, pending=true} {umount, pvname, pod2, pending=false} // add pvname mount
t3 Operations: {umount, pvname2, pod2, mount, pending=true} {mount, pvname, EmptyUniquePodName, pending=false} {umount, pvname, pod2, pending=false} // pvname mount failed
t4 Operations: {umount, pvname2, pod2, mount, pending=true} {umount, pvname, pod2, pending=true} {umount, pvname, pod2, pending=false} // pvname pod2 umount add
t5 Operations: {umount, pvname, pod2, pending=false} {umount, pvname, pod2, pending=true} // pvname2 umount success, replace index 0 with index 2, and delete index 2
t6 Operations: {umount, pvname, pod2, pending=true} // umount pod2 success, delete index 0

maybe in this case, also leak a operation can block pv mount

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reivew, missing exact match first, I will add a test case for this.

@Dingshujie
Copy link
Member

	opIndex := -1
	for previousOpIndex, previousOp := range grm.operations {
		volumeNameMatch := previousOp.key.volumeName == key.volumeName

		podNameMatch := previousOp.key.podName == EmptyUniquePodName ||
			key.podName == EmptyUniquePodName ||
			previousOp.key.podName == key.podName

		podNameExactMatch := previousOp.key.podName == key.podName

		nodeNameMatch := previousOp.key.nodeName == EmptyNodeName ||
			key.nodeName == EmptyNodeName ||
			previousOp.key.nodeName == key.nodeName

		nodeNameExactMatch := previousOp.key.nodeName == key.nodeName

		if volumeNameMatch && podNameMatch && nodeNameMatch {
			if previousOp.operationPending {
				return true, previousOpIndex
			}
			if opIndex == -1 || (podNameExactMatch && nodeNameExactMatch) {
				opIndex = previousOpIndex
			}
		}
	}

prefer to override pod name exact match and node name exact match operation, if not found, override first match one

@249043822 249043822 force-pushed the br-detach-fail branch 4 times, most recently from a914014 to d05470a Compare April 1, 2022 04:56
@249043822
Copy link
Member Author

/test pull-kubernetes-integration

nodeNameExactMatch := previousOp.key.nodeName == key.nodeName
if opIndex == -1 || (podNameExactMatch && nodeNameExactMatch) {
opIndex = previousOpIndex
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the fix. Does following logic seems more direct?

        exactMatch := -1
	nonExactMatch := -1

	for previousOpIndex, previousOp := range grm.operations {
		volumeNameMatch := previousOp.key.volumeName == key.volumeName

		if volumeNameMatch {
			if key.podName == previousOp.key.podName && key.nodeName == previousOp.key.nodeName {
				exactMatch = previousOpIndex
				break
			}

			podNameMatch := previousOp.key.podName == EmptyUniquePodName ||
				key.podName == EmptyUniquePodName ||
				previousOp.key.podName == key.podName

			nodeNameMatch := previousOp.key.nodeName == EmptyNodeName ||
				key.nodeName == EmptyNodeName ||
				previousOp.key.nodeName == key.nodeName

			if podNameMatch && nodeNameMatch {
				nonExactMatch = previousOpIndex
			}
		}
	}
	if exactMatch != -1 {
		return true, exactMatch
	}
	if nonExactMatch != -1 {
		return true, nonExactMatch
	}

Basically we want to pick exactly matching operation over other ones.

Copy link
Member

@gnufied gnufied Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well my version of code does not return first pending match though(can be tweaked but tbh I am okay with either version). One downside of current approach now is - we must iterate through all operations before IsOperationExists can return. This could potentially be expensive on a busy node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, in worst case, there is no pending match operations at all but there is a non exact match at the header, we must still iterate all. But I think we don't change the time complexity of O(N) from global perspective, because we should always iterate the slice to find a match volume operation if there are many different volumes operations. shall we add a TODO to improve the algorithm using other data structure in future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wonder if we can change the data structure used here from a slice to something that can accommodate these requirements in a better way. The code is also kinda hard to follow because of underlying complexity (not this PR's fault, it was always hard to follow).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggree, need to refactor this data structure.

proposing a kep to do this?

@@ -41,15 +45,15 @@ import (

// Validate PV/PVC, create and verify writer pod, delete the PVC, and validate the PV's
// phase. Note: the PV is deleted in the AfterEach, not here.
func completeTest(f *framework.Framework, c clientset.Interface, ns string, pv *v1.PersistentVolume, pvc *v1.PersistentVolumeClaim) {
func completeTest(f *framework.Framework, c clientset.Interface, ns string, pv *v1.PersistentVolume, pvc *v1.PersistentVolumeClaim, act func(c clientset.Interface, t *framework.TimeoutContext, ns string, pvc *v1.PersistentVolumeClaim, command string) error) {
// 1. verify that the PV and PVC have bound correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test case reproduces the issue reliably?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to say. some operation orders required to reproduce the problem, this test case can't guarantee it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, not reliable, only add a e2e test coverage to prevent regression

@249043822
Copy link
Member Author

/test pull-kubernetes-integration

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 3, 2022
@gnufied
Copy link
Member

gnufied commented Apr 4, 2022

Thanks for refactoring the nested pending operation code. We have failing unit tests though:

{Failed  === RUN   Test_NestedPendingOperations_SecondOpBeforeFirstCompletes/Test_1
    nestedpendingoperations_test.go:894: NestedPendingOperations failed. Expected: <no error> Actual: <Failed to create operation with name {volumeName: podName: nodeName: operationName:}. An operation with that name is already executing.>
    --- FAIL: Test_NestedPendingOperations_SecondOpBeforeFirstCompletes/Test_1 (0.00s)
}

@249043822
Copy link
Member Author

Thanks for refactoring the nested pending operation code? We have failing unit tests though:

{Failed  === RUN   Test_NestedPendingOperations_SecondOpBeforeFirstCompletes/Test_1
    nestedpendingoperations_test.go:894: NestedPendingOperations failed. Expected: <no error> Actual: <Failed to create operation with name {volumeName: podName: nodeName: operationName:}. An operation with that name is already executing.>
    --- FAIL: Test_NestedPendingOperations_SecondOpBeforeFirstCompletes/Test_1 (0.00s)
}

Sorry, will look

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: 249043822
To complete the pull request process, please assign saad-ali after the PR has been reviewed.
You can assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@249043822
Copy link
Member Author

/retest

@249043822
Copy link
Member Author

@gnufied All the tests are passed, but i have a little nit, while volumeName is empty, which is for verify_volumes_are_attached_per_node operation, if the operation for the same node is still pending, shall we reject current operation for the same node?
From the test case

func TestOperationExecutor_AttachSingleNodeVolumeConcurrentlyToSameNode(t *testing.T) {
, we can know that verify_volumes_are_attached_per_node for the same node can run conccurently

@jingxu97
Copy link
Contributor

jingxu97 commented Apr 6, 2022

@249043822 249043822, thanks a lot for finding the issue and working on the fix.
I was also thinking about this logic and propose a fix here #109343, the change is smaller. Will that fix logic work for your case?

@249043822
Copy link
Member Author

249043822 commented Apr 7, 2022

@249043822 249043822, thanks a lot for finding the issue and working on the fix. I was also thinking about this logic and propose a fix here #109343, the change is smaller. Will that fix logic work for your case?

@jingxu97 thanks for your reply, the fix in #109343 is similar with the first version (d05470a) which i have committed, it's really a minimal change for the bug. Talking with @gnufied , Hemant thinks that fix has one downside - we must iterate through all operations before IsOperationExists can return. This could potentially be expensive on a busy node. #109190 (comment)

so i refactor a new fix as this

if volumeName == EmptyUniqueVolumeName {
// volumeName empty, nodeName should exists(for verify_volumes_are_attached_per_node)
// to make the opKey unique. if the operation for the same node is still pending,
// should reject current operation?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now only verifyVolumeAttached function uses empty volumeName to allow it to run parallel no matter what other volume operations are running. I think we can still make this if part at the beginning of the function

@249043822
Copy link
Member Author

/retest

@endocrimes endocrimes moved this from Triage to Done in SIG Node PR Triage May 4, 2022
@Dingshujie
Copy link
Member

@SergeyKanzhelev @jingxu97 can you reviewed this pr?

@xhejtman
Copy link

Hello, is there any progress here? The bug is actually a big problem for nextflow computation as tasks get stucked in container creating..

@249043822
Copy link
Member Author

@gnufied @jingxu97 @Dingshujie As this refactor is a big change and may not make sense to all, so can we make this work two steps? First we can consider the smaller change in: #110951 to be landed to solve this bug, then we may promote a nice refactor change for this in next step to make the work completely.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 8, 2022
@k8s-ci-robot
Copy link
Contributor

@249043822: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xing-yang
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 24, 2022
@249043822
Copy link
Member Author

/close

@k8s-ci-robot
Copy link
Contributor

@249043822: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

kubelet nestedPendingOperations may leak operation lead same pv not to do mount or umount operation
8 participants