kubelet nestedPendingOperations may leak operation lead same pv not to do mount or umount operation #109047

Dingshujie · 2022-03-26T07:11:46Z

What happened?

on my scenes，create same cronjob use same sfs(the nfs storage on Huawei Cloud) type storage by pvc to do same training work. At the same time, a node will schedule many pods using the same pvc, and there will be many mount and umount tasks about this pvc executed at the same time. And we find all pods on a node at container creating state, because of waiting pvc to mount.
check log, we can't find any we can't find any operationExecutor.MountVolume started or failed log like "operationExecutor.MountVolume started for"

and also we check volume_manager_total_volumes metrics, volume is add to dsw.
analyze code, if isExpectedError, kubelet not log any thing

// ignore nestedpendingoperations.IsAlreadyExists and exponentialbackoff.IsExponentialBackoff errors, they are expected.
func isExpectedError(err error) bool {
	return nestedpendingoperations.IsAlreadyExists(err) || exponentialbackoff.IsExponentialBackoff(err)
}

and operationExecutor will generate operation, add call pendingOperations Run, pendingOperations Run will check operation slice, to find whether has same operation exists, and if operation exist, will check operation's operationPending status, if equal true, means same operation now is excuting, and return AlreadyExists.
operationKey composed of volumeName, podName, nodeName

// MountVolume Func
// Avoid executing mount/map from multiple pods referencing the
// same volume in parallel
podName := nestedpendingoperations.EmptyUniquePodName

// TODO: remove this -- not necessary
if !volumeToMount.PluginIsAttachable && !volumeToMount.PluginIsDeviceMountable {
	// volume plugins which are Non-attachable and Non-deviceMountable can execute mount for multiple pods
	// referencing the same volume in parallel
	podName = util.GetUniquePodName(volumeToMount.Pod)
}

// TODO mount_device
return oe.pendingOperations.Run(
		volumeToMount.VolumeName, podName, "" /* nodeName */, generatedOperations)

so MountVolume operationKey {volumeName, EmptyUniquePodName,EmptyNodeName}, use EmptyUniquePodName to avoid executing mount/map from multiple pods referencing the same volume in parallel

// UnmountVolume Func
// All volume plugins can execute unmount/unmap for multiple pods referencing the
	// same volume in parallel
	podName := volumetypes.UniquePodName(volumeToUnmount.PodUID)

	return oe.pendingOperations.Run(
		volumeToUnmount.VolumeName, podName, "" /* nodeName */, generatedOperations)

soUnmountVolume operationKey {volumeName, podUID,EmptyNodeName}, because All volume plugins can execute unmount/unmap for multiple pods referencing the same volume in parallel

now see isOperationExists func, if previousOp.podName equals to EmptyUniquePodName or current operation podName equals to EmptyUniquePodName, will return podNameMatch, mount operation can find umount operation or umount operation can find mount operation

// This is an internal function and caller should acquire and release the lock
func (grm *nestedPendingOperations) isOperationExists(key operationKey) (bool, int) {

	// If volumeName is empty, operation can be executed concurrently
	if key.volumeName == EmptyUniqueVolumeName {
		return false, -1
	}

	for previousOpIndex, previousOp := range grm.operations {
		volumeNameMatch := previousOp.key.volumeName == key.volumeName

		podNameMatch := previousOp.key.podName == EmptyUniquePodName ||
			key.podName == EmptyUniquePodName ||
			previousOp.key.podName == key.podName

		nodeNameMatch := previousOp.key.nodeName == EmptyNodeName ||
			key.nodeName == EmptyNodeName ||
			previousOp.key.nodeName == key.nodeName

		if volumeNameMatch && podNameMatch && nodeNameMatch {
			return true, previousOpIndex
		}
	}

	return false, -1
}

func (grm *nestedPendingOperations) Run(
	volumeName v1.UniqueVolumeName,
	podName volumetypes.UniquePodName,
	nodeName types.NodeName,
	generatedOperations volumetypes.GeneratedOperations) error {
	grm.lock.Lock()
	defer grm.lock.Unlock()

	opKey := operationKey{volumeName, podName, nodeName}

	opExists, previousOpIndex := grm.isOperationExists(opKey)
	if opExists {
		previousOp := grm.operations[previousOpIndex]
		// Operation already exists
		if previousOp.operationPending {
			// Operation is pending
			return NewAlreadyExistsError(opKey)
		}

		backOffErr := previousOp.expBackoff.SafeToRetry(fmt.Sprintf("%+v", opKey))
		if backOffErr != nil {
			if previousOp.operationName == generatedOperations.OperationName {
				return backOffErr
			}
			// previous operation and new operation are different. reset op. name and exp. backoff
			grm.operations[previousOpIndex].operationName = generatedOperations.OperationName
			grm.operations[previousOpIndex].expBackoff = exponentialbackoff.ExponentialBackoff{}
		}

		// Update existing operation to mark as pending.
		grm.operations[previousOpIndex].operationPending = true
		grm.operations[previousOpIndex].key = opKey
	} else {
		// Create a new operation
		grm.operations = append(grm.operations,
			operation{
				key:              opKey,
				operationPending: true,
				operationName:    generatedOperations.OperationName,
				expBackoff:       exponentialbackoff.ExponentialBackoff{},
			})
	}

	go func() (eventErr, detailedErr error) {
		// Handle unhandled panics (very unlikely)
		defer k8sRuntime.HandleCrash()
		// Handle completion of and error, if any, from operationFunc()
		defer grm.operationComplete(opKey, &detailedErr)
		return generatedOperations.Run()
	}()

	return nil
}

because of isOperationExists, this mean if a volume now is excuting mount, a umount can't add to operations, and if a umount now is excuting, a mount operation can't add to operations, but multiple pods referencing the same volume can add multiple umount operation into operation, may lead operation leak.

t0 Operations: {umount, pvname, pod1, pending=true} {umount, pvname, pod2, pending=true}
t1 Operations: {umount, pvname, pod1, pending=false} {umount, pvname, pod2, pending=true}
t2 Operations: {mount, pvname, EmptyUniquePodName, pending=true} {umount, pvname, pod2, pending=true}
t3 Operations: {mount, pvname, EmptyUniquePodName, pending=false} {umount, pvname, pod2, pending=true}
t4 Operations: {umount, pvname, pod2, pending=true} {umount, pvname, pod2, pending=true}
t5 Operations: {umount, pvname, pod2, pending=false} {umount, pvname, pod2, pending=true}
t6 Operations: {umount, pvname, pod2, pending=true}

t0 two pod do umount volume operation
t1 pod1 umount failed, update pending to false
t2 pod3 add, operation excuter add mount operation, will use index 0, override umount operation
t3 mount operation failed , update pending to false
t4 pod2 umount operation add, will find index 0, add override opertation, now ,we have two same umount operation, now we have two goroutines to do pod2 umount operations
t5 first goroutine pod2 umount failed, will update index 0 operation, update pending to false
t6 second goroutine pod2 umount success, will delete index 0 operation,
finally lead a umount opertion in cache, and this operation will lead all mount opeartion return AlreadyExistsError, and not cause any csi call.

and i think there exists some other scene can trigger this leak.

and also, when operation success, deleteOperation, will use tail element to replace, and may increase the probalility of occurrence. because isOperationExists will always pick first match one.

func (grm *nestedPendingOperations) deleteOperation(key operationKey) {
	// Assumes lock has been acquired by caller.

	opIndex := -1
	for i, op := range grm.operations {
		if op.key.volumeName == key.volumeName &&
			op.key.podName == key.podName &&
			op.key.nodeName == key.nodeName {
			opIndex = i
			break
		}
	}

	if opIndex < 0 {
		return
	}

	// Delete index without preserving order
	grm.operations[opIndex] = grm.operations[len(grm.operations)-1]
	grm.operations = grm.operations[:len(grm.operations)-1]
}

and also override mechanism may lead exponentialBackOff invalid， for example mount and umount take turns to excute.

What did you expect to happen?

pv can mount , pod can running

How can we reproduce it (as minimally and precisely as possible)?

reproduce step:
create same cronjob that use same pvc(use csi sfs)
and csi driver restart periodic

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

1.23

Cloud provider

HuaweiCloud

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

CSI

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-03-26T07:11:54Z

@Dingshujie: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dingshujie · 2022-03-26T07:22:47Z

/sig storage

Dingshujie · 2022-03-26T08:59:35Z

@saad-ali @jingxu97 @msau42 @jsafrane

Dingshujie · 2022-03-28T04:27:31Z

/sig node

Dingshujie · 2022-03-28T04:28:30Z

@rphillips @ehashman

heartlock · 2022-03-28T09:07:27Z

+1 meet the same problem

maaoBit · 2022-03-28T09:25:14Z

I don't know much about volumeManager. It seems that there is a problem with the override mechanism.
Can mount and unmount of the same volume be executed at the same time? If it can, mount should not override any unmount operation. Otherwise, the mount operation should override all matched operations, not just the first one.

Dingshujie · 2022-03-28T09:52:07Z

I don't know much about volumeManager. It seems that there is a problem with the override mechanism. Can mount and unmount of the same volume be executed at the same time? If it can, mount should not override any unmount operation.

also don't know much about design history background, but comments in volume manager seems to conflict with each other among UmountDevice and UnmountVolume.

// UnmountVolume 
// All volume plugins can execute unmount/unmap for multiple pods referencing the
// same volume in parallel
podName := volumetypes.UniquePodName(volumeToUnmount.PodUID)

return oe.pendingOperations.Run(volumeToUnmount.VolumeName, podName, "" /* nodeName */, generatedOperations)

// UnmountDevice
// Avoid executing unmount/unmap device from multiple pods referencing
// the same volume in parallel
podName := nestedpendingoperations.EmptyUniquePodName

return oe.pendingOperations.Run(deviceToDetach.VolumeName, podName, "" /* nodeName */, generatedOperations)

need sameone who familar with it to explain。

Otherwise, the mount operation should override all matched operations, not just the first one.

yes, must discuss rules for the existence of an operation。

jsafrane · 2022-03-28T13:27:19Z

cc @gnufied

249043822 · 2022-03-29T09:12:15Z

Suppose there are the following cases:

one volume's "volume_mount" should block the next same "volume_mount" or "umount_device" and all related pods's "volume_umount"

exp1:
Operations: {mount, pvname1, EmptyUniquePodName, pending=true}, then {mount, pvname1, EmptyUniquePodName}, {umount_device, pvname1, EmptyUniquePodName} and {umount, pvname1, pod1} could not be added
Operations: {mount, pvname1, EmptyUniquePodName, pending=false}, then {mount, pvname1, EmptyUniquePodName}, {umount_device, pvname1, EmptyUniquePodName} and {umount, pvname1, pod1} could be added
one volume's "volume_umount" should block the next same "volume_umount" and it's "volume_mount" or "umount_device", but it should not block it's related other pods's "volume_umount"
exp2:
Operations: {umount, pvname, pod1, pending=true} {umount, pvname, pod2, pending=true}, then {umount, pvname, pod2}, {mount, pvname1, EmptyUniquePodName}, {umount_device, pvname1, EmptyUniquePodName} could not be added
Operations: {umount, pvname, pod1, pending=false} {umount, pvname, pod2, pending=true}, then {mount, pvname1, EmptyUniquePodName}, {umount_device, pvname1, EmptyUniquePodName} could not be added
Operations: {umount, pvname, pod1, pending=true} {umount, pvname, pod2, pending=false}, then {mount, pvname1, EmptyUniquePodName}, {umount_device, pvname1, EmptyUniquePodName} could not be added
Operations: {umount, pvname, pod1, pending=false} {umount, pvname, pod2, pending=false}, then {umount, pvname, pod2}, {mount, pvname1, EmptyUniquePodName}, {umount_device, pvname1, EmptyUniquePodName} could be added
one volume's "umount_device" should block the next same "umount_device" or "volume_mount" and all related pods's "volume_umount"
exp3:
Operations: {umount_device, pvname1, EmptyUniquePodName, pending=true}, then {mount, pvname1, EmptyUniquePodName}, {umount_device, pvname1, EmptyUniquePodName} and {umount, pvname1, pod1} could not be added
Operations: {umount_device, pvname1, EmptyUniquePodName, pending=false}, then {mount, pvname1, EmptyUniquePodName}, {umount_device, pvname1, EmptyUniquePodName} and {umount, pvname1, pod1} could be added

So we should improve the isOperationExists to satify case 2, that is, the intermediate state Operations: {mount, pvname, EmptyUniquePodName, pending=true} {umount, pvname, pod2, pending=true} should not occur.

Maybe since this pr: #28939, this logic remains the same.
/cc @jingxu97

SergeyKanzhelev · 2022-03-30T17:28:40Z

/remove-sig node

for sig storage to triage

k8s-triage-robot · 2022-07-03T05:50:06Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dingshujie · 2022-07-04T01:22:27Z

/remove-lifecycle stale

…nto 'tke/v1.20.6' (merge request !907) fix nestedPendingOperations mount and umount parallel bug Issue: kubernetes#109047 Cherry Pick: kubernetes#110951 详细内容见Issue，volume manager中存在某些时序性bug，导致某个PV对应的mount操作都会卡住并且和这个PV相关的pod都卡在Creating，只有重启kubelet才会恢复。

Dingshujie added the kind/bug Categorizes issue or PR as related to a bug. label Mar 26, 2022

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 26, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 26, 2022

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 26, 2022

Dingshujie mentioned this issue Mar 28, 2022

kubelet deadlock with many parallel mounts #108009

Closed

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 28, 2022

ehashman added this to Triage in SIG Node Bugs Mar 29, 2022

k8s-ci-robot removed the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 30, 2022

SergeyKanzhelev removed this from Triage in SIG Node Bugs Mar 30, 2022

249043822 mentioned this issue Mar 31, 2022

fix nestedPendingOperations mount and umount parallel bug #109190

Closed

kubernetes deleted a comment Apr 4, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2022

249043822 mentioned this issue Jul 5, 2022

fix nestedPendingOperations mount and umount parallel bug -- minimal change #110951

Merged

k8s-ci-robot closed this as completed in #110951 Aug 26, 2022

xhejtman mentioned this issue Dec 28, 2022

rke2 1.24.9 cannot restart without node disruption rancher/rke2#3723

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet nestedPendingOperations may leak operation lead same pv not to do mount or umount operation #109047

kubelet nestedPendingOperations may leak operation lead same pv not to do mount or umount operation #109047

Dingshujie commented Mar 26, 2022 •

edited

k8s-ci-robot commented Mar 26, 2022

Dingshujie commented Mar 26, 2022

Dingshujie commented Mar 26, 2022

Dingshujie commented Mar 28, 2022

Dingshujie commented Mar 28, 2022

heartlock commented Mar 28, 2022

maaoBit commented Mar 28, 2022

Dingshujie commented Mar 28, 2022

jsafrane commented Mar 28, 2022

249043822 commented Mar 29, 2022

SergeyKanzhelev commented Mar 30, 2022

k8s-triage-robot commented Jul 3, 2022

Dingshujie commented Jul 4, 2022

kubelet nestedPendingOperations may leak operation lead same pv not to do mount or umount operation #109047

kubelet nestedPendingOperations may leak operation lead same pv not to do mount or umount operation #109047

Comments

Dingshujie commented Mar 26, 2022 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Mar 26, 2022

Dingshujie commented Mar 26, 2022

Dingshujie commented Mar 26, 2022

Dingshujie commented Mar 28, 2022

Dingshujie commented Mar 28, 2022

heartlock commented Mar 28, 2022

maaoBit commented Mar 28, 2022

Dingshujie commented Mar 28, 2022

jsafrane commented Mar 28, 2022

249043822 commented Mar 29, 2022

SergeyKanzhelev commented Mar 30, 2022

k8s-triage-robot commented Jul 3, 2022

Dingshujie commented Jul 4, 2022

Dingshujie commented Mar 26, 2022 •

edited