Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI failed to load vol_data.json because vol_data.json is removed #85280

Closed
fankangbest opened this issue Nov 14, 2019 · 11 comments
Closed

CSI failed to load vol_data.json because vol_data.json is removed #85280

fankangbest opened this issue Nov 14, 2019 · 11 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@fankangbest
Copy link
Contributor

fankangbest commented Nov 14, 2019

What happened:
Pod is deleted with csi volume left on node, and kubelet runs with errors:

E1114 11:23:00.248782   32090 reconciler.go:178] operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume "my-csi-volume" (UniqueName: "kubernetes.io/csi/hostpath.csi.k8s.io^4e1181c8-06d0-11ea-8e21-08002715efc3") pod "4d3f80f0-06d0-11ea-a936-08002715efc3" (UID: "4d3f80f0-06d0-11ea-a936-08002715efc3") : UnmountVolume.NewUnmounter failed for volume "my-csi-volume" (UniqueName: "kubernetes.io/csi/hostpath.csi.k8s.io^4e1181c8-06d0-11ea-8e21-08002715efc3") pod "4d3f80f0-06d0-11ea-a936-08002715efc3" (UID: "4d3f80f0-06d0-11ea-a936-08002715efc3") : open /var/lib/kubelet/pods/4d3f80f0-06d0-11ea-a936-08002715efc3/volumes/kubernetes.io~csi/pvc-45f3b32f-06d0-11ea-a936-08002715efc3/vol_data.json: no such file or directory
E1114 11:23:00.350949   32090 csi_util.go:81] kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/4d3f80f0-06d0-11ea-a936-08002715efc3/volumes/kubernetes.io~csi/pvc-45f3b32f-06d0-11ea-a936-08002715efc3/vol_data.json]: open /var/lib/kubelet/pods/4d3f80f0-06d0-11ea-a936-08002715efc3/volumes/kubernetes.io~csi/pvc-45f3b32f-06d0-11ea-a936-08002715efc3/vol_data.json: no such file or directory
E1114 11:23:00.351179   32090 csi_plugin.go:452] kubernetes.io/csi: unmounter failed to load volume data file [/var/lib/kubelet/pods/4d3f80f0-06d0-11ea-a936-08002715efc3/volumes/kubernetes.io~csi/pvc-45f3b32f-06d0-11ea-a936-08002715efc3/mount]: open /var/lib/kubelet/pods/4d3f80f0-06d0-11ea-a936-08002715efc3/volumes/kubernetes.io~csi/pvc-45f3b32f-06d0-11ea-a936-08002715efc3/vol_data.json: no such file or directory
E1114 11:23:00.351228   32090 reconciler.go:178] operationExecutor.UnmountVolume failed (controllerAttachDetachEnabled true) for volume "my-csi-volume" (UniqueName: "kubernetes.io/csi/hostpath.csi.k8s.io^4e1181c8-06d0-11ea-8e21-08002715efc3") pod "4d3f80f0-06d0-11ea-a936-08002715efc3" (UID: "4d3f80f0-06d0-11ea-a936-08002715efc3") : UnmountVolume.NewUnmounter failed for volume "my-csi-volume" (UniqueName: "kubernetes.io/csi/hostpath.csi.k8s.io^4e1181c8-06d0-11ea-8e21-08002715efc3") pod "4d3f80f0-06d0-11ea-a936-08002715efc3" (UID: "4d3f80f0-06d0-11ea-a936-08002715efc3") : open /var/lib/kubelet/pods/4d3f80f0-06d0-11ea-a936-08002715efc3/volumes/kubernetes.io~csi/pvc-45f3b32f-06d0-11ea-a936-08002715efc3/vol_data.json: no such file or directory

After kubelet restarting, orphan pod is found:

Orphaned pod "d5c8dcb7-06b7-11ea-a936-08002715efc3" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.

What you expected to happen:

  1. volume can be deleted correctly;
  2. pod can't be deleted if volume is left on node.

How to reproduce it (as minimally and precisely as possible):

  1. create a pod with cri volume
  2. touch a file in volume dir, in order to make a mistake when delete volume dir:
    touch /var/lib/kubelet/pods/4d3f80f0-06d0-11ea-a936-08002715efc3/volumes/kubernetes.io~csi/pvc-45f3b32f-06d0-11ea-a936-08002715efc3/test
  3. delete the pod
  4. pod is terminating, and kubelet fails to load volumedata:
E1114 11:19:26.722339   32090 csi_util.go:81] kubernetes.io/csi: failed to open volume data file [/var/lib/kubelet/pods/4d3f80f0-06d0-11ea-a936-08002715efc3/volumes/kubernetes.io~csi/pvc-45f3b32f-06d0-11ea-a936-08002715efc3/vol_data.json]: open /var/lib/kubelet/pods/4d3f80f0-06d0-11ea-a936-08002715efc3/volumes/kubernetes.io~csi/pvc-45f3b32f-06d0-11ea-a936-08002715efc3/vol_data.json: no such file or directory
  1. rm test file
    rm /var/lib/kubelet/pods/4d3f80f0-06d0-11ea-a936-08002715efc3/volumes/kubernetes.io~csi/pvc-45f3b32f-06d0-11ea-a936-08002715efc3/test
  2. pod is still can't be deleted
  3. restart kubelet
  4. pod is deleted and volume is left on node

Anything else we need to know?:
To teardown a volume, removeMountDir() is called:

// removeMountDir cleans the mount dir when dir is not mounted and removed the volume data file in dir
func removeMountDir(plug *csiPlugin, mountPath string) error {
	klog.V(4).Info(log("removing mount path [%s]", mountPath))

	mnt, err := isDirMounted(plug, mountPath)
	if err != nil {
		return err
	}
	if !mnt {
		klog.V(4).Info(log("dir not mounted, deleting it [%s]", mountPath))
		if err := os.Remove(mountPath); err != nil && !os.IsNotExist(err) {
			klog.Error(log("failed to remove dir [%s]: %v", mountPath, err))
			return err
		}
		// remove volume data file as well
		volPath := path.Dir(mountPath)
		dataFile := path.Join(volPath, volDataFileName)
		klog.V(4).Info(log("also deleting volume info data file [%s]", dataFile))
		if err := os.Remove(dataFile); err != nil && !os.IsNotExist(err) {
			klog.Error(log("failed to delete volume data file [%s]: %v", dataFile, err))
			return err
		}

		// remove volume path
		klog.V(4).Info(log("deleting volume path [%s]", volPath))
		if err := os.Remove(volPath); err != nil && !os.IsNotExist(err) {
			klog.Error(log("failed to delete volume path [%s]: %v", volPath, err))
			return err
		}
	}
	return nil
}

When there is something wrong to execute os.Remove(volPath), volume path is left on node. However, mount path and vol_data.json is deleted.

Kubelet will try to remove this volume again with NewUnmounter():

func (p *csiPlugin) NewUnmounter(specName string, podUID types.UID) (volume.Unmounter, error) {
	klog.V(4).Infof(log("setting up unmounter for [name=%v, podUID=%v]", specName, podUID))

	unmounter := &csiMountMgr{
		plugin:       p,
		podUID:       podUID,
		specVolumeID: specName,
	}

	// load volume info from file
	dir := unmounter.GetPath()
	dataDir := path.Dir(dir) // dropoff /mount at end
	data, err := loadVolumeData(dataDir, volDataFileName)
	if err != nil {
		klog.Error(log("unmounter failed to load volume data file [%s]: %v", dir, err))
		return nil, err
	}
	unmounter.driverName = csiDriverName(data[volDataKey.driverName])
	unmounter.volumeID = data[volDataKey.volHandle]
	unmounter.csiClientGetter.driverName = unmounter.driverName

	return unmounter, nil
}

NewUnmounter will fail to load vol_data.json to construct unmounter because vol_data.json is deleted.

After restarting kubelet, PodResourcesAreReclaimed() will be passed to this pod because volume construction is failed. And pod will be removed by kubelet with a volume left on disk.

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"archive", BuildDate:"2019-06-25T04:19:35Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"archive", BuildDate:"2019-11-12T11:16:36Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
    NAME="Ubuntu"
    VERSION="18.04.2 LTS (Bionic Beaver)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 18.04.2 LTS"
    VERSION_ID="18.04"
    HOME_URL="https://www.ubuntu.com/"
    SUPPORT_URL="https://help.ubuntu.com/"
    BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    VERSION_CODENAME=bionic
    UBUNTU_CODENAME=bionic
  • Kernel (e.g. uname -a):
    Linux ubuntu 4.15.0-66-generic cluster: allow providing KUBE_MASTER_IP env var #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:
@fankangbest fankangbest added the kind/bug Categorizes issue or PR as related to a bug. label Nov 14, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 14, 2019
@fankangbest
Copy link
Contributor Author

/assign

@fankangbest
Copy link
Contributor Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 15, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 17, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@joshimoo
Copy link

@fankangbest was there any change in this area that got merged?

@xenuser
Copy link

xenuser commented Oct 25, 2022

Issue still happens for us on RKE 1, Kubernetes v1.19.10 and Longhorn 1.3.1
Is there nobody looking at this issue?

@xenuser
Copy link

xenuser commented Oct 25, 2022

/reopen

@k8s-ci-robot
Copy link
Contributor

@xenuser: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vadasambar
Copy link
Member

This issue still happens in 1.22.15 version of kubernetes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

6 participants