Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaking test] CSI Volumes [Driver: pd.csi.storage.gke.io][Serial] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is gracefully deleted while kubelet is down #75326

Open
mariantalla opened this Issue Mar 13, 2019 · 5 comments

Comments

@mariantalla
Copy link
Contributor

mariantalla commented Mar 13, 2019

Which jobs are flaking:

  • ci-kubernetes-e2e-gce-new-master-upgrade-cluster
  • ci-kubernetes-e2e-gce-master-new-downgrade-cluster
  • ci-kubernetes-e2e-gce-new-master-upgrade-master

Which test(s) are flaking:
[sig-storage] CSI Volumes [Driver: pd.csi.storage.gke.io][Serial] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is gracefully deleted while kubelet is down [Disruptive][Slow]

Testgrid link:

Reason for failure:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:305
Expected grep stdout to be empty (i.e. no mount found).
Expected
    <string>: /dev/sdb on /var/lib/kubelet/pods/e7cca97f-3b9e-11e9-9081-42010a8a0002/volumes/kubernetes.io~csi/pvc-e2c79d44-3b9e-11e9-9081-42010a8a0002/mount type ext4 (rw,relatime,data=ordered)
    /dev/sdb on /var/lib/kubelet/pods/e7cca97f-3b9e-11e9-9081-42010a8a0002/volumes/kubernetes.io~csi/pvc-e2c79d44-3b9e-11e9-9081-42010a8a0002/mount type ext4 (rw,relatime,data=ordered)
    /dev/sdb on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/e7cca97f-3b9e-11e9-9081-42010a8a0002/volumes/kubernetes.io~csi/pvc-e2c79d44-3b9e-11e9-9081-42010a8a0002/mount type ext4 (rw,relatime,data=ordered)
    /dev/sdb on /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/e7cca97f-3b9e-11e9-9081-42010a8a0002/volumes/kubernetes.io~csi/pvc-e2c79d44-3b9e-11e9-9081-42010a8a0002/mount type ext4 (rw,relatime,data=ordered)
    
to be empty
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/utils/utils.go:255

Anything else we need to know:

  • This fails about 15%-18% (up to 23%) of the time in sig-release dashboards
  • Important to resolve due to significant flake rate, (#75045 and #75275 are higher priority)
  • The same test for a different driver was recently stabilised: #75196
  • triage link (points to jobs on sig-release dashboards only)

/sig storage
/kind flake
/remove-kind failing-test
/priority important-soon

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

mariantalla commented Mar 13, 2019

@msau42 Could you have a look please? (this is for graceful pod deletion, pd.csi.storage.gke.io driver)

Adding it to v1.14 unless we learn otherwise.

/milestone v1.14

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 13, 2019

gce-new-master-upgrade-master: this test job only upgrades the master, and not the nodes. The fix is only in master/1.14, so we'll need to backport it to 1.13 to fix the flakiness here.

gce-new-master-upgrade-cluster: the failure is due to cluster-upgrade failing, which left the nodes at 1.13 which still has the issue. Fixing the cluster upgrade should address this. We can also backport the fix to 1.13.

gce-master-new-downgrade-cluster: this job downgrades the cluster to 1.13, which doesn't have the fix. We'll need to backport the fix to 1.13 to fix the flakiness.

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 13, 2019

/assign

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 14, 2019

@cofyc can you look at cherry picking the two fixes to 1.13?

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 21, 2019

I think I figured out the root cause for the CSI hostpath flakiness (not the pd driver flakiness though). It is the same issue as #72346 (comment). Pod deletion uses IsLikelyNotMountPoint to check if there are no more mounted pod directories, but it doesn't work for hostpath csi driver since it is just a directory in the container rootfs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.