New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCE PD snapshot tests are flaky #88762
Comments
/sig storage |
Generally, cutting a snapshot is quick, but uploading can take longer. Does provisioner wait until the snapshot is "ready" before returning success? |
@kubernetes/sig-storage-test-failures |
(But yes, we should increase the timeout because overall time will still be longer no matter which order we take) |
The provisioner does not wait. If snapshot is not readyToUse, creating a volume from it will fail. I'll have to check to see if there's logic to wait in the e2e test. Regarding GCE PD, at restore time, do we need to download the snapshot from object store? If that's the case, it could take a while to download as well. |
volume controller should keep trying until snapshot is ready. Looks like attacher tried to attach while disk is not ready. It is strange because I think attach only happen after volume is successfully provisioned. |
@xing-yang AFAIK, most cloud providers' snapshot ATM will be uploaded to some remote when taking, and will be downloaded when restoring. GCE PD is one of them. |
It's not taking snapshots what slows the test, it's restoring it - Now that I am thinking about it, maybe it's better if the driver returns the volume early, so it does not end up orphaned when user deletes corresponding PVC and the provisioner is restarted at the same time. Anyway, I'll extend the timeout in volume checks. /assign |
Which brings me to idea: what about other storage backends? Some of them may take even longer that 10 minutes to restore a snapshot. Or more than 5 minutes to provision an empty volume and run a pod. There should be some configuration of timeouts (timeout multiplier?) for https://github.com/kubernetes/kubernetes/tree/master/test/e2e/storage/external |
I opened kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#482 to investigate from GCE side if this behavior is expected. |
Which jobs are flaking:
gce-cos-master-alpha-features
Which test(s) are flaking:
[sig-storage] CSI Volumes [Driver: pd.csi.storage.gke.io][Serial] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]
Testgrid link:
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features
Reason for failure:
It takes 6 minutes to start a pod with PVC that has been restored from a snapshot, while the test waits only 5 minutes:
csi-attacher sidecar says:
Perhaps the provisioner should wait until the volume is ready to be used? That may not help with this test timeout, as PVC is provisioned after the Pod is scheduled and the total time could be the same.
All this is caused by my PR to test block snapshots, #88727. It reworked the tests from using
slowPodStartTimeout
topodStartTimeout
.If snapshots are slow, should all persistent storage tests use
WaitForPodRunningInNamespaceSlow
/slowPodStartTimeout
?@msau42 @davidz627
The text was updated successfully, but these errors were encountered: