Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE PD snapshot tests are flaky #88762

Closed
jsafrane opened this issue Mar 3, 2020 · 10 comments · Fixed by #88801
Closed

GCE PD snapshot tests are flaky #88762

jsafrane opened this issue Mar 3, 2020 · 10 comments · Fixed by #88801
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@jsafrane
Copy link
Member

jsafrane commented Mar 3, 2020

Which jobs are flaking:
gce-cos-master-alpha-features

Which test(s) are flaking:
[sig-storage] CSI Volumes [Driver: pd.csi.storage.gke.io][Serial] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]

Testgrid link:
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features

Reason for failure:
It takes 6 minutes to start a pod with PVC that has been restored from a snapshot, while the test waits only 5 minutes:

02:40:22 Scheduled: Successfully assigned provisioning-3030/gcepd-client to bootstrap-e2e-minion-group-66r6
02:44:34 AttachVolume.Attach succeeded for volume "pvc-25bf62e1-6739-4e79-87a1-3b575faa4daf" 
02:46:36 Started container gcepd-client

csi-attacher sidecar says:

I0303 02:40:32.854820       1 csi_handler.go:222] Error processing "csi-1683f88fb840d9231ef604066b4ad499db02dd397e96aab70497b0331f73af8e": failed to attach: rpc error: code = Internal desc = unknown Attach error: failed cloud service attach disk call: googleapi: Error 400: The resource 'projects/k8s-gce-serial-1-5/zones/us-west1-b/disks/pvc-25bf62e1-6739-4e79-87a1-3b575faa4daf' is not ready, resourceNotReady

Perhaps the provisioner should wait until the volume is ready to be used? That may not help with this test timeout, as PVC is provisioned after the Pod is scheduled and the total time could be the same.

All this is caused by my PR to test block snapshots, #88727. It reworked the tests from using slowPodStartTimeout to podStartTimeout.

If snapshots are slow, should all persistent storage tests use WaitForPodRunningInNamespaceSlow / slowPodStartTimeout?

@msau42 @davidz627

@jsafrane jsafrane added the kind/flake Categorizes issue or PR as related to a flaky test. label Mar 3, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 3, 2020
@jsafrane
Copy link
Member Author

jsafrane commented Mar 3, 2020

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 3, 2020
@msau42
Copy link
Member

msau42 commented Mar 3, 2020

Generally, cutting a snapshot is quick, but uploading can take longer. Does provisioner wait until the snapshot is "ready" before returning success?

@msau42
Copy link
Member

msau42 commented Mar 3, 2020

@kubernetes/sig-storage-test-failures
cc @xing-yang @yuxiangqian

@msau42
Copy link
Member

msau42 commented Mar 3, 2020

(But yes, we should increase the timeout because overall time will still be longer no matter which order we take)

@xing-yang
Copy link
Contributor

The provisioner does not wait. If snapshot is not readyToUse, creating a volume from it will fail. I'll have to check to see if there's logic to wait in the e2e test.

Regarding GCE PD, at restore time, do we need to download the snapshot from object store? If that's the case, it could take a while to download as well.

@jingxu97
Copy link
Contributor

jingxu97 commented Mar 3, 2020

volume controller should keep trying until snapshot is ready.
https://github.com/kubernetes-csi/external-provisioner/blob/efcaee79e47446e38910a3c1a024824387fcf235/pkg/controller/controller.go#L894

Looks like attacher tried to attach while disk is not ready. It is strange because I think attach only happen after volume is successfully provisioned.

@yuxiangqian
Copy link
Contributor

The provisioner does not wait. If snapshot is not readyToUse, creating a volume from it will fail. I'll have to check to see if there's logic to wait in the e2e test.

Regarding GCE PD, at restore time, do we need to download the snapshot from object store? If that's the case, it could take a while to download as well.

@xing-yang AFAIK, most cloud providers' snapshot ATM will be uploaded to some remote when taking, and will be downloaded when restoring. GCE PD is one of them.

@jsafrane
Copy link
Member Author

jsafrane commented Mar 4, 2020

It's not taking snapshots what slows the test, it's restoring it - CreateVolume(... from snapshot XYZ...) returns quickly, but the returned volume is not really usable and attach fails for ~4 minutes until the snapshot is fully restored somewhere in GCE.

Now that I am thinking about it, maybe it's better if the driver returns the volume early, so it does not end up orphaned when user deletes corresponding PVC and the provisioner is restarted at the same time.

Anyway, I'll extend the timeout in volume checks.

/assign

@jsafrane
Copy link
Member Author

jsafrane commented Mar 4, 2020

(But yes, we should increase the timeout because overall time will still be longer no matter which order we take)

Which brings me to idea: what about other storage backends? Some of them may take even longer that 10 minutes to restore a snapshot. Or more than 5 minutes to provision an empty volume and run a pod. There should be some configuration of timeouts (timeout multiplier?) for https://github.com/kubernetes/kubernetes/tree/master/test/e2e/storage/external

@msau42
Copy link
Member

msau42 commented Mar 4, 2020

I opened kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#482 to investigate from GCE side if this behavior is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants