GCE PD snapshot tests are flaky #88762

jsafrane · 2020-03-03T13:55:10Z

Which jobs are flaking:
gce-cos-master-alpha-features

Which test(s) are flaking:
[sig-storage] CSI Volumes [Driver: pd.csi.storage.gke.io][Serial] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with snapshot data source [Feature:VolumeSnapshotDataSource]

Testgrid link:
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features

Reason for failure:
It takes 6 minutes to start a pod with PVC that has been restored from a snapshot, while the test waits only 5 minutes:

02:40:22 Scheduled: Successfully assigned provisioning-3030/gcepd-client to bootstrap-e2e-minion-group-66r6
02:44:34 AttachVolume.Attach succeeded for volume "pvc-25bf62e1-6739-4e79-87a1-3b575faa4daf" 
02:46:36 Started container gcepd-client

csi-attacher sidecar says:

I0303 02:40:32.854820       1 csi_handler.go:222] Error processing "csi-1683f88fb840d9231ef604066b4ad499db02dd397e96aab70497b0331f73af8e": failed to attach: rpc error: code = Internal desc = unknown Attach error: failed cloud service attach disk call: googleapi: Error 400: The resource 'projects/k8s-gce-serial-1-5/zones/us-west1-b/disks/pvc-25bf62e1-6739-4e79-87a1-3b575faa4daf' is not ready, resourceNotReady

Perhaps the provisioner should wait until the volume is ready to be used? That may not help with this test timeout, as PVC is provisioned after the Pod is scheduled and the total time could be the same.

All this is caused by my PR to test block snapshots, #88727. It reworked the tests from using slowPodStartTimeout to podStartTimeout.

If snapshots are slow, should all persistent storage tests use WaitForPodRunningInNamespaceSlow / slowPodStartTimeout?

@msau42 @davidz627

The text was updated successfully, but these errors were encountered:

jsafrane · 2020-03-03T14:08:51Z

/sig storage

msau42 · 2020-03-03T17:35:48Z

Generally, cutting a snapshot is quick, but uploading can take longer. Does provisioner wait until the snapshot is "ready" before returning success?

msau42 · 2020-03-03T17:36:14Z

@kubernetes/sig-storage-test-failures
cc @xing-yang @yuxiangqian

msau42 · 2020-03-03T17:38:07Z

(But yes, we should increase the timeout because overall time will still be longer no matter which order we take)

xing-yang · 2020-03-03T18:55:45Z

The provisioner does not wait. If snapshot is not readyToUse, creating a volume from it will fail. I'll have to check to see if there's logic to wait in the e2e test.

Regarding GCE PD, at restore time, do we need to download the snapshot from object store? If that's the case, it could take a while to download as well.

jingxu97 · 2020-03-03T18:56:30Z

volume controller should keep trying until snapshot is ready.
https://github.com/kubernetes-csi/external-provisioner/blob/efcaee79e47446e38910a3c1a024824387fcf235/pkg/controller/controller.go#L894

Looks like attacher tried to attach while disk is not ready. It is strange because I think attach only happen after volume is successfully provisioned.

yuxiangqian · 2020-03-03T19:23:00Z

The provisioner does not wait. If snapshot is not readyToUse, creating a volume from it will fail. I'll have to check to see if there's logic to wait in the e2e test.

Regarding GCE PD, at restore time, do we need to download the snapshot from object store? If that's the case, it could take a while to download as well.

@xing-yang AFAIK, most cloud providers' snapshot ATM will be uploaded to some remote when taking, and will be downloaded when restoring. GCE PD is one of them.

jsafrane · 2020-03-04T09:10:50Z

It's not taking snapshots what slows the test, it's restoring it - CreateVolume(... from snapshot XYZ...) returns quickly, but the returned volume is not really usable and attach fails for ~4 minutes until the snapshot is fully restored somewhere in GCE.

Now that I am thinking about it, maybe it's better if the driver returns the volume early, so it does not end up orphaned when user deletes corresponding PVC and the provisioner is restarted at the same time.

Anyway, I'll extend the timeout in volume checks.

/assign

jsafrane · 2020-03-04T12:09:33Z

(But yes, we should increase the timeout because overall time will still be longer no matter which order we take)

Which brings me to idea: what about other storage backends? Some of them may take even longer that 10 minutes to restore a snapshot. Or more than 5 minutes to provision an empty volume and run a pod. There should be some configuration of timeouts (timeout multiplier?) for https://github.com/kubernetes/kubernetes/tree/master/test/e2e/storage/external

msau42 · 2020-03-04T17:15:23Z

I opened kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#482 to investigate from GCE side if this behavior is expected.

jsafrane added the kind/flake Categorizes issue or PR as related to a flaky test. label Mar 3, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 3, 2020

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 3, 2020

k8s-ci-robot assigned jsafrane Mar 4, 2020

jsafrane mentioned this issue Mar 4, 2020

Fix GCE PD snapshot flakiness #88801

Merged

droslean mentioned this issue Mar 4, 2020

[Failing Test] gce-cos-master-alpha-features (ci-kubernetes-e2e-gci-gce-alpha-features) #88707

Closed

msau42 mentioned this issue Mar 4, 2020

Volume restored from snapshot is not ready for use after provisioning kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#482

Closed

k8s-ci-robot closed this as completed in #88801 Mar 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCE PD snapshot tests are flaky #88762

GCE PD snapshot tests are flaky #88762

jsafrane commented Mar 3, 2020 •

edited

jsafrane commented Mar 3, 2020

msau42 commented Mar 3, 2020

msau42 commented Mar 3, 2020

msau42 commented Mar 3, 2020

xing-yang commented Mar 3, 2020

jingxu97 commented Mar 3, 2020 •

edited

yuxiangqian commented Mar 3, 2020

jsafrane commented Mar 4, 2020

jsafrane commented Mar 4, 2020

msau42 commented Mar 4, 2020

GCE PD snapshot tests are flaky #88762

GCE PD snapshot tests are flaky #88762

Comments

jsafrane commented Mar 3, 2020 • edited

jsafrane commented Mar 3, 2020

msau42 commented Mar 3, 2020

msau42 commented Mar 3, 2020

msau42 commented Mar 3, 2020

xing-yang commented Mar 3, 2020

jingxu97 commented Mar 3, 2020 • edited

yuxiangqian commented Mar 3, 2020

jsafrane commented Mar 4, 2020

jsafrane commented Mar 4, 2020

msau42 commented Mar 4, 2020

jsafrane commented Mar 3, 2020 •

edited

jingxu97 commented Mar 3, 2020 •

edited