New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add E2E stress test suite for creation / deletion of VolumeSnapshot resources #95971
Add E2E stress test suite for creation / deletion of VolumeSnapshot resources #95971
Conversation
@chrishenzie: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @chrishenzie. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
/assign @xing-yang @msau42 |
1ca3a6e
to
8429d12
Compare
Updated last commit to incorporate similar changes to here, setup / teardown of resources: |
8429d12
to
442d0dd
Compare
/test pull-kubernetes-conformance-kind-ipv6-parallel |
442d0dd
to
4ee69f6
Compare
/retest |
|
||
createPodsAndVolumes := func() { | ||
for i := 0; i < stressTest.testOptions.NumPods; i++ { | ||
framework.Logf("Creating resources for pod %v/%v", i, stressTest.testOptions.NumPods-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use %d for i and stressTest.testOptions.NumPods-1
|
||
if _, err := cs.CoreV1().Pods(pod.Namespace).Create(context.TODO(), pod, metav1.CreateOptions{}); err != nil { | ||
stressTest.cancel() | ||
framework.Failf("Failed to create pod-%v [%+v]. Error: %v", i, pod, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use %d for i
|
||
if err := e2epod.WaitForPodRunningInNamespace(cs, pod); err != nil { | ||
stressTest.cancel() | ||
framework.Failf("Failed to wait for pod-%v [%+v] turn into running status. Error: %v", i, pod, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use %d for i
|
||
var errs []error | ||
for _, snapshot := range stressTest.snapshots { | ||
framework.Logf("Deleting snapshot %v", snapshot.Vs.GetName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you print out namespace as well? namespace/name
Use %s
errs = append(errs, snapshot.CleanupResource()) | ||
} | ||
for _, pod := range stressTest.pods { | ||
framework.Logf("Deleting pod %v", pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%v -> %s
errs = append(errs, e2epod.DeletePodWithWait(cs, pod)) | ||
} | ||
for _, volume := range stressTest.volumes { | ||
framework.Logf("Deleting volume %v", volume.Pvc.GetName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%v -> %s
errs = append(errs, volume.CleanupResource()) | ||
} | ||
errs = append(errs, tryFunc(stressTest.driverCleanup)) | ||
framework.ExpectNoError(errors.NewAggregate(errs), "While cleaning up resources") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While -> while
case <-stressTest.ctx.Done(): | ||
return | ||
default: | ||
framework.Logf("Pod-%v [%v], Iteration %v/%v", podIndex, pod.Name, j, stressTest.testOptions.NumSnapshots-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use %d for podIndex, use %s for name, and %d for j and NumSnapshots
|
||
if err := snapshot.CleanupResource(); err != nil { | ||
stressTest.cancel() | ||
framework.Failf("Failed to delete snapshot for pod-%v [%+v]. Error: %v", podIndex, pod, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use %d for podIndex
stressTest.snapshots = append(stressTest.snapshots, snapshot) | ||
stressTest.snapshotsMutex.Unlock() | ||
|
||
if err := snapshot.CleanupResource(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testpatterns.DynamicSnapshotRetain, can you change the policy to Delete before deleting the VolumeSnapshot? This is to make sure we don't leave physical snapshot resources behind after cleaning up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the CleanupResource method should be responsible for it instead of each test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is already handled inside of CleanupResource()
:
kubernetes/test/e2e/storage/testsuites/snapshottable.go
Lines 498 to 505 in c82d5ee
if boundVsContent.Object["spec"].(map[string]interface{})["deletionPolicy"] != "Delete" { | |
// The purpose of this block is to prevent physical snapshotContent leaks. | |
// We must update the SnapshotContent to have Delete Deletion policy, | |
// or else the physical snapshot content will be leaked. | |
boundVsContent.Object["spec"].(map[string]interface{})["deletionPolicy"] = "Delete" | |
boundVsContent, err = dc.Resource(SnapshotContentGVR).Update(context.TODO(), boundVsContent, metav1.UpdateOptions{}) | |
framework.ExpectNoError(err) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
67c938d
to
61ef469
Compare
test/e2e/storage/drivers/in_tree.go
Outdated
@@ -788,6 +788,11 @@ func InitHostPathDriver() testsuites.TestDriver { | |||
testsuites.CapSingleNodeVolume: true, | |||
testsuites.CapTopology: true, | |||
}, | |||
StressTestOptions: &testsuites.StressTestOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be added for csi hostpath, not intree hostpath
// Name must be unique, so let's base it on namespace name | ||
"name": ns + "-" + suffix, | ||
// Name must be unique, so let's base it on namespace name and use GenerateName | ||
"name": names.SimpleNameGenerator.GenerateName(ns + "-" + suffix), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create a followup issue to clean this up later?
tsInfo: TestSuiteInfo{ | ||
Name: "snapshottable-stress", | ||
TestPatterns: []testpatterns.TestPattern{ | ||
testpatterns.DynamicSnapshotDelete, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add raw block support once the pattern is added in the other pr. cc @Jiawei0227
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG
// Check preconditions before setting up namespace via framework below. | ||
ginkgo.BeforeEach(func() { | ||
driverInfo = driver.GetDriverInfo() | ||
if driverInfo.StressTestOptions == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe fail? Right now it's really easy to miss that a test case got skipped.
Also can you add similar validation to the volume_stress suite?
createPodsAndVolumes() | ||
}) | ||
|
||
f.AddAfterEach("cleanup", func(f *framework.Framework, failed bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Can you reference the bug in the comment so we can followup on it later?
@@ -21,7 +21,8 @@ spec: | |||
serviceAccountName: csi-gce-pd-controller-sa | |||
containers: | |||
- name: csi-snapshotter | |||
image: gcr.io/gke-release/csi-snapshotter:v2.1.1-gke.0 | |||
# TODO: Replace this with the gke image once available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to have the OSS tests use the OSS images.
@@ -21,7 +21,8 @@ spec: | |||
serviceAccountName: csi-gce-pd-controller-sa | |||
containers: | |||
- name: csi-snapshotter | |||
image: gcr.io/gke-release/csi-snapshotter:v2.1.1-gke.0 | |||
# TODO: Replace this with the gke image once available. | |||
image: k8s.gcr.io/sig-storage/csi-snapshotter:v3.0.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also update the pdcsi image to v1.0.1-gke.0? There were some fixes related to snapshots I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
61ef469
to
4064950
Compare
Is it possible we need to increase the snapshot timeout or make it configurable? The last failed test logs indicated some snapshots were ready but not all of them.
kubernetes/test/e2e/framework/util.go Lines 134 to 135 in 396b90f
|
4064950
to
bc50ef4
Compare
Synced offline with @xing-yang, who found this in the logs:
it seems that GCE only allows for one snapshot to be created for a volume at a time, which is a cause of slowness in this test. I am going to test with fewer snapshots and see if it succeeds in time. |
Yes, I think we need to either make the timeout longer or reduce the number of snapshots being created from the same volume at the same time. For example, it took more than 7 minutes for
|
bc50ef4
to
d3120a4
Compare
@@ -490,6 +495,11 @@ func InitGcePDCSIDriver() testsuites.TestDriver { | |||
StressTestOptions: &testsuites.StressTestOptions{ | |||
NumPods: 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we increase the number of pods here?
Maybe we need the snapshot test to use its own set of options, so it won't impact the volume stress test settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to rename this to VolumeStressTestOptions
but it seems like we'd need to mark this as deprecated if users depend on this in their testdriver.yaml files. Maybe we can add some custom serialization logic and log a deprecation warning? What is the timeline for fully deprecating something like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just mark this with a TODO and a tracking issue for the time being, adding a custom decoder seems more involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we create a new VolumeSnapshotStressTestOptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #96241 for this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, added. I can rename it to that to be even more specific.
7272818
to
bfc2e29
Compare
/test pull-kubernetes-conformance-kind-ga-only-parallel |
Can you squash your commits? |
Introduces a new test suite that creates and deletes many VolumeSnapshots simultaneously to test snapshottable storage plugins under load.
bfc2e29
to
fb6bc4f
Compare
/lgtm |
@msau42 do you have more comments? |
/approve @xing-yang I noticed that hostpath stress tests are running longer than the gce tests. That's a bit surprising, as hostpath driver doesn't do anything and should be much faster. Can you take a look? |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chrishenzie, msau42 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Maybe because hostpath driver creates 10 snapshots while gce creates 2 per volume? I'll take a look. |
Ah I did my math wrong. Yes, I think the hostpath test creates 100 snapshots whereas the gce test creates 40. That makes sense, thanks! |
If that is an issue let me know and I can reduce it to maybe 4 pods and 10 snapshots to be consistent. |
@chrishenzie I don't think you need to change anything. That's fine. Thanks for the great work! |
…-#95971-upstream-release-1.19 Automated cherry pick of #95971: E2E stress test suite for VolumeSnapshots
/sig storage
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces an E2E test suite for stress testing the creation / deletion of VolumeSnapshot objects.
It works by spinning up a set of pods, and launching
len(pods)
goroutines which repeatedly create and delete VolumeSnapshots. Users can configure the number of pods and number of snapshots by settingNumPods
andNumSnapshots
in their testdriver.yaml.Which issue(s) this PR fixes:
Fixes #95969
Special notes for your reviewer:
I split this into three distinct changes to make it easier for reviewing. I can squash everything if reviewers prefer.
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
Additional details:
The testdriver.yaml described above can be supplied to the
e2e.test
binary via-storage.testdriver
. It serializes to this struct:kubernetes/test/e2e/storage/testsuites/testdriver.go
Lines 159 to 193 in da941d8
@msau42