-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add e2e test for CSI volume limits #80247
Conversation
0a8fd7a
to
d28c254
Compare
framework.ExpectNoError(err, "failed waiting for PVs to be deleted") | ||
}() | ||
|
||
// Create <limit> nr. of PVCs and pods. With one pod created above, we are 1 pod over the limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need more PVCs per pod otherwise we will hit 110 pods per node limit first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it so low???
Anyway, I was not able to even create 129 PVs:
I0717 13:28:50.828] STEP: Waiting for 129 PVCs Bound
I0717 13:29:00.642] Jul 17 13:29:00.642: INFO: 30/129 of PVCs are Bound
I0717 13:29:07.375] Jul 17 13:29:07.375: INFO: 37/129 of PVCs are Bound
I0717 13:29:15.760] Jul 17 13:29:15.759: INFO: 43/129 of PVCs are Bound
I0717 13:29:22.415] Jul 17 13:29:22.415: INFO: 52/129 of PVCs are Bound
I0717 13:29:30.550] Jul 17 13:29:30.549: INFO: 65/129 of PVCs are Bound
I0717 13:29:37.465] Jul 17 13:29:37.465: INFO: 72/129 of PVCs are Bound
I0717 13:29:45.482] Jul 17 13:29:45.482: INFO: 80/129 of PVCs are Bound
I0717 13:29:53.365] Jul 17 13:29:53.365: INFO: 87/129 of PVCs are Bound
...
I0717 13:59:11.115] Jul 17 13:59:11.115: INFO: 105/129 of PVCs are Bound
I0717 13:59:11.116] Jul 17 13:59:11.115: INFO: Unexpected error occurred: timed out waiting for the condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure it's not because pods couldn't be scheduled? PD driver uses delayed binding
@davidz627: GitHub didn't allow me to request PR reviews from the following users: hantaowang. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@jsafrane need me to run the GCE version or will you handle it? Or the PR is not ready yet? |
@davidz627 it's getting run as part of pull-kubernetes-e2e-gce-csi-serial |
I checked late binding both with AWS and mock driver, something is very fishy there. Some scheduler predicate allows only 16 unbound PVCs in a single pod to be scheduled, a pod with 17 unbound PVCs is Pending forever. My theory is that GCE and/or Azure predicate don't check |
a5c811f
to
2d2c6b4
Compare
2d2c6b4
to
cb5bd05
Compare
dedced8
to
bcb9687
Compare
@davidz627 @msau42, so I tried with 128 volumes and a pod with 128 volumes failed to run:
Is there any off-by-one error here? Or is there any other pod using one slot (logging, metrics, ...) running? It would be great if you debug the test on your end. |
@jsafrane I will look into it |
/retest |
test/e2e/storage/drivers/in_tree.go
Outdated
|
||
return testsuites.GetStorageClass(provisioner, parameters, nil, ns, suffix) | ||
return testsuites.GetStorageClass(provisioner, parameters, &delayedBinding, ns, suffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still want to "turn on delayed binding" for these suites in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't. Removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
// And one extra pod with a CSI volume should get Pending with a condition | ||
// that says it's unschedulable because of volume limit. | ||
// BEWARE: the test may create lot of volumes and it's really slow. | ||
ginkgo.It("should support volume limits [Slow][Serial]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this test also be considered [Disruptive]
as it would prevent further scheduling of pods using disks onto this node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the test does not break anything (e.g. it does not kill kubelet). It just needs to be alone on a node and that's why we have [Serial]
.
ginkgo.Skip(fmt.Sprintf("driver %s does not support volume limits", driverInfo.Name)) | ||
} | ||
var dDriver DynamicPVTestDriver | ||
if dDriver = driver.(DynamicPVTestDriver); dDriver == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought these checks were all done in some helper function somewhere thats shared between test suites
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically the error path is not reachable. Retyping driver
to DynamicPVTestDriver
should always succeed, but I have a safety check just to be sure.
defer l.resource.cleanupResource() | ||
|
||
// Prepare cleanup | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe split some of this cleanup function into helper functions, it's very long and complicated right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to separate func.
|
||
l.pvNames = sets.NewString() | ||
ginkgo.By("Waiting for all PVCs to get Bound") | ||
err = wait.Poll(5*time.Second, testSlowMultiplier*framework.PVBindingTimeout, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an existing helper function for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for array of PVCs, only single PVC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ I need to collect PV names to make sure they're deleted before the test ends. It takes ages and the test suite could end before the PVs are detached + deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to separate func.
l.unschedulablePod, err = l.cs.CoreV1().Pods(l.ns.Name).Create(pod) | ||
|
||
ginkgo.By("Waiting for the pod to get unschedulable") | ||
err = wait.Poll(5*time.Second, framework.PodStartTimeout, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
helper function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found WaitForPodCondition
, but it's not that shorter.
d771ed3
to
29c688d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm modulo the extra code change for delayed binding
test/e2e/storage/drivers/in_tree.go
Outdated
|
||
return testsuites.GetStorageClass(provisioner, parameters, nil, ns, suffix) | ||
return testsuites.GetStorageClass(provisioner, parameters, &delayedBinding, ns, suffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still want to turn on delayed binding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
test/e2e/storage/drivers/in_tree.go
Outdated
|
||
return testsuites.GetStorageClass(provisioner, parameters, nil, ns, suffix) | ||
return testsuites.GetStorageClass(provisioner, parameters, &delayedBinding, ns, suffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still here :)
6906d21
to
7092875
Compare
/test pull-kubernetes-bazel-build |
@davidz627, @msau42: I addressed the comments, could you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple small comments, lgtm besides those
|
||
// Create <limit> PVCs for one gigantic pod. | ||
for i := 0; i < limit; i++ { | ||
ginkgo.By(fmt.Sprintf("Creating volume %d/%d", i+1, limit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ginkgo.By
seems like it's going to be pretty spammy, might not be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved above loop
} | ||
|
||
// Create the gigantic pod with all <limit> PVCs | ||
ginkgo.By("Creating pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:more context on what kind of pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
err = e2epod.WaitForPodCondition(l.cs, l.ns.Name, l.unschedulablePod.Name, "Unschedulable", framework.PodStartTimeout, func(pod *v1.Pod) (bool, error) { | ||
if pod.Status.Phase == v1.PodPending { | ||
for _, cond := range pod.Status.Conditions { | ||
matched, _ := regexp.MatchString("max.+volume.+count", cond.Message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What could be between max, volume, and count besides a space? This regex is pretty prescriptive it might as well be a substring match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the string it's trying to match, so it's only a space.
I think @jsafrane wanted to use the same approach used in the mock driver.
if len(nodeList.Items) != 0 { | ||
nodeName = nodeList.Items[0].Name | ||
} else { | ||
e2elog.Failf("Unable to find ready and schedulable Node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inconsistent usage of e2elog
and ginkgo.
above we have a ginkgo.Fail("...")
and here we have e2elog.Failf("...")
, also logging through e2elog.Logf
and ginkgo.By
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace one usage of ginkgo.Fail
and removed a redundant e2elog.Log
.
The rest seems consistent with other files (like provisioning.go
), where (AFAIU) ginkgo
is used to delimit/document the logic of the test and e2elog
for logging additional information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like someone is trying to clean this up rn, lets be consistent with what they're doing
#81983
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/retest |
/test pull-kubernetes-conformance-kind-ipv6 flaky job |
lgtm please rebase to fewer descriptive commits. Thanks! |
e6cafc1
to
61ee449
Compare
The test creates as many PVs and pods as the driver/plugin reports to support on a single node.
61ee449
to
90d0991
Compare
/test pull-kubernetes-bazel-build |
@davidz627, squashed commits and fixed some function calls (e.g., |
/lgtm |
/priority important-soon |
/hold cancel |
/retest |
The test creates as many PVs and pods as the driver reports to support on a single node.
Tested with AWS, it's really slow, but it works. Let's see how GCE works with 128 volumes :-)
/kind feature
/sig storage
Does this PR introduce a user-facing change?:
@msau42 @davidz627