Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add e2e test for CSI volume limits #80247

Merged
merged 1 commit into from
Aug 29, 2019

Conversation

jsafrane
Copy link
Member

The test creates as many PVs and pods as the driver reports to support on a single node.

Tested with AWS, it's really slow, but it works. Let's see how GCE works with 128 volumes :-)

/kind feature
/sig storage

Does this PR introduce a user-facing change?:

NONE

@msau42 @davidz627

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 17, 2019
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jul 17, 2019
@jsafrane jsafrane force-pushed the e2e-volume-limits branch 2 times, most recently from 0a8fd7a to d28c254 Compare July 17, 2019 12:25
framework.ExpectNoError(err, "failed waiting for PVs to be deleted")
}()

// Create <limit> nr. of PVCs and pods. With one pod created above, we are 1 pod over the limit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need more PVCs per pod otherwise we will hit 110 pods per node limit first

Copy link
Member Author

@jsafrane jsafrane Jul 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it so low???
Anyway, I was not able to even create 129 PVs:

I0717 13:28:50.828] STEP: Waiting for 129 PVCs Bound
I0717 13:29:00.642] Jul 17 13:29:00.642: INFO: 30/129 of PVCs are Bound
I0717 13:29:07.375] Jul 17 13:29:07.375: INFO: 37/129 of PVCs are Bound
I0717 13:29:15.760] Jul 17 13:29:15.759: INFO: 43/129 of PVCs are Bound
I0717 13:29:22.415] Jul 17 13:29:22.415: INFO: 52/129 of PVCs are Bound
I0717 13:29:30.550] Jul 17 13:29:30.549: INFO: 65/129 of PVCs are Bound
I0717 13:29:37.465] Jul 17 13:29:37.465: INFO: 72/129 of PVCs are Bound
I0717 13:29:45.482] Jul 17 13:29:45.482: INFO: 80/129 of PVCs are Bound
I0717 13:29:53.365] Jul 17 13:29:53.365: INFO: 87/129 of PVCs are Bound

...

I0717 13:59:11.115] Jul 17 13:59:11.115: INFO: 105/129 of PVCs are Bound
I0717 13:59:11.116] Jul 17 13:59:11.115: INFO: Unexpected error occurred: timed out waiting for the condition

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's not because pods couldn't be scheduled? PD driver uses delayed binding

@davidz627
Copy link
Contributor

/cc @davidz627 @hantaowang

@k8s-ci-robot
Copy link
Contributor

@davidz627: GitHub didn't allow me to request PR reviews from the following users: hantaowang.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @davidz627 @hantaowang

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@davidz627
Copy link
Contributor

@jsafrane need me to run the GCE version or will you handle it? Or the PR is not ready yet?

@msau42
Copy link
Member

msau42 commented Jul 17, 2019

@davidz627 it's getting run as part of pull-kubernetes-e2e-gce-csi-serial

@jsafrane
Copy link
Member Author

I checked late binding both with AWS and mock driver, something is very fishy there. Some scheduler predicate allows only 16 unbound PVCs in a single pod to be scheduled, a pod with 17 unbound PVCs is Pending forever. My theory is that GCE and/or Azure predicate don't check provisioner: and assume the unbound PVCs are theirs. @bertinatto is looking at it.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 19, 2019
@jsafrane jsafrane force-pushed the e2e-volume-limits branch 2 times, most recently from a5c811f to 2d2c6b4 Compare July 22, 2019 10:36
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 22, 2019
@jsafrane jsafrane force-pushed the e2e-volume-limits branch 2 times, most recently from dedced8 to bcb9687 Compare August 1, 2019 12:23
@jsafrane
Copy link
Member Author

jsafrane commented Aug 1, 2019

@davidz627 @msau42, so I tried with 128 volumes and a pod with 128 volumes failed to run:

[snip] rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-1564668114867-58f0eaf2d9379-022fe473-7f224368 failed (LIMIT_EXCEEDED): Exceeded limit 'maximum_persistent_disks' on resource 'e2e-4c57e689bd-930d0-minion-group-00hf'. Limit: 128.0

Is there any off-by-one error here? Or is there any other pod using one slot (logging, metrics, ...) running? It would be great if you debug the test on your end.

@davidz627
Copy link
Contributor

@jsafrane I will look into it

@jsafrane
Copy link
Member Author

/retest


return testsuites.GetStorageClass(provisioner, parameters, nil, ns, suffix)
return testsuites.GetStorageClass(provisioner, parameters, &delayedBinding, ns, suffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still want to "turn on delayed binding" for these suites in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't. Removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

// And one extra pod with a CSI volume should get Pending with a condition
// that says it's unschedulable because of volume limit.
// BEWARE: the test may create lot of volumes and it's really slow.
ginkgo.It("should support volume limits [Slow][Serial]", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this test also be considered [Disruptive] as it would prevent further scheduling of pods using disks onto this node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the test does not break anything (e.g. it does not kill kubelet). It just needs to be alone on a node and that's why we have [Serial].

ginkgo.Skip(fmt.Sprintf("driver %s does not support volume limits", driverInfo.Name))
}
var dDriver DynamicPVTestDriver
if dDriver = driver.(DynamicPVTestDriver); dDriver == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought these checks were all done in some helper function somewhere thats shared between test suites

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the error path is not reachable. Retyping driver to DynamicPVTestDriver should always succeed, but I have a safety check just to be sure.

defer l.resource.cleanupResource()

// Prepare cleanup
defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe split some of this cleanup function into helper functions, it's very long and complicated right now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to separate func.


l.pvNames = sets.NewString()
ginkgo.By("Waiting for all PVCs to get Bound")
err = wait.Poll(5*time.Second, testSlowMultiplier*framework.PVBindingTimeout, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an existing helper function for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for array of PVCs, only single PVC.

Copy link
Member Author

@jsafrane jsafrane Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ I need to collect PV names to make sure they're deleted before the test ends. It takes ages and the test suite could end before the PVs are detached + deleted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to separate func.

l.unschedulablePod, err = l.cs.CoreV1().Pods(l.ns.Name).Create(pod)

ginkgo.By("Waiting for the pod to get unschedulable")
err = wait.Poll(5*time.Second, framework.PodStartTimeout, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helper function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found WaitForPodCondition, but it's not that shorter.

@jsafrane jsafrane force-pushed the e2e-volume-limits branch 2 times, most recently from d771ed3 to 29c688d Compare August 16, 2019 13:02
Copy link
Contributor

@davidz627 davidz627 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo the extra code change for delayed binding


return testsuites.GetStorageClass(provisioner, parameters, nil, ns, suffix)
return testsuites.GetStorageClass(provisioner, parameters, &delayedBinding, ns, suffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still want to turn on delayed binding?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


return testsuites.GetStorageClass(provisioner, parameters, nil, ns, suffix)
return testsuites.GetStorageClass(provisioner, parameters, &delayedBinding, ns, suffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still here :)

@bertinatto bertinatto force-pushed the e2e-volume-limits branch 2 times, most recently from 6906d21 to 7092875 Compare August 23, 2019 15:00
@bertinatto
Copy link
Member

/test pull-kubernetes-bazel-build

@bertinatto
Copy link
Member

@davidz627, @msau42: I addressed the comments, could you take another look?

Copy link
Contributor

@davidz627 davidz627 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple small comments, lgtm besides those


// Create <limit> PVCs for one gigantic pod.
for i := 0; i < limit; i++ {
ginkgo.By(fmt.Sprintf("Creating volume %d/%d", i+1, limit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ginkgo.By seems like it's going to be pretty spammy, might not be necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved above loop

}

// Create the gigantic pod with all <limit> PVCs
ginkgo.By("Creating pod")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:more context on what kind of pod

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

err = e2epod.WaitForPodCondition(l.cs, l.ns.Name, l.unschedulablePod.Name, "Unschedulable", framework.PodStartTimeout, func(pod *v1.Pod) (bool, error) {
if pod.Status.Phase == v1.PodPending {
for _, cond := range pod.Status.Conditions {
matched, _ := regexp.MatchString("max.+volume.+count", cond.Message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could be between max, volume, and count besides a space? This regex is pretty prescriptive it might as well be a substring match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the string it's trying to match, so it's only a space.

I think @jsafrane wanted to use the same approach used in the mock driver.

if len(nodeList.Items) != 0 {
nodeName = nodeList.Items[0].Name
} else {
e2elog.Failf("Unable to find ready and schedulable Node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inconsistent usage of e2elog and ginkgo. above we have a ginkgo.Fail("...") and here we have e2elog.Failf("..."), also logging through e2elog.Logf and ginkgo.By.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace one usage of ginkgo.Fail and removed a redundant e2elog.Log.

The rest seems consistent with other files (like provisioning.go), where (AFAIU) ginkgo is used to delimit/document the logic of the test and e2elog for logging additional information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like someone is trying to clean this up rn, lets be consistent with what they're doing
#81983

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@bertinatto
Copy link
Member

/retest

@bertinatto
Copy link
Member

/test pull-kubernetes-conformance-kind-ipv6

flaky job

@davidz627
Copy link
Contributor

davidz627 commented Aug 27, 2019

lgtm please rebase to fewer descriptive commits. Thanks!

@bertinatto bertinatto force-pushed the e2e-volume-limits branch 3 times, most recently from e6cafc1 to 61ee449 Compare August 28, 2019 08:20
The test creates as many PVs and pods as the driver/plugin reports to support on a
single node.
@bertinatto
Copy link
Member

/test pull-kubernetes-bazel-build

@bertinatto
Copy link
Member

@davidz627, squashed commits and fixed some function calls (e.g., MakeSecPod) that have been moved recently. PTAL

@davidz627
Copy link
Contributor

/lgtm
thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 28, 2019
@davidz627
Copy link
Contributor

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 28, 2019
@davidz627
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 28, 2019
@davidz627
Copy link
Contributor

/retest

@k8s-ci-robot k8s-ci-robot merged commit c4ccb62 into kubernetes:master Aug 29, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants