CSI storage capacity check #92387

pohly · 2020-06-22T16:24:38Z

What type of PR is this?
/kind feature

What this PR does / why we need it:

This is the part of kubernetes/enhancements#1472 where the scheduler uses the information for improving pod scheduling of pods with unbound volumes.

Special notes for your reviewer:

The API change itself is getting reviewed in PR #91939 and should be merged first. This PR then just adds one commit with the scheduler change.

Does this PR introduce a user-facing change?:

scheduler: optionally check for available storage capacity before scheduling pods which have unbound volumes (alpha feature with the new `CSIStorageCapacity` feature gate, only works for CSI drivers and depends on support for the feature in a CSI driver deployment)

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1472-storage-capacity-tracking
- [Usage]: https://github.com/kubernetes/website/pull/21634

pohly · 2020-06-22T16:25:28Z

/hold

Let's merge #91939 first, then I'll rebase to resolve the merge conflict in the new commit.

pohly · 2020-06-22T17:11:45Z

/assign @msau42

pohly · 2020-06-22T18:32:07Z

The merge conflict was trivial and only affected a function comment. I've rebased this code here, but will do that one more time to get rid of the API change commits once #91939 is merged.

fedebongio · 2020-06-23T20:11:09Z

/remove-sig api-machinery

msau42 · 2020-06-23T23:09:39Z

pkg/controller/volume/scheduling/scheduler_binder.go

+	driver, err := b.capacityCheck.CSIDriverInformer.Lister().Get(provisioner)
+	if err != nil {
+		if errors.IsNotFound(err) {
+			// Either the provisioner is no CSI driver or the driver does not


typo: "not a CSI driver"

Both sound fine to me, but what do I know about English? 😅

Changed.

msau42 · 2020-06-23T23:14:03Z

pkg/controller/volume/scheduling/scheduler_binder.go

@@ -161,17 +169,31 @@ type volumeBinder struct {
 	bindTimeout time.Duration

 	translator InTreeToCSITranslator
+
+	capacityCheck *CapacityCheck


Eventually when this goes GA, I think this structure should be folded into the main struct

Adding an E2E test showed that this code was actually broken in a very subtle (read: took me two hours to figure out) way: because I wasn't calling the Lister() functions during NewVolumeBinder, the corresponding listeners were never started. That didn't matter with the mock objects used during unit testing, but it did for the real ones: presumably, the factory and thus the listeners are started sometime after NewVolumeBinder and before the volume binder is invoked. Asking for listers later works, but they never return any data...

The revised code now stores the listers and a boolean in this struct instead of CapacityCheck. I kept it to bundle the related parameters together.

I believe @cofyc has a fix for this in #92684

msau42 · 2020-06-23T23:47:26Z

pkg/controller/volume/scheduling/scheduler_binder_test.go

+		if scenario.shouldFail && err == nil {
+			t.Error("returned success but expected error")
+		}
+		checkReasons(t, reasons, scenario.reasons)


Do we want to validate Pod cache in the success case?

I guess it's useful as an additional check, so I've added it (for both positive and negative cases).

msau42 · 2020-06-23T23:48:30Z

pkg/scheduler/framework/plugins/volumebinding/volume_binding.go

+	var capacityCheck *scheduling.CapacityCheck
+	if utilfeature.DefaultFeatureGate.Enabled(features.CSIStorageCapacity) {
+		capacityCheck = &scheduling.CapacityCheck{
+			CSIDriverInformer:          fh.SharedInformerFactory().Storage().V1().CSIDrivers(),


I think rbac rules for the scheduler need to be updated with these permissions as well.

True. I guess proper E2E testing would have found this. Let me think about how I can add that already now, without depending on the external-provisioner changes and a CSI driver deployment that enables the feature. Probably something involving the mock driver and manually created CSIStorageCapacity objects...

Yes, that works. For the negative case (= volume cannot be created) it's currently waiting to ensure that the pod doesn't start, which makes the test slow. We have other cases like that and there decided against relying on error events to shorten the test runtime. Would it perhaps make sense here?

Would the integration test have caught the rbac issue?

The reason to not rely on events wasn't strictly about shortening test time, it was because events are unreliable and could be dropped/garbage collected under high load

Would the integration test have caught the rbac issue?

Not sure... let's test it... no. The test completes fine even with the new RBAC permission missing.

The reason to not rely on events wasn't strictly about shortening test time, it was because events are unreliable and could be dropped/garbage collected under high load.

So one cannot rely on them for the test (i.e. not encountering the event must not be a test failure). But if the event is seen, the test could be stopped early. If slow test execution isn't an issue, then this is probably not worth the complexity.

msau42 · 2020-06-23T23:52:54Z

pkg/controller/volume/scheduling/scheduler_binder_test.go

+
+// TestCapacity covers different scenarios involving CSIStorageCapacity objects.
+// Scenarios without those are covered by TestFindPodVolumesWithProvisioning.
+func TestCapacity(t *testing.T) {


Can you also add some integration test cases so we test the full interaction with etcd?

https://github.com/kubernetes/kubernetes/blob/master/test/integration/volumescheduling/volume_binding_test.go

Okay, done.

msau42 · 2020-06-23T23:53:32Z

/assign @cofyc @ahg-g

k8s-ci-robot · 2020-07-07T21:09:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~api/OWNERS~~ [thockin]
~~pkg/apis/OWNERS~~ [thockin]
~~pkg/controller/volume/scheduling/OWNERS~~ [thockin]
~~pkg/features/OWNERS~~ [thockin]
~~pkg/kubeapiserver/OWNERS~~ [thockin]
~~pkg/printers/OWNERS~~ [thockin]
~~pkg/registry/OWNERS~~ [thockin]
~~pkg/scheduler/OWNERS~~ [thockin]
~~plugin/pkg/auth/authorizer/OWNERS~~ [thockin]
~~staging/src/k8s.io/api/OWNERS~~ [thockin]
~~staging/src/k8s.io/client-go/OWNERS~~ [thockin]
~~test/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2020-07-07T21:16:13Z

/hold cancel

We (@msau42, @thockin, myself) discussed namespacing of CSIStorageCapacity and decided to go with it for now, with a TODO item in kubernetes/enhancements#1472 (comment) to re-evaluate that choice.

pohly · 2020-07-07T21:17:54Z

/hold

Waiting for CSIStorageCapacity tests in
pull-kubernetes-e2e-gce-alpha-features to pass.

msau42 · 2020-07-07T22:16:17Z

test/e2e/storage/csi_mock_volume.go

@@ -944,8 +945,194 @@ var _ = utils.SIGDescribe("CSI mock volume", func() {
 			})
 		}
 	})
+
+	// These tests *only* work on a cluster which has the CSIStorageCapacity feature enabled.
+	ginkgo.Context("CSIStorageCapacity [Feature: CSIStorageCapacity]", func() {


There shouldn't be a whitespace after "Feature".

Temporarily adding a whitespace in the test job so that we can get a run in today. kubernetes/test-infra#18201

msau42 · 2020-07-07T23:39:58Z

/retest

msau42 · 2020-07-08T01:18:40Z

/lgtm
/priority important-soon

/hold
Can you update the release note to clarify that this is a new alpha feature with the feature gate name, and that it's only for csi volumes?

We can create CSIStorageCapacity objects manually, therefore we don't need the updated external-provisioner for these tests.

This is similar to the E2E test, it just doesn't need a real cluster.

Setting testParameters.scName had no effect because StorageClassTest.StorageClassName isn't used anywhere. Instead, the storage class name is generated dynamically.

By creating CSIStorageCapacity objects in advance, we get the FailedScheduling pod event if (and only if!) the test is expected to fail because of insufficient or missing capacity. We can use that as indicator that waiting for pod start can be stopped early. However, because we might not get to see the event under load, we still need the timeout.

pohly · 2020-07-08T06:07:31Z

Can you update the release note to clarify that this is a new alpha feature with the feature gate name, and that it's only for csi volumes?

Done.

pohly · 2020-07-08T08:44:56Z

/retest

pohly · 2020-07-08T11:55:42Z

/retest

TestSubresourcePatch in pull-kubernetes-integration seems to be flaky.

pohly · 2020-07-08T11:57:28Z

Waiting for CSIStorageCapacity tests in pull-kubernetes-e2e-gce-alpha-features to pass.

Which they have done in https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/92387/pull-kubernetes-e2e-gce-alpha-features/1280744225868091392/

msau42 · 2020-07-08T17:48:42Z

/lgtm

msau42 · 2020-07-08T18:15:37Z

/hold cancel

fejta-bot · 2020-07-08T19:20:54Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

PR kubernetes/kubernetes#92387 is adding an E2E test for this new feature.

k8s-ci-robot requested review from brendandburns and davidopp June 22, 2020 16:28

k8s-ci-robot assigned msau42 Jun 22, 2020

pohly changed the title ~~CSI storage capacity~~ CSI storage capacity check Jun 22, 2020

pohly force-pushed the csi-storage-capacity branch from 56a5ccd to de9a347 Compare June 22, 2020 18:30

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2020

pohly force-pushed the csi-storage-capacity branch from de9a347 to 1863d41 Compare June 22, 2020 18:36

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jun 23, 2020

msau42 reviewed Jun 23, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 7, 2020

msau42 reviewed Jul 7, 2020

View reviewed changes

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 8, 2020

pohly added 4 commits July 8, 2020 08:02

CSIStorageCapacity: E2E test with mock driver

567ce87

We can create CSIStorageCapacity objects manually, therefore we don't need the updated external-provisioner for these tests.

CSIStorageCapacity: integration test

cf735a3

This is similar to the E2E test, it just doesn't need a real cluster.

e2e storage: dead code removal and cleanup

bd05791

Setting testParameters.scName had no effect because StorageClassTest.StorageClassName isn't used anywhere. Instead, the storage class name is generated dynamically.

pohly force-pushed the csi-storage-capacity branch from 44fc9cd to 30f9380 Compare July 8, 2020 06:02

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 8, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 8, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 8, 2020

k8s-ci-robot merged commit 94a08e1 into kubernetes:master Jul 9, 2020

k8s-ci-robot added this to the v1.19 milestone Jul 9, 2020

michaelkolber pushed a commit to michaelkolber/test-infra that referenced this pull request Jul 10, 2020

gce-alpha-features: add CSIStorageCapacity feature

85843eb

PR kubernetes/kubernetes#92387 is adding an E2E test for this new feature.

github-actions bot mentioned this pull request Jul 14, 2020

Week Ending July 12, 2020 dev-obs/actus#193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI storage capacity check #92387

CSI storage capacity check #92387

pohly commented Jun 22, 2020 •

edited

Loading

pohly commented Jun 22, 2020

pohly commented Jun 22, 2020

pohly commented Jun 22, 2020

fedebongio commented Jun 23, 2020

msau42 Jun 23, 2020

pohly Jun 24, 2020

msau42 Jun 23, 2020

pohly Jun 24, 2020

msau42 Jul 1, 2020

msau42 Jun 23, 2020

pohly Jun 25, 2020

msau42 Jun 23, 2020

pohly Jun 24, 2020

msau42 Jun 25, 2020

pohly Jun 26, 2020

msau42 Jun 23, 2020

pohly Jun 25, 2020

msau42 commented Jun 23, 2020

k8s-ci-robot commented Jul 7, 2020

pohly commented Jul 7, 2020

pohly commented Jul 7, 2020

msau42 Jul 7, 2020

pohly Jul 8, 2020

msau42 commented Jul 7, 2020

msau42 commented Jul 8, 2020

pohly commented Jul 8, 2020

pohly commented Jul 8, 2020

pohly commented Jul 8, 2020

pohly commented Jul 8, 2020

msau42 commented Jul 8, 2020

msau42 commented Jul 8, 2020

fejta-bot commented Jul 8, 2020

CSI storage capacity check #92387

CSI storage capacity check #92387

Conversation

pohly commented Jun 22, 2020 • edited Loading

pohly commented Jun 22, 2020

pohly commented Jun 22, 2020

pohly commented Jun 22, 2020

fedebongio commented Jun 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Jun 23, 2020

k8s-ci-robot commented Jul 7, 2020

pohly commented Jul 7, 2020

pohly commented Jul 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Jul 7, 2020

msau42 commented Jul 8, 2020

pohly commented Jul 8, 2020

pohly commented Jul 8, 2020

pohly commented Jul 8, 2020

pohly commented Jul 8, 2020

msau42 commented Jul 8, 2020

msau42 commented Jul 8, 2020

fejta-bot commented Jul 8, 2020

pohly commented Jun 22, 2020 •

edited

Loading