-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI storage capacity check #92387
CSI storage capacity check #92387
Conversation
/hold Let's merge #91939 first, then I'll rebase to resolve the merge conflict in the new commit. |
/assign @msau42 |
56a5ccd
to
de9a347
Compare
The merge conflict was trivial and only affected a function comment. I've rebased this code here, but will do that one more time to get rid of the API change commits once #91939 is merged. |
de9a347
to
1863d41
Compare
/remove-sig api-machinery |
driver, err := b.capacityCheck.CSIDriverInformer.Lister().Get(provisioner) | ||
if err != nil { | ||
if errors.IsNotFound(err) { | ||
// Either the provisioner is no CSI driver or the driver does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "not a CSI driver"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both sound fine to me, but what do I know about English? 😅
Changed.
@@ -161,17 +169,31 @@ type volumeBinder struct { | |||
bindTimeout time.Duration | |||
|
|||
translator InTreeToCSITranslator | |||
|
|||
capacityCheck *CapacityCheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually when this goes GA, I think this structure should be folded into the main struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding an E2E test showed that this code was actually broken in a very subtle (read: took me two hours to figure out) way: because I wasn't calling the Lister()
functions during NewVolumeBinder
, the corresponding listeners were never started. That didn't matter with the mock objects used during unit testing, but it did for the real ones: presumably, the factory and thus the listeners are started sometime after NewVolumeBinder
and before the volume binder is invoked. Asking for listers later works, but they never return any data...
The revised code now stores the listers and a boolean in this struct instead of CapacityCheck. I kept it to bundle the related parameters together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if scenario.shouldFail && err == nil { | ||
t.Error("returned success but expected error") | ||
} | ||
checkReasons(t, reasons, scenario.reasons) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to validate Pod cache in the success case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's useful as an additional check, so I've added it (for both positive and negative cases).
var capacityCheck *scheduling.CapacityCheck | ||
if utilfeature.DefaultFeatureGate.Enabled(features.CSIStorageCapacity) { | ||
capacityCheck = &scheduling.CapacityCheck{ | ||
CSIDriverInformer: fh.SharedInformerFactory().Storage().V1().CSIDrivers(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rbac rules for the scheduler need to be updated with these permissions as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I guess proper E2E testing would have found this. Let me think about how I can add that already now, without depending on the external-provisioner changes and a CSI driver deployment that enables the feature. Probably something involving the mock driver and manually created CSIStorageCapacity objects...
Yes, that works. For the negative case (= volume cannot be created) it's currently waiting to ensure that the pod doesn't start, which makes the test slow. We have other cases like that and there decided against relying on error events to shorten the test runtime. Would it perhaps make sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the integration test have caught the rbac issue?
The reason to not rely on events wasn't strictly about shortening test time, it was because events are unreliable and could be dropped/garbage collected under high load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the integration test have caught the rbac issue?
Not sure... let's test it... no. The test completes fine even with the new RBAC permission missing.
The reason to not rely on events wasn't strictly about shortening test time, it was because events are unreliable and could be dropped/garbage collected under high load.
So one cannot rely on them for the test (i.e. not encountering the event must not be a test failure). But if the event is seen, the test could be stopped early. If slow test execution isn't an issue, then this is probably not worth the complexity.
|
||
// TestCapacity covers different scenarios involving CSIStorageCapacity objects. | ||
// Scenarios without those are covered by TestFindPodVolumesWithProvisioning. | ||
func TestCapacity(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add some integration test cases so we test the full interaction with etcd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, done.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pohly, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel We (@msau42, @thockin, myself) discussed namespacing of |
/hold Waiting for CSIStorageCapacity tests in |
test/e2e/storage/csi_mock_volume.go
Outdated
@@ -944,8 +945,194 @@ var _ = utils.SIGDescribe("CSI mock volume", func() { | |||
}) | |||
} | |||
}) | |||
|
|||
// These tests *only* work on a cluster which has the CSIStorageCapacity feature enabled. | |||
ginkgo.Context("CSIStorageCapacity [Feature: CSIStorageCapacity]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be a whitespace after "Feature".
Temporarily adding a whitespace in the test job so that we can get a run in today. kubernetes/test-infra#18201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
/retest |
/lgtm /hold |
We can create CSIStorageCapacity objects manually, therefore we don't need the updated external-provisioner for these tests.
This is similar to the E2E test, it just doesn't need a real cluster.
Setting testParameters.scName had no effect because StorageClassTest.StorageClassName isn't used anywhere. Instead, the storage class name is generated dynamically.
By creating CSIStorageCapacity objects in advance, we get the FailedScheduling pod event if (and only if!) the test is expected to fail because of insufficient or missing capacity. We can use that as indicator that waiting for pod start can be stopped early. However, because we might not get to see the event under load, we still need the timeout.
44fc9cd
to
30f9380
Compare
Done. |
/retest |
/retest TestSubresourcePatch in pull-kubernetes-integration seems to be flaky. |
Which they have done in https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/92387/pull-kubernetes-e2e-gce-alpha-features/1280744225868091392/ |
/lgtm |
/hold cancel |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
PR kubernetes/kubernetes#92387 is adding an E2E test for this new feature.
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is the part of kubernetes/enhancements#1472 where the scheduler uses the information for improving pod scheduling of pods with unbound volumes.
Special notes for your reviewer:
The API change itself is getting reviewed in PR #91939 and should be merged first. This PR then just adds one commit with the scheduler change.
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: