Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prioritizing nodes based on volume capacity: API changes #99594

Merged
merged 1 commit into from
Jul 1, 2021

Conversation

cofyc
Copy link
Member

@cofyc cofyc commented Mar 1, 2021

What type of PR is this?

/kind api-change

What this PR does / why we need it:

Based on #96347. This adds api changes.

#96347 must be merged first.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1845-prioritization-on-volume-capacity

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 1, 2021
@cofyc cofyc changed the title Prioritizing nodes based on volume capacity: API changes WIP: Prioritizing nodes based on volume capacity: API changes Mar 1, 2021
@k8s-ci-robot k8s-ci-robot added do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 1, 2021
@k8s-ci-robot
Copy link
Contributor

@cofyc: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 1, 2021
@k8s-ci-robot k8s-ci-robot added area/test sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 1, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 2, 2021
@cofyc cofyc force-pushed the kep1845-api branch 4 times, most recently from 3df62be to bedf08a Compare March 2, 2021 10:49
@cofyc
Copy link
Member Author

cofyc commented Mar 3, 2021

/retest

@ahg-g
Copy link
Member

ahg-g commented Jun 24, 2021

you need to update the VolumeBinding defaults in the following testing file for some tests to pass:

I guess you didn't need this because the feature is still disabled.

@ahg-g
Copy link
Member

ahg-g commented Jun 24, 2021

/approve

this is good from the scheduler's POV.

@@ -211,6 +211,11 @@ type VolumeBindingArgs struct {
// Value must be non-negative integer. The value zero indicates no waiting.
// If this value is nil, the default value will be used.
BindTimeoutSeconds int64

// Shape specifies the points defining the score function shape.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more description to this field? What does each point represent and how does it impact scoring?

It would also be good to indicate that this field requires the alpha VolumeCapacityPriority feature gate to be enabled.

Copy link
Member Author

@cofyc cofyc Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to indicate that this field requires the alpha VolumeCapacityPriority feature gate to be enabled.

Is the +featureGate=VolumeCapacityPriority for this purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more description to this field? What does each point represent and how does it impact scoring?

tried my best, but I'm not good at this, suggestions are welcome!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is helpful.

Can you also describe how volume capacity maps to utilization? Is it based on #pvs, total capacity available to the node?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also describe how volume capacity maps to utilization? Is it based on #pvs, total capacity available to the node?

done in 5d9baa9

the existing bound PVs on the node are excluded as they cannot be shared by other pods like cpu/memory.

@@ -273,6 +273,18 @@ func SetDefaults_VolumeBindingArgs(obj *v1beta1.VolumeBindingArgs) {
if obj.BindTimeoutSeconds == nil {
obj.BindTimeoutSeconds = pointer.Int64Ptr(600)
}
if len(obj.Shape) == 0 && feature.DefaultFeatureGate.Enabled(features.VolumeCapacityPriority) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add unit tests for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

return err
if utilfeature.DefaultFeatureGate.Enabled(features.VolumeCapacityPriority) {
allErrs = append(allErrs, validateFunctionShape(args.Shape, path.Child("shape"))...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate the argument even if the feature gate is off? Otherwise someone could pass an invalid argument and we ignore it, and then in the future when we turn on the feature, their config no longer works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt we don't have "field dropping" for config APIs right? In lieu of that, I think validating the field in all cases and ignoring it at runtime is the best approach.

Another approach of returning an error if the feature is disabled will break rollback/disable scenarios.

Copy link
Member Author

@cofyc cofyc Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In lieu of that, I think validating the field in all cases and ignoring it at runtime is the best approach.

+1 and implemented

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go a step further and forbid values in this config field if the feature is off... otherwise someone can specify a shape that is syntactically valid but does not behave at all like they want, successfully run the scheduler with this feature off, and then break themselves with the same config when the feature turns on in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return an error, then if someone has the feature on and sets the config, but then rolls back/disables the feature, then they would have to update their config at that time too otherwise the config will fail validation. Is that the expected approach for config apis?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I don't love either direction, but in the rollback case, since they're already modifying the invocation, I think making them comment out the config for the disabled feature is better than letting people put timebomb inert configuration in the file and then activating it on upgrade when the feature becomes active

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems a safer choice, done in 3e788a3

@@ -273,6 +273,18 @@ func SetDefaults_VolumeBindingArgs(obj *v1beta1.VolumeBindingArgs) {
if obj.BindTimeoutSeconds == nil {
obj.BindTimeoutSeconds = pointer.Int64Ptr(600)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is a v1 type. Do we need to add defaults there? I notice there are no defaults at all for v1. cc @ahg-g

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the v1 is not for component config, it is for the old policy api, which we will be deprecating in the next release.

@msau42
Copy link
Member

msau42 commented Jun 29, 2021

api review
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 29, 2021
@liggitt
Copy link
Member

liggitt commented Jun 29, 2021

one comment about forbidding this config value in validation if the feature is off, then lgtm

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2021
@liggitt
Copy link
Member

liggitt commented Jun 30, 2021

/approve
for API changes

/hold for final scheduler lgtm and approval

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 30, 2021
@ahg-g
Copy link
Member

ahg-g commented Jun 30, 2021

can you please squash the commits.

@cofyc
Copy link
Member Author

cofyc commented Jul 1, 2021

can you please squash the commits.

done

@ahg-g
Copy link
Member

ahg-g commented Jul 1, 2021

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, cofyc, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cofyc
Copy link
Member Author

cofyc commented Jul 1, 2021

/retest

@ahg-g
Copy link
Member

ahg-g commented Jul 1, 2021

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 1, 2021
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jul 1, 2021

@cofyc: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-bazel-test 61592a7c8fd48dd3675240d87f59bc6a86ff759c link /test pull-kubernetes-bazel-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 25bbe2e into kubernetes:master Jul 1, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone Jul 1, 2021
@cofyc cofyc deleted the kep1845-api branch July 2, 2021 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: API review completed, 1.22
Development

Successfully merging this pull request may close these issues.

7 participants