Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-23900: config.openshift.io/v1/scheduler: allow profile customizations for DRA #1738

Merged
merged 8 commits into from Feb 8, 2024

Conversation

ingvagabund
Copy link
Member

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 24, 2024
@openshift-ci-robot
Copy link

@ingvagabund: This pull request references Jira Issue OCPBUGS-23900, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

For more info: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 24, 2024
@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 24, 2024
Copy link
Contributor

openshift-ci bot commented Jan 24, 2024

Hello @ingvagabund! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 24, 2024
@ingvagabund
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 24, 2024
@openshift-ci-robot
Copy link

@ingvagabund: This pull request references Jira Issue OCPBUGS-23900, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @kasturinarra

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, do we have to use the acronym DRA? Why not spell it out ExperimentalDynamicResourceAllocation?

Also, given you've included Experimental in the name, what's the plan for promoting this? How will we transition from ExperimentalDRA to regular DRA?

Is this a tech preview feature?

@ingvagabund ingvagabund changed the title OCPBUGS-23900: config.openshift.io/v1/scheduler: add a new profile for DRA OCPBUGS-23900: config.openshift.io/v1/scheduler: allow profile customizations for DRA Jan 24, 2024
@openshift-ci openshift-ci bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 24, 2024
@ingvagabund
Copy link
Member Author

Nit, do we have to use the acronym DRA? Why not spell it out ExperimentalDynamicResourceAllocation?

It's true ExperimentalDynamicResourceAllocation is easier to read. On the other hand the name is too long.

Also, given you've included Experimental in the name, what's the plan for promoting this? How will we transition from ExperimentalDRA to regular DRA?
Is this a tech preview feature?

Yes. It's currently upstream alpha feature. Normally, lifting a feature gate is sufficient for enabling it. However, in this case an additional scheduling plugin needs to be enabled alongside. So even when a customer/user enables the feature (through TPNoUpgrade) the scheduler operator still needs to make an extra step.

@ingvagabund
Copy link
Member Author

ingvagabund commented Jan 24, 2024

How will we transition from ExperimentalDRA to regular DRA

That's still to be debated. Either the new scheduling plugin will be enabled by default after GA eventually. At which point the experimental profile/field gets removed. If not enabled by default after GA, we will promote the profile/field and remove the word experimental.

@JoelSpeed
Copy link
Contributor

It's true ExperimentalDynamicResourceAllocation is easier to read. On the other hand the name is too long.

Too long by what measure?

Yes. It's currently upstream alpha feature. Normally, lifting a feature gate is sufficient for enabling it. However, in this case an additional scheduling plugin needs to be enabled alongside. So even when a customer/user enables the feature (through TPNoUpgrade) the scheduler operator still needs to make an extra step.

Have you considered that the scheduler operator could observe the feature gate being present in the FeatureGate status and auto enable the value based on the feature gate status? At least for the tech preview version of this feature, it would be very obvious what needs to be done when the feature gate is enabled

What do you expect the feature enablement to look like when you GA the feature? Do you expect that users will always opt-in? If that's the case, it's not the first time we've added a "You must set this feature gate, then enable the feature by this field that is only enabled in tech preview clusters".

That's still to be debated. Either the new scheduling plugin will be enabled by default after GA eventually. At which point the experimental profile/field gets removed. If not enabled by default after GA, we will promote the profile/field and remove the word experimental.

So if it's likely to be enabled by default in the future, perhaps just relying on the feature gate and enabling the feature is acceptable?
If you think after GA we will still want a knob, I think we want to add that now, but use the correct name, and mark it as a TechPreview only field/enum value

@ingvagabund
Copy link
Member Author

Too long by what measure?

Too long for typing by hand. Although, that's probably not a blocking issue here. Lemme do some renaming.

Have you considered that the scheduler operator could observe the feature gate being present in the FeatureGate status and auto enable the value based on the feature gate status? At least for the tech preview version of this feature, it would be very obvious what needs to be done when the feature gate is enabled

Auto-enabling based on TPNoUpgrade was discussed before. The decision behind going with an extra step of enabling the feature for kube-scheduler was to allow to enable dynamic resource allocation as a set of functionalities. The DRA implementation itself is an upstream feature that requires additional components deployed. E.g. vendor plugins for NVDIA, Intel, etc. running on each node. So enabling dynamicresource scheduling plugin without these additional components may lead to pods staying in pending state indefinitely.

What do you expect the feature enablement to look like when you GA the feature? Do you expect that users will always opt-in? If that's the case, it's not the first time we've added a "You must set this feature gate, then enable the feature by this field that is only enabled in tech preview clusters".

Users need to deploy third-party plugins before DRA functionality is enabled in the kube-scheduler. Which will always be the case for GA until OpenShift start shipping a generic plugin by default. Which might not happen at all. So this extra step of configuration KSO to enable the plugin will always be required.

If you think after GA we will still want a knob, I think we want to add that now, but use the correct name, and mark it as a TechPreview only field/enum value

The way the plugin is enabled might change in the future. We might actually advocate for always disabling the plugin by default (even after GA) to enable it only when corresponding vendor plugins are installed. Also, the new field might be removed in favor of automatically detecting vendor plugins. Something to be decided in time.

@ingvagabund ingvagabund force-pushed the kso-new-profile branch 4 times, most recently from 2e10503 to 8a69ff1 Compare January 24, 2024 15:56
@openshift-ci openshift-ci bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 24, 2024
@ingvagabund ingvagabund force-pushed the kso-new-profile branch 2 times, most recently from 2567a6d to c6c4d46 Compare January 26, 2024 10:05
ingvagabund and others added 5 commits February 8, 2024 14:18
Co-authored-by: Joel Speed <Joel.speed@hotmail.co.uk>
Co-authored-by: Joel Speed <Joel.speed@hotmail.co.uk>
Co-authored-by: Joel Speed <Joel.speed@hotmail.co.uk>
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2024
Copy link
Contributor

openshift-ci bot commented Feb 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ingvagabund, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2024
@JoelSpeed
Copy link
Contributor

/override ci/prow/verify-crd-schema

Failures are from pre-existing errors, future us will not be using booleans

Copy link
Contributor

openshift-ci bot commented Feb 8, 2024

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/verify-crd-schema

In response to this:

/override ci/prow/verify-crd-schema

Failures are from pre-existing errors, future us will not be using booleans

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ingvagabund
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 8, 2024
@ingvagabund
Copy link
Member Author

/retest-required

Copy link
Contributor

openshift-ci bot commented Feb 8, 2024

@ingvagabund: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp ccda9db link false /test e2e-gcp

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit f8cee3e into openshift:master Feb 8, 2024
16 of 17 checks passed
@openshift-ci-robot
Copy link

@ingvagabund: Jira Issue OCPBUGS-23900: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-23900 has not been moved to the MODIFIED state.

In response to this:

For more info: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@ingvagabund ingvagabund deleted the kso-new-profile branch February 9, 2024 07:36
@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-cluster-config-api-container-v4.16.0-202402090739.p0.gf8cee3e.assembly.stream.el9 for distgit ose-cluster-config-api.
All builds following this will include this PR.

@ingvagabund
Copy link
Member Author

/cherry-pick release-4.15

@openshift-cherrypick-robot

@ingvagabund: new pull request created: #1763

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-02-17-013806

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-03-05-105513

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants