Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UPSTREAM: <carry>: OCPEDGE-807: add support for cpu limits into management workloads #1902

Merged

Conversation

eggfoobar
Copy link

@eggfoobar eggfoobar commented Feb 27, 2024

Added support to allow workload partitioning to use the CPU limits for a container, to allow the runtime to make better decisions around workload cpu quotas we are passing down the cpu limit as part of the cpuLimitMilli value in the annotation. CRI-O will take that information and calculate the quota per node. This should support situations where workloads might have different cpu period overrides assigned.

Enhancement Proposal
OCPEDGE-57

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@openshift-ci-robot openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Feb 27, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 27, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 27, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 27, 2024

@eggfoobar: This pull request references OCPEDGE-807 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Added support to allow workload partitioning to use the CPU limits for a container, to allow the runtime to make better decisions around workload cpu quotas we are passing down the cpu limit as part of the cpuLimitMilli value in the annotation. CRI-O will take that information and calculate the quota per node. This should support situations where workloads might have different cpu period overrides assigned.

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

@eggfoobar: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@eggfoobar: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@eggfoobar
Copy link
Author

/retest-required

@openshift-ci-robot
Copy link

@eggfoobar: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@eggfoobar: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@eggfoobar
Copy link
Author

/retest-required

@openshift-ci-robot
Copy link

@eggfoobar: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci-robot
Copy link

@eggfoobar: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@eggfoobar eggfoobar changed the title [WIP] UPSTREAM: <carry>: OCPEDGE-807: add support for cpu limits into management workloads UPSTREAM: <carry>: OCPEDGE-807: add support for cpu limits into management workloads Mar 27, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2024
@eggfoobar
Copy link
Author

/hold

Holding for Crio Change to be merged in, cri-o/cri-o#7822

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2024
Copy link

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change seems to also modify how the code detects and manages the QoS class. Is part functional to the handling of the limits? If so, could you please explain why?
would it possible to move this logic change in its own commit, and is this part covered by existing tests?

// and add a warning annotation
resourceAnnoString, err := json.Marshal(resourceAnno)
if err != nil {
podAnnotations[workloadAdmissionWarning] = fmt.Sprintf("failed to marshal cpu resources, using fallback: err: %s", err.Error())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to set a warning as annotation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah wasn't 100% sure what to do here, I wanted to make it easy to identify if it cropped up in the wild, even though it's highly improbable, maybe a simple log would be good enough here, unless we feel firing off an event would be more appropriate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity @ffromani , do you mean in general the use of annotations for warning in this webhook?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it feels weird to add warning data as annotation. Is the user supposed to look at annotations checking for warninga? Not sure there's a better alternative though, so this comment is not blocking.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 27, 2024

@eggfoobar: This pull request references OCPEDGE-807 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Added support to allow workload partitioning to use the CPU limits for a container, to allow the runtime to make better decisions around workload cpu quotas we are passing down the cpu limit as part of the cpuLimitMilli value in the annotation. CRI-O will take that information and calculate the quota per node. This should support situations where workloads might have different cpu period overrides assigned.

Enhancement Proposal

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 27, 2024

@eggfoobar: This pull request references OCPEDGE-807 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Added support to allow workload partitioning to use the CPU limits for a container, to allow the runtime to make better decisions around workload cpu quotas we are passing down the cpu limit as part of the cpuLimitMilli value in the annotation. CRI-O will take that information and calculate the quota per node. This should support situations where workloads might have different cpu period overrides assigned.

Enhancement Proposal
[Epic OCPEDGE-57](https://issues.redhat.com/browse/OCPEDGE-57)

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 27, 2024

@eggfoobar: This pull request references OCPEDGE-807 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Added support to allow workload partitioning to use the CPU limits for a container, to allow the runtime to make better decisions around workload cpu quotas we are passing down the cpu limit as part of the cpuLimitMilli value in the annotation. CRI-O will take that information and calculate the quota per node. This should support situations where workloads might have different cpu period overrides assigned.

Enhancement Proposal
[Epic OCPEDGE-57](https://issues.redhat.com/browse/OCPEDGE-57)

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 27, 2024

@eggfoobar: This pull request references OCPEDGE-807 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Added support to allow workload partitioning to use the CPU limits for a container, to allow the runtime to make better decisions around workload cpu quotas we are passing down the cpu limit as part of the cpuLimitMilli value in the annotation. CRI-O will take that information and calculate the quota per node. This should support situations where workloads might have different cpu period overrides assigned.

Enhancement Proposal
OCPEDGE-57

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@eggfoobar
Copy link
Author

eggfoobar commented Mar 27, 2024

this change seems to also modify how the code detects and manages the QoS class. Is part functional to the handling of the limits? If so, could you please explain why? would it possible to move this logic change in its own commit, and is this part covered by existing tests?

Great questions, I was hoping to simplify the code here and use the existing library code for computing QoS. The original reason for this code being here was because at the time there seemed to only be one for v1.Pod and we needed to use core.Pod for the webhook. I noticed there was a core.Pod package available and opted to off load the logic to that library. More info in this comment at the time

The tests should cover this change, I'll double check when I have a moment. As for moving this to it's own commit, I kept it as one because of the UPSTREAM rule, but thinking again, I don't see why that would be a problem. Let me split it and if it becomes an issue we can just squash it.

Edit:

After going through the code again, I realized one of the concerns brought up in the original comment on how the defaulter behaves was not accounted for in the new QoS package. I reverted that code and kept this commit to just the cpu limit change.

@eggfoobar
Copy link
Author

/retest-required

@Tal-or
Copy link

Tal-or commented Apr 11, 2024

LGTM

Copy link
Member

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/remove-label backports/unvalidated-commits
/label backports/validated-commits
/approve

@openshift-ci openshift-ci bot added backports/validated-commits Indicates that all commits come to merged upstream PRs. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. labels Apr 11, 2024
Copy link

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 15, 2024
Copy link

openshift-ci bot commented Apr 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eggfoobar, ffromani, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6b4d6cb and 2 for PR HEAD c652a1d in total

@eggfoobar
Copy link
Author

/retest-required

2 similar comments
@eggfoobar
Copy link
Author

/retest-required

@eggfoobar
Copy link
Author

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 5fa1806 and 1 for PR HEAD c652a1d in total

@eggfoobar
Copy link
Author

/retest-required

3 similar comments
@eggfoobar
Copy link
Author

/retest-required

@eggfoobar
Copy link
Author

/retest-required

@eggfoobar
Copy link
Author

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 0fdcb8e and 0 for PR HEAD c652a1d in total

@openshift-ci-robot
Copy link

/hold

Revision c652a1d was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2024
@eggfoobar
Copy link
Author

/unhold

Both errors are due to quota slices

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2024
@eggfoobar
Copy link
Author

/retest-required

Issue seems to have been resolved, retesting

@eggfoobar
Copy link
Author

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 6506f5b and 2 for PR HEAD c652a1d in total

Copy link

openshift-ci bot commented Apr 18, 2024

@eggfoobar: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit f3e484f into openshift:master Apr 18, 2024
19 checks passed
@eggfoobar eggfoobar deleted the wrk-prt-support-cpu-limits branch April 30, 2024 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backports/validated-commits Indicates that all commits come to merged upstream PRs. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants