Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNF-7610: Mixed CPUs plugin deployment support #853

Conversation

Tal-or
Copy link
Contributor

@Tal-or Tal-or commented Nov 15, 2023

This PR contains set of changes that are needed for MixedCPUs feature support:

  • Bootstrap the new API used for the feature activation
  • Append the shared cpus as part of Kubelet's reservedSystemCPUs
  • Deploy all the different configuration files that are needed for the mixed-cpus feature activation
  • Unit tests

The feature is optional and off by default.

E2E tests will be added in a separate PR since this PR alone doesn't provide any added functionality and depends on other changes.

Related to:

EP: openshift/enhancements@76ea48a

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 15, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 15, 2023

@Tal-or: This pull request references CNF-7610 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This PR contains set of changes that are needed for MixedCPUs feature support:

  • Bootstrap the new API used for the feature activation
  • Append the shared cpus as part of Kubelet's reservedSystemCPUs
  • Deploy all the different configuration files that are needed for the mixed-cpus feature activation
  • Unit tests
  • WIP: E2E tests

The feature is optional and off by default.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 15, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 15, 2023

@Tal-or: This pull request references CNF-7610 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This PR contains set of changes that are needed for MixedCPUs feature support:

  • Bootstrap the new API used for the feature activation
  • Append the shared cpus as part of Kubelet's reservedSystemCPUs
  • Deploy all the different configuration files that are needed for the mixed-cpus feature activation
  • Unit tests
  • WIP: E2E tests

The feature is optional and off by default.

EP: openshift/enhancements@76ea48a

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we add validation already? we need to check how the shared pool composes with other pools. E.g. (purely random thoughts): not overlap with isolated or offlined, potentially a subset of reserved? not overlapping ALSO with reserved?

@Tal-or
Copy link
Contributor Author

Tal-or commented Nov 20, 2023

Did we add validation already? we need to check how the shared pool composes with other pools. E.g. (purely random thoughts): not overlap with isolated or offlined, potentially a subset of reserved? not overlapping ALSO with reserved?

It should not overlap with any of the sets.
The fact that we're merging reserved + shared later on under Kubelet's systemReserved is an implementation detail.

@Tal-or Tal-or force-pushed the mixed_cpu_plugin_deployment_support branch from 65d1442 to 4cf8654 Compare November 20, 2023 16:26
@Tal-or Tal-or changed the title WIP: CNF-7610: Mixed CPUs plugin deployment support CNF-7610: Mixed CPUs plugin deployment support Nov 20, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2023
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2023

@Tal-or: This pull request references CNF-7610 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This PR contains set of changes that are needed for MixedCPUs feature support:

  • Bootstrap the new API used for the feature activation
  • Append the shared cpus as part of Kubelet's reservedSystemCPUs
  • Deploy all the different configuration files that are needed for the mixed-cpus feature activation
  • Unit tests

The feature is optional and off by default.

E2E tests will be added in a separate PR since this PR alone doesn't provide any added functionality and depends on other changes.

Related to:

  1. CNF-8326: advertise shared cpus for mixed cpus feature kubernetes#1795
  2. CNF-8809: admission: add new admission for handling shared cpus request kubernetes#1799

EP: openshift/enhancements@76ea48a

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2023

@Tal-or: This pull request references CNF-7610 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This PR contains set of changes that are needed for MixedCPUs feature support:

  • Bootstrap the new API used for the feature activation
  • Append the shared cpus as part of Kubelet's reservedSystemCPUs
  • Deploy all the different configuration files that are needed for the mixed-cpus feature activation
  • Unit tests

The feature is optional and off by default.
E2E tests will be added in a separate PR since this PR alone doesn't provide any added functionality and depends on other changes.

Related to:

  1. CNF-8326: advertise shared cpus for mixed cpus feature kubernetes#1795
  2. CNF-8809: admission: add new admission for handling shared cpus request kubernetes#1799

EP: openshift/enhancements@76ea48a

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 21, 2023

@Tal-or: This pull request references CNF-7610 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This PR contains set of changes that are needed for MixedCPUs feature support:

  • Bootstrap the new API used for the feature activation
  • Append the shared cpus as part of Kubelet's reservedSystemCPUs
  • Deploy all the different configuration files that are needed for the mixed-cpus feature activation
  • Unit tests

The feature is optional and off by default.

E2E tests will be added in a separate PR since this PR alone doesn't provide any added functionality and depends on other changes.

Related to:

EP: openshift/enhancements@76ea48a

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Tal-or Tal-or force-pushed the mixed_cpu_plugin_deployment_support branch 4 times, most recently from 05f045a to 40b7803 Compare November 28, 2023 07:17
@Tal-or
Copy link
Contributor Author

Tal-or commented Nov 29, 2023

depends on #858

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 3, 2023

@Tal-or: This pull request references CNF-7610 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.15.0" version, but no target version was set.

In response to this:

This PR contains set of changes that are needed for MixedCPUs feature support:

  • Bootstrap the new API used for the feature activation
  • Append the shared cpus as part of Kubelet's reservedSystemCPUs
  • Deploy all the different configuration files that are needed for the mixed-cpus feature activation
  • Unit tests

The feature is optional and off by default.

E2E tests will be added in a separate PR since this PR alone doesn't provide any added functionality and depends on other changes.

Related to:

EP: openshift/enhancements@76ea48a

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 3, 2023

/test e2e-gcp-pao

@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 5, 2023

/test e2e-gcp-pao
Flaky

@@ -0,0 +1,21 @@
package __mixedcpus_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have already a 10_* suite, so this should be 11_mixedcpus_... IMO.

We also need to check (separately, but realized just now) we don't have hidden assumptions about the numerical prefix being single-digit. And (separately ofc) we probably should also rename the suites like 01_, 02_, ... 09_, 10_...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 11 thanks

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

looks very good. Non-binding LGTM, want to have another pass later and give chance to others to review.

Copy link
Contributor

openshift-ci bot commented Dec 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 7, 2023

/test e2e-gcp-pao

validates that shared cpus are not overlapping
with any of the other CPU sets

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@Tal-or Tal-or force-pushed the mixed_cpu_plugin_deployment_support branch from 4ecf57f to 2de3e30 Compare December 7, 2023 10:02
@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 7, 2023

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2023
@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 7, 2023

Need to reprovision my cluster to validate FG integration

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

as discussed offline, the missing comments about e2e tests will be handled as followup PR

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2023
@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 7, 2023

  Unexpected error:
      <wait.errInterrupted>: 
      timed out waiting for the condition
      {
          cause: <*errors.errorString | 0xc0002bd6b0>{
              s: "timed out waiting for the condition",
          },
      }
  occurred
  In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:298 @ 12/07/23 11:28:00.169

The lane is falling consistently maybe the issue is real. working to figure it out

@ffromani
Copy link
Contributor

ffromani commented Dec 7, 2023

  Unexpected error:
      <wait.errInterrupted>: 
      timed out waiting for the condition
      {
          cause: <*errors.errorString | 0xc0002bd6b0>{
              s: "timed out waiting for the condition",
          },
      }
  occurred
  In [It] at: /go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/cpu_management.go:298 @ 12/07/23 11:28:00.169

The lane is falling consistently maybe the issue is real. working to figure it out

,Reason:Unschedulable,Message:0/6 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 1 No preemption victims found for incoming pod, 5 Preemption is not helpful for scheduling..,},},Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{},QOSClass:Guaranteed,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},Resize:,},}

suggests there could be a pod being deleted and not yet gone, or that in general the cluster state is polluted somehow when the tests runs.

@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 7, 2023

/test e2e-gcp-pao

@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 7, 2023

There's an issue with the PR:
We added a new annotation to crio under the allowed annotation section:
2375df3#diff-e763c8f403f44cd298b79a200633f6e687c478a00bda33b0d85e0add10eeb60bR21

but since the existing CRIO version that is running in the CI is not familiar with this new annotation (PR is till open: cri-o/cri-o#7502) it would failed to validate the high-performance runtime and hence will ignore it: https://github.com/cri-o/cri-o/blob/1530af133d95de9cf4e5502c2f7dba4faa286a04/pkg/config/config.go#L1235
Missing the high-performance runtime configuration will cause the test the request for lb disable to fail

Bottom line:
We should merge CRIO patch first.

@ffromani
Copy link
Contributor

ffromani commented Dec 7, 2023

[...]

Bottom line: We should merge CRIO patch first.

...and let the CI infra consume the updated package once released. Is that feasible under the current timeline constraints?

[edit] the only other option coming to mind is to merge this functionality disabled by default, merge the e2e tests but make them SKIP as early as possible.
It's a risk. So it is not having this code merged. Let's find directions in the appropriate (offline, sadly) forums [/edit]

Feature gates sealing should be perfect and provide 0
changes to configuration in case mixed CPUs feature gates is off.

Hence, we don\'t want to add the `cpu-shared.crio.io` annotation
to CRI-O allowed annotation in case it's off.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
E2E tests that verifies the deployment done by PAO controller.
Mainly checks of the configuration files integrity.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@Tal-or Tal-or force-pushed the mixed_cpu_plugin_deployment_support branch from 2de3e30 to 4e30eab Compare December 7, 2023 17:03
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2023
@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 7, 2023

There's an issue with the PR: We added a new annotation to crio under the allowed annotation section: 2375df3#diff-e763c8f403f44cd298b79a200633f6e687c478a00bda33b0d85e0add10eeb60bR21

but since the existing CRIO version that is running in the CI is not familiar with this new annotation (PR is till open: cri-o/cri-o#7502) it would failed to validate the high-performance runtime and hence will ignore it: https://github.com/cri-o/cri-o/blob/1530af133d95de9cf4e5502c2f7dba4faa286a04/pkg/config/config.go#L1235 Missing the high-performance runtime configuration will cause the test the request for lb disable to fail

Bottom line: We should merge CRIO patch first.

I removed the annotation in case the FG disabled. This should remove the change from the allowed_annotation configured for crio and now tests should pass with no issues.

@yanirq
Copy link
Contributor

yanirq commented Dec 7, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2023
@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 7, 2023

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2023
@Tal-or
Copy link
Contributor Author

Tal-or commented Dec 7, 2023

/test e2e-hypershift

Copy link
Contributor

openshift-ci bot commented Dec 7, 2023

@Tal-or: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 570bbfd into openshift:master Dec 7, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants