OCPVE-379: fix: avoid checking resources for BestEffort pods #28006

eggfoobar · 2023-06-27T17:56:08Z

The reason the flakes have been occurring is due to some tests around the network-operator creating DaemonSets and Deployments that do not contain resources. Adding a skip list for resources that are okay being BestEffort and added comment to code to be explicit on the meaning of best effort in a workload partitioned cluster. This does not effect the CPU affinity for workload partitioning.

Test-Grid Flaks: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.14-informing#periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-cpu-partitioning&include-filter-by-regex=CPU%20Partitioning

Ref Resources on the network operator:
DaemonSet
Deployment

/assign @deads2k

openshift-ci-robot · 2023-06-27T17:56:59Z

@eggfoobar: This pull request references OCPVE-379 which is a valid jira issue.

In response to this:

The reason the flakes have been occurring is due to some tests around the network-operator creating DaemonSets and Deployments that do not contain resources. Since BestEffort is something we don't alter during the admission webhook I think it makes sense to skip BestEffort pods for resource checks. This does not effect the CPU affinity for workload partitioning, it will just set default cpushare time.

Test-Grid Flaks: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.14-informing#periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-cpu-partitioning&include-filter-by-regex=CPU%20Partitioning

Ref Resources on the network operator:
DaemonSet
Deployment

/assign @deads2k

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

deads2k · 2023-06-27T21:34:01Z

The point of cpu affinity is to keep the platform from taking cpu from the workloads. Our platform pods, even best effort ones, need to be controlled for that purpose

eggfoobar · 2023-06-28T05:08:43Z

Oh sorry @deads2k, I wasn't clear what's happening, the desired CPU affinity is not being violated here. There are two things that happen for workload partitioning, a CPU affinity and a CPU share change.

The flake was around the CPU Share itself. Since we checked that pods containing the workload partitioning annotations also had their resource requests altered to the custom resource field, this would flake on the tests that spun up BestEffort deployments and daemonsets because the default behavior is to apply the workload annotation with the minimum 2 CPU shares and not add/change resource requests since none are present at admission time.

Unless we should require that all platform pods must contain resource requests, this would be safe to skip since the CPU affinity is still respected, and the workload get's a default CPUShare value of 2. What do you think?

eggfoobar · 2023-06-29T19:35:31Z

Adding some more context around CPU shares. A cpu share is a relative amount of time a process is granted on a CPU, in Kubernetes 1CPU = 1024CPU shares.

When there is no CPU time contention, nothing happens and the process is allowed to use more, however if there is CPU time contention, then the cpu share will be enforced and processes with a higher CPU share will get that proportional percentage of time on the CPU. i.e two process with a 512 cpu share will each get 50% CPU time.

In Kubernetes when a pod has a QoS of BestEffort, the kubelet will automatically give it a default CPU share of 2, meaning it will be scheduled pretty much anywhere since it's so small and if there's contention it will get the smallest amount of time on the CPU. If you need a minimum amount, consider supplying a resource.requests for your pod.

Normally this is not too much of a problem. However, when using the workload pinning feature ( which allows cluster admins to limit the number of CPUs the platform pods have access to ), the probability of CPU time contention goes up and resource.requests become more important for platform pods. It is strongly recommended that for everything but your least important pods you should supply adequate resource.requests, otherwise you might be in scenarios where your pod might be starved for CPU time with the default 2 CPU Shares. In other words under a highly utilized CPU in a CPU pinned cluster would mean your pod would only get 2/1024 * 100 = ~0.19% CPU time.

openshift-ci-robot · 2023-07-05T14:42:32Z

@eggfoobar: This pull request references OCPVE-379 which is a valid jira issue.

In response to this:

The reason the flakes have been occurring is due to some tests around the network-operator creating DaemonSets and Deployments that do not contain resources. Adding a skip list for resources that are okay being BestEffort and added comment to code to be explicit on the meaning of best effort in a workload partitioned cluster. This does not effect the CPU affinity for workload partitioning.

Test-Grid Flaks: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.14-informing#periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-cpu-partitioning&include-filter-by-regex=CPU%20Partitioning

Ref Resources on the network operator:
DaemonSet
Deployment

/assign @deads2k

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: ehila <ehila@redhat.com> feat: update to use exclude list instead of blanket ignore on qos Signed-off-by: ehila <ehila@redhat.com>

deads2k · 2023-07-05T16:24:57Z

the networking team ack'd these. The CPU slice is extremely small and bugs were opened so there's long term tracking as well.

/lgtm

openshift-ci · 2023-07-05T16:27:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, eggfoobar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-07-05T17:28:27Z

/retest-required

Remaining retests: 0 against base HEAD 861d0fd and 2 for PR HEAD 0057332 in total

openshift-ci-robot · 2023-07-05T23:31:13Z

/retest-required

Remaining retests: 0 against base HEAD 30b9ae2 and 1 for PR HEAD 0057332 in total

eggfoobar · 2023-07-06T05:12:12Z

/retest-required

openshift-ci · 2023-07-06T07:10:56Z

@eggfoobar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node-serial	`0057332`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-openstack-ovn	`0057332`	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-aws-ovn-single-node-upgrade	`0057332`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-aws-ovn-cgroupsv2	`0057332`	link	false	`/test e2e-aws-ovn-cgroupsv2`
ci/prow/e2e-metal-ipi-ovn-ipv6	`0057332`	link	false	`/test e2e-metal-ipi-ovn-ipv6`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot assigned deads2k Jun 27, 2023

eggfoobar changed the title ~~fix: avoid checking resources for BestEffort pods~~ OCPVE-379: fix: avoid checking resources for BestEffort pods Jun 27, 2023

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 27, 2023

openshift-ci bot requested review from csrwng and mfojtik June 27, 2023 17:57

eggfoobar force-pushed the fix-cpu-partitioning-flakes branch from c062f0f to 6359b46 Compare June 27, 2023 19:10

eggfoobar force-pushed the fix-cpu-partitioning-flakes branch from 6359b46 to 748ea47 Compare July 5, 2023 14:31

eggfoobar force-pushed the fix-cpu-partitioning-flakes branch from 748ea47 to c503d91 Compare July 5, 2023 14:43

fix: avoid checking resources for BestEffort pods

0057332

Signed-off-by: ehila <ehila@redhat.com> feat: update to use exclude list instead of blanket ignore on qos Signed-off-by: ehila <ehila@redhat.com>

eggfoobar force-pushed the fix-cpu-partitioning-flakes branch from c503d91 to 0057332 Compare July 5, 2023 15:08

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 5, 2023

openshift-merge-robot merged commit 21f0791 into openshift:master Jul 6, 2023
17 of 22 checks passed

eggfoobar mentioned this pull request Aug 8, 2023

OCPBUGS-12205: remove techpreview from cpu partitioning openshift/api#1443

Merged

eggfoobar deleted the fix-cpu-partitioning-flakes branch August 8, 2023 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPVE-379: fix: avoid checking resources for BestEffort pods #28006

OCPVE-379: fix: avoid checking resources for BestEffort pods #28006

eggfoobar commented Jun 27, 2023 •

edited

openshift-ci-robot commented Jun 27, 2023 •

edited by openshift-ci bot

deads2k commented Jun 27, 2023

eggfoobar commented Jun 28, 2023

eggfoobar commented Jun 29, 2023 •

edited

openshift-ci-robot commented Jul 5, 2023 •

edited by openshift-ci bot

deads2k commented Jul 5, 2023

openshift-ci bot commented Jul 5, 2023

openshift-ci-robot commented Jul 5, 2023

openshift-ci-robot commented Jul 5, 2023

eggfoobar commented Jul 6, 2023

openshift-ci bot commented Jul 6, 2023

OCPVE-379: fix: avoid checking resources for BestEffort pods #28006

OCPVE-379: fix: avoid checking resources for BestEffort pods #28006

Conversation

eggfoobar commented Jun 27, 2023 • edited

openshift-ci-robot commented Jun 27, 2023 • edited by openshift-ci bot

deads2k commented Jun 27, 2023

eggfoobar commented Jun 28, 2023

eggfoobar commented Jun 29, 2023 • edited

openshift-ci-robot commented Jul 5, 2023 • edited by openshift-ci bot

deads2k commented Jul 5, 2023

openshift-ci bot commented Jul 5, 2023

openshift-ci-robot commented Jul 5, 2023

openshift-ci-robot commented Jul 5, 2023

eggfoobar commented Jul 6, 2023

openshift-ci bot commented Jul 6, 2023

eggfoobar commented Jun 27, 2023 •

edited

openshift-ci-robot commented Jun 27, 2023 •

edited by openshift-ci bot

eggfoobar commented Jun 29, 2023 •

edited

openshift-ci-robot commented Jul 5, 2023 •

edited by openshift-ci bot