Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: fix CPU manager methods to be more flexible to different CPU topology #98373

Merged

Conversation

cynepco3hahue
Copy link

What type of PR is this?

/kind bug
/kind failing-test

What this PR does / why we need it:

  • fix the issue when the test runs on the node with the single CPU
  • fix the issue when the CPU topology has only one core per socket, it can
    be easily reproduced by configuring VM with multi NUMA, but when each socket
    has only one core

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Signed-off-by: Artyom Lukianov alukiano@redhat.com

…ology

- fix the issue when the test runs on the node with the single CPU
- fix the issue when the CPU topology has only one core per socket, it can
  be easily reproduced by configuring VM with multi NUMA, but when each socket
  has only one core

Signed-off-by: Artyom Lukianov <alukiano@redhat.com>
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 25, 2021
@k8s-ci-robot
Copy link
Contributor

@cynepco3hahue: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 25, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @cynepco3hahue. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jan 25, 2021
@cynepco3hahue
Copy link
Author

/test pull-kubernetes-node-kubelet-serial-cpu-manager

@cynepco3hahue
Copy link
Author

/test pull-kubernetes-node-kubelet-serial-topology-manager

@k8s-ci-robot
Copy link
Contributor

@cynepco3hahue: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test pull-kubernetes-node-kubelet-serial-cpu-manager

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@cynepco3hahue: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test pull-kubernetes-node-kubelet-serial-topology-manager

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 25, 2021
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to PRs - Review in progress in SIG Node CI/Test Board Jan 25, 2021
@SergeyKanzhelev
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 26, 2021
@@ -283,7 +283,9 @@ func runGuPodTest(f *framework.Framework) {
cpu1 = cpuList[1]
} else if isMultiNUMA() {
cpuList = cpuset.MustParse(getCoreSiblingList(0)).ToSlice()
cpu1 = cpuList[1]
if len(cpuList) > 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is inside the isMultiNUMA branch, meaning, there are more than 1 cpu, correct? Not following why it may be just 1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it calls getCoreSiblingList(0) to get all cores that placed under the same NUMA node as the CPU 0, so if the NUMA node has only one core, that unusual but possible(with fake NUMA for example), it will return the list with only one element.

@@ -359,7 +364,9 @@ func runMultipleGuNonGuPods(f *framework.Framework, cpuCap int64, cpuAlloc int64
cpu1 = cpuList[1]
} else if isMultiNUMA() {
cpuList = cpuset.MustParse(getCoreSiblingList(0)).ToSlice()
cpu1 = cpuList[1]
if len(cpuList) > 1 {
cpu1 = cpuList[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

barely understand the logic of this test. Comment from the common sense: is the default value 1 be working in later call to cpuset.NewCPUSet(cpu1)? Curious why default value is not 0.

If nobody else will comment with knowledge of these tests I will take a deeper look later

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test assumes that the CPU 0 is a reserved one, and it some logic under the CPU manager that takes CPUs by NUMA nodes, for example, if NUMA node 0 has enough CPUs to satisfy the container request it will use CPUs from the NUMA node 0, otherwise, it will pass to the next NUMA node
give me know if you need a better explanation

@cynepco3hahue
Copy link
Author

/retest

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the patch per se is correct, but the main factor is these tests don't really want to run on environments which don't have enough core to actuallty guarantee a meanongful environment.
Otherwise, the test actually runs, but in environments so constrainted that actually gives us no signal.
This is the reason why in the topology manager e2e tests we have explicit check about running on a system with enough resources (and truth to be told, this check can probably be made a bit smarter) https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/topology_manager_test.go#L375

I think that in a followup PR we should add a similar check here. I can help with that.

@@ -313,6 +315,10 @@ func runNonGuPodTest(f *framework.Framework, cpuCap int64) {

ginkgo.By("checking if the expected cpuset was assigned")
expAllowedCPUsListRegex = fmt.Sprintf("^0-%d\n$", cpuCap-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the assumption that cpuCap is a single digit number?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, it passed to the method as cpuCap int64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as this assumption is OK

@SergeyKanzhelev
Copy link
Member

/lgtm

I suggest to follow @fromanirh suggestion on checking the test host specs in the beginning of the test. Maybe follow up PR

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2021
@ffromani
Copy link
Contributor

ffromani commented Feb 2, 2021

/lgtm

I suggest to follow @fromanirh suggestion on checking the test host specs in the beginning of the test. Maybe follow up PR

I think the cpumanager (and to a lesser extent the topology manager) e2e testsuite can use some improvements and cleanups and I'd be happy to help with this effort.

@cynepco3hahue
Copy link
Author

I think the cpumanager (and to a lesser extent the topology manager) e2e testsuite can use some improvements and cleanups and I'd be happy to help with this effort.

Yes, it can be great!

@SergeyKanzhelev SergeyKanzhelev moved this from PRs - Review in progress to PRs - Reviewer lgtm'd in SIG Node CI/Test Board Feb 8, 2021
@mrunalp
Copy link
Contributor

mrunalp commented Feb 24, 2021

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cynepco3hahue, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2021
@k8s-ci-robot k8s-ci-robot merged commit 6235ad8 into kubernetes:master Feb 24, 2021
SIG Node CI/Test Board automation moved this from PRs - Reviewer lgtm'd to Done Feb 24, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants