Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix discovering numa distance when node ids are not starting from 0 or it's ids are not sequential #113788

Merged

Conversation

PiotrProkop
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #113781

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 9, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @PiotrProkop. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 9, 2022
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 9, 2022
@mkumatag
Copy link
Member

mkumatag commented Nov 9, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 9, 2022
@klueska
Copy link
Contributor

klueska commented Nov 9, 2022

/ok-to-test

@klueska
Copy link
Contributor

klueska commented Nov 9, 2022

/triage accepted
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 9, 2022
@klueska
Copy link
Contributor

klueska commented Nov 9, 2022

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 9, 2022
@klueska
Copy link
Contributor

klueska commented Nov 9, 2022

/milestone v1.26

@klueska
Copy link
Contributor

klueska commented Nov 9, 2022

/cc dchen1107 derekwaynecarr
This bug was introduced as part of #112914

It will cause the kubelet to crash on any system that has more than 1 NUMA node (i.e. almost any machine nowadays)

@PiotrProkop
Copy link
Contributor Author

PiotrProkop commented Nov 9, 2022

Just to clarify it will fail if NUMA nodes ids are not sequential e.g node0, node1 starting from 0 or if the single NUMA node on the system is not node0. I have added unit tests to cover those cases.

@liggitt
Copy link
Member

liggitt commented Nov 9, 2022

I'll defer to node reviewers on this specific change, but I labeled the bug as a regression/release-blocker and added it to the milestone

@klueska
Copy link
Contributor

klueska commented Nov 9, 2022

Thanks @liggitt is that label / milestone not applicable to this PR itself? Or just the bug?

@liggitt
Copy link
Member

liggitt commented Nov 9, 2022

Thanks @liggitt is that label / milestone not applicable to this PR itself? Or just the bug?

if this is accepted as the fix, it will be added to this PR (that's what I'm deferring to node folks on)

@dims
Copy link
Member

dims commented Nov 9, 2022

bug in alpha feature that did not get caught in our CI jobs. we need this in 1.26 itself!

/milestone v1.26

@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Nov 9, 2022
@dims
Copy link
Member

dims commented Nov 9, 2022

thanks @PiotrProkop

@klueska
Copy link
Contributor

klueska commented Nov 9, 2022

if this is accepted as the fix, it will be added to this PR (that's what I'm deferring to node folks on)

Great, thanks. Yes. I have already approved it as a sig-node-approver:
https://github.com/kubernetes/kubernetes/blob/master/OWNERS_ALIASES#L222...L229

@mkumatag
Copy link
Member

mkumatag commented Nov 9, 2022

bug in alpha feature that did not get caught in our CI jobs. we need this in 1.26 itself!

/milestone v1.26

well, we didn't enable any alpha feature gates.. this was just happening in our regular CI test on ppc64le platform with single NUMA node(with id 1)

@PiotrProkop
Copy link
Contributor Author

/hold
I will add a check not to discover NUMA distance, if proper topology-option is not set, just to be double sure.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2022
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 9, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska, PiotrProkop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…r their ids are not sequential

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
@klueska
Copy link
Contributor

klueska commented Nov 9, 2022

@PiotrProkop Thanks for the update.

Now that the cadvisor update has landed this is much cleaner and much more protected behind the feature gate.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2022
@PiotrProkop
Copy link
Contributor Author

Thanks, for the review.
/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 9, 2022
@PiotrProkop
Copy link
Contributor Author

/retest
The error seems unrelated:

Nov  9 17:22:21.469: INFO: Waiting 10s to retry failed RunHostCmd: error running /home/prow/go/src/k8s.io/kubernetes/_output/bin/kubectl --server=https://127.0.0.1:35701/ --kubeconfig=/root/.kube/kind-test-config --namespace=statefulset-6261 exec ss-1 -- /bin/sh -x -c mv -v /tmp/index.html /usr/local/apache2/htdocs/ || true:
        Command stdout:
        stderr:
        Error from server (NotFound): pods "ss-1" not found
        error:
        exit status 1

@k8s-ci-robot k8s-ci-robot merged commit 623376b into kubernetes:master Nov 9, 2022
SIG Node PR Triage automation moved this from Triage to Done Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

Latest k8s fails with kublet panic runtime error
7 participants