Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Bugs in CPUManager distribute NUMA policy option #106599

Merged
merged 13 commits into from Dec 10, 2021

Conversation

klueska
Copy link
Contributor

@klueska klueska commented Nov 22, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix bugs in CPUManager distribute NUMA policy introduced for 1.23 in #105631

This includes:

  • Fixing [Flaky Test][sig-node] kubernetes-unit-test TestTakeByTopologyNUMADistributed is being Flaky #106571 by rounding the mean and stddev calculations to the nearest 1000th. Without this, rounding errors could lead to different results, causing flaky unit tests.
  • Making sure we balance any remainder CPUs we want to allocate across all NUMA nodes (not just the NUMA nodes in the distribution set).
  • Fixing a bug in the map.Keys() and map.Values() implementations. We were appending to lists that had been preallocated with a specific size, leading to a bunch of leading zeros in these lits. When passed to the mean/stddev functions this caused obvious issues.
  • Updating the calculation for the minimum number of NUMA nodes required to satisfy the allocation. Previously we would calculate a value larger than the true minimum, causing the algorithm to think it couldn't distribute CPUs properly and fall back to the packed algorithm unnecessarily.
  • Fixing an accounting bug in the NUMA distribution algorithm. Without this fix, the algorithm may decide to allocate "remainder" CPUs from a NUMA node that has no more CPUs to allocate. Moreover, it was only considering allocation of remainder CPUs from NUMA nodes such that each NUMA node in the remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these two issues in play, one could end up with an accounting error where not enough CPUs were allocated by the time the algorithm runs to completion.

I have added / updated a number of unit tests to verify that each of these bugs are indeed fixed. That said, the code could still benefit from more extensive testing. We will make sure to include that before promoting this feature to beta.

Which issue(s) this PR fixes:

Fixes #106571

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 22, 2021
@klueska
Copy link
Contributor Author

klueska commented Nov 22, 2021

/sig node

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 22, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 22, 2021
@klueska
Copy link
Contributor Author

klueska commented Nov 22, 2021

/cc @fromanirh @swatisehgal

@ffromani
Copy link
Contributor

/triage accepted
it seems to me a extremely low risk fix whose gains greatly exceeded the risks, and since we are still before test freeze, we should have this in. I'll check ASAP if my understanding is correct.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 22, 2021
@ffromani
Copy link
Contributor

It would be nice, for documentation purposes, to have some more details on the scenario which triggers this bug, because it's surprising we cannot trigger this with the unit tests. This area of the code is easily testable (and because of this it has comprehensive testing). Besides this, LGTM.

@klueska
Copy link
Contributor Author

klueska commented Nov 22, 2021

/retest

@swatisehgal
Copy link
Contributor

The changes look reasonable and I have no objections but I agree with @fromanirh, it would be helpful to pinpoint what exactly caused this and how it can be reproduced.
@klueska Can you share the CPU Distribution of the live machine where this issue occurred and what the CPU request looked like for better understanding?

@swatisehgal
Copy link
Contributor

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 23, 2021
@klueska
Copy link
Contributor Author

klueska commented Nov 23, 2021

To actually pinpoint this bug, I did, in fact reproduce the exact failure scenario it in a unit test. But I had to add a new machine that matched the live machine in order to do this. I was hoping to not add the new machine (since for all intents and purposes it is similar to the topoDualSocketMultiNumaPerSocketHT we already have. The main difference being that it has 256 CPUs instead of 80, and 8 NUMA nodes instead of 4. As such, I was hoping to just reproduce this issue on the existing topoDualSocketMultiNumaPerSocketHT, but wasn't able to. If you think it's worthwhile, I can add this new machine with the test to reproduce this issue to the PR.

@ffromani
Copy link
Contributor

ffromani commented Nov 23, 2021

To actually pinpoint this bug, I did, in fact reproduce the exact failure scenario it in a unit test. But I had to add a new machine that matched the live machine in order to do this. I was hoping to not add the new machine (since for all intents and purposes it is similar to the topoDualSocketMultiNumaPerSocketHT we already have. The main difference being that it has 256 CPUs instead of 80, and 8 NUMA nodes instead of 4. As such, I was hoping to just reproduce this issue on the existing topoDualSocketMultiNumaPerSocketHT, but wasn't able to. If you think it's worthwhile, I can add this new machine with the test to reproduce this issue to the PR.

I think it is important indeed to have this machine, because it allows to exercise a flow and verify the bug which (surprisingly?) we cannot do with """just""" 80 cores. But if this is the case, it can wait a followup PR.
[EDIT]
most notably, we can add extra tests when we graduate this feature to beta

@ffromani
Copy link
Contributor

/lgtm
codewise looks fine, it's low-to-none risk, fixes an actual bug, we have clarity about the test scenario and a path forward.
Won't merge automatically because Pending — Not mergeable. Must be in milestone v1.23.

@ffromani
Copy link
Contributor

ffromani commented Nov 24, 2021

Im not done yet. There are at least 4 more commits that will come on too of this. I’ll let you know when it’s ready for review again.

Sure, and just to be clear, I meant my own review needs to be deeper. Issues like the wrong stddev computation may happen (EDIT: meaning: can slip past the review), but still let me try harder.

Without this fix, the algorithm may decide to allocate "remainder" CPUs from a
NUMA node that has no more CPUs to allocate. Moreover, it was only considering
allocation of remainder CPUs from NUMA nodes such that each NUMA node in the
remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these
two issues in play, one could end up with an accounting error where not enough
CPUs were allocated by the time the algorithm runs to completion.

The updated algorithm will now omit any NUMA nodes that have 0 CPUs left from
the set of NUMA nodes considered for allocating remainder CPUs. Additionally,
we now consider *all* combinations of nodes from the remainder set of size
1..len(remainderSet). This allows us to find a better solution if allocating
CPUs from a smaller set leads to a more balanced allocation. Finally, we loop
through all NUMA nodes 1-by-1 in the remainderSet until all rmeainer CPUs have
been accounted for and allocated. This ensure that we will not hit an
accounting error later on because we explicitly remove CPUs from the remainder
set until there are none left.

A follow-on commit adds a set of unit tests that will fail before these
changes, but succeeds after them.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 24, 2021
@klueska
Copy link
Contributor Author

klueska commented Nov 24, 2021

Sure, and just to be clear, I meant my own review needs to be deeper. Issues like the wrong stddev computation may happen (EDIT: meaning: can slip past the review), but still let me try harder.

Actually the stddev calculation itself was correct -- the list of values passed into it was incorrect because of this:

 func (m mapIntInt) Values() []int {
-   values := make([]int, len(m))
+   var values []int
    for _, v := range m {
        values = append(values, v)
    }

The values arrays were then much longer than they should have been (front padded with 0s) becuase we fist initialized them to the size of len(m) and then appended to them.

Performing a stddev on one of these lists then gave us a skewed result because of all the leading 0s. The result sort of looked correct on the surface without actually double checking inputs and the the results closely.

@klueska
Copy link
Contributor Author

klueska commented Nov 24, 2021

I have all of the code updates added now, but will rework the PR a bit and add more unit tests. Feel free to take a cursory look to begin with. The changes are best reviewed commit-by-commit with a detailed reading of each commit message.

Before Change:
"test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request"
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 1] distribution=8 remainder=2 available=[-1 -1 0 6] balance=2.915
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 2] distribution=8 remainder=2 available=[-1 0 -1 6] balance=2.915
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 3] distribution=8 remainder=2 available=[5 -1 0 0] balance=2.345
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 2] distribution=8 remainder=2 available=[0 -1 -1 6] balance=2.915
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 3] distribution=8 remainder=2 available=[0 -1 0 5] balance=2.345
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[2 3] distribution=8 remainder=2 available=[0 0 -1 5] balance=2.345
"bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[0 3]

--- FAIL: TestTakeByTopologyNUMADistributed (0.01s)
    --- FAIL: TestTakeByTopologyNUMADistributed/ensure_bestRemainder_chosen_with_NUMA_nodes_that_have_enough_CPUs_to_satisfy_the_request (0.00s)
        cpu_assignment_test.go:867: unexpected error [accounting error, not enough CPUs allocated, remaining: 1]

After Change:
"test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request"
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[3] distribution=8 remainder=2 available=[0 0 0 4] balance=1.732
"bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[3]

SUCCESS

Signed-off-by: Kevin Klues <kklues@nvidia.com>
We witnessed this exact allocation attempt in a live cluster and witnessed the
algorithm fail with an accounting error. This test was added to verify that
this case is now handled by the updates to the algorithm and that we don't
regress from it in the future.

"test" description="ensure previous failure encountered on live machine has been fixed (1/1)"
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4 6] distribution=9 remainder=1 available=[14 2 4 4 0 3 4 1] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4] distribution=9 remainder=1 available=[0 3 4 1 14 2 4 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2 6] distribution=9 remainder=1 available=[1 14 2 4 4 0 3 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[4 6] distribution=9 remainder=1 available=[1 3 4 0 14 2 4 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2] distribution=9 remainder=1 available=[4 0 3 4 1 14 2 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[4] distribution=9 remainder=1 available=[3 4 0 14 2 4 4 1] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[6] distribution=9 remainder=1 available=[1 13 2 4 4 1 3 4] balance=3.606
"bestCombo found" distribution=9 bestCombo=[2 4 6] bestRemainder=[6]

Signed-off-by: Kevin Klues <kklues@nvidia.com>
@klueska
Copy link
Contributor Author

klueska commented Nov 24, 2021

OK. PR is now ready for review.

@ehashman ehashman moved this from Triage to Waiting on Author in SIG Node PR Triage Nov 24, 2021
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial review. Looks good overall and the series is nicely laid to be as easy to review as possible.
However a number of issues fixed here could have been caught in the initial review (of mine) so I want to deep dive in the algorithm and and in the changes.

I think we aim for 1.23.1 anyway (out of necessity) so timing should not be a too pressing factor in this regard.

Last but not least, a fair amount of utility functions have been added. This is fine, and I don't think it's time yet to generalize them and/or move them outside the cpumanager package, but some focused unit testing of these can improve the code quality/reliabilty even more, at a very little cost (4008ea0 comes to mind).

@klueska klueska changed the title Fix Bug in CPUManager distribute NUMA policy option Fix Bugs in CPUManager distribute NUMA policy option Nov 26, 2021
@klueska
Copy link
Contributor Author

klueska commented Nov 26, 2021

Thanks @fromanirh. I agree that more unit testing (especially of the utility functions) will be useful here. We have plans to move all of this logic to a subpackage (i.e. cpuassignment) to help clean some of this up in general. Maybe it makes sense to focus on the correctness of the algorithm for now and then move the code to a subpackage / add unit tests for the utility functions in a follow-on PR.

@ffromani
Copy link
Contributor

Thanks @fromanirh. I agree that more unit testing (especially of the utility functions) will be useful here. We have plans to move all of this logic to a subpackage (i.e. cpuassignment) to help clean some of this up in general. Maybe it makes sense to focus on the correctness of the algorithm for now and then move the code to a subpackage / add unit tests for the utility functions in a follow-on PR.

On one hand I do agree that it makes sense to add these tests as part as the cleanup and code movement, which also I like and agree with, as prep work to graduate to beta or in general later.
On the other hand testing these utility functions also contributes to correctness to the algorithm, because we have more coverage and more trust in the algorithm building blocks.

Overall, adding more content to this PR considering again we aim for 1.23.1, is probably not a good idea, so better to defer these additions to the future cleanup PR, backporting fixes and their minimal test coverage if needed.

@ffromani
Copy link
Contributor

ffromani commented Dec 8, 2021

update: I expect to complete my review by (end of) December 10, worst case.

@ffromani
Copy link
Contributor

/lgtm
all my comments have been addressed, we have plans we all agree about for further cleanup and improvements. For example, I think we should review the logging to let cpumanager emit the attempts like we do in the tests; but this needs discussion (finding the right log level doesn't seem too easy) and there is no point to discuss in the context of this PR.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2021
@klueska
Copy link
Contributor Author

klueska commented Dec 10, 2021

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2021
@swatisehgal
Copy link
Contributor

/lgtm

Agree with @fromani that the PR is nicely laid out! Thanks for addressing the review comments and adding additional tests.

@k8s-ci-robot k8s-ci-robot merged commit 1b0d83f into kubernetes:master Dec 10, 2021
SIG Node PR Triage automation moved this from Waiting on Author to Done Dec 10, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Dec 10, 2021
k8s-ci-robot added a commit that referenced this pull request Dec 10, 2021
…599-upstream-release-1.23

Automated cherry pick of #106599: Fix Bugs in CPUManager distribute NUMA policy option
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

[Flaky Test][sig-node] kubernetes-unit-test TestTakeByTopologyNUMADistributed is being Flaky
6 participants