Fix Bugs in CPUManager distribute NUMA policy option #106599

klueska · 2021-11-22T18:26:46Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fix bugs in CPUManager distribute NUMA policy introduced for 1.23 in #105631

This includes:

Fixing [Flaky Test][sig-node] kubernetes-unit-test TestTakeByTopologyNUMADistributed is being Flaky #106571 by rounding the mean and stddev calculations to the nearest 1000th. Without this, rounding errors could lead to different results, causing flaky unit tests.
Making sure we balance any remainder CPUs we want to allocate across all NUMA nodes (not just the NUMA nodes in the distribution set).
Fixing a bug in the map.Keys() and map.Values() implementations. We were appending to lists that had been preallocated with a specific size, leading to a bunch of leading zeros in these lits. When passed to the mean/stddev functions this caused obvious issues.
Updating the calculation for the minimum number of NUMA nodes required to satisfy the allocation. Previously we would calculate a value larger than the true minimum, causing the algorithm to think it couldn't distribute CPUs properly and fall back to the packed algorithm unnecessarily.
Fixing an accounting bug in the NUMA distribution algorithm. Without this fix, the algorithm may decide to allocate "remainder" CPUs from a NUMA node that has no more CPUs to allocate. Moreover, it was only considering allocation of remainder CPUs from NUMA nodes such that each NUMA node in the remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these two issues in play, one could end up with an accounting error where not enough CPUs were allocated by the time the algorithm runs to completion.

I have added / updated a number of unit tests to verify that each of these bugs are indeed fixed. That said, the code could still benefit from more extensive testing. We will make sure to include that before promoting this feature to beta.

Which issue(s) this PR fixes:

Fixes #106571

Does this PR introduce a user-facing change?

NONE

klueska · 2021-11-22T18:26:54Z

/sig node

k8s-ci-robot · 2021-11-22T18:27:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/cm/cpumanager/OWNERS~~ [klueska]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

klueska · 2021-11-22T18:28:56Z

/cc @fromanirh @swatisehgal

ffromani · 2021-11-22T19:25:55Z

/triage accepted
it seems to me a extremely low risk fix whose gains greatly exceeded the risks, and since we are still before test freeze, we should have this in. I'll check ASAP if my understanding is correct.

ffromani · 2021-11-22T19:27:31Z

It would be nice, for documentation purposes, to have some more details on the scenario which triggers this bug, because it's surprising we cannot trigger this with the unit tests. This area of the code is easily testable (and because of this it has comprehensive testing). Besides this, LGTM.

klueska · 2021-11-22T22:02:37Z

/retest

swatisehgal · 2021-11-23T11:26:03Z

The changes look reasonable and I have no objections but I agree with @fromanirh, it would be helpful to pinpoint what exactly caused this and how it can be reproduced.
@klueska Can you share the CPU Distribution of the live machine where this issue occurred and what the CPU request looked like for better understanding?

swatisehgal · 2021-11-23T11:42:16Z

/priority important-soon

klueska · 2021-11-23T12:06:10Z

To actually pinpoint this bug, I did, in fact reproduce the exact failure scenario it in a unit test. But I had to add a new machine that matched the live machine in order to do this. I was hoping to not add the new machine (since for all intents and purposes it is similar to the topoDualSocketMultiNumaPerSocketHT we already have. The main difference being that it has 256 CPUs instead of 80, and 8 NUMA nodes instead of 4. As such, I was hoping to just reproduce this issue on the existing topoDualSocketMultiNumaPerSocketHT, but wasn't able to. If you think it's worthwhile, I can add this new machine with the test to reproduce this issue to the PR.

ffromani · 2021-11-23T12:09:10Z

To actually pinpoint this bug, I did, in fact reproduce the exact failure scenario it in a unit test. But I had to add a new machine that matched the live machine in order to do this. I was hoping to not add the new machine (since for all intents and purposes it is similar to the topoDualSocketMultiNumaPerSocketHT we already have. The main difference being that it has 256 CPUs instead of 80, and 8 NUMA nodes instead of 4. As such, I was hoping to just reproduce this issue on the existing topoDualSocketMultiNumaPerSocketHT, but wasn't able to. If you think it's worthwhile, I can add this new machine with the test to reproduce this issue to the PR.

I think it is important indeed to have this machine, because it allows to exercise a flow and verify the bug which (surprisingly?) we cannot do with """just""" 80 cores. But if this is the case, it can wait a followup PR.
[EDIT]
most notably, we can add extra tests when we graduate this feature to beta

ffromani · 2021-11-23T12:11:02Z

/lgtm
codewise looks fine, it's low-to-none risk, fixes an actual bug, we have clarity about the test scenario and a path forward.
Won't merge automatically because Pending — Not mergeable. Must be in milestone v1.23.

ffromani · 2021-11-24T19:04:40Z

Im not done yet. There are at least 4 more commits that will come on too of this. I’ll let you know when it’s ready for review again.

Sure, and just to be clear, I meant my own review needs to be deeper. Issues like the wrong stddev computation may happen (EDIT: meaning: can slip past the review), but still let me try harder.

Without this fix, the algorithm may decide to allocate "remainder" CPUs from a NUMA node that has no more CPUs to allocate. Moreover, it was only considering allocation of remainder CPUs from NUMA nodes such that each NUMA node in the remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these two issues in play, one could end up with an accounting error where not enough CPUs were allocated by the time the algorithm runs to completion. The updated algorithm will now omit any NUMA nodes that have 0 CPUs left from the set of NUMA nodes considered for allocating remainder CPUs. Additionally, we now consider *all* combinations of nodes from the remainder set of size 1..len(remainderSet). This allows us to find a better solution if allocating CPUs from a smaller set leads to a more balanced allocation. Finally, we loop through all NUMA nodes 1-by-1 in the remainderSet until all rmeainer CPUs have been accounted for and allocated. This ensure that we will not hit an accounting error later on because we explicitly remove CPUs from the remainder set until there are none left. A follow-on commit adds a set of unit tests that will fail before these changes, but succeeds after them. Signed-off-by: Kevin Klues <kklues@nvidia.com>

klueska · 2021-11-24T19:36:34Z

Sure, and just to be clear, I meant my own review needs to be deeper. Issues like the wrong stddev computation may happen (EDIT: meaning: can slip past the review), but still let me try harder.

Actually the stddev calculation itself was correct -- the list of values passed into it was incorrect because of this:

 func (m mapIntInt) Values() []int {
-   values := make([]int, len(m))
+   var values []int
    for _, v := range m {
        values = append(values, v)
    }

The values arrays were then much longer than they should have been (front padded with 0s) becuase we fist initialized them to the size of len(m) and then appended to them.

Performing a stddev on one of these lists then gave us a skewed result because of all the leading 0s. The result sort of looked correct on the surface without actually double checking inputs and the the results closely.

klueska · 2021-11-24T19:37:57Z

I have all of the code updates added now, but will rework the PR a bit and add more unit tests. Feel free to take a cursory look to begin with. The changes are best reviewed commit-by-commit with a detailed reading of each commit message.

Before Change: "test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request" "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 1] distribution=8 remainder=2 available=[-1 -1 0 6] balance=2.915 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 2] distribution=8 remainder=2 available=[-1 0 -1 6] balance=2.915 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 3] distribution=8 remainder=2 available=[5 -1 0 0] balance=2.345 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 2] distribution=8 remainder=2 available=[0 -1 -1 6] balance=2.915 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 3] distribution=8 remainder=2 available=[0 -1 0 5] balance=2.345 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[2 3] distribution=8 remainder=2 available=[0 0 -1 5] balance=2.345 "bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[0 3] --- FAIL: TestTakeByTopologyNUMADistributed (0.01s) --- FAIL: TestTakeByTopologyNUMADistributed/ensure_bestRemainder_chosen_with_NUMA_nodes_that_have_enough_CPUs_to_satisfy_the_request (0.00s) cpu_assignment_test.go:867: unexpected error [accounting error, not enough CPUs allocated, remaining: 1] After Change: "test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request" "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[3] distribution=8 remainder=2 available=[0 0 0 4] balance=1.732 "bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[3] SUCCESS Signed-off-by: Kevin Klues <kklues@nvidia.com>

We witnessed this exact allocation attempt in a live cluster and witnessed the algorithm fail with an accounting error. This test was added to verify that this case is now handled by the updates to the algorithm and that we don't regress from it in the future. "test" description="ensure previous failure encountered on live machine has been fixed (1/1)" "combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4 6] distribution=9 remainder=1 available=[14 2 4 4 0 3 4 1] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4] distribution=9 remainder=1 available=[0 3 4 1 14 2 4 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[2 6] distribution=9 remainder=1 available=[1 14 2 4 4 0 3 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[4 6] distribution=9 remainder=1 available=[1 3 4 0 14 2 4 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[2] distribution=9 remainder=1 available=[4 0 3 4 1 14 2 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[4] distribution=9 remainder=1 available=[3 4 0 14 2 4 4 1] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[6] distribution=9 remainder=1 available=[1 13 2 4 4 1 3 4] balance=3.606 "bestCombo found" distribution=9 bestCombo=[2 4 6] bestRemainder=[6] Signed-off-by: Kevin Klues <kklues@nvidia.com>

klueska · 2021-11-24T20:51:45Z

OK. PR is now ready for review.

ffromani

initial review. Looks good overall and the series is nicely laid to be as easy to review as possible.
However a number of issues fixed here could have been caught in the initial review (of mine) so I want to deep dive in the algorithm and and in the changes.

I think we aim for 1.23.1 anyway (out of necessity) so timing should not be a too pressing factor in this regard.

Last but not least, a fair amount of utility functions have been added. This is fine, and I don't think it's time yet to generalize them and/or move them outside the cpumanager package, but some focused unit testing of these can improve the code quality/reliabilty even more, at a very little cost (4008ea0 comes to mind).

klueska · 2021-11-26T10:56:45Z

Thanks @fromanirh. I agree that more unit testing (especially of the utility functions) will be useful here. We have plans to move all of this logic to a subpackage (i.e. cpuassignment) to help clean some of this up in general. Maybe it makes sense to focus on the correctness of the algorithm for now and then move the code to a subpackage / add unit tests for the utility functions in a follow-on PR.

ffromani · 2021-11-26T11:09:35Z

Thanks @fromanirh. I agree that more unit testing (especially of the utility functions) will be useful here. We have plans to move all of this logic to a subpackage (i.e. cpuassignment) to help clean some of this up in general. Maybe it makes sense to focus on the correctness of the algorithm for now and then move the code to a subpackage / add unit tests for the utility functions in a follow-on PR.

On one hand I do agree that it makes sense to add these tests as part as the cleanup and code movement, which also I like and agree with, as prep work to graduate to beta or in general later.
On the other hand testing these utility functions also contributes to correctness to the algorithm, because we have more coverage and more trust in the algorithm building blocks.

Overall, adding more content to this PR considering again we aim for 1.23.1, is probably not a good idea, so better to defer these additions to the future cleanup PR, backporting fixes and their minimal test coverage if needed.

ffromani · 2021-12-08T06:49:22Z

update: I expect to complete my review by (end of) December 10, worst case.

ffromani · 2021-12-10T11:32:08Z

/lgtm
all my comments have been addressed, we have plans we all agree about for further cleanup and improvements. For example, I think we should review the logging to let cpumanager emit the attempts like we do in the tests; but this needs discussion (finding the right log level doesn't seem too easy) and there is no point to discuss in the context of this PR.

klueska · 2021-12-10T11:39:53Z

/hold cancel

swatisehgal · 2021-12-10T12:00:49Z

/lgtm

Agree with @fromani that the PR is nicely laid out! Thanks for addressing the review comments and adding additional tests.

…599-upstream-release-1.23 Automated cherry pick of #106599: Fix Bugs in CPUManager distribute NUMA policy option

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 22, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 22, 2021

k8s-ci-robot requested review from feiskyer and matthyx November 22, 2021 18:27

k8s-ci-robot added the area/kubelet label Nov 22, 2021

k8s-ci-robot requested review from ffromani and swatisehgal November 22, 2021 18:28

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 22, 2021

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 23, 2021

klueska force-pushed the fix-numa-bug branch from 6bc16cb to 8f5c90c Compare November 24, 2021 19:29

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 24, 2021

klueska force-pushed the fix-numa-bug branch from 8f5c90c to 52e03a8 Compare November 24, 2021 19:31

klueska force-pushed the fix-numa-bug branch from 52e03a8 to 48acd79 Compare November 24, 2021 20:29

klueska added 2 commits November 24, 2021 20:45

klueska force-pushed the fix-numa-bug branch from 48acd79 to f851187 Compare November 24, 2021 20:51

ehashman moved this from Triage to Waiting on Author in SIG Node PR Triage Nov 24, 2021

ffromani reviewed Nov 26, 2021

View reviewed changes

klueska changed the title ~~Fix Bug in CPUManager distribute NUMA policy option~~ Fix Bugs in CPUManager distribute NUMA policy option Nov 26, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2021

k8s-ci-robot assigned swatisehgal Dec 10, 2021

k8s-ci-robot merged commit 1b0d83f into kubernetes:master Dec 10, 2021

SIG Node PR Triage automation moved this from Waiting on Author to Done Dec 10, 2021

k8s-ci-robot added this to the v1.24 milestone Dec 10, 2021

klueska mentioned this pull request Dec 10, 2021

Automated cherry pick of #106599: Fix Bugs in CPUManager distribute NUMA policy option #106953

Merged

k8s-ci-robot added a commit that referenced this pull request Dec 10, 2021

Merge pull request #106953 from klueska/automated-cherry-pick-of-#106…

ca786e7

…599-upstream-release-1.23 Automated cherry pick of #106599: Fix Bugs in CPUManager distribute NUMA policy option

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Bugs in CPUManager distribute NUMA policy option #106599

Fix Bugs in CPUManager distribute NUMA policy option #106599

klueska commented Nov 22, 2021 •

edited

klueska commented Nov 22, 2021

k8s-ci-robot commented Nov 22, 2021

klueska commented Nov 22, 2021

ffromani commented Nov 22, 2021

ffromani commented Nov 22, 2021

klueska commented Nov 22, 2021

swatisehgal commented Nov 23, 2021

swatisehgal commented Nov 23, 2021

klueska commented Nov 23, 2021

ffromani commented Nov 23, 2021 •

edited

ffromani commented Nov 23, 2021

ffromani commented Nov 24, 2021 •

edited

klueska commented Nov 24, 2021

klueska commented Nov 24, 2021

klueska commented Nov 24, 2021

ffromani left a comment

klueska commented Nov 26, 2021

ffromani commented Nov 26, 2021

ffromani commented Dec 8, 2021

ffromani commented Dec 10, 2021

klueska commented Dec 10, 2021

swatisehgal commented Dec 10, 2021

Fix Bugs in CPUManager distribute NUMA policy option #106599

Fix Bugs in CPUManager distribute NUMA policy option #106599

Conversation

klueska commented Nov 22, 2021 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?

klueska commented Nov 22, 2021

k8s-ci-robot commented Nov 22, 2021

klueska commented Nov 22, 2021

ffromani commented Nov 22, 2021

ffromani commented Nov 22, 2021

klueska commented Nov 22, 2021

swatisehgal commented Nov 23, 2021

swatisehgal commented Nov 23, 2021

klueska commented Nov 23, 2021

ffromani commented Nov 23, 2021 • edited

ffromani commented Nov 23, 2021

ffromani commented Nov 24, 2021 • edited

klueska commented Nov 24, 2021

klueska commented Nov 24, 2021

klueska commented Nov 24, 2021

ffromani left a comment

Choose a reason for hiding this comment

klueska commented Nov 26, 2021

ffromani commented Nov 26, 2021

ffromani commented Dec 8, 2021

ffromani commented Dec 10, 2021

klueska commented Dec 10, 2021

swatisehgal commented Dec 10, 2021

klueska commented Nov 22, 2021 •

edited

ffromani commented Nov 23, 2021 •

edited

ffromani commented Nov 24, 2021 •

edited