New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Bugs in CPUManager distribute NUMA policy option #106599
Conversation
/sig node |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: klueska The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @fromanirh @swatisehgal |
/triage accepted |
It would be nice, for documentation purposes, to have some more details on the scenario which triggers this bug, because it's surprising we cannot trigger this with the unit tests. This area of the code is easily testable (and because of this it has comprehensive testing). Besides this, LGTM. |
/retest |
The changes look reasonable and I have no objections but I agree with @fromanirh, it would be helpful to pinpoint what exactly caused this and how it can be reproduced. |
/priority important-soon |
To actually pinpoint this bug, I did, in fact reproduce the exact failure scenario it in a unit test. But I had to add a new machine that matched the live machine in order to do this. I was hoping to not add the new machine (since for all intents and purposes it is similar to the |
I think it is important indeed to have this machine, because it allows to exercise a flow and verify the bug which (surprisingly?) we cannot do with """just""" 80 cores. But if this is the case, it can wait a followup PR. |
/lgtm |
Sure, and just to be clear, I meant my own review needs to be deeper. Issues like the wrong stddev computation may happen (EDIT: meaning: can slip past the review), but still let me try harder. |
Without this fix, the algorithm may decide to allocate "remainder" CPUs from a NUMA node that has no more CPUs to allocate. Moreover, it was only considering allocation of remainder CPUs from NUMA nodes such that each NUMA node in the remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these two issues in play, one could end up with an accounting error where not enough CPUs were allocated by the time the algorithm runs to completion. The updated algorithm will now omit any NUMA nodes that have 0 CPUs left from the set of NUMA nodes considered for allocating remainder CPUs. Additionally, we now consider *all* combinations of nodes from the remainder set of size 1..len(remainderSet). This allows us to find a better solution if allocating CPUs from a smaller set leads to a more balanced allocation. Finally, we loop through all NUMA nodes 1-by-1 in the remainderSet until all rmeainer CPUs have been accounted for and allocated. This ensure that we will not hit an accounting error later on because we explicitly remove CPUs from the remainder set until there are none left. A follow-on commit adds a set of unit tests that will fail before these changes, but succeeds after them. Signed-off-by: Kevin Klues <kklues@nvidia.com>
6bc16cb
to
8f5c90c
Compare
8f5c90c
to
52e03a8
Compare
Actually the
The values arrays were then much longer than they should have been (front padded with 0s) becuase we fist initialized them to the size of Performing a |
I have all of the code updates added now, but will rework the PR a bit and add more unit tests. Feel free to take a cursory look to begin with. The changes are best reviewed commit-by-commit with a detailed reading of each commit message. |
52e03a8
to
48acd79
Compare
Before Change: "test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request" "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 1] distribution=8 remainder=2 available=[-1 -1 0 6] balance=2.915 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 2] distribution=8 remainder=2 available=[-1 0 -1 6] balance=2.915 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 3] distribution=8 remainder=2 available=[5 -1 0 0] balance=2.345 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 2] distribution=8 remainder=2 available=[0 -1 -1 6] balance=2.915 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 3] distribution=8 remainder=2 available=[0 -1 0 5] balance=2.345 "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[2 3] distribution=8 remainder=2 available=[0 0 -1 5] balance=2.345 "bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[0 3] --- FAIL: TestTakeByTopologyNUMADistributed (0.01s) --- FAIL: TestTakeByTopologyNUMADistributed/ensure_bestRemainder_chosen_with_NUMA_nodes_that_have_enough_CPUs_to_satisfy_the_request (0.00s) cpu_assignment_test.go:867: unexpected error [accounting error, not enough CPUs allocated, remaining: 1] After Change: "test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request" "combo remainderSet balance" combo=[0 1 2 3] remainderSet=[3] distribution=8 remainder=2 available=[0 0 0 4] balance=1.732 "bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[3] SUCCESS Signed-off-by: Kevin Klues <kklues@nvidia.com>
We witnessed this exact allocation attempt in a live cluster and witnessed the algorithm fail with an accounting error. This test was added to verify that this case is now handled by the updates to the algorithm and that we don't regress from it in the future. "test" description="ensure previous failure encountered on live machine has been fixed (1/1)" "combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4 6] distribution=9 remainder=1 available=[14 2 4 4 0 3 4 1] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4] distribution=9 remainder=1 available=[0 3 4 1 14 2 4 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[2 6] distribution=9 remainder=1 available=[1 14 2 4 4 0 3 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[4 6] distribution=9 remainder=1 available=[1 3 4 0 14 2 4 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[2] distribution=9 remainder=1 available=[4 0 3 4 1 14 2 4] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[4] distribution=9 remainder=1 available=[3 4 0 14 2 4 4 1] balance=4.031 "combo remainderSet balance" combo=[2 4 6] remainderSet=[6] distribution=9 remainder=1 available=[1 13 2 4 4 1 3 4] balance=3.606 "bestCombo found" distribution=9 bestCombo=[2 4 6] bestRemainder=[6] Signed-off-by: Kevin Klues <kklues@nvidia.com>
48acd79
to
f851187
Compare
OK. PR is now ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial review. Looks good overall and the series is nicely laid to be as easy to review as possible.
However a number of issues fixed here could have been caught in the initial review (of mine) so I want to deep dive in the algorithm and and in the changes.
I think we aim for 1.23.1 anyway (out of necessity) so timing should not be a too pressing factor in this regard.
Last but not least, a fair amount of utility functions have been added. This is fine, and I don't think it's time yet to generalize them and/or move them outside the cpumanager package, but some focused unit testing of these can improve the code quality/reliabilty even more, at a very little cost (4008ea0 comes to mind).
Thanks @fromanirh. I agree that more unit testing (especially of the utility functions) will be useful here. We have plans to move all of this logic to a subpackage (i.e. |
On one hand I do agree that it makes sense to add these tests as part as the cleanup and code movement, which also I like and agree with, as prep work to graduate to beta or in general later. Overall, adding more content to this PR considering again we aim for 1.23.1, is probably not a good idea, so better to defer these additions to the future cleanup PR, backporting fixes and their minimal test coverage if needed. |
update: I expect to complete my review by (end of) December 10, worst case. |
/lgtm |
/hold cancel |
/lgtm Agree with @fromani that the PR is nicely laid out! Thanks for addressing the review comments and adding additional tests. |
…599-upstream-release-1.23 Automated cherry pick of #106599: Fix Bugs in CPUManager distribute NUMA policy option
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fix bugs in CPUManager distribute NUMA policy introduced for 1.23 in #105631
This includes:
I have added / updated a number of unit tests to verify that each of these bugs are indeed fixed. That said, the code could still benefit from more extensive testing. We will make sure to include that before promoting this feature to
beta
.Which issue(s) this PR fixes:
Fixes #106571
Does this PR introduce a user-facing change?