Fix bug in TopologyManager with merging hints when NUM_NUMA > 2 #108052

klueska · 2022-02-10T21:13:58Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Before this fix, hint permutations such as:

	permutation: [{0011 true} {0101 true}]

Could result in merged hints of:

	mergedHint: {0001 true}

This was possible because both hints in the permutation contain a "preferred"
allocation (i.e. the full set of NUMA nodes set in the affinity bitmask are
required to satisfy the allocation). With this in place, the simplified logic
we had simply kept the merged hint as preferred as well.

However, what we really want is to ensure that the merged hint is only
preferred if true alignment of all resources is possible (i.e. if all hints
in the permutation are preferred AND their affinities are exactly equal).

The only exception to this is if no topology information is provided by a
given hint provider. In this case, we assume alignment doesn't matter and only
consider the resources that actually have hints provided for them.

This changes the semantics of permutations of the form:

	permutation: [{0111 true} {0011 true}]

To now result in the merged hint of:

	mergedHint: {0011 false}

Instead of:

	mergedHint: {0011 true}

This is arguably how it should always have been though (because a hint should
not be preferred if true alignment isn't possible), and two tests have had to
change to accommodate these new semantics.

This new algorithm also ensures that the first merged hint from the set of
generated preferred hints below gets chosen (the old algorithm would
have naively chosen the second because it was narrower).

	[{0111 true} {0111 true}] --> {0111 true}
	[{1011 true} {0111 true}] --> {0011 true}

This PR changes the merge function to implement the updated logic, adds a
test to verify it is functioning correctly, and updates the two tests mentioned
above to adjust to the new semantics.

A follow-up PR updates the logic to ensure that the (now larger) set of
non-preferred hints are prioritized in a less naive fashion: #108154

For example, it will ensure that the first hint below is chosen instead the second
(which the existing naive algorithm will choose):

	[{0111 true} {0011 true}] --> {0011 false}
	[{0111 true} {1001 true}] --> {0001 false}

Does this PR introduce a user-facing change?

Fixes bug in TopologyManager for ensuring aligned allocations on machines with more than 2 NUMA nodes

k8s-ci-robot · 2022-02-10T21:14:05Z

@klueska: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

klueska · 2022-02-10T21:14:17Z

/sig node

klueska · 2022-02-10T21:14:30Z

/assign @fromanirh @swatisehgal

k8s-ci-robot · 2022-02-10T21:16:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/cm/topologymanager/OWNERS~~ [klueska]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Before this fix, hint permutations such as: permutation: [{11 true} {0101 true}] Could result in merged hints of: mergedHint: {01 true} This was possible because both hints in the permutation container a "preferred" allocation (i.e. the full set of NUMA nodes set in the affinity bitmask are *required* to satisfy the allocation). With this in place, the simplified logic we had simply kept the merged hint as preferred as well. However, what we really want is to ensure that the merged hint is only preferred if *true* alignment of all resources is possible (i.e. if all hints in the permutation are preferred AND their affinities are exactly equal). The only exception to this is if *no* topology information is provided by a given hint provider. In this case, we assume alignment doesn't matter and only consider the resources that actually have hints provided for them. This changes the semantics of permutations of the form: permutation: [{111 true} {011 true}] To now result in the merged hint of: mergedHint: {011 false} Instead of: mergedHint: {011 true} This is arguably how it should always have been though (because a hint should not be preferred if true alignment isn't possible), and two tests have had to change to accomodate these new semantics. This commit changes the merge function to implement the updated logic, adds a test to verify it is functioning correctly, and updates the two tests mentioned above to adjust to the new semantics. Signed-off-by: Kevin Klues <kklues@nvidia.com>

klueska · 2022-02-11T08:21:06Z

We may need to follow-up to ensure that "better" hints are now generated in the case of the best-effort strategy (because most cases will resolve to 0001 now if no preferred hints are found).

ffromani · 2022-02-11T09:22:43Z

the only thing is that is not obvious to me is:


This changes the semantics of permutations of the form:

	permutation: [{111 true} {011 true}]

To now result in the merged hint of:

	mergedHint: {011 false}

Instead of:

	mergedHint: {011 true}

I thought that { 0 1 1 true} is a valid merged hint for the permutation {111 true} {011 true} because I was working under the assumption that the most restrictive hint is pretty much always the more correct. While I do realize that this is actually an approximation of the desired behaviour (perhaps a good one, but never the less an approximation), I'm having a hard time imagining a case on which { 0 1 1 true } should not be preferred. Do we have some concrete cases that can help here my understanding?

Besides that, LGTM. I'll have another pass shortly and most likely add the tag.

klueska · 2022-02-11T09:40:58Z

If there are any concerns at all with changing those semantics, then we shouldn't rush things.

To answer your question though, imagine this set of possible permutations:

	[{0111 true} {0111 true}]
	[{1011 true} {0111 true}]

With the old logic, the first will result in a merged hint of:

	{0111 true}

And the second will result in:

	{0011 true}

The one we want is the first one (i.e. {0111 true}) because this gives us perfect alignment across both resources. However, the current algorithm will choose {0011 true} because it appears that the allocation can be satisfied on just 2 NUMA nodes instead of 3.

The reason we prefer {0111 true} over {0011 true} is because both of the original hints were preferred themselves. This means that the NUMA affinity set for them is actually necessary to allocate the full set of resources requested for them. By
choosing {0011 true} instead, only partial alignment will be guaranteed (when we actually could have done better).

ffromani · 2022-02-11T09:48:00Z

If there are any concerns at all with changing those semantics, then we shouldn't rush things.

To answer your question though, imagine this set of possible permutations:
	[{0111 true} {0111 true}]
	[{1011 true} {0111 true}]
With the old logic, the first will result in a merged hint of:
	{0111 true}
And the second will result in:
	{0011 true}
The one we want is the first one (i.e. {0111 true}) because this gives us perfect alignment across both resources. However, the current algorithm will choose {0011 true} because it appears that the allocation can be satisfied on just 2 NUMA nodes instead of 3.

The reason we prefer {0111 true} over {0011 true} is because both of the original hints were preferred themselves. This means that the NUMA affinity set for them is actually necessary to allocate the full set of resources requested for them. By choosing {0011 true} instead, only partial alignment will be guaranteed (when we actually could have done better).

Thanks, this helps. Your example makes a lot of sense. I was thinking about more general cases and I'm realizing more and more I was a bit blindsided by the short-circuit of the approximation I mentioned above. Your change fixes a clear bug and the end behaviour, albeit perhaps a bit less intuitive, looks indeed more correct.

LGTM, but putting hold to let other reviewers chime in. Feel free to remove when you see fit.

ffromani · 2022-02-11T09:48:45Z

/lgtm
/hold

klueska · 2022-02-11T09:55:35Z

The other option (to keep the original semantics) would be to not check for equality of all hints in the permutation to decide if the merged hint should be preferred. And instead check that all hints in the permutation can be "anded" with the hint that has the least number of bits set (and have the result still equal that 'minimal' affinity).

The implications of this though, are that there is no way to know for sure that perfect alignment was satisfied (which is what the restricted policy is supposed to guarantee).

klueska · 2022-02-11T10:10:38Z

I think the new semantics are the more "correct" way of interpreting the merged hints, but (as I mentioned before) it means that we will almost always end up with a {0001 false} hint as our bestHint if full alignment can't be satisfied by the best-effort policy.

While technically correct (because it's best-effort), we could do better if more information were encoded into the set of unpreferred merged hints (i.e. indicating that a given unpreferred hint was more favored over another).

For example:

	[{0111 true} {0011 true}] --> {0011 false}
	[{0111 true} {1001 true}] --> {0001 false}

We would actually rather have the first, but the second will be chosen.

swatisehgal · 2022-02-11T11:16:53Z

I was having a hard time breaking the mental model too given that we always preferred a more narrower selection but based on the discussion above, I think it makes sense as we are optimizing the selection of merged hint to identify the most appropriate hint in order to optimize the NUMA alignment of resources. Thanks for the detailed explanation @klueska.

I think the explanation in the comment section (#108052 (comment) and #108052 (comment)) captures the rationale behind this change much better so would be better to capture that in the PR description as well as the commit message.

Even though this is a Topology Manager internal semantic change, I am conscious that Topology Manager is a Beta feature and a changelog entry might not be sufficient here.

I sense that this change could potentially cause confusion when it comes to the expected behavior as the way hints are merged ultimately influences the NUMA node selection. It will invalidate the content in Topology Manager blog post which I assume is a reference point for anyone who is looking to understand the internal workings of Topology Manager. Even though there is a note that indicates that the article is outdated we should do our due diligence to communicate this change to the rest of the SIG Node community. Sending an email to SIG Node mailing list will be a good starting point.

/lgtm

klueska · 2022-02-11T12:20:21Z

I'm happy to send an email update, but this truly is a bug fix (so anyone relying on the old semantics for some reason, really shouldn't have been). In terms of rejecting or admitting pods, this semantic change only affects what pods will now be allowed in by the restricted policy -- and the ones it was letting in before this change probably shouldn't have been. The intent of restricted was always to only allow pods in if full alignment could be satisfied (it just might be that full alignment may pull resources from more than 1 NUMA node).

The only "incorrect" info in the blog now is the details of the Merge algorithm, which has been modified by this PR to check for equality across all hints to determine if the merged hint should be interpreted as preferred or not.
https://kubernetes.io/blog/2020/04/01/kubernetes-1-18-feature-topoloy-manager-beta/#policy-merge

ffromani · 2022-02-11T12:28:40Z

/hold cancel
It seems we all reached consensus and the discussion was helpful. I feel confident to remove the hold.

swatisehgal · 2022-02-11T13:03:41Z

I'm happy to send an email update, but this truly is a bug fix (so anyone relying on the old semantics for some reason, really shouldn't have been). In terms of rejecting or admitting pods, this semantic change only affects what pods will now be allowed in by the restricted policy -- and the ones it was letting in before this change probably shouldn't have been. The intent of restricted was always to only allow pods in if full alignment could be satisfied (it just might be that full alignment may pull resources from more than 1 NUMA node).

The only "incorrect" info in the blog now is the details of the Merge algorithm, which has been modified by this PR to check for equality across all hints to determine if the merged hint should be interpreted as preferred or not. https://kubernetes.io/blog/2020/04/01/kubernetes-1-18-feature-topoloy-manager-beta/#policy-merge

Thanks @klueska. Totally understand and agree that this fixes a bug, just trying to be proactive here.

klueska · 2022-02-11T16:30:28Z

I've decided to hold off on cherry-picking these until we can see if there is a good way to do what I mention in #108052 (comment) about favoring one unpreferred allocation over another.

klueska · 2022-02-16T10:15:46Z

I've updated the description to the PR as suggested and will send out a summary email of the changes once we merge the follow-up PR here: #108154

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Feb 10, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 10, 2022

k8s-ci-robot assigned ffromani and swatisehgal Feb 10, 2022

k8s-ci-robot requested review from dchen1107 and yujuhong February 10, 2022 21:14

k8s-ci-robot added the area/kubelet label Feb 10, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2022

klueska force-pushed the fix-topology-manager branch from 3e35197 to 155562d Compare February 10, 2022 22:10

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 11, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 11, 2022

k8s-ci-robot merged commit e24b533 into kubernetes:master Feb 11, 2022

k8s-ci-robot added this to the v1.24 milestone Feb 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in TopologyManager with merging hints when NUM_NUMA > 2 #108052

Fix bug in TopologyManager with merging hints when NUM_NUMA > 2 #108052

klueska commented Feb 10, 2022 •

edited

k8s-ci-robot commented Feb 10, 2022

klueska commented Feb 10, 2022

klueska commented Feb 10, 2022

k8s-ci-robot commented Feb 10, 2022

klueska commented Feb 11, 2022

ffromani commented Feb 11, 2022 •

edited

klueska commented Feb 11, 2022 •

edited

ffromani commented Feb 11, 2022

ffromani commented Feb 11, 2022

klueska commented Feb 11, 2022 •

edited

klueska commented Feb 11, 2022

swatisehgal commented Feb 11, 2022 •

edited

klueska commented Feb 11, 2022 •

edited

ffromani commented Feb 11, 2022

swatisehgal commented Feb 11, 2022

klueska commented Feb 11, 2022

klueska commented Feb 16, 2022

Fix bug in TopologyManager with merging hints when NUM_NUMA > 2 #108052

Fix bug in TopologyManager with merging hints when NUM_NUMA > 2 #108052

Conversation

klueska commented Feb 10, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Feb 10, 2022

klueska commented Feb 10, 2022

klueska commented Feb 10, 2022

k8s-ci-robot commented Feb 10, 2022

klueska commented Feb 11, 2022

ffromani commented Feb 11, 2022 • edited

klueska commented Feb 11, 2022 • edited

ffromani commented Feb 11, 2022

ffromani commented Feb 11, 2022

klueska commented Feb 11, 2022 • edited

klueska commented Feb 11, 2022

swatisehgal commented Feb 11, 2022 • edited

klueska commented Feb 11, 2022 • edited

ffromani commented Feb 11, 2022

swatisehgal commented Feb 11, 2022

klueska commented Feb 11, 2022

klueska commented Feb 16, 2022

klueska commented Feb 10, 2022 •

edited

ffromani commented Feb 11, 2022 •

edited

klueska commented Feb 11, 2022 •

edited

klueska commented Feb 11, 2022 •

edited

swatisehgal commented Feb 11, 2022 •

edited

klueska commented Feb 11, 2022 •

edited