Update TopologyManager algorithm for selecting "best" non-preferred hint #108154

klueska · 2022-02-16T09:45:15Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

For the 'single-numa' and 'restricted' TopologyManager policies, pods are only
admitted if all of their containers have perfect alignment across the set of
resources they are requesting. The best-effort policy, on the other hand,
will prefer allocations that have perfect alignment, but fall back to a
non-preferred alignment if perfect alignment can't be achieved.

The existing algorithm of how to choose the best hint from the set of
"non-preferred" hints is fairly naive and often results in choosing a
sub-optimal hint. It works fine in cases where all resources would end up
coming from a single NUMA node (even if its not the same NUMA nodes), but
breaks down as soon as multiple NUMA nodes are required for the "best"
alignment. We will never be able to achieve perfect alignment with these
non-preferred hints, but we should try and do something more intelligent than
simply choosing the hint with the narrowest mask.

In an ideal world, we would have the TopologyManager return a set of
"resource-relative" hints (as opposed to a common hint for all resources as is
done today). Each resource-relative hint would indicate how many other
resources could be aligned to it on a given NUMA node, and a hint provider
would use this information to allocate its resources in the most aligned way
possible. There are likely some edge cases to consider here, but such an
algorithm would allow us to do partial-perfect-alignment of "some" resources,
even if all resources could not be perfectly aligned.

Unfortunately, supporting something like this would require a major redesign to
how the TopologyManager interacts with its hint providers (as well as how
those hint providers make decisions based on the hints they get back).

That said, we can still do better than the naive algorithm we have today, and
this patch provides a mechanism to do so.

We start by looking at the set of hints passed into the TopologyManager for
each resource and generate a list of the minimum number of NUMA nodes
required to satisfy an allocation for a given resource. In other words, each entry
in this list contains the minNUMAAffinity.Count() for a given resource.

Once we have this list, we find the maximum minNUMAAffinity.Count() from
the list and mark that as the bestNonPreferredAffinityCount that we would like
to have associated with whatever "bestHint" we ultimately generate. The intuition
being that we would like to (at the very least) get alignment for those resources
that require multiple NUMA nodes to satisfy their allocation. If we can't
quite get there, then we should try to come as close to it as possible.

For example, consider a machine where we have 8 NUMA nodes with 32
CPUs per NUMA node and 2 GPUs attached only to each odd numbered
NUMA node (e.g. the DGX-A100 server provided by NVIDIA).

Socket Numa Node CPU(s) GPU(s)

0 0 0-15,128-143

0 1 16-31,144-159 /dev/nvidia2, /dev/nvidia3

0 2 32-47,160-175

0 3 48-63,176-191 /dev/nvidia0, /dev/nvidia1

1 4 64-79,192-207

1 5 80-95,208-223 /dev/nvidia6, /dev/nvidia7

1 6 96-111,224-239

1 7 112-127,240-255 /dev/nvidia4, /dev/nvidia5

Assuming a machine with no resources allocated yet, if a user were to request 32
CPUs (which can fit on one NUMA node) and 4 GPUs (which requires at least 2
NUMA nodes), then you would expect one of the following affinity masks to win out
as the "best" affinity mask since it encodes the alignment required for all 4 GPUs
to be allocated (even though the 32 CPUs only require a single NUMA node):

Bits:  7 6 5 4 3 2 1 0
      {0 0 0 0 1 0 1 0}
      {0 0 1 0 0 0 1 0}
      {1 0 0 0 0 0 1 0}
      {0 0 1 0 1 0 0 0}
      {1 0 0 0 1 0 0 0}
      {1 0 1 0 0 0 0 0}

However, with the existing algorithm, none of these affinity masks will be
considered, and a naive result of {00000010} will be returned since that is the
"narrowest" alignment that the allocation of all CPUs with some subset of GPUs
can be satisfied with. In effect, the old algorithm lets the "least" constrained
resource influence the hint generation more heavily when what we actually want
is the "more" constrained resource to influence the hint generation more heavily.

To achieve this, the new algorithm proceeds as follows once
we have calculated the bestNonPreferredAffinityCount as described above:

If the mergedHint and bestHint are both non-preferred, then try and find a hint
whose affinity count is as close to (but not higher than) the
bestNonPreferredAffinityCount as possible. To do this we need to consider the
following cases and react accordingly:

  1. bestHint.NUMANodeAffinity.Count() >  bestNonPreferredAffinityCount
  2. bestHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount
  3. bestHint.NUMANodeAffinity.Count() <  bestNonPreferredAffinityCount

For case (1), the current bestHint is larger than the
bestNonPreferredAffinityCount, so updating to any narrower mergeHint is
preferred over staying where we are.

For case (2), the current bestHint is equal to the
bestNonPreferredAffinityCount, so we would like to stick with what we have
unless the current mergedHint is also equal to
bestNonPreferredAffinityCount and it is narrower.

For case (3), the current bestHint is less than
bestNonPreferredAffinityCount, so we would like to creep back up to
bestNonPreferredAffinityCount as close as we can. There are three cases to
consider here:

  3a. mergedHint.NUMANodeAffinity.Count() >  bestNonPreferredAffinityCount
  3b. mergedHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount
  3c. mergedHint.NUMANodeAffinity.Count() <  bestNonPreferredAffinityCount

For case (3a), we just want to stick with the current bestHint because
choosing a new hint that is greater than bestNonPreferredAffinityCount would
be counter-productive.

For case (3b), we want to immediately update bestHint to the current
mergedHint, making it now equal to bestNonPreferredAffinityCount.

For case (3c), we know that both the current bestHint and the current
mergedHint are less than bestNonPreferredAffinityCount, so we want to
choose one that brings us back up as close to bestNonPreferredAffinityCount
as possible. There are three cases to consider here:

  3ca. mergedHint.NUMANodeAffinity.Count() >  bestHint.NUMANodeAffinity.Count()
  3cb. mergedHint.NUMANodeAffinity.Count() <  bestHint.NUMANodeAffinity.Count()
  3cc. mergedHint.NUMANodeAffinity.Count() == bestHint.NUMANodeAffinity.Count()

For case (3ca), we want to immediately update bestHint to mergedHint
because that will bring us closer to the (higher) value of
bestNonPreferredAffinityCount.

For case (3cb), we want to stick with the current bestHint because choosing
the current mergedHint would strictly move us further away from the
bestNonPreferredAffinityCount.

Finally, for case (3cc), we know that the current bestHint and the current
mergedHint are equal, so we simply choose the narrower of the 2.

This PR implements this algorithm for the case where we must choose from a set
of non-preferred hints and provides a set of unit-tests to verify its
correctness.

Does this PR introduce a user-facing change?

Improved algorithm for selecting "best" non-preferred hint in the TopologyManager

k8s-ci-robot · 2022-02-16T09:46:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/cm/topologymanager/OWNERS~~ [klueska]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

klueska · 2022-02-16T11:37:34Z

/sig node
/assign @fromanirh @swatisehgal

swatisehgal · 2022-02-16T12:40:06Z

/triage accepted
/priority important-longterm

k8s-triage-robot · 2022-02-16T14:47:06Z

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

swatisehgal · 2022-02-17T10:11:17Z

This PR significantly improves the selection process of the best hints in case of non-preferred hints when multiple NUMA nodes are required for the "best" topology alignment. We now have a way more intelligent selection process rather than selecting the narrowest topology hint!

I appreciate the detailed explanation both in the PR description and as comments in the code which helped a lot in the review process.

The PR looks good to me and I am happy to give it an LGTM but while reviewing this, I couldn't stop myself from thinking that if we maintained a list of mergedHints sorted by the corresponding NUMANodeAffinity.Count() we would have made the besthint evaluation logic even more streamlined. Maintaining a sorted list naturally gives us the narrowestHint on top and for the non-preferred hints we essentially have to run a binary search to identify a hint with bestNonPreferredAffinityCount. We might need two lists, one to capture all preferred hints and another one with non-preferred ones. WDYT?

klueska · 2022-02-17T10:49:40Z

The current logic never builds a list of mergedHints at all (such a list could get very large depending on how many different permutations you have to walk through). It also never generates a full list of permutations -- it just iterates through each permutation, calculating the current mergedHint and tracking the current bestHint. I'm worried moving to a model that generates (and stores) full lists for these may cause other unforeseen performance problems.

swatisehgal · 2022-02-17T11:41:35Z

The current logic never builds a list of mergedHints at all (such a list could get very large depending on how many different permutations you have to walk through). It also never generates a full list of permutations -- it just iterates through each permutation, calculating the current mergedHint and tracking the current bestHint. I'm worried moving to a model that generates (and stores) full lists for these may cause other unforeseen performance problems.

Yeah, I was suggesting that we move to storing the merged hint (evaluated by performing the bitwise AND of a cross product entry) rather than just evaluating by iterating over them.
Hmm, I understand, you do have a valid concern as in order to store mergedhints we need to maintain a sorted list with size equal to product of number of hints from each hintprovider, which (like you said) could get very large and cause performance issues.

I am happy with the PR, going to put on hold so that other reviewers can provide their input.

/lgtm
/hold

klueska · 2022-02-17T12:38:20Z

One thing I could do to maybe help write more comprehensive unit tests is to factor out the new logic into a standalone function and then test explicit inputs to that function. I struggled to write comprehensive tests because of the way the bestNonPreferredAffinityCount is calculated across the different resource types.

pacoxu · 2022-03-01T06:48:47Z

/lgtm
feel free to /unhold

For the 'single-numa' and 'restricted' TopologyManager policies, pods are only admitted if all of their containers have perfect alignment across the set of resources they are requesting. The best-effort policy, on the other hand, will prefer allocations that have perfect alignment, but fall back to a non-preferred alignment if perfect alignment can't be achieved. The existing algorithm of how to choose the best hint from the set of "non-preferred" hints is fairly naive and often results in choosing a sub-optimal hint. It works fine in cases where all resources would end up coming from a single NUMA node (even if its not the same NUMA nodes), but breaks down as soon as multiple NUMA nodes are required for the "best" alignment. We will never be able to achieve perfect alignment with these non-preferred hints, but we should try and do something more intelligent than simply choosing the hint with the narrowest mask. In an ideal world, we would have the TopologyManager return a set of "resources-relative" hints (as opposed to a common hint for all resources as is done today). Each resource-relative hint would indicate how many other resources could be aligned to it on a given NUMA node, and a hint provider would use this information to allocate its resources in the most aligned way possible. There are likely some edge cases to consider here, but such an algorithm would allow us to do partial-perfect-alignment of "some" resources, even if all resources could not be perfectly aligned. Unfortunately, supporting something like this would require a major redesign to how the TopologyManager interacts with its hint providers (as well as how those hint providers make decisions based on the hints they get back). That said, we can still do better than the naive algorithm we have today, and this patch provides a mechanism to do so. We start by looking at the set of hints passed into the TopologyManager for each resource and generate a list of the minimum number of NUMA nodes required to satisfy an allocation for a given resource. Each entry in this list then contains the 'minNUMAAffinity.Count()' for a given resources. Once we have this list, we find the *maximum* 'minNUMAAffinity.Count()' from the list and mark that as the 'bestNonPreferredAffinityCount' that we would like to have associated with whatever "bestHint" we ultimately generate. The intuition being that we would like to (at the very least) get alignment for those resources that *require* multiple NUMA nodes to satisfy their allocation. If we can't quite get there, then we should try to come as close to it as possible. Once we have this 'bestNonPreferredAffinityCount', the algorithm proceeds as follows: If the mergedHint and bestHint are both non-preferred, then try and find a hint whose affinity count is as close to (but not higher than) the bestNonPreferredAffinityCount as possible. To do this we need to consider the following cases and react accordingly: 1. bestHint.NUMANodeAffinity.Count() > bestNonPreferredAffinityCount 2. bestHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount 3. bestHint.NUMANodeAffinity.Count() < bestNonPreferredAffinityCount For case (1), the current bestHint is larger than the bestNonPreferredAffinityCount, so updating to any narrower mergeHint is preferred over staying where we are. For case (2), the current bestHint is equal to the bestNonPreferredAffinityCount, so we would like to stick with what we have *unless* the current mergedHint is also equal to bestNonPreferredAffinityCount and it is narrower. For case (3), the current bestHint is less than bestNonPreferredAffinityCount, so we would like to creep back up to bestNonPreferredAffinityCount as close as we can. There are three cases to consider here: 3a. mergedHint.NUMANodeAffinity.Count() > bestNonPreferredAffinityCount 3b. mergedHint.NUMANodeAffinity.Count() == bestNonPreferredAffinityCount 3c. mergedHint.NUMANodeAffinity.Count() < bestNonPreferredAffinityCount For case (3a), we just want to stick with the current bestHint because choosing a new hint that is greater than bestNonPreferredAffinityCount would be counter-productive. For case (3b), we want to immediately update bestHint to the current mergedHint, making it now equal to bestNonPreferredAffinityCount. For case (3c), we know that *both* the current bestHint and the current mergedHint are less than bestNonPreferredAffinityCount, so we want to choose one that brings us back up as close to bestNonPreferredAffinityCount as possible. There are three cases to consider here: 3ca. mergedHint.NUMANodeAffinity.Count() > bestHint.NUMANodeAffinity.Count() 3cb. mergedHint.NUMANodeAffinity.Count() < bestHint.NUMANodeAffinity.Count() 3cc. mergedHint.NUMANodeAffinity.Count() == bestHint.NUMANodeAffinity.Count() For case (3ca), we want to immediately update bestHint to mergedHint because that will bring us closer to the (higher) value of bestNonPreferredAffinityCount. For case (3cb), we want to stick with the current bestHint because choosing the current mergedHint would strictly move us further away from the bestNonPreferredAffinityCount. Finally, for case (3cc), we know that the current bestHint and the current mergedHint are equal, so we simply choose the narrower of the 2. This patch implements this algorithm for the case where we must choose from a set of non-preferred hints and provides a set of unit-tests to verify its correctness. Signed-off-by: Kevin Klues <kklues@nvidia.com>

klueska · 2022-03-01T14:42:24Z

@swatisehgal , @fromanirh , @pacoxu
More extensive unit tests now added.

ffromani · 2022-03-01T14:53:16Z

@swatisehgal , @fromanirh , @pacoxu More extensive unit tests now added.

thanks Kevin, will review ASAP
EDIT: ETA March 2 morning.

swatisehgal

Just noticed a duplicate test but other than that looks good. Please remove that and I will add an LGTM.

pkg/kubelet/cm/topologymanager/policy_test.go

Signed-off-by: Kevin Klues <kklues@nvidia.com>

swatisehgal · 2022-03-01T17:36:36Z

Thanks @klueska for adding comprehensive tests. Your effort is greatly appreciated!
/lgtm

swatisehgal · 2022-03-01T18:27:21Z

/test pull-kubernetes-e2e-kind-ipv6

pkg/kubelet/cm/topologymanager/policy.go

ffromani · 2022-03-02T08:25:17Z

pkg/kubelet/cm/topologymanager/policy.go

+	// Finally, for case (3cc), we know that the current bestHint and the
+	// candidate hint are equal, so we simply choose the narrower of the 2.
+
+	// Case 1


nit, but still: the above explanation was just great, so I can't help but wonder if would have be even better to intermix it with the actual code (kinda literate-ish programming style) instead of having first pretty long (and very informative) explanation and the chunk of code afterwards.

I actually had it split across them all originally, and I found it a bit harder to follow (even as the author of it). By putting it up front you get the change to read through it all without being interrupted by the code in between. Let me see if I can come up with some middle ground that still flows well.

I've been wondering myself, and I don't want to slow down this PR needlessly, so up to you!

pkg/kubelet/cm/topologymanager/policy_test.go

ffromani

This is a great PR and it is worth merging for the great commit message and the code cleanup alone. I've added a bunch of minor comments, but they are suggestions to improve even futher rather than requests, and by no means they require a re-upload.

The only real question I have is the following.
There is the main commit message which goes great lengths in explaining the rationale for the change, and the description is indeed great. If we grok the basic premise, the rest of the code is very clear and follows very smoothly.

The devil's is in the premise itself, I mean specifically here:

We start by looking at the set of hints passed into the TopologyManager for
each resource and generate a list of the minimum number of NUMA nodes required
to satisfy an allocation for a given resource. Each entry in this list then
contains the minNUMAAffinity.Count() for a given resources. Once we have this
list, we find the maximum minNUMAAffinity.Count() from the list and mark
that as the bestNonPreferredAffinityCount'that we would like to have
associated with whatever "bestHint" we ultimately generate. The intuition being
that we would like to (at the very least) get alignment for those resources
that require multiple NUMA nodes to satisfy their allocation. If we can't
quite get there, then we should try to come as close to it as possible.

(emphasis added)
The only problem here is being on the same page about the intuition. From my own past experience, I can think of few examples indeed, but I'm not sure I'm seeing the very same picture you're referring.

An example would be great to make sure we immediately get the same intuition and to make the otherwise flawless explanation really complete.
No need to re-upload, a GH comment is fine here.

ffromani · 2022-03-02T09:54:19Z

/lgtm
for the reasons in #108154 (review)
feel free to unhold anytime

klueska · 2022-03-02T10:05:32Z

Thanks for the reviews everyone. Given the only comments are minor nits, I will leave the code as is to avoid requiring one more round of reviews. I will add the example @fromanirh requested to the PR description though. Thanks again.

/unhold

klueska · 2022-03-02T11:01:18Z

Description updated.

ffromani · 2022-03-02T11:03:33Z

Description updated.

very helpful. Thanks!

ffromani · 2022-03-02T11:13:12Z

/test pull-kubernetes-e2e-gce-ubuntu-containerd
all failures seems to be unrelated

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2022

k8s-ci-robot requested review from derekwaynecarr and pacoxu February 16, 2022 09:46

klueska mentioned this pull request Feb 16, 2022

Fix bug in TopologyManager with merging hints when NUM_NUMA > 2 #108052

Merged

k8s-ci-robot assigned ffromani and swatisehgal Feb 16, 2022

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 16, 2022

SergeyKanzhelev added this to Triage in SIG Node PR Triage Feb 16, 2022

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 17, 2022

klueska force-pushed the fix-topology-manager branch from 886a9b3 to a226cdd Compare February 28, 2022 20:37

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2022

klueska force-pushed the fix-topology-manager branch from a226cdd to e7160eb Compare March 1, 2022 14:41

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 1, 2022

klueska force-pushed the fix-topology-manager branch from e7160eb to 4ac43b0 Compare March 1, 2022 14:45

swatisehgal reviewed Mar 1, 2022

View reviewed changes

pkg/kubelet/cm/topologymanager/policy_test.go Outdated Show resolved Hide resolved

Add extensive unit testing for TopologyManager hint generation algorithm

e370b73

Signed-off-by: Kevin Klues <kklues@nvidia.com>

klueska force-pushed the fix-topology-manager branch from 4ac43b0 to e370b73 Compare March 1, 2022 17:30

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2022

ffromani reviewed Mar 2, 2022

View reviewed changes

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 2, 2022

k8s-ci-robot merged commit 422001d into kubernetes:master Mar 2, 2022

SIG Node PR Triage automation moved this from Needs Approver to Done Mar 2, 2022

k8s-ci-robot added this to the v1.24 milestone Mar 2, 2022

github-actions bot mentioned this pull request Mar 22, 2022

Week Ending March 6, 2022 dev-obs/actus#394

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update TopologyManager algorithm for selecting "best" non-preferred hint #108154

Update TopologyManager algorithm for selecting "best" non-preferred hint #108154

klueska commented Feb 16, 2022 •

edited

k8s-ci-robot commented Feb 16, 2022

klueska commented Feb 16, 2022

swatisehgal commented Feb 16, 2022

k8s-triage-robot commented Feb 16, 2022

swatisehgal commented Feb 17, 2022

klueska commented Feb 17, 2022

swatisehgal commented Feb 17, 2022 •

edited

klueska commented Feb 17, 2022

pacoxu commented Mar 1, 2022

klueska commented Mar 1, 2022

ffromani commented Mar 1, 2022 •

edited

swatisehgal left a comment

swatisehgal commented Mar 1, 2022

swatisehgal commented Mar 1, 2022

ffromani Mar 2, 2022

klueska Mar 2, 2022 •

edited

ffromani Mar 2, 2022

ffromani left a comment •

edited

ffromani commented Mar 2, 2022

klueska commented Mar 2, 2022

klueska commented Mar 2, 2022

ffromani commented Mar 2, 2022

ffromani commented Mar 2, 2022

Socket	Numa Node	CPU(s)	GPU(s)
0	0	0-15,128-143
0	1	16-31,144-159	/dev/nvidia2, /dev/nvidia3
0	2	32-47,160-175
0	3	48-63,176-191	/dev/nvidia0, /dev/nvidia1
1	4	64-79,192-207
1	5	80-95,208-223	/dev/nvidia6, /dev/nvidia7
1	6	96-111,224-239
1	7	112-127,240-255	/dev/nvidia4, /dev/nvidia5

Update TopologyManager algorithm for selecting "best" non-preferred hint #108154

Update TopologyManager algorithm for selecting "best" non-preferred hint #108154

Conversation

klueska commented Feb 16, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Feb 16, 2022

klueska commented Feb 16, 2022

swatisehgal commented Feb 16, 2022

k8s-triage-robot commented Feb 16, 2022

swatisehgal commented Feb 17, 2022

klueska commented Feb 17, 2022

swatisehgal commented Feb 17, 2022 • edited

klueska commented Feb 17, 2022

pacoxu commented Mar 1, 2022

klueska commented Mar 1, 2022

ffromani commented Mar 1, 2022 • edited

swatisehgal left a comment

Choose a reason for hiding this comment

swatisehgal commented Mar 1, 2022

swatisehgal commented Mar 1, 2022

ffromani Mar 2, 2022

Choose a reason for hiding this comment

klueska Mar 2, 2022 • edited

Choose a reason for hiding this comment

ffromani Mar 2, 2022

Choose a reason for hiding this comment

ffromani left a comment • edited

Choose a reason for hiding this comment

ffromani commented Mar 2, 2022

klueska commented Mar 2, 2022

klueska commented Mar 2, 2022

ffromani commented Mar 2, 2022

ffromani commented Mar 2, 2022

klueska commented Feb 16, 2022 •

edited

swatisehgal commented Feb 17, 2022 •

edited

ffromani commented Mar 1, 2022 •

edited

klueska Mar 2, 2022 •

edited

ffromani left a comment •

edited