Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GetTopologyHints() for TopologyManager Hint Providers to return a map #80569

Merged

Conversation

klueska
Copy link
Contributor

@klueska klueska commented Jul 25, 2019

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change

/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

    At present, there is no way for a hint provider to return distinct hints
    for different resource types via a call to GetTopologyHints(). This
    means that hint providers that govern multiple resource types (e.g. the
    devicemanager) must do some sort of "pre-merge" on the hints it
    generates for each resource type before passing them back to the
    TopologyManager.

    This seems counter-intuitive, since there is no practical reason that a
    "pre-merge" should be necessary -- it just happens to be necessary
    because of the way the current interface is designed.

    It would be better to allow a hint provider to pass back raw hints for
    each resource type, and allow the TopologyManager to merge them using
    a single unified strategy.

    This patch makes a simple change to the GetTopologyHints() interface
    to allow this to occur.

    Moreover, this change allows the TopologyManager to recognize which
    resource type a set of hints originated from, should this information
    become useful in the future.

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

This implements the change proposed in kubernetes/enhancements#1131

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 25, 2019
@klueska klueska force-pushed the upstream-get-topology-hints-map branch from c06dcff to 24270d9 Compare July 25, 2019 09:01
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 25, 2019
@klueska klueska force-pushed the upstream-get-topology-hints-map branch 3 times, most recently from efed402 to 8ad06c6 Compare July 25, 2019 09:24
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

Copy link
Contributor

@mattjmcnaughton mattjmcnaughton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this lgtm. I agree with your rationale and don't see any downsides.

I'm going to wait to officially mark "lgtm" until the dependent diffs are merged. Could I trouble you to ping me when they are?

In addition, looks like there are some small gofmt/golint errors to resolve when you get a sec.

@klueska klueska force-pushed the upstream-get-topology-hints-map branch from 8ad06c6 to 605284d Compare July 26, 2019 10:13
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2019
@klueska klueska force-pushed the upstream-get-topology-hints-map branch 2 times, most recently from 5bf09d8 to 878df97 Compare July 26, 2019 11:00
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2019
continue
}

if len(hints[resource]) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case we will get to a situation that we don't any hint for resource? isn't that will blocked on the scheduler level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics we agreed to enforce are:

  1. nil == don't care --> (1,1: true)
  2. {} == care but has no alignment --> (1,1: false)

Copy link
Contributor Author

@klueska klueska Jul 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to hardcode the (1,1: true) or the (1,1: false) inside the hintprovider, because different policies might want to encode the "don't care" and "care but has no alignment" differently.

The current semantics are admittedly "best-effort" specific and I'd imagine they would be pushed down into the logic of your Merge() abstraction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this logic should be pushed down to the merge abstraction.
I am investigating what is the best way to do that. So for "nil == don't care" we can just remove the resource from the allproviderhints (better name would be allresourceshints). and for "{} == care has no alignment" I don't understand in what case we will have a resource with empty hints. Can you elaborate on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds to me that the hints shouldn't be nil if the plugin cares about Topology as there has to be some hint even if it is not preferred.

A nil hint would indicate to me that there were not enough resources to calculate the hint and there should have been an error.

continue
}

allProviderHints = append(allProviderHints, hints[resource])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you going over all the resources just to change them from a map[string][]TopologyHint to [][]TopologyHint. Why not just keeping it as a map[string][]TopologyHint. In my merge abstraction I was expecting the permutation to be map[string]TopologyHint. I was thinking that it might be useful in the future that we will know what is the resource this hint is belong too. One case for that is GPU direct. I will be able to write a policy that align GPU PCI address with NIC PCI address which are have on the same NUMA with the cpus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine keeping it as a map if that makes your Merge() abstraction better. I was just tring to change the code from its current state as little as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this due time constraints for the release? not sure when kubernetes 1.16 is release, and I doubt the merge abstraction will be merged into this release. (there are other patches that should be merged first). If that the case we can keep the code as is, but we will need to revisit this later.

Copy link
Contributor Author

@klueska klueska Jul 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more due to ease of reviewability. The smaller the change, the easier it is to review and verify its correctness relative to the existing code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moshe010 You could introduce the change in Topology Manager as part of your PR and explain the reasoning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sure I will do the changes in my PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought, for the current policies (best effort and strict) I don't need merge signature to be map only for more advance policies in the future. Because we are in time constraints for 1.16 I will make the merge signature to be slice and we will revisit it late.

@@ -164,16 +164,35 @@ func (m *manager) calculateAffinity(pod v1.Pod, container v1.Container) Topology
// Get the TopologyHints from a provider.
hints := provider.GetTopologyHints(pod, container)

// If hints is nil, overwrite 'hints' with a preferred any-socket affinity.
// If hints is nil, insert a single, preferred any-socket hint into allProviderHints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is need? alternative will be to not include it in allProviderHints.
For example if I have cpu provider and device plugin provider and only cpu provider has hint, I will only add them in the allProviderHints. also in Line 177 when you are going per resource if that resource don't have hint is should not be included in the allProviderHints. This is also related to my comment below that it better to have allProviderHints as map[string][] TopologyHint, so we will keep a map of resources that have hints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep it as a map below, then it's probably fine to remove this. As of now, having a {1,1:true} is the same as not having a hint at all (at least for the best effort policy). I can see how it might cause problems though when integrating your Merge() abstraction. At what level in this logic do you pass things down to the Merge()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pictured you passing the entire hints map into merge and having it pop out the single TopologyHint, with the current logic becoming the implementation of the best-effort merge policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my intention was like in the POC code [1]. you see that I am passing a permutation of map[string]TopologyHint and from permutation I create the merged hint.
I assumed that I will have only resources with hints, but like you mention above:
nil - as don't care
{} - as care but not align.
So I wonder in what situation in the best-effort policy we will have resource with empty hints?

[1] https://github.com/moshe010/kubernetes/blob/numa_toplogy_new/pkg/kubelet/cm/topologymanager/topology_manager.go#L182-L183

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I have this happening in the device manager when the set of available devices is less than the set of requested devices for a specific resource type:
1a2b03b#diff-02911c3198a5635aba854ec2d6844cfdR47

This can happen if some devices becomes unhealthy after the scheduler has admitted the pod, but before the device manager has done its allocation (which happens after the call to
GetTopologyHints()).

Thinking about his more though, this really this should return an error and not an empty list. Maybe we need to modify the interface again to allow us to return errors here.

With that said, in the future, I see the need for {} becoming important once we introduce the GetPreferredAllocations() call discussed in kubernetes/enhancements#1121

Once that is in place, we will need a way of encoding that there are no preferred allocations, and that we'd prefer to block admission of the pod until a preferred allocation can be calculated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the case of devices becomes unhealthy after the scheduler has admitted the pod I agree with you it should be error and not empty list. We should probably need to change the interface.

Regrading the GetPreferredAllocations(), so I understand you want to use {} to indicate that there are no preferred allocations. So instead of returning {} I think we should generate hints of all allocation and mark the good ones as preferred. (if there are not preferred allocation all the hints will be preferred =false) It up to the policy to decide if to admit or not. So we can use best-effort policy top pick not preferred one. we can use strict policy to failed. We can even add another policy that strict the GPU preferred allocation and best effort other resources like cpu and nic. (it like best-effort but with checking the GPU hint is preferred) This can be done once we pass the all resources hints as a map and implement merge abstraction
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course. You are right. That is actually how it is designed and exactly what I outlined in the proposal / feedback document. Not sure what I was thinking when I wrote this this morning.

So yeah, not sure exactly when we would have a „care but no hints“.

Probably need to rethink this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree here that we should always return hints, and look at the case where the resources have become unavailable and we no longer have enough to satisfy the request.

Should the topology manager error and fail the pod? Or should it delegate that to the relevant providers, ie. when device manager comes to do actual allocation it will fail at this stage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have create PR to extend GetTopologyHints to return error see #81687

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 30, 2019
@klueska klueska force-pushed the upstream-get-topology-hints-map branch from 878df97 to d35d03b Compare August 14, 2019 13:39
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2019
Copy link
Contributor

@ConnorDoyle ConnorDoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Perhaps I missed something in the logic, but could you explain why each resource name is associated with a slice of hints (instead of just one hint per resource name?) It looks like in the only implementation a single element slice is always returned. What's the use case for multiple?

  2. We talked (slack dm) about amending the return type so that the resource name is part of the topology hint instead of being associated via the map. Combining with suggestion (1) above, how about the following:

type TopologyHint struct {
    Resource v1.ResourceName
    SocketAffinity SocketMask
    Preferred bool
}

func GetTopologyHints() []TopologyHint

@moshe010
Copy link
Contributor

  1. Perhaps I missed something in the logic, but could you explain why each resource name is associated with a slice of hints (instead of just one hint per resource name?) It looks like in the only implementation a single element slice is always returned. What's the use case for multiple?
  2. We talked (slack dm) about amending the return type so that the resource name is part of the topology hint instead of being associated via the map. Combining with suggestion (1) above, how about the following:
type TopologyHint struct {
    Resource v1.ResourceName
    SocketAffinity SocketMask
    Preferred bool
}

func GetTopologyHints() []TopologyHint

I hope I understand the question, but this is my understanding
So here we want per resource to create all the available hints. for example on 2 NUMA node I can have cpus with ([(1,0), preferred:=true],[(0,1), preferred:=true],[(1,1), preferred:=false]) if It can fulfill the requested on cpus on each NUMA. or in case I can do only on NUMA1 we will return ,[(0,1), preferred:=true])

the merged of hints is done later on when we iterate on all combination of resources hints and select the best merged hint we can find.
Example;
I requested 2 cpus, 2 gpu and 2 VF.
let say for cpu GetTopologyHints() will return
"cpu": ([(1,0), preferred:=true],[(0,1), preferred:=true],[(1,1), preferred:=false] (because It can fulfill the request with cpus on both NUMA,
let say for gpu GetTopologyHints() (option1 all gpus on NUMA0 option2 some gpus NUMA0 and some NUMA1)
"gpu" ([(1,0), preferred:=true], [(1,1), preferred:=false])
let say for vf GetTopologyHints() will return all on NUMA0
"vf" ([(1,0), preferred:=true])
later on will wil create permutation of

  1. {cpu: [(1,0), preferred:=true], gpu: (1,0), preferred:=true], "vf": ([(1,0), preferred:=true])}
  2. {cpu: ,[(0,1), preferred:=true], gpu: (1,0), preferred:=true], "vf": ([(1,0), preferred:=true]}
  3. {cpu: ,[(1,1), preferred:=false], gpu: (1,0), preferred:=true], "vf": ([(1,0), preferred:=true]}
  4. {cpu: [(1,0), preferred:=true], gpu: (1,1), preferred:=false], "vf": ([(1,0), preferred:=true])}
  5. {cpu: ,[(0,1), preferred:=true], gpu: (1,1), preferred:=false], "vf": ([(1,0), preferred:=true]}
  6. {cpu: ,[(1,1), preferred:=false], gpu: (1,1), preferred:=false], "vf": ([(1,0), preferred:=true]}
    for each permutation will calc the merged hint and at the end we will select the best merged hint.

With the merge abstraction we can control who we can generate the merged hint.

Does this make sense?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 15, 2019
@klueska
Copy link
Contributor Author

klueska commented Aug 16, 2019

@ConnorDoyle

  1. Perhaps I missed something in the logic, but could you explain why each resource name is associated with a slice of hints (instead of just one hint per resource name?)

For the same reason that the CPUManager returns a slice of hints for just the CPU resource. It returns one hint for every possible mask that it can satisfy its allocation from. The reason for the map is so that we can handle the case where the devicemanager manages multiple resource types and wants to return a slice of hints for each of them.

Please see the next PR where this is done:
https://github.com/kubernetes/kubernetes/pull/80570/files/06e214f73ca30e40d0b7b2a80a0db66f36c3c673..a15d8626e1f5fc9e66f061ae084f483322760f55#diff-02911c3198a5635aba854ec2d6844cfdR30

It looks like in the only implementation a single element slice is always returned. What's the use case for multiple?

Which part of the implementation are you referring to? This PR modifies the CPUManager to return a []TopologyHint exactly as before, but now wrapped in a single element map[string[]TopologyHint, indexed by string(v1.ResouerceCPU).

  1. We talked (slack dm) about amending the return type so that the resource name is part of the topology hint instead of being associated via the map. Combining with suggestion (1) above, how about the following:
type TopologyHint struct {
    Resource v1.ResourceName
    SocketAffinity SocketMask
    Preferred bool
}

func GetTopologyHints() []TopologyHint

On slack, the actual suggestion was for:

type TopologyHints struct {
    resource v1.ResourceName
    hints []TopologyHint
}
type TopologyHint struct {
    SocketAffinity SocketMask
    Preferred bool
}

func GetTopologyHints() []TopologyHints

Which still has a []TopologyHint per resource name, it's just wrapped in a struct instead of indexed through a map.

At present, there is no way for a hint provider to return distinct hints
for different resource types via a call to GetTopologyHints(). This
means that hint providers that govern multiple resource types (e.g. the
devicemanager) must do some sort of "pre-merge" on the hints it
generates for each resource type before passing them back to the
TopologyManager.

This patch changes the GetTopologyHints() interface to allow a hint
provider to pass back raw hints for each resource type, and allow the
TopologyManager to merge them using a single unified strategy.

This change also allows the TopologyManager to recognize which
resource type a set of hints originated from, should this information
become useful in the future.
@klueska klueska force-pushed the upstream-get-topology-hints-map branch from d35d03b to 4fdd52b Compare August 16, 2019 06:06
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 16, 2019
@klueska
Copy link
Contributor Author

klueska commented Aug 16, 2019

/retest

@ConnorDoyle
Copy link
Contributor

@moshe010 and @klueska, thanks for clarifying / reminding me about the semantics of the return type.

Since this is internal API only, I agree we can change the signature of GetTopologyHints in a follow-up since what's here is functionally equivalent.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 19, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ConnorDoyle, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 19, 2019
@ConnorDoyle
Copy link
Contributor

/retest

@k8s-ci-robot k8s-ci-robot merged commit ddd45b7 into kubernetes:master Aug 20, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 20, 2019
@klueska klueska deleted the upstream-get-topology-hints-map branch August 29, 2019 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants