New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update GetTopologyHints() for TopologyManager Hint Providers to return a map #80569
Update GetTopologyHints() for TopologyManager Hint Providers to return a map #80569
Conversation
c06dcff
to
24270d9
Compare
efed402
to
8ad06c6
Compare
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this lgtm. I agree with your rationale and don't see any downsides.
I'm going to wait to officially mark "lgtm" until the dependent diffs are merged. Could I trouble you to ping me when they are?
In addition, looks like there are some small gofmt
/golint
errors to resolve when you get a sec.
8ad06c6
to
605284d
Compare
5bf09d8
to
878df97
Compare
continue | ||
} | ||
|
||
if len(hints[resource]) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case we will get to a situation that we don't any hint for resource? isn't that will blocked on the scheduler level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics we agreed to enforce are:
- nil == don't care --> (1,1: true)
- {} == care but has no alignment --> (1,1: false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to hardcode the (1,1: true) or the (1,1: false) inside the hintprovider, because different policies might want to encode the "don't care" and "care but has no alignment" differently.
The current semantics are admittedly "best-effort" specific and I'd imagine they would be pushed down into the logic of your Merge()
abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, this logic should be pushed down to the merge abstraction.
I am investigating what is the best way to do that. So for "nil == don't care" we can just remove the resource from the allproviderhints (better name would be allresourceshints). and for "{} == care has no alignment" I don't understand in what case we will have a resource with empty hints. Can you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds to me that the hints shouldn't be nil if the plugin cares about Topology as there has to be some hint even if it is not preferred.
A nil hint would indicate to me that there were not enough resources to calculate the hint and there should have been an error.
continue | ||
} | ||
|
||
allProviderHints = append(allProviderHints, hints[resource]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you going over all the resources just to change them from a map[string][]TopologyHint to [][]TopologyHint. Why not just keeping it as a map[string][]TopologyHint. In my merge abstraction I was expecting the permutation to be map[string]TopologyHint. I was thinking that it might be useful in the future that we will know what is the resource this hint is belong too. One case for that is GPU direct. I will be able to write a policy that align GPU PCI address with NIC PCI address which are have on the same NUMA with the cpus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine keeping it as a map if that makes your Merge()
abstraction better. I was just tring to change the code from its current state as little as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this due time constraints for the release? not sure when kubernetes 1.16 is release, and I doubt the merge abstraction will be merged into this release. (there are other patches that should be merged first). If that the case we can keep the code as is, but we will need to revisit this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more due to ease of reviewability. The smaller the change, the easier it is to review and verify its correctness relative to the existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moshe010 You could introduce the change in Topology Manager as part of your PR and explain the reasoning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure I will do the changes in my PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on second thought, for the current policies (best effort and strict) I don't need merge signature to be map only for more advance policies in the future. Because we are in time constraints for 1.16 I will make the merge signature to be slice and we will revisit it late.
@@ -164,16 +164,35 @@ func (m *manager) calculateAffinity(pod v1.Pod, container v1.Container) Topology | |||
// Get the TopologyHints from a provider. | |||
hints := provider.GetTopologyHints(pod, container) | |||
|
|||
// If hints is nil, overwrite 'hints' with a preferred any-socket affinity. | |||
// If hints is nil, insert a single, preferred any-socket hint into allProviderHints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is need? alternative will be to not include it in allProviderHints.
For example if I have cpu provider and device plugin provider and only cpu provider has hint, I will only add them in the allProviderHints. also in Line 177 when you are going per resource if that resource don't have hint is should not be included in the allProviderHints. This is also related to my comment below that it better to have allProviderHints as map[string][] TopologyHint, so we will keep a map of resources that have hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep it as a map below, then it's probably fine to remove this. As of now, having a {1,1:true} is the same as not having a hint at all (at least for the best effort policy). I can see how it might cause problems though when integrating your Merge()
abstraction. At what level in this logic do you pass things down to the Merge()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pictured you passing the entire hints map into merge and having it pop out the single TopologyHint
, with the current logic becoming the implementation of the best-effort
merge policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my intention was like in the POC code [1]. you see that I am passing a permutation of map[string]TopologyHint and from permutation I create the merged hint.
I assumed that I will have only resources with hints, but like you mention above:
nil - as don't care
{} - as care but not align.
So I wonder in what situation in the best-effort policy we will have resource with empty hints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I have this happening in the device manager when the set of available devices is less than the set of requested devices for a specific resource type:
1a2b03b#diff-02911c3198a5635aba854ec2d6844cfdR47
This can happen if some devices becomes unhealthy after the scheduler has admitted the pod, but before the device manager has done its allocation (which happens after the call to
GetTopologyHints()
).
Thinking about his more though, this really this should return an error
and not an empty list. Maybe we need to modify the interface again to allow us to return errors here.
With that said, in the future, I see the need for {}
becoming important once we introduce the GetPreferredAllocations()
call discussed in kubernetes/enhancements#1121
Once that is in place, we will need a way of encoding that there are no preferred allocations, and that we'd prefer to block admission of the pod until a preferred allocation can be calculated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding the case of devices becomes unhealthy after the scheduler has admitted the pod I agree with you it should be error and not empty list. We should probably need to change the interface.
Regrading the GetPreferredAllocations(), so I understand you want to use {} to indicate that there are no preferred allocations. So instead of returning {} I think we should generate hints of all allocation and mark the good ones as preferred. (if there are not preferred allocation all the hints will be preferred =false) It up to the policy to decide if to admit or not. So we can use best-effort policy top pick not preferred one. we can use strict policy to failed. We can even add another policy that strict the GPU preferred allocation and best effort other resources like cpu and nic. (it like best-effort but with checking the GPU hint is preferred) This can be done once we pass the all resources hints as a map and implement merge abstraction
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course. You are right. That is actually how it is designed and exactly what I outlined in the proposal / feedback document. Not sure what I was thinking when I wrote this this morning.
So yeah, not sure exactly when we would have a „care but no hints“.
Probably need to rethink this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree here that we should always return hints, and look at the case where the resources have become unavailable and we no longer have enough to satisfy the request.
Should the topology manager error and fail the pod? Or should it delegate that to the relevant providers, ie. when device manager comes to do actual allocation it will fail at this stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have create PR to extend GetTopologyHints to return error see #81687
878df97
to
d35d03b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Perhaps I missed something in the logic, but could you explain why each resource name is associated with a slice of hints (instead of just one hint per resource name?) It looks like in the only implementation a single element slice is always returned. What's the use case for multiple?
-
We talked (slack dm) about amending the return type so that the resource name is part of the topology hint instead of being associated via the map. Combining with suggestion (1) above, how about the following:
type TopologyHint struct {
Resource v1.ResourceName
SocketAffinity SocketMask
Preferred bool
}
func GetTopologyHints() []TopologyHint
I hope I understand the question, but this is my understanding the merged of hints is done later on when we iterate on all combination of resources hints and select the best merged hint we can find.
With the merge abstraction we can control who we can generate the merged hint. Does this make sense? |
For the same reason that the Please see the next PR where this is done:
Which part of the implementation are you referring to? This PR modifies the
On slack, the actual suggestion was for:
Which still has a |
At present, there is no way for a hint provider to return distinct hints for different resource types via a call to GetTopologyHints(). This means that hint providers that govern multiple resource types (e.g. the devicemanager) must do some sort of "pre-merge" on the hints it generates for each resource type before passing them back to the TopologyManager. This patch changes the GetTopologyHints() interface to allow a hint provider to pass back raw hints for each resource type, and allow the TopologyManager to merge them using a single unified strategy. This change also allows the TopologyManager to recognize which resource type a set of hints originated from, should this information become useful in the future.
d35d03b
to
4fdd52b
Compare
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ConnorDoyle, klueska The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
What type of PR is this?
/kind api-change
What this PR does / why we need it:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
This implements the change proposed in kubernetes/enhancements#1131