Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Topology Manager #693

Closed
lmdaly opened this issue Jan 17, 2019 · 171 comments
Closed

Node Topology Manager #693

lmdaly opened this issue Jan 17, 2019 · 171 comments

Comments

@lmdaly
Copy link
Contributor

@lmdaly lmdaly commented Jan 17, 2019

Enhancement Description

@lmdaly
Copy link
Contributor Author

@lmdaly lmdaly commented Jan 17, 2019

/sig node
/kind feature
cc @ConnorDoyle @balajismaniam @nolancon

@vishh
Copy link
Member

@vishh vishh commented Feb 4, 2019

I can help inform this design based on learning from Borg. So count me in as a reviewer/approver.

@jeremyeder
Copy link
Member

@jeremyeder jeremyeder commented Feb 11, 2019

I can help inform this design based on learning from Borg. So count me in as a reviewer/approver.

Is there any public documentation on how this feature works in borg?

@vishh
Copy link
Member

@vishh vishh commented Feb 11, 2019

@spiffxp
Copy link
Member

@spiffxp spiffxp commented Feb 24, 2019

FYI @claurence

This tracking issue and KEP (#781) did not make it in time for the v1.14 enhancements freeze nor the extended deadline. I appreciate that you opened these before the deadlines, but they didn't seem to get sufficient review or sign off. This will need to go through the exception process.

Until we decide whether this is worth the exception, I'm inclined to put a hold on all PR's associated with this enhancement.

ref: kubernetes/kubernetes#72828

@jiayingz
Copy link
Member

@jiayingz jiayingz commented Feb 25, 2019

@claurence
Copy link

@claurence claurence commented Feb 27, 2019

@lmdaly I see y'all are have 1.14 listed in the description as the alpha milestone - since there wasn't a merged implementable KEP this issue is not being tracked for 1.14 - if there are intentions for it to be included in that release please submit an exception request.

@lmdaly
Copy link
Contributor Author

@lmdaly lmdaly commented Mar 5, 2019

@lmdaly I see y'all are have 1.14 listed in the description as the alpha milestone - since there wasn't a merged implementable KEP this issue is not being tracked for 1.14 - if there are intentions for it to be included in that release please submit an exception request.

@claurence the KEP is now merged (KEP had been previously merged in the community repo. this was just to move it to the new enhancements repo as per the new guidelines), do we still need to submit an exception request to get this issue tracked for 1.14?

@resouer
Copy link
Member

@resouer resouer commented Mar 11, 2019

While after reading the design & WIP PRs througoutly, I have concerns that the current implementation is not generic as the original topology design we proposed in #781. This one currently reads more like NUMA topology in node level.

I left some comments for further discussion here: kubernetes/kubernetes#74345 (comment)

@k82cn
Copy link
Member

@k82cn k82cn commented Mar 20, 2019

the current implementation is not generic

Share the same concern about on that :) How about others, e.g. links between device (nvlinke for GPU)?

@lmdaly
Copy link
Contributor Author

@lmdaly lmdaly commented Mar 25, 2019

@resouer @k82cn The initial proposal deals only with aligning the decisions made by cpu manager and device manager to ensure proximity of devices with the cpu the container runs on. Satisfying the inter-device affinity was a non-goal of the proposal.

If however, the current implementation is blocking the addition of inter-device affinity in the future then I am happy to change the implementation once I get an understanding of how it is doing so,

@klueska
Copy link
Contributor

@klueska klueska commented Apr 5, 2019

I think the main issue I see with the current implementation and the ability to support inter-device affinity is the following:

To support inter-device affinity you normally need to first figure out which devices you would like to allocate to a container before deciding what socket affinity you would like the container to have.

For example, with Nvidia GPUs, for optimal connectivity, you first need to find and allocate the set of GPUs with the most connected NVLINKs before determining what socket affinity that set has.

From what I can tell in the current proposal, the assumption is that these operations happen in reverse order, i.e. the socket affinity is decided before doing the allocation of devices.

@ConnorDoyle
Copy link
Member

@ConnorDoyle ConnorDoyle commented Apr 5, 2019

That’s not necessarily true @klueska. If the topology hints were extended to encode point-to-point device topology, the Device Manager could consider that when reporting socket affinity. In other words, cross device topology wouldn’t need to leak out of the scope of the device manager. Does that seem feasible?

@klueska
Copy link
Contributor

@klueska klueska commented Apr 5, 2019

Maybe I'm confused about the flow somehow. This is how I understand it:

  1. At initialization, device plugins (not the devicemanager) register themselves with the topologymanager so it can issue callbacks on it at a later time.

  2. When a pod is submitted the kubelet calls the lifecycle.PodAdmitHandler on the topologymanager.

  3. The lifecycle.PodAdmitHandler calls GetTopologyHints on each registered device plugin

  4. It then merges these hints to produce a consolidated TopologyHint associated with the pod

  5. If it decided to admit the pod, it returns successfully from lifecycle.PodAdmitHandler storing the consolidated TopologyHint for the pod in a local state store

  6. At some point in the future, the cpumanager and the devicemanager call GetAffinity(pod) on the topology manager to retrieve the TopologyHint associated with the pod

  7. The cpumanager uses this TopologyHint` to allocate a CPU

  8. The devicemanager uses this TopologyHint` to allocate a set of devices

  9. Initialization of the pod continues...

If this is correct, I guess I'm struggling with what happens between the point in time when the device plugin reports its TopologyHints and the time when the devicemanager does the actual allocation.

If these hints are meant to encode "preferences" for allocation, then I think what you are saying is to have a structure more like:

type TopologyHints struct {
    hints []struct {
        SocketID int
        DeviceIDs []int
    }
}

Where we not only pass a list of socket affinity preferences, but how those socket affinity preferences pair with allocatable GPU preferences.

If this is the direction you are thinking, then I think we could make it work, but we would need to somehow coordinate between the cpumanager and the devicemanager to make sure they "accepted" the same hint when making their allocations.

Is there something in place that allows this already that I missed?

@eloyekunle
Copy link
Member

@eloyekunle eloyekunle commented Apr 6, 2019

@klueska

I think what happens, making some minor corrections to your flow is:

  1. At initialization, device plugins register themselves with the devicemanager so it can issue callbacks on it at a later time.

  2. The lifecycle.PodAdmitHandler calls GetTopologyHints on each topology-aware component in the Kubelet, currently devicemanager and cpumanager.

In this case, what will be represented as topology-aware in the Kubelet are the cpumanager and the devicemanager. The topology manager is only intended to coordinate allocations between topology-aware components.

For this:

but we would need to somehow coordinate between the cpumanager and the devicemanager to make sure they "accepted" the same hint when making their allocations.

This is what the topologymanager itself was introduced to achieve. From one of the earlier drafts,

These components should coordinate in order to avoid cross NUMA assignments. The problems related to this coordination are tricky; cross domain requests such as “An exclusive core on the same NUMA node as the assigned NIC” involves both CNI and the CPU manager. If the CPU manager picks first, it may select a core on a NUMA node without an available NIC and vice-versa.

@klueska
Copy link
Contributor

@klueska klueska commented Apr 9, 2019

I see.

So the devicemanager and cpumanager both implement GetTopologyHints() as well as call GetAffinity(), avoiding direction interaction from the topologymanager with any underlying device plugins. Looking more closely at the code, I see that the devicemanager simply delegates control to the plugins to help fill in TopologyHints, which makes more sense in the end anyway.

Circling back to the original question / issue I raised though....

From Nvidia's perspective, I think we can make everything work with this proposed flow, assuming more information is added to the TopologyHints struct (and consequently the device plugin interface) to report point-to-point link information in the future.

However, I think starting with a SocketMask as the primary data structure for advertising socket affinity may limit our ability to expand TopologyHints with point-to-point information in the future without breaking the existing interface. The primary reason being that (at least in the case of Nvidia GPUs) the preferred socket depends on which GPUs are actually going to be allocated in the end.

For example, consider the figure below, when attempting to allocate 2 GPUs to a pod with optimal connectivity:

Bildschirmfoto 2019-04-09 um 15 51 37

The GPU combinations of (2, 3) and (6, 7) both have 2 NVLINKs and reside on the same PCIe bus. They should therefore be considered equal candidates when attempting to allocate 2 GPUs to a pod. Depending on which combination is chosen, however, a different socket will obviously be preferred as (2, 3) is connected to socket 0 and (6, 7) is connected to socket 1.

This information will somehow need to be encoded in the TopologyHints struct so that the devicemanager can perform one of these desired allocations in the end (i.e. whichever one the topologymanager consolidates the hints down to). Likewise, the dependency between the preferred device allocations and the preferred socket will need to be encoded in TopologyHints so that the cpumanager can allocate CPUs from the correct socket.

A potential solution specific to Nvidia GPUs for this example would look something like:

type TopologyHint struct {
    SocketID int
    DeviceIDs []int
}

type TopologyHints []TopologyHint

devicemanagerhints := &TopologyHints{
    {SocketID: 0, DeviceIDs: []int{2, 3}},
    {SocketID: 1, DeviceIDs: []int{6, 7}},
}

cpumanagerhints := &TopologyHints{
    {SocketID: 1},
}

Where the topologymanager would consolidate these hints to return {SocketID: 1, DeviceIDs: []int{6, 7}} as the preferred hint when the devicemanager and cpumanager later call GetAffinity().

While this may or may not provide a generic enough solution for all accelerators, replacing SocketMask in the TopologyHints struct with something structured more like the following would allow us to expand each individual hint with more fields in the future:

Note that GetTopologyHints() still return TopologyHints, while GetAffinity()has been modified to return a single TopologyHint rather than TopologyHints.

type TopologyHint struct {
    SocketID int
}

type TopologyHints []TopologyHint

&TopologyHints{
    {SocketID: 0},
    {SocketID: 1},
}

type HintProvider interface {
	GetTopologyHints(pod v1.Pod, container v1.Container) TopologyHints
}

type Store interface {
	GetAffinity(podUID string, containerName string) TopologyHint
}

Thoughts?

@k-wiatrzyk
Copy link
Contributor

@k-wiatrzyk k-wiatrzyk commented Nov 2, 2020

I have sent a Call of Exception mail as the deadline is today (just in case): https://groups.google.com/g/kubernetes-sig-node/c/lsy0fO6cBUY/m/_ujNkQtCCQAJ

@k-wiatrzyk
Copy link
Contributor

@k-wiatrzyk k-wiatrzyk commented Nov 4, 2020

@kikisdeliveryservice @klueska @annajung
The Call for Exception was approved, you can find the confirmation here: https://groups.google.com/g/kubernetes-sig-release/c/otE2ymBKeMA/m/ML_HMQO7BwAJ

@kikisdeliveryservice
Copy link
Member

@kikisdeliveryservice kikisdeliveryservice commented Nov 5, 2020

Thanks @k-wiatrzyk & @klueska Updated the tracking sheet.

cc: @annajung

@kikisdeliveryservice
Copy link
Member

@kikisdeliveryservice kikisdeliveryservice commented Nov 11, 2020

Hey @k-wiatrzyk @klueska

Looks like kubernetes/kubernetes#92967 is approved but needs a rebase.

Just a reminder that Code Freeze is coming up in 2 days on Thursday, November 12th. All PRs must be merged by that date, otherwise an Exception is required.

Best,
Kirsten

@kikisdeliveryservice
Copy link
Member

@kikisdeliveryservice kikisdeliveryservice commented Nov 12, 2020

The PR merged, updating sheet - yay! 😸

@annajung annajung removed this from the v1.20 milestone Jan 7, 2021
@fejta-bot
Copy link

@fejta-bot fejta-bot commented Apr 7, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@klueska
Copy link
Contributor

@klueska klueska commented Apr 8, 2021

I am going to mark this enhancement as complete.
We can open a new one when we start more actively working on moving the TopologyManager to GA.

@klueska
Copy link
Contributor

@klueska klueska commented Apr 8, 2021

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Apr 8, 2021

@klueska: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SergeyKanzhelev
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev commented Apr 8, 2021

I am going to mark this enhancement as complete.
We can open a new one when we start more actively working on moving the TopologyManager to GA.

My understanding of the process is that Enhancement issue stays open till GA. This way long-living beta enhancements can be found without parsing documents.

@annajung what's the right process?

@annajung
Copy link
Member

@annajung annajung commented Apr 8, 2021

Yes, that's correct, we keep all issues open until GA. Related KEP also is mapped to this issue, therefore you should not create a new issue for the same enhancement.

I am going to reopen the issue, any concerns please feel free to reach out. Thanks!

@annajung annajung reopened this Apr 8, 2021
@klueska
Copy link
Contributor

@klueska klueska commented Apr 9, 2021

I am the current owner of this feature, but not the original poster of this issue (so I don't have edit rights to the description). The original poster does not work on Kubernetes anymore (which is why I thought it might be better to open a new issue going forward). If there's a way to give me edit right to this issue, that would be great.

@annajung
Copy link
Member

@annajung annajung commented Apr 9, 2021

Hey @klueska right now there isn't the best way to provide access to edit unless you're a SIG lead. However, as a workaround, I have seen others creating a new comment in the issue that contains the latest description and use that to keep track of the latest status. If you do that, I can add a link to the comment in the description as well so everyone knows to look at that comment instead of the stale description. In addition to myself, you can always contact the SIG lead who owns the enhancement to make changes as well.

I know this isn't very ideal but I don't think there is a better way right now with limitations with GitHub permissions. However, I think they're rolling out custom roles(?) soon and maybe once that happens we can try to improve this if possible! Feel free to reach out in #enhancements slack channel as well to get more feedback from the enhancements subproject team.

@fejta-bot
Copy link

@fejta-bot fejta-bot commented May 9, 2021

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@fejta-bot
Copy link

@fejta-bot fejta-bot commented Jun 8, 2021

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Jun 8, 2021

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet