Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flag to not advertise NUMA information #320

Open
blackgold opened this issue Jan 29, 2021 · 11 comments
Open

Flag to not advertise NUMA information #320

blackgold opened this issue Jan 29, 2021 · 11 comments
Labels
enhancement New feature or request

Comments

@blackgold
Copy link

What would you like to be added?

Flag to not advertise NUMA information

What is the use case for this feature / enhancement?

Logic to generate placement hints in topology manager is exponential to number of numa cores.
When we have like 8 nodes it takes really long.
We are not using numa information for rdma, so if device plugin does not send it (configurable by flag)it will be helpful.

@killianmuldoon
Copy link
Collaborator

@blackgold This is really interesting - have you got numbers for how long the TM calculation is taking? Does it impact your container startup time? It would be really helpful to understand the impact of the Topology calculation.

@zshi-redhat
Copy link
Collaborator

@blackgold I assume you also have other device plugin instances running in the same cluster that requires NUMA advertising, correct? so disabling NUMA policy in kubelet is not an option here.

@blackgold
Copy link
Author

@blackgold This is really interesting - have you got numbers for how long the TM calculation is taking? Does it impact your container startup time? It would be really helpful to understand the impact of the Topology calculation.

It takes more than 20 minutes. Jobcontroller kills the jobs in pending state for more than 20 mins after binding to node.
I will try to add some logs in kubelet to time it.

@blackgold
Copy link
Author

@blackgold I assume you also have other device plugin instances running in the same cluster that requires NUMA advertising, correct? so disabling NUMA policy in kubelet is not an option here.

Ack. we have gpu device plugin advertising topology information so cannot disable it in kubelet.
Jobs requiring less than 8 gpu's don't request rdma resources.So we need it enabled in kubelet for this case.

@killianmuldoon
Copy link
Collaborator

@blackgold Is this an 8 NUMA zone node? I didn't realize the Topology Manager calculation could take so long - any extra information on the set up and config would be great.

@zshi-redhat this seems like a must-have for these sorts of situations. Do you think it would work as a cmd flag i.e. daemonset wide(but not necessarily cluster wide) , or would it be better to have it as a per-pool config (would allow TM active for SRIOV on some pools but not on others)

@blackgold
Copy link
Author

Yup its a 8 NUMA zone node, 8 gpu, 8 RDMA devices and 255 cpus.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/topologymanager/policy.go#L142
Here the size of allProviderHints is 10x239.

When I timed it took 220 seconds to generate permutations from [8,0] to [8,96] ~= 22944 function calls. 220 seconds seems a lot for those many function calls. Need to debug more.

@zshi-redhat
Copy link
Collaborator

@zshi-redhat this seems like a must-have for these sorts of situations. Do you think it would work as a cmd flag i.e. daemonset wide(but not necessarily cluster wide) , or would it be better to have it as a per-pool config (would allow TM active for SRIOV on some pools but not on others)

I think having a per-pool config would allow more flexibility and ultimately solve any relevant issues. For example, running one device plugin instance would be possible for several resource pools, with NUMA enabled for some pools but not the others.
If we only have cli option, then user would need run multiple instances of device plugin, with each using different NUMA cli config.

For this particular case, my understanding is GPU is advertised by a different device plugin (may not be sriov), so having a cli option would be enough.

@adrianchiris
Copy link
Contributor

adrianchiris commented Feb 2, 2021

was an issue filed against topology manager ? maybe the algorithm can be improved

@blackgold
Copy link
Author

was an issue filed against topology manager ? maybe the algorithm can be improved
not yet. @klueska

If you guys think its reasonable to control this using a cli option i can send out a mr

@zshi-redhat
Copy link
Collaborator

was an issue filed against topology manager ? maybe the algorithm can be improved
not yet. @klueska

If you guys think its reasonable to control this using a cli option i can send out a mr

I'm fine with using a cli option, this is aligned with the discussion we had in #320 and resource mgmt meeting - to have a featureGate for features that may need to be enabled/disabled. I think numa could be one example of such.

/cc @killianmuldoon @ahalim-intel @adrianchiris @martinkennelly

@killianmuldoon
Copy link
Collaborator

@zshi-redhat I think a feature gate is a good idea here for sure, but we should think about implementing per-pool numa-awareness (default on, opt out for a specific pool) for advanced cases where sriov topology may not be important (one NIC per node, multi-resource NUMA contstraints).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants