-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[KEP] Topology Manager - Update KEP for pod spec extension #1674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Separate Pod-level topology policy from node-level topology policy. * Extend Pod spec to describe a Pod-level topology policy. * Add Node-level `dynamic` policy. * Add Pod-level `pod-level-single-numa-node` policy. Signed-off-by: sw.han <sw.han@samsung.com>
Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
|
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Hi @swhan91. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: swhan91 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I signed it |
|
/ok-to-test |
Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
|
@swhan91
|
AlexeyPerevalov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks good for me in general
| 1. Add a functional Topology Manager that queries hint providers in order | ||
| feature gate is disabled. | ||
| 2. Add a functional Topology Manager that queries hint providers in order | ||
| to compute a preferred socket mask for each container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think word socket is inappropriate here. I remember it was a mess (socket/NUMA node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems clear to change to a NUMA node.
|
|
||
| ### New Component: Topology Manager | ||
|
|
||
| This proposal is focused on a new component in the Kubelet called the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misprint and and
| * The Topology Manager reads `v1core.PodSpec.Topology.Policy` field to coordinate assignment of resources for a Pod. | ||
|
|
||
| ## Practical challenges | ||
| Non-NUMA aware scheduler may bind a pod to a node, where topology requirements cannot be satisfied. Topology Manager will reject a pod admission with TopologyError and pod will result in a `Terminated` state on a node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that problem exists in current implementation with per node defined policies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, that problem originated from the concept of pod and its lifecycle.
In general, Pods remain until a human or controller process explicitly removes them.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifetime
| The node-level topology policy takes precedence over the pod-level topology policy. | ||
| The node-level policies will be deprecated later, because the utilization is better for servers to have dynamic policies. | ||
| #### Topology Policy in Pod Spec | ||
| The value of `v1core.PodSpec.Topology.Policy` field indicates the desired topology of Pod's resource assignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if only policy planed right now why not simplify it in following way:
v1core.PodSpec.nodeTopologyPolicy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we have a plan to extend Topology (from my point of view just Topology too general, since there are Topology* in PodSpec), need to mention about it here, in this context it sill be clear why such hierarchy was chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are talking about.
We don't have plan to introduce topology.cpuManagerPolicy or topology.cpuSiblings now.
So it seems to introduce v1core.PodSpec.topologyPolicy is a better way.
(nodeTopologyPolicy, I think it is a little bit wired because we are talking about topology of Pod not a node.)
…o topologyPolicy The proposed topology field was hierarchical(`v1core.PodSpec.Topology.Policy`). It is changed to `v1core.PodSpec.topologyPolicy` because there is no need to divide hierarchy at the moment. Signed-off-by: sw.han <sw.han@samsung.com>
| 1. Using a controller that guarantees to bind a Pod to a node where the desired topology of Pod's resource assignment can be satisfied. | ||
| 1. A new object is required to describe the desired topology, the object can be a new workload type `TopologySet` or CRD. | ||
| 2. A Topology-Aware Scheduler that would consider the topology of available resources on nodes. | ||
| 1. Custom Scheduler could be provisioned to take care of the scheduling for high-performance applications. Detection of node's resources topology can be based on Node Feature Discovery with the usage of user-specified features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding using Node Feature Discovery here, a couple of questions: my understanding is a custome scheduler should take into account not only the node resource topology (e.g. "this node has two NUMA nodes") but also the node resource availability (e.g. "this node has two NUMA nodes, and 4 cores free on NUMA node 0 and 6 core free on NUMA node 1") at the moment of the scheduling decision, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, taking node resource availability into account would be very important during scheduling, this could prevent assigning a pod to the node where each particular NUMA node doesn't have enough CPUs but in common they have enough.
Example here (only CPU but there are also other resources):
- pod request: 4 exclusive CPUs from one NUMA node
- NUMA 1: 3 free CPUs, NUMA 2: 2 free CPUs, Together: 5 free CPUs
Additionally topologyPolicy field specified in Pod Spec could be used to inform the scheduler if it should look on NUMA resources availability or not, depending on selected policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, hence my second question: is really NFD a good place to expose this information? Reasons I'm asking:
- AFAIU node-feature-labels aren't supposed to change very often; surely the node allocation changes very frequently, so I wonder if this is the right approach to convey this information. Maybe we should move in the direction of the prometheus node_exporter ?
- did you planned already about how to extract the current allocation values (e.g. how many cores per-NUMA are actually being allocated) and expose them?
Last: while important in the big picture, I'm not sure this KEP PR is the right place to discuss this sub-topic, should we move in another venue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an Issue #90204 created to discuss this:
kubernetes/kubernetes#90204
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
@swhan91: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
) * feat: initial arbiter cluster enhancement Signed-off-by: ehila <ehila@redhat.com> * upkeep: updated with suggestions Signed-off-by: ehila <ehila@redhat.com> * upkeep: add clear HA goal Signed-off-by: ehila <ehila@redhat.com> * feat: updated implementation detail updated CEO changes to include idea for wrapping informer and lister to minimize CEO code changes updated MCO changes to include bootstrap files for arbiter node added installer changes needed to include new machine types and install config changes Signed-off-by: ehila <ehila@redhat.com> * update: reviewers and component changes added appropriate reviewers for the component changes added more detail around the specific component changes Signed-off-by: ehila <ehila@redhat.com> * upkeep: add more details around test plan Signed-off-by: ehila <ehila@redhat.com> * upkeep: added input from comments added more components that need to be changed updated wording to correctly reflect whats being changed during installer file generations added clearity around the use and purpose of the currently identified components added more test cases that should be covered Signed-off-by: ehila <ehila@redhat.com> * upkeep: add olm filtering wording and olm reviewer Signed-off-by: ehila <ehila@redhat.com> * upkeep: updated wording and focused on baremetal installs added explicit focus on baremetal updated installer flow for baremetal updated wording around MCO expectations and validations added config naming suggestions for installer config Signed-off-by: ehila <ehila@redhat.com> * feat: wording fixes fixed wording around infrastructure topology Signed-off-by: ehila <ehila@redhat.com> * upkeep: add wording around testing for telco configurations Signed-off-by: ehila <ehila@redhat.com> * upkeep: added suggestions for baremetal changes added expectations for assisted installer and agent based installer fix wording and spelling Signed-off-by: ehila <ehila@redhat.com> --------- Signed-off-by: ehila <ehila@redhat.com>
Signed-off-by: sw.han sw.han@samsung.com