Skip to content

Conversation

@swhan91
Copy link

@swhan91 swhan91 commented Apr 9, 2020

  • Pod Spec extension to describe topology policy
  • Improve TM to allow binding Pods with multiple topology policy to the same node.
  • Introduce pod-level-single-numa policy
  • Deprecation plan of the topology policy flag
  • The way to mitigate the issue from non-NUMA aware Pod scheduling

Signed-off-by: sw.han sw.han@samsung.com

swhan91 and others added 2 commits April 8, 2020 18:31
* Separate Pod-level topology policy from node-level topology policy.
* Extend Pod spec to describe a Pod-level topology policy.
* Add Node-level `dynamic` policy.
* Add Pod-level `pod-level-single-numa-node` policy.

Signed-off-by: sw.han <sw.han@samsung.com>
Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @swhan91. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Apr 9, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: swhan91
To complete the pull request process, please assign derekwaynecarr
You can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 9, 2020
@k-wiatrzyk
Copy link
Contributor

I signed it

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 9, 2020
@bg-chun
Copy link
Member

bg-chun commented Apr 9, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2020
Signed-off-by: Byonggon Chun <bg.chun@samsung.com>
@bg-chun
Copy link
Member

bg-chun commented Apr 9, 2020

@swhan91
Can you rewrite summary of PR?
Something like this:

  • Pod Spec extension to describe topology policy
  • Improve TM to allow binding Pods with multiple topology policy to the same node.
  • Introduce pod-level-single-numa policy
  • Deprecation plan of the topology policy flag
  • The way to mitigate the issue from non-NUMA aware Pod scheduling

Copy link
Contributor

@AlexeyPerevalov AlexeyPerevalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good for me in general

1. Add a functional Topology Manager that queries hint providers in order
feature gate is disabled.
2. Add a functional Topology Manager that queries hint providers in order
to compute a preferred socket mask for each container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think word socket is inappropriate here. I remember it was a mess (socket/NUMA node).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems clear to change to a NUMA node.


### New Component: Topology Manager

This proposal is focused on a new component in the Kubelet called the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misprint and and

* The Topology Manager reads `v1core.PodSpec.Topology.Policy` field to coordinate assignment of resources for a Pod.

## Practical challenges
Non-NUMA aware scheduler may bind a pod to a node, where topology requirements cannot be satisfied. Topology Manager will reject a pod admission with TopologyError and pod will result in a `Terminated` state on a node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that problem exists in current implementation with per node defined policies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, that problem originated from the concept of pod and its lifecycle.
In general, Pods remain until a human or controller process explicitly removes them.
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifetime

The node-level topology policy takes precedence over the pod-level topology policy.
The node-level policies will be deprecated later, because the utilization is better for servers to have dynamic policies.
#### Topology Policy in Pod Spec
The value of `v1core.PodSpec.Topology.Policy` field indicates the desired topology of Pod's resource assignment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if only policy planed right now why not simplify it in following way:
v1core.PodSpec.nodeTopologyPolicy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we have a plan to extend Topology (from my point of view just Topology too general, since there are Topology* in PodSpec), need to mention about it here, in this context it sill be clear why such hierarchy was chosen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are talking about.
We don't have plan to introduce topology.cpuManagerPolicy or topology.cpuSiblings now.
So it seems to introduce v1core.PodSpec.topologyPolicy is a better way.
(nodeTopologyPolicy, I think it is a little bit wired because we are talking about topology of Pod not a node.)

…o topologyPolicy

The proposed topology field was hierarchical(`v1core.PodSpec.Topology.Policy`).
It is changed to `v1core.PodSpec.topologyPolicy`
because there is no need to divide hierarchy at the moment.

Signed-off-by: sw.han <sw.han@samsung.com>
1. Using a controller that guarantees to bind a Pod to a node where the desired topology of Pod's resource assignment can be satisfied.
1. A new object is required to describe the desired topology, the object can be a new workload type `TopologySet` or CRD.
2. A Topology-Aware Scheduler that would consider the topology of available resources on nodes.
1. Custom Scheduler could be provisioned to take care of the scheduling for high-performance applications. Detection of node's resources topology can be based on Node Feature Discovery with the usage of user-specified features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding using Node Feature Discovery here, a couple of questions: my understanding is a custome scheduler should take into account not only the node resource topology (e.g. "this node has two NUMA nodes") but also the node resource availability (e.g. "this node has two NUMA nodes, and 4 cores free on NUMA node 0 and 6 core free on NUMA node 1") at the moment of the scheduling decision, right?

Copy link
Contributor

@k-wiatrzyk k-wiatrzyk Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, taking node resource availability into account would be very important during scheduling, this could prevent assigning a pod to the node where each particular NUMA node doesn't have enough CPUs but in common they have enough.
Example here (only CPU but there are also other resources):

  • pod request: 4 exclusive CPUs from one NUMA node
  • NUMA 1: 3 free CPUs, NUMA 2: 2 free CPUs, Together: 5 free CPUs

Additionally topologyPolicy field specified in Pod Spec could be used to inform the scheduler if it should look on NUMA resources availability or not, depending on selected policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, hence my second question: is really NFD a good place to expose this information? Reasons I'm asking:

  1. AFAIU node-feature-labels aren't supposed to change very often; surely the node allocation changes very frequently, so I wonder if this is the right approach to convey this information. Maybe we should move in the direction of the prometheus node_exporter ?
  2. did you planned already about how to extract the current allocation values (e.g. how many cores per-NUMA are actually being allocated) and expose them?

Last: while important in the big picture, I'm not sure this KEP PR is the right place to discuss this sub-topic, should we move in another venue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an Issue #90204 created to discuss this:
kubernetes/kubernetes#90204

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 15, 2020
@k8s-ci-robot
Copy link
Contributor

@swhan91: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@swhan91 swhan91 closed this Jul 16, 2020
RomanBednar pushed a commit to RomanBednar/enhancements that referenced this pull request Jan 31, 2025
)

* feat: initial arbiter cluster enhancement

Signed-off-by: ehila <ehila@redhat.com>

* upkeep: updated with suggestions

Signed-off-by: ehila <ehila@redhat.com>

* upkeep: add clear HA goal

Signed-off-by: ehila <ehila@redhat.com>

* feat: updated implementation detail

updated CEO changes to include idea for wrapping informer and lister to minimize CEO code changes
updated MCO changes to include bootstrap files for arbiter node
added installer changes needed to include new machine types and install config changes

Signed-off-by: ehila <ehila@redhat.com>

* update: reviewers and component changes

added appropriate reviewers for the component changes
added more detail around the specific component changes

Signed-off-by: ehila <ehila@redhat.com>

* upkeep: add more details around test plan

Signed-off-by: ehila <ehila@redhat.com>

* upkeep: added input from comments

added more components that need to be changed
updated wording to correctly reflect whats being changed during installer file generations
added clearity around the use and purpose of the currently identified components
added more test cases that should be covered

Signed-off-by: ehila <ehila@redhat.com>

* upkeep: add olm filtering wording and olm reviewer

Signed-off-by: ehila <ehila@redhat.com>

* upkeep: updated wording and focused on baremetal installs

added explicit focus on baremetal
updated installer flow for baremetal
updated wording around MCO expectations and validations
added config naming suggestions for installer config

Signed-off-by: ehila <ehila@redhat.com>

* feat: wording fixes

fixed wording around infrastructure topology

Signed-off-by: ehila <ehila@redhat.com>

* upkeep: add wording around testing for telco configurations

Signed-off-by: ehila <ehila@redhat.com>

* upkeep: added suggestions for baremetal changes

added expectations for assisted installer and agent based installer
fix wording and spelling

Signed-off-by: ehila <ehila@redhat.com>

---------

Signed-off-by: ehila <ehila@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants