Skip to content

Latest commit

 

History

History

3022-min-domains-in-pod-topology-spread

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

KEP-3022: min domains in Pod Topology Spread

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests for meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

A new field minDomains is introduced to PodSpec.TopologySpreadConstraint[*] to limit the minimum number of topology domains. minDomains can be used only when whenUnsatisfiable=DoNotSchedule.

Motivation

Pod Topology Spread has maxSkew parameter, which control the degree to which Pods may be unevenly distributed. But, there isn't a way to control the number of domains over which we should spread. In some cases, users want to force spreading Pods over a minimum number of domains and, if there aren't enough already present, make the cluster-autoscaler provision them.

Goals

  • Users can specify minDomains to limit the number of domains when using WhenUnsatisfiable=DoNotSchedule.

Non-Goals

  • Add new field to limit the maximum number of topology domains.
  • Users can use it as a best-efforts manner with WhenUnsatisfiable=ScheduleAnyway.

Proposal

User Story

I am using cluster autoscaler and I want to force spreading a deployment over at least 5 Nodes.

Design Details

Users can define a minimum number of domains with minDomains parameter. This parameter only applies when whenUnsatisfiable=DoNotSchedule.

Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain.

However, the global minimum is only calculated for the nodes that exist and match the node affinity. In other words, if a topology domain was scaled down to zero (for example, because of low utilization), this topology domain is unknown to the scheduler, thus it's not considered in the global minimum calculations.

The new minDomains field can help with this problem.

When the number of domains with matching topology keys is less than minDomains, Pod Topology Spread treats "global minimum" as 0; otherwise, "global minimum" is equal to the minimum number of matching pods on a domain.

As a result, when the number of domains is less than minDomains, scheduler doesn't schedule a matching Pod to Nodes on the domains that have the same or more number of matching Pods as maxSkew.

minDomains is an optional parameter. If minDomains is nil, the constraint behaves as if MinDomains is equal to 1.

API

New optional parameter called MinDomains is introduced to PodSpec.TopologySpreadConstraint[*].

type TopologySpreadConstraint struct {
......
	// MinDomains indicates a minimum number of eligible domains.
	// When the number of eligible domains with matching topology keys is less than minDomains,
	// Pod Topology Spread treats "global minimum" as 0, and then the calculation of Skew is performed.
	// And when the number of eligible domains with matching topology keys equals or greater than minDomains,
	// this value has no effect on scheduling.
	// As a result, when the number of eligible domains is less than minDomains,
	// scheduler won't schedule more than maxSkew Pods to those domains.
	// If value is nil, the constraint behaves as if MinDomains is equal to 1.
	// Valid values are integers greater than 0.
	// When value is not nil, WhenUnsatisfiable must be DoNotSchedule.
	//
	// For example, in a 3-zone cluster, MaxSkew is set to 2, MinDomains is set to 5 and pods with the same
	// labelSelector spread as 2/2/2:
	// +-------+-------+-------+
	// | zone1 | zone2 | zone3 |
	// +-------+-------+-------+
	// |  P P  |  P P  |  P P  |
	// +-------+-------+-------+
	// The number of domains is less than 5(MinDomains), so "global minimum" is treated as 0.
	// In this situation, new pod with the same labelSelector cannot be scheduled,
	// because computed skew will be 3(3 - 0) if new Pod is scheduled to any of the three zones,
	// it will violate MaxSkew.
	//
	// This is an alpha field and requires enabling MinDomainsInPodTopologySpread feature gate.
	// +optional
  MinDomains *int32
}

Implementation details

In Filter of Pod Topology Spread, current filtering criteria is

('existing matching num' + 'if self-match (1 or 0)' - 'global min matching num') <= 'maxSkew'
  • existing matching num denotes the number of current existing matching Pods on the domain.
  • if self-match denotes if the labels of Pod matches with selector of the constraint.
  • global min matching num denotes the minumun number of matching Pods.

For whenUnsatisfiable: DoNotSchedule, Pod Topology Spread will treat global min matching num as 0 when the number of domains with matching topology keys is less than minDomains.

We can calculate the number of domains with matching topology keys in PreFilter, along with the calculation of TpPairToMatchNum. This extra calculation doesn't increase the complexity of the preFilter logic. Pod Topology Spread will be able to use the number of domains to determine the value of global min matching num when we calculate filtering criteria.

How user stories are addressed

Users can set MinDomains and whenUnsatisfiable: DoNotSchedule to achieve it.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 10
  template:
    metadata:
      labels:
        foo: bar
    spec:
      containers:
        - name: nginx
          image: nginx:1.14.2
          ports:
            - containerPort: 80
      topologySpreadConstraints:
        - maxSkew: 2
          minDomains: 5
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              foo: bar

Considering the case that we have 3 Nodes which can schedule Pods to.

6 Pods will be scheduled to that Nodes, and the rest 4 Pods can only be scheduled when 2 more Node join the cluster.

With the flow, this deployment will be spread over at least 5 Nodes while protecting the constraints of maxSkew.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

To ensure this feature to be rolled out in high quality. Following tests are mandatory:

  • Unit Tests: All core changes must be covered by unit tests.
  • Integration Tests / E2E Tests: Tests to ensure the behavior of this feature must be covered by either integration tests or e2e tests.
  • Benchmark Tests: We can bear with slight performance overhead if users are using this feature, but it shouldn't impose penalty to users who are not using this feature. We will verify it by designing some benchmark tests.
Unit tests
  • k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread: 2024-01-28 - 87.3%
  • k8s.io/kubernetes/pkg/api/pod: 2024-01-28 - 74.8%
  • k8s.io/kubernetes/pkg/apis/core/validation: 2024-01-28 - 83.9%
Integration tests

test: https://github.com/kubernetes/kubernetes/blob/c6064489223862fe1888fcbe0656ab1087461adb/test/integration/scheduler/filters/filters_test.go#L1349 k8s-triage: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling&test=TestPodTopologySpreadFilter

e2e tests

N/A

--

This feature doesn't introduce any new API endpoints and doesn't interact with other components. So, E2E tests doesn't add extra value to integration tests.

Graduation Criteria

Alpha (v1.24):

  • Add new parameter MinDomains to TopologySpreadConstraint and feature gating.
  • Filter extension point implementation.
  • Implement all tests mentioned in the Test Plan.

Beta (v1.25):

  • This feature will be enabled by default as a Beta feature in v1.25.

GA (v1.30):

  • No particular issue is reported to this feature for a certain length of time.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: MinDomainsInPodTopologySpread
    • Components depending on the feature gate: kube-scheduler, kube-apiserver
Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver and kube-scheduler with feature-gate off. In terms of Stable versions, users can choose to opt-out by not setting the pod.spec.topologySpreadConstraints.minDomains field.

What happens if we reenable the feature if it was previously rolled back?

Scheduling of new Pods is affected.

Are there any tests for feature enablement/disablement?

No - we've only done the manual testing as described at Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

It shouldn't impact already running workloads. It's an opt-in feature, and users need to set pod.spec.topologySpreadConstraints.minDomains field to use this feature.

When this feature is disabled by the feature flag, the already created Pod's pod.spec.topologySpreadConstraints.minDomains field is preserved, but, the newly created Pod's pod.spec.topologySpreadConstraints.minDomains field is silently dropped.

What specific metrics should inform a rollback?
  • A spike on metric schedule_attempts_total{result="error|unschedulable"} when pods using this feature are added.
  • A spike on metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} or scheduling_algorithm_duration_seconds when pods using this feature are added.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes. The behavior is changed as expected.

Test scenario:

  1. start kube-apiserver v1.24 where MinDomains feature is disabled.
  2. create three nodes and pods spread across nodes as 2/2/1
  3. create new Pod that has a TopologySpreadConstraints: maxSkew is 1, topologyKey is kubernetes.io/hostname, and minDomains is 4 (larger than the number of domains (= 3)).
  4. the Pod created in (3) is scheduled because MinDomain is disabled.
  5. delete the Pod created in (3).
  6. recreate kube-apiserver v1.25 where MinDomains feature is enabled.
  7. create the same Pod as (3).
  8. the Pod created in (7) isn't scheduled because MinDomain is enabled and minDomains is larger than the number of domains (= 3)).
  9. delete the Pod created in (7).
  10. recreate kube-apiserver v1.24 where MinDomains feature is disabled.
  11. create the same Pod as (3).
  12. the Pod created in (11) is scheduled because MinDomain is disabled.
  13. delete the Pod created in (11).
  14. recreate kube-apiserver v1.25 where MinDomains feature is enabled.
  15. create the same Pod as (3).
  16. the Pod created in (15) isn't scheduled because MinDomain is enabled and minDomains is larger than the number of domains (= 3)).
  17. delete the Pod created in (15).
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The operator can query pods with pod.spec.topologySpreadConstraints.minDomains field set. And, after adopting minDomains in some Pods, they can confirm that minDomains impacts on the scheduling by observing an increase in plugin_evaluation_total{plugin="PodTopologySpread",extension_point="Filter"}.

How can someone using this feature know that it is working for their instance?
  • Other (treat as last resort)
    • Details: The feature MinDomains in Pod Topology Sprad plugin doesn't cause any logs, any events, any pod status updates. If a Pod using pod.spec.topologySpreadConstraints.minDomains was successfully assigned a Node, nodeName will be updated. And if not, PodScheduled condition will be false and an event will be recorded with a detailed message describing the reason including the failed filters. (Pod Topology Spread plugin could be one of them.)
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
  • Metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} <= 100ms on 90-percentile.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Component exposing the metric: kube-scheduler
      • Metric name: plugin_execution_duration_seconds{plugin="PodTopologySpread"}
      • Metric name: schedule_attempts_total{result="error|unschedulable"}
Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Describe them, providing:

  • API type(s): Pod
  • Estimated increase in size: new field .Spec.topologySpreadConstraint.MinDomains about 4 bytes (int32)
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No. The performance degradation on scheduler is not expected.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

The scheduler have to process MinDomains parameter which may result in some small increase in CPU usage.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The feature isn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd during Filter phase.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?
  • Check plugin_execution_duration_seconds{plugin="PodTopologySpread"} to see if latency increased.
    • In this case, the metrics showes literally the feature is slow.
    • You should stop using MinDomains in your Pods and may need to disable MinDomains feature by feature flag MinDomainsInPodTopologySpread.
  • Check schedule_attempts_total{result="error|unschedulable"} to see if the number of attempts increased.
    • In this case, your use of MinDomains may be incorrect or not appropriate for your cluster.

Implementation History

  • 2021-11-02: Initial KEP sent for review
  • 2022-01-14: Initial KEP is merged.
  • 2022-03-16: The implementation PRs are merged.
  • 2022-05-03: The MinDomain feature is released as alpha feature with Kubernetes v1.24 release.
  • 2022-06-23: KEP is updated so that the MinDomain feature is moving to beta with Kubernetes v1.25 release.
  • 2022-07-16: The feature gate is changed to be enabled by default.
  • 2024-01-15: KEP is updated so that the MinDomain feature is moving to GA with Kubernetes v1.30 release.

Drawbacks

Alternatives

Support minDomains in ScheduleAnyway as well

When the number of domains with matching topology keys is less than minDomains and whenUnsatisfiable equals to ScheduleAnyway, Pod Topology Spread will give low scores to Nodes on the domains which have the same or more number of matching Pods as maxSkew.

In Pod Topology Spread, the higher the score from Score, the lower will be the normalized score calculated by Normalized Score. So, Pod Topology Spread should give high scores to non-preferred Nodes in Score.

When the number of domains with matching topology keys is less than minDomains, Pod Topology Spread doubles that score for the constraint in Score (so that normalized score will be a lower score) if this criteria is met:

('existing matching num' + 'if self-match (1 or 0)' - 'global min matching num') > 'maxSkew'
  • existing matching num denotes the number of current existing matching Pods on the domain.
  • if self-match denotes if the labels of Pod matches with selector of the constraint.
  • global min matching num denotes the minumun number of matching Pods.

This minDomains in ScheduleAnyway is decided not to support because of the following reasons:

  • To support this, we need to calculate the number of domains with matching topology keys and the minimum number of matching Pods in preScore like preFilter, so that Pod Topology Spread can determine the evaluation way with them.

    This extra calculation may affect the performance of the preScore, because the current preScore only see Nodes which have passed the Filter, but to calculate them, Pod Topology Spread needs to see all Nodes (includes Nodes which haven't passed the Filter).

  • minDomains is supported mainly for the above user story, which using the cluster autoscaler.

    The scoring results of scheduler doesn't affect the cluster-autoscaler. So, it is not worth supporting with the performance degradation.

Infrastructure Needed (Optional)