- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
A new field minDomains is introduced to PodSpec.TopologySpreadConstraint[*] to limit
the minimum number of topology domains.
minDomains can be used only when whenUnsatisfiable=DoNotSchedule.
Pod Topology Spread has maxSkew parameter, which control the degree to which Pods may be unevenly distributed.
But, there isn't a way to control the number of domains over which we should spread.
In some cases, users want to force spreading Pods over a minimum number of domains and, if there aren't enough already present, make the cluster-autoscaler provision them.
- Users can specify
minDomainsto limit the number of domains when usingWhenUnsatisfiable=DoNotSchedule.
- Add new field to limit the maximum number of topology domains.
- Users can use it as a best-efforts manner with
WhenUnsatisfiable=ScheduleAnyway.
I am using cluster autoscaler and I want to force spreading a deployment over at least 5 Nodes.
Users can define a minimum number of domains with minDomains parameter.
This parameter only applies when whenUnsatisfiable=DoNotSchedule.
Pod Topology Spread has the semantics of "global minimum", which means the minimum number of pods that match the label selector in a topology domain.
However, the global minimum is only calculated for the nodes that exist and match the node affinity. In other words, if a topology domain was scaled down to zero (for example, because of low utilization), this topology domain is unknown to the scheduler, thus it's not considered in the global minimum calculations.
The new minDomains field can help with this problem.
When the number of domains with matching topology keys is less than minDomains,
Pod Topology Spread treats "global minimum" as 0; otherwise, "global minimum"
is equal to the minimum number of matching pods on a domain.
As a result, when the number of domains is less than minDomains, scheduler doesn't schedule a matching Pod to Nodes on the domains that have the same or more number of matching Pods as maxSkew.
minDomains is an optional parameter. If minDomains is nil, the constraint behaves as if MinDomains is equal to 1.
New optional parameter called MinDomains is introduced to PodSpec.TopologySpreadConstraint[*].
type TopologySpreadConstraint struct {
......
// MinDomains indicates a minimum number of eligible domains.
// When the number of eligible domains with matching topology keys is less than minDomains,
// Pod Topology Spread treats "global minimum" as 0, and then the calculation of Skew is performed.
// And when the number of eligible domains with matching topology keys equals or greater than minDomains,
// this value has no effect on scheduling.
// As a result, when the number of eligible domains is less than minDomains,
// scheduler won't schedule more than maxSkew Pods to those domains.
// If value is nil, the constraint behaves as if MinDomains is equal to 1.
// Valid values are integers greater than 0.
// When value is not nil, WhenUnsatisfiable must be DoNotSchedule.
//
// For example, in a 3-zone cluster, MaxSkew is set to 2, MinDomains is set to 5 and pods with the same
// labelSelector spread as 2/2/2:
// +-------+-------+-------+
// | zone1 | zone2 | zone3 |
// +-------+-------+-------+
// | P P | P P | P P |
// +-------+-------+-------+
// The number of domains is less than 5(MinDomains), so "global minimum" is treated as 0.
// In this situation, new pod with the same labelSelector cannot be scheduled,
// because computed skew will be 3(3 - 0) if new Pod is scheduled to any of the three zones,
// it will violate MaxSkew.
//
// This is an alpha field and requires enabling MinDomainsInPodTopologySpread feature gate.
// +optional
MinDomains *int32
}In Filter of Pod Topology Spread, current filtering criteria is
('existing matching num' + 'if self-match (1 or 0)' - 'global min matching num') <= 'maxSkew'
existing matching numdenotes the number of current existing matching Pods on the domain.if self-matchdenotes if the labels of Pod matches with selector of the constraint.global min matching numdenotes the minumun number of matching Pods.
For whenUnsatisfiable: DoNotSchedule, Pod Topology Spread will treat global min matching num as 0
when the number of domains with matching topology keys is less than minDomains.
We can calculate the number of domains with matching topology keys in PreFilter, along with the calculation of TpPairToMatchNum.
This extra calculation doesn't increase the complexity of the preFilter logic.
Pod Topology Spread will be able to use the number of domains to determine the value of global min matching num when we calculate filtering criteria.
Users can set MinDomains and whenUnsatisfiable: DoNotSchedule to achieve it.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 10
template:
metadata:
labels:
foo: bar
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
topologySpreadConstraints:
- maxSkew: 2
minDomains: 5
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: barConsidering the case that we have 3 Nodes which can schedule Pods to.
6 Pods will be scheduled to that Nodes, and the rest 4 Pods can only be scheduled when 2 more Node join the cluster.
With the flow, this deployment will be spread over at least 5 Nodes while protecting the constraints of maxSkew.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
To ensure this feature to be rolled out in high quality. Following tests are mandatory:
- Unit Tests: All core changes must be covered by unit tests.
- Integration Tests / E2E Tests: Tests to ensure the behavior of this feature must be covered by either integration tests or e2e tests.
- Benchmark Tests: We can bear with slight performance overhead if users are using this feature, but it shouldn't impose penalty to users who are not using this feature. We will verify it by designing some benchmark tests.
k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread:2024-01-28-87.3%k8s.io/kubernetes/pkg/api/pod:2024-01-28-74.8%k8s.io/kubernetes/pkg/apis/core/validation:2024-01-28-83.9%
test: https://github.com/kubernetes/kubernetes/blob/c6064489223862fe1888fcbe0656ab1087461adb/test/integration/scheduler/filters/filters_test.go#L1349 k8s-triage: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling&test=TestPodTopologySpreadFilter
N/A
--
This feature doesn't introduce any new API endpoints and doesn't interact with other components. So, E2E tests doesn't add extra value to integration tests.
- Add new parameter
MinDomainstoTopologySpreadConstraintand feature gating. - Filter extension point implementation.
- Implement all tests mentioned in the Test Plan.
- This feature will be enabled by default as a Beta feature in v1.25.
- No particular issue is reported to this feature for a certain length of time.
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
MinDomainsInPodTopologySpread - Components depending on the feature gate:
kube-scheduler,kube-apiserver
- Feature gate name:
No.
The feature can be disabled in Alpha and Beta versions
by restarting kube-apiserver and kube-scheduler with feature-gate off.
In terms of Stable versions, users can choose to opt-out by not setting the
pod.spec.topologySpreadConstraints.minDomains field.
Scheduling of new Pods is affected.
No - we've only done the manual testing as described at Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?.
It shouldn't impact already running workloads. It's an opt-in feature,
and users need to set pod.spec.topologySpreadConstraints.minDomains field to use this feature.
When this feature is disabled by the feature flag, the already created Pod's pod.spec.topologySpreadConstraints.minDomains field is preserved,
but, the newly created Pod's pod.spec.topologySpreadConstraints.minDomains field is silently dropped.
- A spike on metric
schedule_attempts_total{result="error|unschedulable"}when pods using this feature are added. - A spike on metric
plugin_execution_duration_seconds{plugin="PodTopologySpread"}orscheduling_algorithm_duration_secondswhen pods using this feature are added.
Yes. The behavior is changed as expected.
Test scenario:
- start kube-apiserver v1.24 where
MinDomainsfeature is disabled. - create three nodes and pods spread across nodes as 2/2/1
- create new Pod that has a TopologySpreadConstraints: maxSkew is 1, topologyKey is
kubernetes.io/hostname, and minDomains is 4 (larger than the number of domains (= 3)). - the Pod created in (3) is scheduled because
MinDomainis disabled. - delete the Pod created in (3).
- recreate kube-apiserver v1.25 where
MinDomainsfeature is enabled. - create the same Pod as (3).
- the Pod created in (7) isn't scheduled because
MinDomainis enabled and minDomains is larger than the number of domains (= 3)). - delete the Pod created in (7).
- recreate kube-apiserver v1.24 where
MinDomainsfeature is disabled. - create the same Pod as (3).
- the Pod created in (11) is scheduled because
MinDomainis disabled. - delete the Pod created in (11).
- recreate kube-apiserver v1.25 where
MinDomainsfeature is enabled. - create the same Pod as (3).
- the Pod created in (15) isn't scheduled because
MinDomainis enabled and minDomains is larger than the number of domains (= 3)). - delete the Pod created in (15).
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
The operator can query pods with pod.spec.topologySpreadConstraints.minDomains field set.
And, after adopting minDomains in some Pods, they can confirm that minDomains impacts on the scheduling
by observing an increase in plugin_evaluation_total{plugin="PodTopologySpread",extension_point="Filter"}.
- Other (treat as last resort)
- Details:
The feature MinDomains in Pod Topology Sprad plugin doesn't cause any logs, any events, any pod status updates.
If a Pod using
pod.spec.topologySpreadConstraints.minDomainswas successfully assigned a Node, nodeName will be updated. And if not,PodScheduledcondition will be false and an event will be recorded with a detailed message describing the reason including the failed filters. (Pod Topology Spread plugin could be one of them.)
- Details:
The feature MinDomains in Pod Topology Sprad plugin doesn't cause any logs, any events, any pod status updates.
If a Pod using
- Metric
plugin_execution_duration_seconds{plugin="PodTopologySpread"}<= 100ms on 90-percentile.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Component exposing the metric: kube-scheduler
- Metric name:
plugin_execution_duration_seconds{plugin="PodTopologySpread"} - Metric name:
schedule_attempts_total{result="error|unschedulable"}
- Metric name:
- Component exposing the metric: kube-scheduler
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
No.
No.
No.
No.
Describe them, providing:
- API type(s): Pod
- Estimated increase in size: new field
.Spec.topologySpreadConstraint.MinDomainsabout 4 bytes (int32)
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No. The performance degradation on scheduler is not expected.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
The scheduler have to process MinDomains parameter which may result in some small increase in CPU usage.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
The feature isn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd during Filter phase.
N/A
- Check
plugin_execution_duration_seconds{plugin="PodTopologySpread"}to see if latency increased.- In this case, the metrics showes literally the feature is slow.
- You should stop using
MinDomainsin your Pods and may need to disableMinDomainsfeature by feature flagMinDomainsInPodTopologySpread.
- Check
schedule_attempts_total{result="error|unschedulable"}to see if the number of attempts increased.- In this case, your use of
MinDomainsmay be incorrect or not appropriate for your cluster.
- In this case, your use of
- 2021-11-02: Initial KEP sent for review
- 2022-01-14: Initial KEP is merged.
- 2022-03-16: The implementation PRs are merged.
- 2022-05-03: The MinDomain feature is released as alpha feature with Kubernetes v1.24 release.
- 2022-06-23: KEP is updated so that the MinDomain feature is moving to beta with Kubernetes v1.25 release.
- 2022-07-16: The feature gate is changed to be enabled by default.
- 2024-01-15: KEP is updated so that the MinDomain feature is moving to GA with Kubernetes v1.30 release.
When the number of domains with matching topology keys is less than minDomains and whenUnsatisfiable equals to ScheduleAnyway,
Pod Topology Spread will give low scores to Nodes on the domains which have the same or more number of matching Pods as maxSkew.
In Pod Topology Spread, the higher the score from Score, the lower will be the normalized score calculated by Normalized Score. So, Pod Topology Spread should give high scores to non-preferred Nodes in Score.
When the number of domains with matching topology keys is less than minDomains,
Pod Topology Spread doubles that score for the constraint in Score (so that normalized score will be a lower score) if this criteria is met:
('existing matching num' + 'if self-match (1 or 0)' - 'global min matching num') > 'maxSkew'
existing matching numdenotes the number of current existing matching Pods on the domain.if self-matchdenotes if the labels of Pod matches with selector of the constraint.global min matching numdenotes the minumun number of matching Pods.
This minDomains in ScheduleAnyway is decided not to support because of the following reasons:
-
To support this, we need to calculate the number of domains with matching topology keys and the minimum number of matching Pods in preScore like preFilter, so that Pod Topology Spread can determine the evaluation way with them.
This extra calculation may affect the performance of the preScore, because the current preScore only see Nodes which have passed the Filter, but to calculate them, Pod Topology Spread needs to see all Nodes (includes Nodes which haven't passed the Filter).
-
minDomainsis supported mainly for the above user story, which using the cluster autoscaler.The scoring results of scheduler doesn't affect the cluster-autoscaler. So, it is not worth supporting with the performance degradation.