- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The pod topology spread feature allows users to define the group of pods over which spreading is applied using a LabelSelector. This means the user should know the exact label key and value when defining the pod spec.
This KEP proposes a complementary field to LabelSelector named MatchLabelKeys
in
TopologySpreadConstraint
which represent a set of label keys only. The scheduler
will use those keys to look up label values from the incoming pod; and those key-value
labels are ANDed with LabelSelector
to identify the group of existing pods over
which the spreading skew will be calculated.
The main case that this new way for identifying pods will enable is constraining skew spreading calculation to happen at the revision level in Deployments during rolling upgrades.
PodTopologySpread is widely used in production environments, especially in service type workloads which employ Deployments. However, currently it has a limitation that manifests during rolling updates which causes the deployment to end up out of balance (98215, 105661, k8s-pod-topology spread is not respected after rollout).
The root cause is that PodTopologySpread constraints allow defining a key-value label selector, which applies to all pods in a Deployment irrespective of their owning ReplicaSet. As a result, when a new revision is rolled out, spreading will apply across pods from both the old and new ReplicaSets, and so by the time the new ReplicaSet is completely rolled out and the old one is rolled back, the actual spreading we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause skewed distribution for the remaining pods.
Currently, users are given two solutions to this problem. The first is to add a revision label to Deployment and update it manually at each rolling upgrade (both the label on the podTemplate and the selector in the podTopologySpread constraint), while the second is to deploy a descheduler to re-balance the pod distribution. The former solution isn't user friendly and requires manual tuning, which is error prone; while the latter requires installing and maintaining an extra controller. In this proposal, we propose a native way to maintain pod balance after a rolling upgrade in Deployments that use PodTopologySpread.
- Allow users to define PodTopologySpread constraints such that they apply only within the boundaries of a Deployment revision during rolling upgrades.
When users apply a rolling update to a deployment that uses PodTopologySpread, the spread should be respected only within the new revision, not across all revisions of the deployment.
In most scenarios, users can use the label keyed with pod-template-hash
added
automatically by the Deployment controller to distinguish between different
revisions in a single Deployment. But for more complex scenarios
(eg. topology spread associating two deployments at the same time), users are
responsible for providing common labels to identify which pods should be grouped.
In addition to using pod-template-hash
added by the Deployment controller,
users can also provide the customized key in MatchLabelKeys
to identify
which pods should be grouped. If so, the user needs to ensure that it is
correct and not duplicated with other unrelated workloads.
A new field named MatchLabelKeys
will be added to TopologySpreadConstraint
.
Currently, when scheduling a pod, the LabelSelector
defined in the pod is used
to identify the group of pods over which spreading will be calculated.
MatchLabelKeys
adds another constraint to how this group of pods is identified:
the scheduler will use those keys to look up label values from the incoming pod;
and those key-value labels are ANDed with LabelSelector
to select the group of
existing pods over which spreading will be calculated.
A new field named MatchLabelKeys
will be introduced toTopologySpreadConstraint
:
type TopologySpreadConstraint struct {
MaxSkew int32
TopologyKey string
WhenUnsatisfiable UnsatisfiableConstraintAction
LabelSelector *metav1.LabelSelector
// MatchLabelKeys is a set of pod label keys to select the pods over which
// spreading will be calculated. The keys are used to lookup values from the
// incoming pod labels, those key-value labels are ANDed with `LabelSelector`
// to select the group of existing pods over which spreading will be calculated
// for the incoming pod. Keys that don't exist in the incoming pod labels will
// be ignored.
MatchLabelKeys []string
}
Examples of use are as follows:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
matchLabelKeys:
- app
- pod-template-hash
The scheduler plugin PodTopologySpread
will obtain the labels from the pod
labels by the keys in matchLabelKeys
. The obtained labels will be merged
to labelSelector
of topologySpreadConstraints
to filter and group pods.
The pods belonging to the same group will be part of the spreading in
PodTopologySpread
.
Finally, the feature will be guarded by a new feature flag. If the feature is
disabled, the field matchLabelKeys
is preserved if it was already set in the
persisted Pod object, otherwise it is silently dropped; moreover, kube-scheduler
will ignore the field and continue to behave as before.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread
:06-07
-86%
k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread/plugin.go
:06-07
-73.1%
-
These cases will be added in the existed integration tests:
- Feature gate enable/disable tests
MatchLabelKeys
inTopologySpreadConstraint
works as expected- Verify no significant performance degradation
-
k8s.io/kubernetes/test/integration/scheduler/filters/filters_test.go
: https://storage.googleapis.com/k8s-triage/index.html?test=TestPodTopologySpreadFilter -
k8s.io/kubernetes/test/integration/scheduler/scoring/priorities_test.go
: https://storage.googleapis.com/k8s-triage/index.html?test=TestPodTopologySpreadScoring -
k8s.io/kubernetes/test/integration/scheduler_perf/scheduler_perf_test.go
: https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling
-
These cases will be added in the existed e2e tests:
- Feature gate enable/disable tests
MatchLabelKeys
inTopologySpreadConstraint
works as expected
-
k8s.io/kubernetes/test/e2e/scheduling/predicates.go
: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling -
k8s.io/kubernetes/test/e2e/scheduling/priorities.go
: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling
- Feature implemented behind feature gate.
- Unit and integration tests passed as designed in TestPlan.
- Feature is enabled by default
- Benchmark tests passed, and there is no performance degradation.
- Update documents to reflect the changes.
- No negative feedback.
- Update documents to reflect the changes.
In the event of an upgrade, kube-apiserver will start to accept and store the field MatchLabelKeys
.
In the event of a downgrade, kube-scheduler will ignore MatchLabelKeys
even if it was set.
N/A
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
MatchLabelKeysInPodTopologySpread
- Components depending on the feature gate:
kube-scheduler
,kube-apiserver
- Feature gate name:
No.
The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver and kube-scheduler with feature-gate off. One caveat is that pods that used the feature will continue to have the MatchLabelKeys field set even after disabling the feature gate, however kube-scheduler will not take the field into account.
Newly created pods need to follow this policy when scheduling. Old pods will not be affected.
No. The unit tests that are exercising the switch
of feature gate itself will be added.
It won't impact already running workloads because it is an opt-in feature in scheduler. But during a rolling upgrade, if some apiservers have not enabled the feature, they will not be able to accept and store the field "MatchLabelKeys" and the pods associated with these apiservers will not be able to use this feature. As a result, pods belonging to the same deployment may have different scheduling outcomes.
- If the metric
schedule_attempts_total{result="error|unschedulable"}
increased significantly after pods using this feature are added. - If the metric
plugin_execution_duration_seconds{plugin="PodTopologySpread"}
increased to higher than 100ms on 90% after pods using this feature are added.
Yes, it was tested manually by following the steps below, and it was working at intended.
- create a kubernetes cluster v1.26 with 3 nodes where
MatchLabelKeysInPodTopologySpread
feature is disabled. - deploy a deployment with this yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 12
selector:
matchLabels:
foo: bar
template:
metadata:
labels:
foo: bar
spec:
restartPolicy: Always
containers:
- name: nginx
image: nginx:1.14.2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
matchLabelKeys:
- pod-template-hash
- pods spread across nodes as 4/4/4
- update the deployment nginx image to
nginx:1.15.0
- pods spread across nodes as 5/4/3
- delete deployment nginx
- upgrade kubenetes cluster to v1.27 (at master branch) while
MatchLabelKeysInPodTopologySpread
is enabled. - deploy a deployment nginx like step2
- pods spread across nodes as 4/4/4
- update the deployment nginx image to
nginx:1.15.0
- pods spread across nodes as 4/4/4
- delete deployment nginx
- downgrade kubenetes cluster to v1.26 where
MatchLabelKeysInPodTopologySpread
feature is enabled. - deploy a deployment nginx like step2
- pods spread across nodes as 4/4/4
- update the deployment nginx image to
nginx:1.15.0
- pods spread across nodes as 4/4/4
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Operator can query pods that have the pod.spec.topologySpreadConstraints.matchLabelKeys
field set to determine if the feature is in use by workloads.
- Other (treat as last resort)
- Details: We can determine if this feature is being used by checking deployments that have only
MatchLabelKeys
set inTopologySpreadConstraint
and noLabelSelector
. These Deployments will strictly adhere to TopologySpread after both deployment and rolling upgrades if the feature is being used.
- Details: We can determine if this feature is being used by checking deployments that have only
Metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} <= 100ms on 90-percentile.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Component exposing the metric: kube-scheduler
- Metric name:
plugin_execution_duration_seconds{plugin="PodTopologySpread"}
- Metric name:
schedule_attempts_total{result="error|unschedulable"}
- Metric name:
- Component exposing the metric: kube-scheduler
Are there any missing metrics that would be useful to have to improve observability of this feature?
Yes. It's helpful if we have the metrics to see which plugins affect to scheduler's decisions in Filter/Score phase. There is the related issue: kubernetes/kubernetes#110643 . It's very big and still on the way.
No.
No.
No.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Yes. there is an additional work: the scheduler will use the keys in matchLabelKeys
to look up label values from the pod and AND with LabelSelector
.
Maybe result in a very samll impact in scheduling latency which directly contributes to pod-startup-latency SLO.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
If the API server and/or etcd is not available, this feature will not be available. This is because the scheduler needs to update the scheduling results to the pod via the API server/etcd.
N/A
- Check the metric
plugin_execution_duration_seconds{plugin="PodTopologySpread"}
to determine if the latency increased. If increased, it means this feature may increased scheduling latency. You can disable the featureMatchLabelKeysInPodTopologySpread
to see if it's the cause of the increased latency. - Check the metric
schedule_attempts_total{result="error|unschedulable"}
to determine if the number of attempts increased. If increased, You need to determine the cause of the failure by the event of the pod. If it's caused by pluginPodTopologySpread
, You can further analyze this problem by looking at the scheduler log.
- 2022-03-17: Initial KEP
- 2022-06-08: KEP merged
- 2023-01-16: Graduate to Beta
Use pod.generateName
to distinguish new/old pods that belong to the
revisions of the same workload in scheduler plugin. It's decided not to
support because of the following reason: scheduler needs to ensure universal
and scheduler plugin shouldn't have special treatment for any labels/fields.