KEP-3243: Respect PodTopologySpread after rolling upgrades

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The pod topology spread feature allows users to define the group of pods over which spreading is applied using a LabelSelector. This means the user should know the exact label key and value when defining the pod spec.

This KEP proposes a complementary field to LabelSelector named MatchLabelKeys in TopologySpreadConstraint which represent a set of label keys only. The scheduler will use those keys to look up label values from the incoming pod; and those key-value labels are ANDed with LabelSelector to identify the group of existing pods over which the spreading skew will be calculated.

The main case that this new way for identifying pods will enable is constraining skew spreading calculation to happen at the revision level in Deployments during rolling upgrades.

Motivation

PodTopologySpread is widely used in production environments, especially in service type workloads which employ Deployments. However, currently it has a limitation that manifests during rolling updates which causes the deployment to end up out of balance (98215, 105661, k8s-pod-topology spread is not respected after rollout).

The root cause is that PodTopologySpread constraints allow defining a key-value label selector, which applies to all pods in a Deployment irrespective of their owning ReplicaSet. As a result, when a new revision is rolled out, spreading will apply across pods from both the old and new ReplicaSets, and so by the time the new ReplicaSet is completely rolled out and the old one is rolled back, the actual spreading we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause skewed distribution for the remaining pods.

Currently, users are given two solutions to this problem. The first is to add a revision label to Deployment and update it manually at each rolling upgrade (both the label on the podTemplate and the selector in the podTopologySpread constraint), while the second is to deploy a descheduler to re-balance the pod distribution. The former solution isn't user friendly and requires manual tuning, which is error prone; while the latter requires installing and maintaining an extra controller. In this proposal, we propose a native way to maintain pod balance after a rolling upgrade in Deployments that use PodTopologySpread.

Goals

Allow users to define PodTopologySpread constraints such that they apply only within the boundaries of a Deployment revision during rolling upgrades.

Non-Goals

Proposal

User Stories (Optional)

Story 1

When users apply a rolling update to a deployment that uses PodTopologySpread, the spread should be respected only within the new revision, not across all revisions of the deployment.

Notes/Constraints/Caveats (Optional)

In most scenarios, users can use the label keyed with pod-template-hash added automatically by the Deployment controller to distinguish between different revisions in a single Deployment. But for more complex scenarios (eg. topology spread associating two deployments at the same time), users are responsible for providing common labels to identify which pods should be grouped.

Risks and Mitigations

In addition to using pod-template-hash added by the Deployment controller, users can also provide the customized key in MatchLabelKeys to identify which pods should be grouped. If so, the user needs to ensure that it is correct and not duplicated with other unrelated workloads.

Design Details

A new field named MatchLabelKeys will be added to TopologySpreadConstraint. Currently, when scheduling a pod, the LabelSelector defined in the pod is used to identify the group of pods over which spreading will be calculated. MatchLabelKeys adds another constraint to how this group of pods is identified: the scheduler will use those keys to look up label values from the incoming pod; and those key-value labels are ANDed with LabelSelector to select the group of existing pods over which spreading will be calculated.

A new field named MatchLabelKeys will be introduced toTopologySpreadConstraint:

type TopologySpreadConstraint struct {
	MaxSkew           int32
	TopologyKey       string
	WhenUnsatisfiable UnsatisfiableConstraintAction
	LabelSelector     *metav1.LabelSelector

	// MatchLabelKeys is a set of pod label keys to select the pods over which 
	// spreading will be calculated. The keys are used to lookup values from the
	// incoming pod labels, those key-value labels are ANDed with `LabelSelector`
	// to select the group of existing pods over which spreading will be calculated
	// for the incoming pod. Keys that don't exist in the incoming pod labels will
	// be ignored.
	MatchLabelKeys []string
}

Examples of use are as follows:

     topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          matchLabelKeys:
            - app
            - pod-template-hash

The scheduler plugin PodTopologySpread will obtain the labels from the pod labels by the keys in matchLabelKeys. The obtained labels will be merged to labelSelector of topologySpreadConstraints to filter and group pods. The pods belonging to the same group will be part of the spreading in PodTopologySpread.

Finally, the feature will be guarded by a new feature flag. If the feature is disabled, the field matchLabelKeys is preserved if it was already set in the persisted Pod object, otherwise it is silently dropped; moreover, kube-scheduler will ignore the field and continue to behave as before.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread: 06-07 - 86%
k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread/plugin.go: 06-07 - 73.1%

Integration tests

These cases will be added in the existed integration tests:
- Feature gate enable/disable tests
- MatchLabelKeys in TopologySpreadConstraint works as expected
- Verify no significant performance degradation
k8s.io/kubernetes/test/integration/scheduler/filters/filters_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=TestPodTopologySpreadFilter
k8s.io/kubernetes/test/integration/scheduler/scoring/priorities_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=TestPodTopologySpreadScoring
k8s.io/kubernetes/test/integration/scheduler_perf/scheduler_perf_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling

e2e tests

These cases will be added in the existed e2e tests:
- Feature gate enable/disable tests
- MatchLabelKeys in TopologySpreadConstraint works as expected
k8s.io/kubernetes/test/e2e/scheduling/predicates.go: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling
k8s.io/kubernetes/test/e2e/scheduling/priorities.go: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling

Graduation Criteria

Alpha

Feature implemented behind feature gate.
Unit and integration tests passed as designed in TestPlan.

Beta

Feature is enabled by default
Benchmark tests passed, and there is no performance degradation.
Update documents to reflect the changes.

GA

No negative feedback.
Update documents to reflect the changes.

Upgrade / Downgrade Strategy

In the event of an upgrade, kube-apiserver will start to accept and store the field MatchLabelKeys.

In the event of a downgrade, kube-scheduler will ignore MatchLabelKeys even if it was set.

Version Skew Strategy

N/A

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: MatchLabelKeysInPodTopologySpread
- Components depending on the feature gate: kube-scheduler, kube-apiserver

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver and kube-scheduler with feature-gate off. One caveat is that pods that used the feature will continue to have the MatchLabelKeys field set even after disabling the feature gate, however kube-scheduler will not take the field into account.

What happens if we reenable the feature if it was previously rolled back?

Newly created pods need to follow this policy when scheduling. Old pods will not be affected.

Are there any tests for feature enablement/disablement?

No. The unit tests that are exercising the switch of feature gate itself will be added.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

It won't impact already running workloads because it is an opt-in feature in scheduler. But during a rolling upgrade, if some apiservers have not enabled the feature, they will not be able to accept and store the field "MatchLabelKeys" and the pods associated with these apiservers will not be able to use this feature. As a result, pods belonging to the same deployment may have different scheduling outcomes.

What specific metrics should inform a rollback?

If the metric schedule_attempts_total{result="error|unschedulable"} increased significantly after pods using this feature are added.
If the metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} increased to higher than 100ms on 90% after pods using this feature are added.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes, it was tested manually by following the steps below, and it was working at intended.

create a kubernetes cluster v1.26 with 3 nodes where MatchLabelKeysInPodTopologySpread feature is disabled.
deploy a deployment with this yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 12 
  selector:
    matchLabels:
      foo: bar
  template:
    metadata:
      labels:
        foo: bar
    spec:
      restartPolicy: Always
      containers:
      - name: nginx
        image: nginx:1.14.2
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              foo: bar
          matchLabelKeys:
            - pod-template-hash

pods spread across nodes as 4/4/4
update the deployment nginx image to nginx:1.15.0
pods spread across nodes as 5/4/3
delete deployment nginx
upgrade kubenetes cluster to v1.27 (at master branch) while MatchLabelKeysInPodTopologySpread is enabled.
deploy a deployment nginx like step2
pods spread across nodes as 4/4/4
update the deployment nginx image to nginx:1.15.0
pods spread across nodes as 4/4/4
delete deployment nginx
downgrade kubenetes cluster to v1.26 where MatchLabelKeysInPodTopologySpread feature is enabled.
deploy a deployment nginx like step2
pods spread across nodes as 4/4/4
update the deployment nginx image to nginx:1.15.0
pods spread across nodes as 4/4/4

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Operator can query pods that have the pod.spec.topologySpreadConstraints.matchLabelKeys field set to determine if the feature is in use by workloads.

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)
- Details: We can determine if this feature is being used by checking deployments that have only MatchLabelKeys set in TopologySpreadConstraint and no LabelSelector. These Deployments will strictly adhere to TopologySpread after both deployment and rolling upgrades if the feature is being used.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} <= 100ms on 90-percentile.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Component exposing the metric: kube-scheduler
  - Metric name: plugin_execution_duration_seconds{plugin="PodTopologySpread"}
  - Metric name: schedule_attempts_total{result="error|unschedulable"}

Are there any missing metrics that would be useful to have to improve observability of this feature?

Yes. It's helpful if we have the metrics to see which plugins affect to scheduler's decisions in Filter/Score phase. There is the related issue: kubernetes/kubernetes#110643 . It's very big and still on the way.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes. there is an additional work: the scheduler will use the keys in matchLabelKeys to look up label values from the pod and AND with LabelSelector. Maybe result in a very samll impact in scheduling latency which directly contributes to pod-startup-latency SLO.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If the API server and/or etcd is not available, this feature will not be available. This is because the scheduler needs to update the scheduling results to the pod via the API server/etcd.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Check the metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} to determine if the latency increased. If increased, it means this feature may increased scheduling latency. You can disable the feature MatchLabelKeysInPodTopologySpread to see if it's the cause of the increased latency.
Check the metric schedule_attempts_total{result="error|unschedulable"} to determine if the number of attempts increased. If increased, You need to determine the cause of the failure by the event of the pod. If it's caused by plugin PodTopologySpread, You can further analyze this problem by looking at the scheduler log.

Implementation History

2022-03-17: Initial KEP
2022-06-08: KEP merged
2023-01-16: Graduate to Beta

Drawbacks

Alternatives

Use pod.generateName to distinguish new/old pods that belong to the revisions of the same workload in scheduler plugin. It's decided not to support because of the following reason: scheduler needs to ensure universal and scheduler plugin shouldn't have special treatment for any labels/fields.

Files

README.md

Latest commit

History