- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Alternatives
- Implementation History
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
By default, pod affinity/anti-affinity constraints are calculated against
the pods in the same namespace. The Spec allows users to expand that via
the PodAffinityTerm.Namespaces list.
This KEP proposes two related features:
- Adding NamespaceSelector to
PodAffinityTermso that users can specify the set of namespace using a label selector. - Adding a new quota scope named
CrossNamespaceAffinitythat allows operators to limit which namespaces are allowed to have pods that use affinity/anti-affinity across namespaces.
Pod affinity/anti-affinity API allows using this feature across namespaces using a static list, which works if the user knowns the namespace names in advance.
However, there are cases where the namespaces are not known beforehand, for which there is no way to use pod affinity/anti-affinity. Allowing users to specify the set of namespaces using a namespace selector addresses this problem.
Since NamespaceSelector doubles down on allowing cross-namespace pod affinity, giving operators a knob to control that is important to limit the potential abuse of this feature as described in the risks section.
- Allow users to dynamically select the set of namespaces considered when using pod affinity/anti-affinity
- Allow limiting which namespaces can have pods with cross namespace pod affinity
I am running a SaaS service, the workload for each customer is placed in a separate namespace. The workloads requires 1:1 pod to node placement. I want to use pod anti-affinity across all customers namespaces to achieve that.
Using namespace selector will make it easier for users to specify affinity constraints across a large number of namespaces. The initial implementation of pod affinity/anti-affinity suffered from performance challenges, however over the releases 1.17 - 1.20 we significantly improved that.
We currently have integration and clusterloader benchmarks that evaluate the extreme cases of all pods having affinity/anti-affinity constraints to/against each other. Those benchmarks show that the scheduler is able to achieve maximum throughput (i.e., the api-server qps limit).
See previous discussion here NamespaceSelector will allow users to select all namespaces. This may cause a security concern: a pod with anti-affinity constraint can block pods from all other namespaces from getting scheduled in a failure domain.
We will address this concern by introducing a new quota scope named CrossNamespaceAffinity
that operators can use to limit which namespaces are allowed to have pods with affinity terms
that set the existing namespaces field or the proposed one namespaceSelector.
Using this new scope, operators can prevent certain namespaces (foo-ns in the example below)
from having pods that use cross-namespace pod affinity by creating a resource quota object in
that namespace with CrossNamespaceAffinity scope and hard limit of 0:
apiVersion: v1
kind: ResourceQuota
metadata:
name: disable-cross-namespace-affinity
namespace: foo-ns
spec:
hard:
pods: "0"
scopeSelector:
matchExpressions:
- scopeName: CrossNamespaceAffinityIf operators want to disallow using namespaces and namespaceSelector by default, and
only allow it for specific namespaces, they could configure CrossNamespaceAffinity
as a limited resource by setting the kube-apiserver flag --admission-control-config-file
to the path of the following configuration file:
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: "ResourceQuota"
configuration:
apiVersion: apiserver.config.k8s.io/v1
kind: ResourceQuotaConfiguration
limitedResources:
- resource: pods
matchScopes:
- scopeName: CrossNamespaceAffinityWith the above configuration, pods can use namespaces and namespaceSelector only
if the namespace where they are created have a resource quota object with
CrossNamespaceAffinity scope and a hard limit equal to the number of pods that are
allowed to.
For more protection, admission webhooks like gatekeeper can be used to further restrict the use of this field.
We are aware of two k8s projects that could be impacted by this change, the descheduler and cluster autoscaler. Cluster autoscaler should automatically consume the change since it imports the scheduler code, the descheduler however doesn't and needs to be changed to take this feature into account. We will open an issue to inform the project about this update.
Add NamespaceSelector field to PodAffinityTerm:
type PodAffinityTerm struct {
// A label query over the set of namespaces that the term applies to.
// The term is applied to the union of the namespaces selected by this field
// and the ones listed in the namespaces field.
// nil selector and empty namespaces list means "this pod's namespace"
// An empty selector ({}) means all namespaces.
NamespaceSelector *metav1.LabelSelector
}As indicated in the comment, the scheduler will consider the union of the namespaces
specified in the existing Namespaces field and the ones selected by the new
NamespaceSelector field. NamespaceSelector is ignored when set to nil.
We will do two precomputations at PreFilter/PreScore:
- The names of the namespaces selected by the
NamespaceSelectoris computed. This set will be used by Fliter/Score to match against existing pods namespaces. - A snapshot of the labels of the namespace of the incoming pod. This will be used when to match against the anti-affinity constraints of existing pods.
The precomputations are necessary for:
- Performance.
- Ensures a consistent behavior if namespace labels are added/removed during the scheduling cycle of a pod.
Finally, the feature will be guarded by a new feature flag. If the feature is
disabled, the field NamespaceSelector is preserved if it was already set in
the persisted Pod ojbect, otherwise it is silently dropped; moreover kube-scheduler
will ignore the field and continue to behave as before.
With regards to adding the CrossNamespaceAffinity quota scope, the one design aspect
worth noting is that it will be rolled out in multiple releases similar to the NamespaceSelector:
when the feature is disabled, the new value will be tolerated in updates of objects
already containing the new value, but will not be allowed to be added on create or update.
- Unit and integration tests covering:
- core changes
- correctness for namespace addition/removal, specifically label updates should be taken into account, but not in the middle of a scheduling cycle
- feature gate enabled/disabled
- Benchmark Tests:
- evaluate performance for the case where the selector matches a large number of pods
in large number of namespaces. The evaluation shows that using NamespaceSelector has no
impact on performance, summarized as follows:
- compares affinity performance without namespace selector of a workload that puts all pods in one namespace vs splitting them across 100 namespaces and using namespace selector
- tests both required and preferred, and for each affinity and anti-affinity
- measures the performance (latency and throughput) of scheduling 1000 pods on 5k nodes with 5k existing pods (4k in case of required anti-affinity)
- (see kubernetes/kubernetes#101329 for details).
- evaluate performance for the case where the selector matches a large number of pods
in large number of namespaces. The evaluation shows that using NamespaceSelector has no
impact on performance, summarized as follows:
- Benchmark tests showing no performance problems
- No user complaints regarding performance/correctness.
- Still no complaints regarding performance.
- Allowing time for feedback
In the event of an upgrade, kube-apiserver will start accepting NamespaceSelector and the new CrossNamespaceAffinity quota scope.
In the event of a downgrade, kube-scheduler will ignore NamespaceSelector even if it was set.
N/A
This section must be completed when targeting alpha to a release.
-
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: PodAffinityNamespaceSelector
- Components depending on the feature gate: kube-scheduler, kube-api-server
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
- Feature gate (also fill in values in
-
Does enabling the feature change any default behavior? No.
-
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Yes. One caveat is that pods that used the feature will continue to have the NamespaceSelector field set even after disabling, however kube-scheduler will not take the field into account.
-
What happens if we reenable the feature if it was previously rolled back? It should continue to work as expected.
-
Are there any tests for feature enablement/disablement? Yes, unit tests exist.
This section must be completed when targeting beta graduation to a release.
-
How can a rollout fail? Can it impact already running workloads? It shouldn't impact already running workloads. This is an opt-in feature since users need to explicitly set the NamespaceSelector parameter in the pod spec, if the feature is disabled the field is preserved if it was already set in the presisted pod object, otherwise it is silently dropped.
-
What specific metrics should inform a rollback?
- A spike on metric
schedule_attempts_total{result="error|unschedulable"}when pods using this feature are added. - A spike on
plugin_execution_duration_seconds{plugin="InterPodAffinity"}.
- A spike on metric
-
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Manually tested successfully.
-
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? No.
This section must be completed when targeting beta graduation to a release.
-
How can an operator determine if the feature is in use by workloads? The operator can query pods with the NamespaceSelector field set in pod affinity terms.
-
How can someone using this feature know that it is working for their instance?
- Other (treat as last resort)
- Details: inter-pod affinity as a feature doesn't trigger pod status updates on its own, none of the scheduler's filters/scores do on their own. If a pod using affinity was successfully assigned a node, nodeName will be updated, if not, then PodScheduled condition will be false, and an event will be recorded with a detailed message describing the reason including the failed filters (inter-pod affnity could be one of them).
-
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Component exposing the metric: kube-scheduler
- Metric name:
pod_scheduling_duration_seconds - Metric name:
plugin_execution_duration_seconds{plugin="InterPodAffinity"} - Metric name:
schedule_attempts_total{result="error|unschedulable"}
- Metric name:
- Component exposing the metric: kube-scheduler
- Other (treat as last resort)
- Details:
- Metrics
-
What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
- 99% of pod scheduling latency is within x minutes
- 99% of
InterPodAffinityplugin executions are within x milliseconds - x% of
schedule_attempts_totalare successful
-
Are there any missing metrics that would be useful to have to improve observability of this feature? No.
This section must be completed when targeting beta graduation to a release.
- Does this feature depend on any specific services running in the cluster? No
For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.
For beta, this section is required: reviewers must answer these questions.
For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.
-
Will enabling / using this feature result in any new API calls? No.
-
Will enabling / using this feature result in introducing new API types? No.
-
Will enabling / using this feature result in any new calls to the cloud provider? No.
-
Will enabling / using this feature result in increasing size or count of the existing API objects? Yes, if users set the NamespaceSelector field.
-
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? May impact scheduling latency if the feature was used.
-
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? If NamespaceSelector field is set, then the scheduler will have to process that which will result in some increase in CPU usage.
The Troubleshooting section currently serves the Playbook role. We may consider
splitting it into a dedicated Playbook document (potentially with some monitoring
details). For now, we leave it here.
This section must be completed when targeting beta graduation to a release.
- How does this feature react if the API server and/or etcd is unavailable?
Running workloads will not be impacted, but pods that are not scheduled yet will not get assigned nodes.
-
What are other known failure modes? N/A
-
What steps should be taken if SLOs are not being met to determine the problem?
Check plugin_execution_duration_seconds{plugin="InterPodAffinity"} to see if latency increased. Note
that latency increases with number of existing pods.
Another alternative is to limit the api to "all namespaces" using a dedicated flag or a special token in Namespacces lis, like "*" (see here for previous discussion).
While this limits the api surface, it makes the api slightly messy and limits use cases where only a select set of namespaces needs to be considered. Moreover, a label selector is consistent with how pods are selected in the same api.
- 2021-01-11: Initial KEP sent for review
- 2021-02-10: Remove the restriction on empty namespace selector
- 2021-04-26: Graduate the feature to Beta
- 2022-01-08: Graduate the feature to Stable