Skip to content

Latest commit

 

History

History

2249-pod-affinity-namespace-selector

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

KEP-2249: Namespace Selector For Pod Affinity

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

By default, pod affinity/anti-affinity constraints are calculated against the pods in the same namespace. The Spec allows users to expand that via the PodAffinityTerm.Namespaces list.

This KEP proposes two related features:

  1. Adding NamespaceSelector to PodAffinityTerm so that users can specify the set of namespace using a label selector.
  2. Adding a new quota scope named CrossNamespaceAffinity that allows operators to limit which namespaces are allowed to have pods that use affinity/anti-affinity across namespaces.

Motivation

Pod affinity/anti-affinity API allows using this feature across namespaces using a static list, which works if the user knowns the namespace names in advance.

However, there are cases where the namespaces are not known beforehand, for which there is no way to use pod affinity/anti-affinity. Allowing users to specify the set of namespaces using a namespace selector addresses this problem.

Since NamespaceSelector doubles down on allowing cross-namespace pod affinity, giving operators a knob to control that is important to limit the potential abuse of this feature as described in the risks section.

Goals

  • Allow users to dynamically select the set of namespaces considered when using pod affinity/anti-affinity
  • Allow limiting which namespaces can have pods with cross namespace pod affinity

Proposal

User Stories (Optional)

Story 1

I am running a SaaS service, the workload for each customer is placed in a separate namespace. The workloads requires 1:1 pod to node placement. I want to use pod anti-affinity across all customers namespaces to achieve that.

Risks and Mitigations

Performance

Using namespace selector will make it easier for users to specify affinity constraints across a large number of namespaces. The initial implementation of pod affinity/anti-affinity suffered from performance challenges, however over the releases 1.17 - 1.20 we significantly improved that.

We currently have integration and clusterloader benchmarks that evaluate the extreme cases of all pods having affinity/anti-affinity constraints to/against each other. Those benchmarks show that the scheduler is able to achieve maximum throughput (i.e., the api-server qps limit).

Security/Abuse

See previous discussion here NamespaceSelector will allow users to select all namespaces. This may cause a security concern: a pod with anti-affinity constraint can block pods from all other namespaces from getting scheduled in a failure domain.

We will address this concern by introducing a new quota scope named CrossNamespaceAffinity that operators can use to limit which namespaces are allowed to have pods with affinity terms that set the existing namespaces field or the proposed one namespaceSelector.

Using this new scope, operators can prevent certain namespaces (foo-ns in the example below) from having pods that use cross-namespace pod affinity by creating a resource quota object in that namespace with CrossNamespaceAffinity scope and hard limit of 0:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: disable-cross-namespace-affinity
  namespace: foo-ns
spec:
  hard:
    pods: "0"
  scopeSelector:
    matchExpressions:
    - scopeName: CrossNamespaceAffinity

If operators want to disallow using namespaces and namespaceSelector by default, and only allow it for specific namespaces, they could configure CrossNamespaceAffinity as a limited resource by setting the kube-apiserver flag --admission-control-config-file to the path of the following configuration file:

apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: "ResourceQuota"
  configuration:
    apiVersion: apiserver.config.k8s.io/v1
    kind: ResourceQuotaConfiguration
    limitedResources:
    - resource: pods
      matchScopes:
      - scopeName: CrossNamespaceAffinity

With the above configuration, pods can use namespaces and namespaceSelector only if the namespace where they are created have a resource quota object with CrossNamespaceAffinity scope and a hard limit equal to the number of pods that are allowed to.

For more protection, admission webhooks like gatekeeper can be used to further restrict the use of this field.

External Dependencies

We are aware of two k8s projects that could be impacted by this change, the descheduler and cluster autoscaler. Cluster autoscaler should automatically consume the change since it imports the scheduler code, the descheduler however doesn't and needs to be changed to take this feature into account. We will open an issue to inform the project about this update.

Design Details

Add NamespaceSelector field to PodAffinityTerm:

type PodAffinityTerm struct {
    // A label query over the set of namespaces that the term applies to.
    // The term is applied to the union of the namespaces selected by this field
    // and the ones listed in the namespaces field.
    // nil selector and empty namespaces list means "this pod's namespace"
    // An empty selector ({}) means all namespaces.
    NamespaceSelector *metav1.LabelSelector
}

As indicated in the comment, the scheduler will consider the union of the namespaces specified in the existing Namespaces field and the ones selected by the new NamespaceSelector field. NamespaceSelector is ignored when set to nil.

We will do two precomputations at PreFilter/PreScore:

  • The names of the namespaces selected by the NamespaceSelector is computed. This set will be used by Fliter/Score to match against existing pods namespaces.
  • A snapshot of the labels of the namespace of the incoming pod. This will be used when to match against the anti-affinity constraints of existing pods.

The precomputations are necessary for:

  • Performance.
  • Ensures a consistent behavior if namespace labels are added/removed during the scheduling cycle of a pod.

Finally, the feature will be guarded by a new feature flag. If the feature is disabled, the field NamespaceSelector is preserved if it was already set in the persisted Pod ojbect, otherwise it is silently dropped; moreover kube-scheduler will ignore the field and continue to behave as before.

With regards to adding the CrossNamespaceAffinity quota scope, the one design aspect worth noting is that it will be rolled out in multiple releases similar to the NamespaceSelector: when the feature is disabled, the new value will be tolerated in updates of objects already containing the new value, but will not be allowed to be added on create or update.

Test Plan

  • Unit and integration tests covering:
    • core changes
    • correctness for namespace addition/removal, specifically label updates should be taken into account, but not in the middle of a scheduling cycle
    • feature gate enabled/disabled
  • Benchmark Tests:
    • evaluate performance for the case where the selector matches a large number of pods in large number of namespaces. The evaluation shows that using NamespaceSelector has no impact on performance, summarized as follows:
      • compares affinity performance without namespace selector of a workload that puts all pods in one namespace vs splitting them across 100 namespaces and using namespace selector
      • tests both required and preferred, and for each affinity and anti-affinity
      • measures the performance (latency and throughput) of scheduling 1000 pods on 5k nodes with 5k existing pods (4k in case of required anti-affinity)
      • (see kubernetes/kubernetes#101329 for details).

Graduation Criteria

Alpha -> Beta Graduation

  • Benchmark tests showing no performance problems
  • No user complaints regarding performance/correctness.

Beta -> GA Graduation

  • Still no complaints regarding performance.
  • Allowing time for feedback

Upgrade / Downgrade Strategy

In the event of an upgrade, kube-apiserver will start accepting NamespaceSelector and the new CrossNamespaceAffinity quota scope.

In the event of a downgrade, kube-scheduler will ignore NamespaceSelector even if it was set.

Version Skew Strategy

N/A

Production Readiness Review Questionnaire

Feature Enablement and Rollback

This section must be completed when targeting alpha to a release.

  • How can this feature be enabled / disabled in a live cluster?

    • Feature gate (also fill in values in kep.yaml)
      • Feature gate name: PodAffinityNamespaceSelector
      • Components depending on the feature gate: kube-scheduler, kube-api-server
    • Other
      • Describe the mechanism:
      • Will enabling / disabling the feature require downtime of the control plane?
      • Will enabling / disabling the feature require downtime or reprovisioning of a node?
  • Does enabling the feature change any default behavior? No.

  • Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Yes. One caveat is that pods that used the feature will continue to have the NamespaceSelector field set even after disabling, however kube-scheduler will not take the field into account.

  • What happens if we reenable the feature if it was previously rolled back? It should continue to work as expected.

  • Are there any tests for feature enablement/disablement? Yes, unit tests exist.

Rollout, Upgrade and Rollback Planning

This section must be completed when targeting beta graduation to a release.

  • How can a rollout fail? Can it impact already running workloads? It shouldn't impact already running workloads. This is an opt-in feature since users need to explicitly set the NamespaceSelector parameter in the pod spec, if the feature is disabled the field is preserved if it was already set in the presisted pod object, otherwise it is silently dropped.

  • What specific metrics should inform a rollback?

    • A spike on metric schedule_attempts_total{result="error|unschedulable"} when pods using this feature are added.
    • A spike on plugin_execution_duration_seconds{plugin="InterPodAffinity"}.
  • Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Manually tested successfully.

  • Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? No.

Monitoring Requirements

This section must be completed when targeting beta graduation to a release.

  • How can an operator determine if the feature is in use by workloads? The operator can query pods with the NamespaceSelector field set in pod affinity terms.

  • How can someone using this feature know that it is working for their instance?

  • Other (treat as last resort)
    • Details: inter-pod affinity as a feature doesn't trigger pod status updates on its own, none of the scheduler's filters/scores do on their own. If a pod using affinity was successfully assigned a node, nodeName will be updated, if not, then PodScheduled condition will be false, and an event will be recorded with a detailed message describing the reason including the failed filters (inter-pod affnity could be one of them).
  • What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

    • Metrics
      • Component exposing the metric: kube-scheduler
        • Metric name: pod_scheduling_duration_seconds
        • Metric name: plugin_execution_duration_seconds{plugin="InterPodAffinity"}
        • Metric name: schedule_attempts_total{result="error|unschedulable"}
    • Other (treat as last resort)
      • Details:
  • What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

    • 99% of pod scheduling latency is within x minutes
    • 99% of InterPodAffinity plugin executions are within x milliseconds
    • x% of schedule_attempts_total are successful
  • Are there any missing metrics that would be useful to have to improve observability of this feature? No.

Dependencies

This section must be completed when targeting beta graduation to a release.

  • Does this feature depend on any specific services running in the cluster? No

Scalability

For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.

For beta, this section is required: reviewers must answer these questions.

For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.

  • Will enabling / using this feature result in any new API calls? No.

  • Will enabling / using this feature result in introducing new API types? No.

  • Will enabling / using this feature result in any new calls to the cloud provider? No.

  • Will enabling / using this feature result in increasing size or count of the existing API objects? Yes, if users set the NamespaceSelector field.

  • Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? May impact scheduling latency if the feature was used.

  • Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? If NamespaceSelector field is set, then the scheduler will have to process that which will result in some increase in CPU usage.

Troubleshooting

The Troubleshooting section currently serves the Playbook role. We may consider splitting it into a dedicated Playbook document (potentially with some monitoring details). For now, we leave it here.

This section must be completed when targeting beta graduation to a release.

  • How does this feature react if the API server and/or etcd is unavailable?

Running workloads will not be impacted, but pods that are not scheduled yet will not get assigned nodes.

  • What are other known failure modes? N/A

  • What steps should be taken if SLOs are not being met to determine the problem?

Check plugin_execution_duration_seconds{plugin="InterPodAffinity"} to see if latency increased. Note that latency increases with number of existing pods.

Alternatives

Another alternative is to limit the api to "all namespaces" using a dedicated flag or a special token in Namespacces lis, like "*" (see here for previous discussion).

While this limits the api surface, it makes the api slightly messy and limits use cases where only a select set of namespaces needs to be considered. Moreover, a label selector is consistent with how pods are selected in the same api.

Implementation History

  • 2021-01-11: Initial KEP sent for review
  • 2021-02-10: Remove the restriction on empty namespace selector
  • 2021-04-26: Graduate the feature to Beta
  • 2022-01-08: Graduate the feature to Stable