Node affinity for even spread of pods across multiple availability zones #68981

raravena80 · 2018-09-22T22:24:51Z

/kind feature
/sig scheduling

What happened:

Currently, nodeAffinity allows you to select multiple av zones to schedule your pods but there's no guarantee that you will have a pod in at least one availability zone. Or if the algorithm inherently does this, it doesn't seem to be documented.

More information here:

https://stackoverflow.com/questions/52457455/multizone-kubernetes-cluster-and-affinity-how-to-distribute-application-per-zon

What you expected to happen:

To have NodeAffinity support spread your pods across different availability zones.

Perhaps a key like this:

    affinity:
      NodeAffinity:
        preferedDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
          spreadPodsEvently: "true" <=== HERE

How to reproduce it (as minimally and precisely as possible):

Standard NodeAffinity case:

    affinity:
      NodeAffinity:
        preferedDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2

In this case there are no guarantees that the scheduler will schedule pods like this (as required or prefered methods):

4 pods:
- 2 pods in e2e-az1
- 2 pods in e2e-az2
5 pods:
- 3 pods in e2e-az1
- 2 pods in e2e-az2

Because of that you could end up in situation like this:

4 pods
- 3 pods in e2e-az1
- 1 pod in e2e-az2
4 pods
- 0 pods in e2e-az1
- 4 pods in e2e-az2

or just the podAntiAffinity try described here

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
           matchExpressions:
           - key: component
              operator: In
              values:
              - app
         topologyKey: failure-domain.beta.kubernetes.io/zone

Where there's a chicken an egg problem, when doing re-deploys. It works initially but it doesn't work in subsequent deployments due to the fact that pods with the same label already exist in the given availability zone.

Anything else we need to know?:

This could also be a feature in podAntiAffinity that allows you to schedule in different availability zones.

Environment:

Kubernetes version (use kubectl version): latest
Cloud provider or hardware configuration: AWS, GCP, Azure, etc.
OS (e.g. from /etc/os-release): Any
Kernel (e.g. uname -a): Any that supports k8s.
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

achalshant · 2018-09-26T04:31:31Z

Hey, I'd like to work on this. What are my next steps?

embano1 · 2018-10-02T19:11:53Z

Please see the following issues/KEP for similar questions. It's probably not that easy to implement as day 2 (scale up/down/rolling update) also have to be considered here and scheduler is not always involved (e.g. delete some replicas during scale down) which IMHO is based on a timestamp (could be wrong, did not check the code).

#4301
#40358
#41394
kubernetes/community#2045

Huang-Wei · 2018-10-08T17:45:54Z

@raravena80 as you and @embano1 mentioned, "rolling update" is the key issue here:

Where there's a chicken an egg problem, when doing re-deploys. It works initially but it doesn't work in subsequent deployments due to the fact that pods with the same label already exist in the given availability zone.

If rolling or not is not a concern of you, maybe strategy type recreate is a better fit:

apiVersion: apps/v1
kind: Deployment
...
spec:
...
  strategy:
    type: Recreate
...

For the case that number of replicas == number of topology domains (e.g. 2 replicas in 2 zones), I'm pretty confident that Recreate + {nodeAffinity|podAntiAffinity} + requiredDuringSchedulingIgnoredDuringExecution would work.

But if number of replicas != number of topology domains (e.g. 4 replicas in 2 zones), we have to use {nodeAffinity|podAntiAffinity} + preferredDuringSchedulingIgnoredDuringExecution. In this case, I'm not that sure if/how existing replicas in a topology domain weights in scheduler Prioritize phase. Maybe it ends up with 3 pods / 1 pod, or 4 pods / 1 pod or evenly. Please give a try and let me know.

raravena80 · 2018-10-09T16:19:05Z

Agree, I think Recreate is a good workaround if the cluster operator is not concerned about rolling updates.

moonek · 2018-10-29T02:29:00Z

This feature is highly needed for operational grade.
I'm aware that there is currently no way to guarantee the availability zone in situations where frequent rolling updates and HPA are mixed.
I don't want to recreate service in the operating situation.

fejta-bot · 2019-01-27T03:10:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

GMartinez-Sisti · 2019-01-27T15:15:24Z

/remove-lifecycle stale

ErikLundJensen · 2019-02-06T08:19:12Z

Work-a-round is to set number of pods to at least 3 in the replicaset. Thereby you will get at least one pod in each zone -- also when using rolling deployments.

sbutt · 2019-03-06T13:28:20Z

Work-a-round is to set number of pods to at least 3 in the replicaset. Thereby you will get at least one pod in each zone -- also when using rolling deployments.

How does that guarantee that pods are scheduled in different availability zones?

ErikLundJensen · 2019-03-06T19:57:02Z

See my description in:
#56539 (comment)

It still does not solve the issue when scaling down pods, however, with this solution you may do rolling deployment where at least one pod is running in each zone.

rfrink · 2019-05-02T21:36:09Z

Has anybody deployed Kafka in different availability zones? Having some sort of socket connectivity errors at startup, across different datacenters, yet seems if path is open for all zookeeper and kafka sockets. Thanks!

Huang-Wei · 2019-05-03T00:45:30Z

FYI: we're under the development of an alpha feature called "Even Pods Spread". Hope that can resolve this issue. The earliest available release is 1.15.

KEP: even-pods-spreading.md
Development Issue: #77284

fejta-bot · 2019-08-01T01:15:00Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

raravena80 · 2019-08-01T01:19:28Z

/remove-lifecycle stale

fejta-bot · 2019-10-30T01:40:48Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

raravena80 · 2019-10-30T03:04:33Z

/remove-lifecycle stale

zedtux · 2019-12-15T06:38:44Z

Any news on this please?

Huang-Wei · 2019-12-16T18:29:51Z

@zedtux You can check out this: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

zedtux · 2019-12-16T19:14:57Z

Thank you @Huang-Wei, looks good, I'll give it a try after upgraded my cluster from 1.15.

fejta-bot · 2020-03-15T19:27:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

raravena80 · 2020-03-15T20:07:47Z

/remove-lifecycle stale

zedtux · 2020-03-16T04:10:58Z

/remove-lifecycle stale

Huang-Wei · 2020-04-08T16:58:24Z

@raravena80 I think this issue can be resolved by using the feature PodTopologySpread (beta in 1.18) - you and define NodeAffinity/NodeSelector spec, along with the TopologySpreadConstraints, so the Pods can be scheduled in an absolute (maxSkew=1) even manner, or relatively even (maxSkew >=2).

raravena80 · 2020-04-08T17:44:34Z

@Huang-Wei yes that works. I'll close this. Thanks!

zedtux · 2020-04-08T18:02:20Z

https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 22, 2018

derek-gfs mentioned this issue Dec 24, 2018

Rolling upgrade of deployment has conflict with pod anti-affinity policy #56539

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2019

mmgaggle mentioned this issue Jul 12, 2019

Mon placement should respect failure domains rook/rook#2603

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2019

This was referenced Oct 31, 2019

support zone aware routing of envoy istio/istio#18499

Closed

support zone aware routing of envoy istio/istio#18500

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2020

raravena80 closed this as completed Apr 8, 2020

sync-by-unito bot mentioned this issue Aug 11, 2021

K8SSAND-788 ⁃ In Rack Topology, Replace Affinity Rules with TopologySpreadConstraint k8ssandra/k8ssandra-operator#74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node affinity for even spread of pods across multiple availability zones #68981

Node affinity for even spread of pods across multiple availability zones #68981

raravena80 commented Sep 22, 2018

achalshant commented Sep 26, 2018

embano1 commented Oct 2, 2018

Huang-Wei commented Oct 8, 2018

raravena80 commented Oct 9, 2018

moonek commented Oct 29, 2018

fejta-bot commented Jan 27, 2019

GMartinez-Sisti commented Jan 27, 2019

ErikLundJensen commented Feb 6, 2019

sbutt commented Mar 6, 2019

ErikLundJensen commented Mar 6, 2019 •

edited

rfrink commented May 2, 2019

Huang-Wei commented May 3, 2019

fejta-bot commented Aug 1, 2019

raravena80 commented Aug 1, 2019

fejta-bot commented Oct 30, 2019

raravena80 commented Oct 30, 2019

zedtux commented Dec 15, 2019

Huang-Wei commented Dec 16, 2019

zedtux commented Dec 16, 2019

fejta-bot commented Mar 15, 2020

raravena80 commented Mar 15, 2020

zedtux commented Mar 16, 2020

Huang-Wei commented Apr 8, 2020

raravena80 commented Apr 8, 2020

zedtux commented Apr 8, 2020

Node affinity for even spread of pods across multiple availability zones #68981

Node affinity for even spread of pods across multiple availability zones #68981

Comments

raravena80 commented Sep 22, 2018

achalshant commented Sep 26, 2018

embano1 commented Oct 2, 2018

Huang-Wei commented Oct 8, 2018

raravena80 commented Oct 9, 2018

moonek commented Oct 29, 2018

fejta-bot commented Jan 27, 2019

GMartinez-Sisti commented Jan 27, 2019

ErikLundJensen commented Feb 6, 2019

sbutt commented Mar 6, 2019

ErikLundJensen commented Mar 6, 2019 • edited

rfrink commented May 2, 2019

Huang-Wei commented May 3, 2019

fejta-bot commented Aug 1, 2019

raravena80 commented Aug 1, 2019

fejta-bot commented Oct 30, 2019

raravena80 commented Oct 30, 2019

zedtux commented Dec 15, 2019

Huang-Wei commented Dec 16, 2019

zedtux commented Dec 16, 2019

fejta-bot commented Mar 15, 2020

raravena80 commented Mar 15, 2020

zedtux commented Mar 16, 2020

Huang-Wei commented Apr 8, 2020

raravena80 commented Apr 8, 2020

zedtux commented Apr 8, 2020

ErikLundJensen commented Mar 6, 2019 •

edited