Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node affinity for even spread of pods across multiple availability zones #68981

Closed
raravena80 opened this issue Sep 22, 2018 · 25 comments
Closed
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@raravena80
Copy link

/kind feature
/sig scheduling

What happened:

Currently, nodeAffinity allows you to select multiple av zones to schedule your pods but there's no guarantee that you will have a pod in at least one availability zone. Or if the algorithm inherently does this, it doesn't seem to be documented.

More information here:

https://stackoverflow.com/questions/52457455/multizone-kubernetes-cluster-and-affinity-how-to-distribute-application-per-zon

What you expected to happen:

To have NodeAffinity support spread your pods across different availability zones.

Perhaps a key like this:

    affinity:
      NodeAffinity:
        preferedDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
          spreadPodsEvently: "true" <=== HERE

How to reproduce it (as minimally and precisely as possible):

Standard NodeAffinity case:

    affinity:
      NodeAffinity:
        preferedDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2

In this case there are no guarantees that the scheduler will schedule pods like this (as required or prefered methods):

  • 4 pods:

    • 2 pods in e2e-az1
    • 2 pods in e2e-az2
  • 5 pods:

    • 3 pods in e2e-az1
    • 2 pods in e2e-az2

Because of that you could end up in situation like this:

  • 4 pods

    • 3 pods in e2e-az1
    • 1 pod in e2e-az2
  • 4 pods

    • 0 pods in e2e-az1
    • 4 pods in e2e-az2

or just the podAntiAffinity try described here

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
           matchExpressions:
           - key: component
              operator: In
              values:
              - app
         topologyKey: failure-domain.beta.kubernetes.io/zone

Where there's a chicken an egg problem, when doing re-deploys. It works initially but it doesn't work in subsequent deployments due to the fact that pods with the same label already exist in the given availability zone.

Anything else we need to know?:

This could also be a feature in podAntiAffinity that allows you to schedule in different availability zones.

Environment:

  • Kubernetes version (use kubectl version): latest
  • Cloud provider or hardware configuration: AWS, GCP, Azure, etc.
  • OS (e.g. from /etc/os-release): Any
  • Kernel (e.g. uname -a): Any that supports k8s.
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 22, 2018
@achalshant
Copy link

Hey, I'd like to work on this. What are my next steps?

@embano1
Copy link
Member

embano1 commented Oct 2, 2018

Please see the following issues/KEP for similar questions. It's probably not that easy to implement as day 2 (scale up/down/rolling update) also have to be considered here and scheduler is not always involved (e.g. delete some replicas during scale down) which IMHO is based on a timestamp (could be wrong, did not check the code).

#4301
#40358
#41394
kubernetes/community#2045

@Huang-Wei
Copy link
Member

@raravena80 as you and @embano1 mentioned, "rolling update" is the key issue here:

Where there's a chicken an egg problem, when doing re-deploys. It works initially but it doesn't work in subsequent deployments due to the fact that pods with the same label already exist in the given availability zone.

If rolling or not is not a concern of you, maybe strategy type recreate is a better fit:

apiVersion: apps/v1
kind: Deployment
...
spec:
...
  strategy:
    type: Recreate
...

For the case that number of replicas == number of topology domains (e.g. 2 replicas in 2 zones), I'm pretty confident that Recreate + {nodeAffinity|podAntiAffinity} + requiredDuringSchedulingIgnoredDuringExecution would work.

But if number of replicas != number of topology domains (e.g. 4 replicas in 2 zones), we have to use {nodeAffinity|podAntiAffinity} + preferredDuringSchedulingIgnoredDuringExecution. In this case, I'm not that sure if/how existing replicas in a topology domain weights in scheduler Prioritize phase. Maybe it ends up with 3 pods / 1 pod, or 4 pods / 1 pod or evenly. Please give a try and let me know.

@raravena80
Copy link
Author

Agree, I think Recreate is a good workaround if the cluster operator is not concerned about rolling updates.

@moonek
Copy link
Contributor

moonek commented Oct 29, 2018

This feature is highly needed for operational grade.
I'm aware that there is currently no way to guarantee the availability zone in situations where frequent rolling updates and HPA are mixed.
I don't want to recreate service in the operating situation.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2019
@GMartinez-Sisti
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2019
@ErikLundJensen
Copy link

Work-a-round is to set number of pods to at least 3 in the replicaset. Thereby you will get at least one pod in each zone -- also when using rolling deployments.

@sbutt
Copy link

sbutt commented Mar 6, 2019

Work-a-round is to set number of pods to at least 3 in the replicaset. Thereby you will get at least one pod in each zone -- also when using rolling deployments.

How does that guarantee that pods are scheduled in different availability zones?

@ErikLundJensen
Copy link

ErikLundJensen commented Mar 6, 2019

See my description in:
#56539 (comment)

It still does not solve the issue when scaling down pods, however, with this solution you may do rolling deployment where at least one pod is running in each zone.

@rfrink
Copy link

rfrink commented May 2, 2019

Has anybody deployed Kafka in different availability zones? Having some sort of socket connectivity errors at startup, across different datacenters, yet seems if path is open for all zookeeper and kafka sockets. Thanks!

@Huang-Wei
Copy link
Member

FYI: we're under the development of an alpha feature called "Even Pods Spread". Hope that can resolve this issue. The earliest available release is 1.15.

KEP: even-pods-spreading.md
Development Issue: #77284

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2019
@raravena80
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2019
@raravena80
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2019
@zedtux
Copy link

zedtux commented Dec 15, 2019

Any news on this please?

@Huang-Wei
Copy link
Member

@zedtux You can check out this: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

@zedtux
Copy link

zedtux commented Dec 16, 2019

Thank you @Huang-Wei, looks good, I'll give it a try after upgraded my cluster from 1.15.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2020
@raravena80
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2020
@zedtux
Copy link

zedtux commented Mar 16, 2020

/remove-lifecycle stale

@Huang-Wei
Copy link
Member

@raravena80 I think this issue can be resolved by using the feature PodTopologySpread (beta in 1.18) - you and define NodeAffinity/NodeSelector spec, along with the TopologySpreadConstraints, so the Pods can be scheduled in an absolute (maxSkew=1) even manner, or relatively even (maxSkew >=2).

@raravena80
Copy link
Author

@Huang-Wei yes that works. I'll close this. Thanks!

@zedtux
Copy link

zedtux commented Apr 8, 2020

https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests