-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node affinity for even spread of pods across multiple availability zones #68981
Comments
Hey, I'd like to work on this. What are my next steps? |
Please see the following issues/KEP for similar questions. It's probably not that easy to implement as day 2 (scale up/down/rolling update) also have to be considered here and scheduler is not always involved (e.g. delete some replicas during scale down) which IMHO is based on a timestamp (could be wrong, did not check the code). |
@raravena80 as you and @embano1 mentioned, "rolling update" is the key issue here:
If rolling or not is not a concern of you, maybe strategy type apiVersion: apps/v1
kind: Deployment
...
spec:
...
strategy:
type: Recreate
... For the case that number of replicas == number of topology domains (e.g. 2 replicas in 2 zones), I'm pretty confident that But if number of replicas != number of topology domains (e.g. 4 replicas in 2 zones), we have to use {nodeAffinity|podAntiAffinity} + preferredDuringSchedulingIgnoredDuringExecution. In this case, I'm not that sure if/how existing replicas in a topology domain weights in scheduler Prioritize phase. Maybe it ends up with 3 pods / 1 pod, or 4 pods / 1 pod or evenly. Please give a try and let me know. |
Agree, I think |
This feature is highly needed for operational grade. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Work-a-round is to set number of pods to at least 3 in the replicaset. Thereby you will get at least one pod in each zone -- also when using rolling deployments. |
How does that guarantee that pods are scheduled in different availability zones? |
See my description in: It still does not solve the issue when scaling down pods, however, with this solution you may do rolling deployment where at least one pod is running in each zone. |
Has anybody deployed Kafka in different availability zones? Having some sort of socket connectivity errors at startup, across different datacenters, yet seems if path is open for all zookeeper and kafka sockets. Thanks! |
FYI: we're under the development of an alpha feature called "Even Pods Spread". Hope that can resolve this issue. The earliest available release is 1.15. KEP: even-pods-spreading.md |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Any news on this please? |
@zedtux You can check out this: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/ |
Thank you @Huang-Wei, looks good, I'll give it a try after upgraded my cluster from 1.15. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
/remove-lifecycle stale |
@raravena80 I think this issue can be resolved by using the feature PodTopologySpread (beta in 1.18) - you and define NodeAffinity/NodeSelector spec, along with the TopologySpreadConstraints, so the Pods can be scheduled in an absolute (maxSkew=1) even manner, or relatively even (maxSkew >=2). |
@Huang-Wei yes that works. I'll close this. Thanks! |
/kind feature
/sig scheduling
What happened:
Currently, nodeAffinity allows you to select multiple av zones to schedule your pods but there's no guarantee that you will have a pod in at least one availability zone. Or if the algorithm inherently does this, it doesn't seem to be documented.
More information here:
https://stackoverflow.com/questions/52457455/multizone-kubernetes-cluster-and-affinity-how-to-distribute-application-per-zon
What you expected to happen:
To have NodeAffinity support spread your pods across different availability zones.
Perhaps a key like this:
How to reproduce it (as minimally and precisely as possible):
Standard NodeAffinity case:
In this case there are no guarantees that the scheduler will schedule pods like this (as required or prefered methods):
4 pods:
5 pods:
Because of that you could end up in situation like this:
4 pods
4 pods
or just the podAntiAffinity try described here
Where there's a chicken an egg problem, when doing re-deploys. It works initially but it doesn't work in subsequent deployments due to the fact that pods with the same label already exist in the given availability zone.
Anything else we need to know?:
This could also be a feature in podAntiAffinity that allows you to schedule in different availability zones.
Environment:
kubectl version
): latestuname -a
): Any that supports k8s.The text was updated successfully, but these errors were encountered: