New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods not being evenly scheduled across worker nodes #105220
Comments
/sig scheduling |
I also realized that if I create a deployment with 1000 replicas, pods are evenly distributed: $ kubectl create deployment --replicas=1000 --image=k8s.gcr.io/pause sleep
$ kubectl get pods -o go-template --template='{{range .items}}{{if eq .status.phase "Running"}}{{.spec.nodeName}}{{"\n"}}{{end}}{{end}}' --all-namespaces | awk '{nodes[$1]++ }
END{ for (n in nodes) print n": "nodes[n]}'
ip-10-0-187-21.eu-west-3.compute.internal: 59
ip-10-0-139-16.eu-west-3.compute.internal: 134
ip-10-0-210-1.eu-west-3.compute.internal: 126
ip-10-0-146-3.eu-west-3.compute.internal: 125
ip-10-0-156-89.eu-west-3.compute.internal: 128
ip-10-0-134-116.eu-west-3.compute.internal: 35
ip-10-0-218-198.eu-west-3.compute.internal: 126
ip-10-0-168-121.eu-west-3.compute.internal: 125
ip-10-0-182-174.eu-west-3.compute.internal: 125
ip-10-0-199-68.eu-west-3.compute.internal: 127
ip-10-0-223-121.eu-west-3.compute.internal: 32
ip-10-0-187-122.eu-west-3.compute.internal: 135 |
A recap of the conclusions we already have: This regressed with #102925, which changed
Ah, that's good to have confirmed. Pods within a deployment have an extra spreading score. It looks like the score that Discussing solutions:
|
cc @ahg-g @Huang-Wei |
Opened the partial reverts for 1.21 and 1.20: |
Is there any negative impact on reverting the changes made to the balanced plugin? looking at #102925, I agree that we shouldn't have made the change for the balanced plugin, but just wondering what was the rational?
Another thought: adding pod count to the balanced resource calculation in addition to cpu and memory? |
The node would get a score of 0 for Maybe reverting is a valid solution too, but only if we reduce the non-zero requests. This reduces the chances of nodes getting the zero score.
I don't think so. It's hard for users to estimate how many pods they would fit in a node. This might lead to more undesired behaviors. |
Reducing the non-zero requests sgtm. But perhaps the other question is how much impact should balanced allocation have compared to least/most allocated. I also feel we are not using the scoring weights enough to solve those types of issues.
Each node already defines the max number of pods, and each pod consumes 1. So there is nothing new to be estimated. but yeah, since each pod have a fixed request of 1, we are basically scoring on node max pod limit, which is usually fixed for all nodes. |
What I'm saying is that most users probably don't optimize the number of pods to tailor their workloads.
At least it doesn't seem to be too strong of a signal compared to Spreading. Note that after #101946, it tops at 50. If that's the case, all nodes had 100% utilization, thus 0 score for Then, reducing the non-zero request is a win-win-win for the three scores :) |
I'm fine with reducing the default requests, esp. the memory value. Another idea is to set the non-zero requests dynamically. For example, suppose the initial default req for a resource is M, as time goes, when the number of best-efforts pods reaches to a number N on a Node, make the default value as M/2. When the number of bets-efforts pods is less than N, the value gets restored to M. Regarding the type of resources, we may apply the dynamics to non-compressible resources (memory) only. |
That sounds kind of hard to configure. Can you explain a bit more why you think it would be a good idea? |
Some do because they want to optimize IP usage. But again, this doesn't address the problem here.
But we keep tuning the score returned by the plugins without reference which should be stronger than which. We should try and rank all plugins based on importance and weight them accordingly.
Reducing the default requests will help, but if there are bunch of pods that make actual large enough requests, then we are back to the same issue. I think another thing we probably need is to do is make the default cpu and memory close enough to the ratio used in common machines types.
Assuming common machine types, I think we need to roughly double the memory, then reduce both by which ever value we want (like .01 CPU and 40MB memory assuming 10%) |
In that case the Filter would kick in.
I don't think this is case of bad weights. We had a score topping at the same time that the other score was hitting the lower score. Optimizing the weights is a longer discussion which requires a lot of experimentation. Maybe we can prioritize it for 1.24.
SGTM. Maybe it's safer to start at 20%? |
No it wouldn't for the ones that don't have requests. Basically the pods with requests make the available resources lower, and so negating the fact that the non-zero requests got lower. |
My key idea is to differentiate the default reqs for nodes when a node is obviously overutilized. So that it can prevent the symptom described in this issue.
We have to admit the limitation and applicability of the current rule/weight-based scoring. We "thought" one score pluign should be weighted higher/lower than the other, but it's not always satisfying for different workloads and clusters. In the long run, we may turn to build an adaptive machine learning model to complete the scoring job. Inspired by some project like adaptdl. Maybe we can leverage some industry practices to improve the entire scheduler scoring area. |
Worker node allocatable resources are: allocatable:
attachable-volumes-aws-ebs: "25"
cpu: 3500m
ephemeral-storage: "115470533646"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 14783292Ki
pods: "250" |
My comment was in response to
Did we actually try to use it? I feel we didn't.
I also feel that the choices are fairly limited and we can reach a reasonable ranking without the complexity of ML, also the ML model is as good as the data ("ground truth") you feed it... |
We started to deviate from the problem at hand. Is everyone ok with this?
I agree that we should match common machine types' ratios. But I would vote 20% to start. |
What is the difference between 10% and 20%? do we actually need to consider pods that don't declare requests in balanced utilization score? |
The main reason to be more conservative is that the same non-zero values are used for the 3 scores. |
Unless you are suggesting we decouple |
Yes, I am suggesting we treat the balanced score differently from the others. As I mentioned above, reducing the values will basically shift the problem not solve it.
In a sense balanced serves a different purpose which is also evident from it not being part of the common score plugin we now have. |
So, in summary, the suggested solution is to use the original requests in @damemi WDYT? could you take that? |
We seem to have hit the same problem even on 1.19.14 / 1.19.15. @SenatorSupes has managed to find the offending commit here - f7b2ca5 |
The linked PRs are not merged yet. Unfortunately, I don't think there will be another 1.19 release. |
@damemi Do you mind if I pick this up? |
/assign @ahmad-diaa |
I've done some additional tests consisting of deploying a bunch of pause pods with requests in a single namespace: # Number of pods in the workload's namespace
rsevilla@wonderland ~ $ kubectl get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 --no-headers | wc -l
446
# All of the deployed pods have resource requests configured
rsevilla@wonderland ~ $ kubectl get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 node-density-1 -o jsonpath="{.spec.containers[*].resources}"
{"requests":{"cpu":"1m","memory":"10Mi"}}
# Worker nodes total
$ kubectl describe node -l node-role.kubernetes.io/worker | grep -E "(^Name:|^Non-terminated)"
Name: ip-10-0-147-142.eu-west-3.compute.internal
Non-terminated Pods: (249 in total)
Name: ip-10-0-158-24.eu-west-3.compute.internal
Non-terminated Pods: (249 in total)
Name: ip-10-0-187-55.eu-west-3.compute.internal
Non-terminated Pods: (25 in total)
Name: ip-10-0-218-220.eu-west-3.compute.internal
Non-terminated Pods: (31 in total)
# Number of pods per node in the workload's namespace
rsevilla@wonderland ~ $ kubectl get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 -o wide --no-headers | awk '{node[$7]++ }END{ for (n in node) print n": "node[n]; }'
ip-10-0-147-142.eu-west-3.compute.internal: 218
ip-10-0-187-55.eu-west-3.compute.internal: 5
ip-10-0-158-24.eu-west-3.compute.internal: 223 $ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"archive", BuildDate:"2021-07-22T00:00:00Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.0-rc.0+af080cb", GitCommit:"af080cb8d127b31307ed3622992c05a4b59f15ba", GitTreeState:"clean", BuildDate:"2021-09-17T18:36:43Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"} According to the comments here I thought that setting pod requests would help to get these pods scheduled properly, however as shown above, this is not happening. |
It's not just about the pods that are being scheduled, but also the pods that already exist in the cluster. 249-218 is 31 Pods. That could be enough to make the BalancedAllocation plugin return 100% score, if none of those pods have requests. |
@alculquicondor hello from 2023 :-) I just hit the unevenly balanced node issues due to high default for cpu requests: #122131 |
What happened:
After a straightforward scale test consisting of creating several hundreds of standalone pods (sleep) on a small-size cluster (9 worker nodes) I realized that the pods are not evenly scheduled across the nodes.
The test was executed w/o any limitRange and the created pods don't have any requests either.
What you expected to happen:
Pods are evenly spread across all worker nodes.
How to reproduce it (as minimally and precisely as possible):
Number of pods in nodes before executing the test:
Create 1000 pods:
for i in {1..1000}; do kubectl run --image=k8s.gcr.io/pause sleep-${i}; done
Check Running pods per node:
As shown above, some nodes ran out of room to execute more pods (max-pods is set to 250) while there're other nodes with much fewer pods
Anything else we need to know?:
Environment:
kubectl version
):The text was updated successfully, but these errors were encountered: