Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods not being evenly scheduled across worker nodes #105220

Closed
rsevilla87 opened this issue Sep 23, 2021 · 31 comments · Fixed by #105845
Closed

Pods not being evenly scheduled across worker nodes #105220

rsevilla87 opened this issue Sep 23, 2021 · 31 comments · Fixed by #105845
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@rsevilla87
Copy link

What happened:

After a straightforward scale test consisting of creating several hundreds of standalone pods (sleep) on a small-size cluster (9 worker nodes) I realized that the pods are not evenly scheduled across the nodes.

The test was executed w/o any limitRange and the created pods don't have any requests either.

What you expected to happen:

Pods are evenly spread across all worker nodes.

How to reproduce it (as minimally and precisely as possible):

Number of pods in nodes before executing the test:

$ kubectl get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-134-116.eu-west-3.compute.internal   Ready    master   10d     v1.22.0-rc.0+75ee307
ip-10-0-139-16.eu-west-3.compute.internal    Ready    worker   7h48m   v1.22.0-rc.0+75ee307
ip-10-0-146-3.eu-west-3.compute.internal     Ready    worker   7h48m   v1.22.0-rc.0+75ee307
ip-10-0-156-89.eu-west-3.compute.internal    Ready    worker   10d     v1.22.0-rc.0+75ee307
ip-10-0-168-121.eu-west-3.compute.internal   Ready    worker   7h48m   v1.22.0-rc.0+75ee307
ip-10-0-182-174.eu-west-3.compute.internal   Ready    worker   7h47m   v1.22.0-rc.0+75ee307
ip-10-0-187-122.eu-west-3.compute.internal   Ready    worker   10d     v1.22.0-rc.0+75ee307
ip-10-0-187-21.eu-west-3.compute.internal    Ready    master   10d     v1.22.0-rc.0+75ee307
ip-10-0-199-68.eu-west-3.compute.internal    Ready    worker   3d10h   v1.22.0-rc.0+75ee307
ip-10-0-210-1.eu-west-3.compute.internal     Ready    worker   7h48m   v1.22.0-rc.0+75ee307
ip-10-0-218-198.eu-west-3.compute.internal   Ready    worker   7h47m   v1.22.0-rc.0+75ee307
ip-10-0-223-121.eu-west-3.compute.internal   Ready    master   10d     v1.22.0-rc.0+75ee307
$ kubectl get pods -o go-template --template='{{range .items}}{{if eq .status.phase "Running"}}{{.spec.nodeName}}{{"\n"}}{{end}}{{end}}' --all-namespaces | awk '{nodes[$1]++ }                                           
END{ for (n in nodes) print n": "nodes[n]}'
ip-10-0-187-21.eu-west-3.compute.internal: 59 <- master node not schedulable
ip-10-0-139-16.eu-west-3.compute.internal: 23
ip-10-0-210-1.eu-west-3.compute.internal: 15
ip-10-0-146-3.eu-west-3.compute.internal: 14
ip-10-0-156-89.eu-west-3.compute.internal: 17
ip-10-0-134-116.eu-west-3.compute.internal: 35 <- master node not schedulable
ip-10-0-218-198.eu-west-3.compute.internal: 15
ip-10-0-168-121.eu-west-3.compute.internal: 14
ip-10-0-182-174.eu-west-3.compute.internal: 14
ip-10-0-199-68.eu-west-3.compute.internal: 15
ip-10-0-223-121.eu-west-3.compute.internal: 32<- master node not schedulable
ip-10-0-187-122.eu-west-3.compute.internal: 24  

Create 1000 pods:
for i in {1..1000}; do kubectl run --image=k8s.gcr.io/pause sleep-${i}; done

Check Running pods per node:

$ kubectl get pods -o go-template --template='{{range .items}}{{if eq .status.phase "Running"}}{{.spec.nodeName}}{{"\n"}}{{end}}{{end}}' --all-namespaces | awk '{nodes[$1]++ }END{ for (n in nodes) print n": "nodes[n]}'
ip-10-0-187-21.eu-west-3.compute.internal: 59 <- master node not schedulable
ip-10-0-139-16.eu-west-3.compute.internal: 224
ip-10-0-210-1.eu-west-3.compute.internal: 78
ip-10-0-146-3.eu-west-3.compute.internal: 71
ip-10-0-156-89.eu-west-3.compute.internal: 250
ip-10-0-134-116.eu-west-3.compute.internal: 35 <- master node not schedulable
ip-10-0-218-198.eu-west-3.compute.internal: 76
ip-10-0-168-121.eu-west-3.compute.internal: 71
ip-10-0-182-174.eu-west-3.compute.internal: 75
ip-10-0-199-68.eu-west-3.compute.internal: 56
ip-10-0-223-121.eu-west-3.compute.internal: 32 <- master node not schedulable
ip-10-0-187-122.eu-west-3.compute.internal: 250

As shown above, some nodes ran out of room to execute more pods (max-pods is set to 250) while there're other nodes with much fewer pods

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"archive", BuildDate:"2021-03-30T00:00:00Z", GoVersion:"go1.16", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.0-rc.0+75ee307", GitCommit:"75ee3073266f07baaba5db004cde0636425737cf", GitTreeState:"clean", BuildDate:"2021-09-04T12:16:28Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
AWS using m5.xlarge worker nodes
@rsevilla87 rsevilla87 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 23, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 23, 2021
@rsevilla87
Copy link
Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 23, 2021
@rsevilla87
Copy link
Author

rsevilla87 commented Sep 23, 2021

cc: @alculquicondor @damemi @jtaleric

@rsevilla87
Copy link
Author

rsevilla87 commented Sep 23, 2021

I also realized that if I create a deployment with 1000 replicas, pods are evenly distributed:

$ kubectl create  deployment --replicas=1000 --image=k8s.gcr.io/pause sleep
$ kubectl get pods -o go-template --template='{{range .items}}{{if eq .status.phase "Running"}}{{.spec.nodeName}}{{"\n"}}{{end}}{{end}}' --all-namespaces | awk '{nodes[$1]++ }                                           
END{ for (n in nodes) print n": "nodes[n]}'
ip-10-0-187-21.eu-west-3.compute.internal: 59
ip-10-0-139-16.eu-west-3.compute.internal: 134
ip-10-0-210-1.eu-west-3.compute.internal: 126
ip-10-0-146-3.eu-west-3.compute.internal: 125
ip-10-0-156-89.eu-west-3.compute.internal: 128
ip-10-0-134-116.eu-west-3.compute.internal: 35
ip-10-0-218-198.eu-west-3.compute.internal: 126
ip-10-0-168-121.eu-west-3.compute.internal: 125
ip-10-0-182-174.eu-west-3.compute.internal: 125
ip-10-0-199-68.eu-west-3.compute.internal: 127
ip-10-0-223-121.eu-west-3.compute.internal: 32
ip-10-0-187-122.eu-west-3.compute.internal: 135

@alculquicondor
Copy link
Member

alculquicondor commented Sep 24, 2021

A recap of the conclusions we already have:

This regressed with #102925, which changed NodeResourcesBalancedAllocation and NodeResourcesMostAllocated scores. However, the fix does what it intended: ensure that a node is not underutilized, when using NodeResourcesMostAllocated. Of course, this is causing the opposite problem when using NodeResourcesLeastAllocated.

I also realized that if I create a deployment with 1000 replicas, pods are evenly distributed:

Ah, that's good to have confirmed. Pods within a deployment have an extra spreading score. It looks like the score that NodeResourcesBalancedAllocation provides is not as strong as the spreading score 🥳. This is probably thanks to #101946

Discussing solutions:

  • We should do a partial revert of Fix Node Resources plugins score when there are pods with no requests #102925 in 1.20 and 1.21, as they don't have Support extended resource in NodeResourcesBalancedAllocation plugin #101946. By partial I mean that we should leave the fix on NodeResourcesMostAllocated.
  • For 1.22 and beyond, I think we should reduce the "default requests" that the scheduler implicitly adds when scoring. They are arguably too big. I would suggest 10 to 20% of the current numbers
    // DefaultMilliCPURequest defines default milli cpu request number.
    DefaultMilliCPURequest int64 = 100 // 0.1 core
    // DefaultMemoryRequest defines default memory request size.
    DefaultMemoryRequest int64 = 200 * 1024 * 1024 // 200 MB

    This should reduce the chances of the scheduler estimating 100% resources allocated. In reality, most production clusters have characteristics that your test cluster probably doesn't have: bigger nodes and a smaller number of pods per node.

@alculquicondor
Copy link
Member

cc @ahg-g @Huang-Wei

@ahg-g
Copy link
Member

ahg-g commented Sep 24, 2021

Is there any negative impact on reverting the changes made to the balanced plugin? looking at #102925, I agree that we shouldn't have made the change for the balanced plugin, but just wondering what was the rational?

This should reduce the chances of the scheduler estimating 100% resources allocated. In reality, most production clusters have characteristics that your test cluster probably doesn't have: bigger nodes and a smaller number of pods per node.

Another thought: adding pod count to the balanced resource calculation in addition to cpu and memory?

@alculquicondor
Copy link
Member

alculquicondor commented Sep 24, 2021

Is there any negative impact on reverting the changes made to the balanced plugin?

The node would get a score of 0 for NodeResourcesBalancedAllocation which will hurt utilization when trying to bin-pack.

Maybe reverting is a valid solution too, but only if we reduce the non-zero requests. This reduces the chances of nodes getting the zero score.

Another thought: adding pod count to the balanced resource calculation in addition to cpu and memory?

I don't think so. It's hard for users to estimate how many pods they would fit in a node. This might lead to more undesired behaviors.

@ahg-g
Copy link
Member

ahg-g commented Sep 24, 2021

Reducing the non-zero requests sgtm. But perhaps the other question is how much impact should balanced allocation have compared to least/most allocated. I also feel we are not using the scoring weights enough to solve those types of issues.

I don't think so. It's hard for users to estimate how many pods they would fit in a node. This might lead to more undesired behaviors.

Each node already defines the max number of pods, and each pod consumes 1. So there is nothing new to be estimated. but yeah, since each pod have a fixed request of 1, we are basically scoring on node max pod limit, which is usually fixed for all nodes.

@alculquicondor
Copy link
Member

Each node already defines the max number of pods, and each pod consumes 1

What I'm saying is that most users probably don't optimize the number of pods to tailor their workloads.

But perhaps the other question is how much impact should balanced allocation have compared to least/most allocated.

At least it doesn't seem to be too strong of a signal compared to Spreading. Note that after #101946, it tops at 50.
I think the problem in the scenario above is that all pods already have a big number of pods (56 at least, for non masters). With the current defaults, this is equivalent to 5.6 CPUs, which is likely greater than the allocatable CPU (@rsevilla87, please confirm).

If that's the case, all nodes had 100% utilization, thus 0 score for NodeResourcesLeastAllocated. Then the fact that this was behaving as badly in 1.22 as in previous versions without #101946 makes more sense. Essentially there is only one score at play, NodeResourcesBalancedAllocation.

Then, reducing the non-zero request is a win-win-win for the three scores :)

@Huang-Wei
Copy link
Member

I'm fine with reducing the default requests, esp. the memory value.

Another idea is to set the non-zero requests dynamically. For example, suppose the initial default req for a resource is M, as time goes, when the number of best-efforts pods reaches to a number N on a Node, make the default value as M/2. When the number of bets-efforts pods is less than N, the value gets restored to M.

Regarding the type of resources, we may apply the dynamics to non-compressible resources (memory) only.

@alculquicondor
Copy link
Member

That sounds kind of hard to configure. Can you explain a bit more why you think it would be a good idea?

@ahg-g
Copy link
Member

ahg-g commented Sep 24, 2021

What I'm saying is that most users probably don't optimize the number of pods to tailor their workloads.

Some do because they want to optimize IP usage. But again, this doesn't address the problem here.

At least it doesn't seem to be too strong of a signal compared to Spreading. Note that after #101946, it tops at 50.

But we keep tuning the score returned by the plugins without reference which should be stronger than which. We should try and rank all plugins based on importance and weight them accordingly.

Then, reducing the non-zero request is a win-win-win for the three scores :)

Reducing the default requests will help, but if there are bunch of pods that make actual large enough requests, then we are back to the same issue. I think another thing we probably need is to do is make the default cpu and memory close enough to the ratio used in common machines types.

I'm fine with reducing the default requests, esp. the memory value.

Assuming common machine types, I think we need to roughly double the memory, then reduce both by which ever value we want (like .01 CPU and 40MB memory assuming 10%)

@alculquicondor
Copy link
Member

but if there are bunch of pods that make actual large enough requests

In that case the Filter would kick in.

But we keep tuning the score returned by the plugins without reference which should be stronger than which.

I don't think this is case of bad weights. We had a score topping at the same time that the other score was hitting the lower score.

Optimizing the weights is a longer discussion which requires a lot of experimentation. Maybe we can prioritize it for 1.24.

Assuming common machine types, I think we need to roughly double the memory, then reduce both by which ever value we want (like .01 CPU and 40MB memory assuming 10%)

SGTM. Maybe it's safer to start at 20%?

@ahg-g
Copy link
Member

ahg-g commented Sep 24, 2021

In that case the Filter would kick in.

No it wouldn't for the ones that don't have requests. Basically the pods with requests make the available resources lower, and so negating the fact that the non-zero requests got lower.

@Huang-Wei
Copy link
Member

That sounds kind of hard to configure. Can you explain a bit more why you think it would be a good idea?

My key idea is to differentiate the default reqs for nodes when a node is obviously overutilized. So that it can prevent the symptom described in this issue.

Optimizing the weights is a longer discussion which requires a lot of experimentation. Maybe we can prioritize it for 1.24.

We have to admit the limitation and applicability of the current rule/weight-based scoring. We "thought" one score pluign should be weighted higher/lower than the other, but it's not always satisfying for different workloads and clusters. In the long run, we may turn to build an adaptive machine learning model to complete the scoring job. Inspired by some project like adaptdl. Maybe we can leverage some industry practices to improve the entire scheduler scoring area.

@rsevilla87
Copy link
Author

rsevilla87 commented Sep 24, 2021

Each node already defines the max number of pods, and each pod consumes 1

What I'm saying is that most users probably don't optimize the number of pods to tailor their workloads.

But perhaps the other question is how much impact should balanced allocation have compared to least/most allocated.

At least it doesn't seem to be too strong of a signal compared to Spreading. Note that after #101946, it tops at 50.
I think the problem in the scenario above is that all pods already have a big number of pods (56 at least, for non masters). With the current defaults, this is equivalent to 5.6 CPUs, which is likely greater than the allocatable CPU (@rsevilla87, please confirm).

If that's the case, all nodes had 100% utilization, thus 0 score for NodeResourcesLeastAllocated. Then the fact that this was behaving as badly in 1.22 as in previous versions without #101946 makes more sense. Essentially there is only one score at play, NodeResourcesBalancedAllocation.

Then, reducing the non-zero request is a win-win-win for the three scores :)

Worker node allocatable resources are:

  allocatable:       
    attachable-volumes-aws-ebs: "25"                                                                                                                                                                                                          
    cpu: 3500m                                                                                                         
    ephemeral-storage: "115470533646"
    hugepages-1Gi: "0"                                                                                                                                                                                                                        
    hugepages-2Mi: "0"  
    memory: 14783292Ki 
    pods: "250"  

@ahg-g
Copy link
Member

ahg-g commented Sep 26, 2021

I don't think this is case of bad weights. We had a score topping at the same time that the other score was hitting the lower score.

My comment was in response to At least it doesn't seem to be too strong of a signal compared to Spreading. Note that after #101946, it tops at 50.

We have to admit the limitation and applicability of the current rule/weight-based scoring

Did we actually try to use it? I feel we didn't.

In the long run, we may turn to build an adaptive machine learning model to complete the scoring job.

I also feel that the choices are fairly limited and we can reach a reasonable ranking without the complexity of ML, also the ML model is as good as the data ("ground truth") you feed it...

@alculquicondor
Copy link
Member

We started to deviate from the problem at hand. Is everyone ok with this?

Assuming common machine types, I think we need to roughly double the memory, then reduce both by which ever value we want (like .01 CPU and 40MB memory assuming 10%)

I agree that we should match common machine types' ratios. But I would vote 20% to start.

@ahg-g
Copy link
Member

ahg-g commented Sep 27, 2021

What is the difference between 10% and 20%? do we actually need to consider pods that don't declare requests in balanced utilization score?

@alculquicondor
Copy link
Member

The main reason to be more conservative is that the same non-zero values are used for the 3 scores.

@alculquicondor
Copy link
Member

Unless you are suggesting we decouple NodeResourcesLeastAllocated and NodeResourcesMostAllocated from the balanced one, which will only use declared requests. I'm fine with that, but it might be harder to reason about how the scores play together.

@ahg-g
Copy link
Member

ahg-g commented Sep 27, 2021

Yes, I am suggesting we treat the balanced score differently from the others. As I mentioned above, reducing the values will basically shift the problem not solve it.

but it might be harder to reason about how the scores play together.

In a sense balanced serves a different purpose which is also evident from it not being part of the common score plugin we now have.

@alculquicondor
Copy link
Member

So, in summary, the suggested solution is to use the original requests in NodeResourcesBalancedAllocation instead of the nonzero ones.

@damemi WDYT? could you take that?

@ShashankGirish
Copy link

Opened the partial reverts for 1.21 and 1.20:

We seem to have hit the same problem even on 1.19.14 / 1.19.15.

@SenatorSupes has managed to find the offending commit here - f7b2ca5

@alculquicondor
Copy link
Member

The linked PRs are not merged yet. Unfortunately, I don't think there will be another 1.19 release.

@ahmad-diaa
Copy link
Contributor

ahmad-diaa commented Sep 30, 2021

@damemi Do you mind if I pick this up?

@alculquicondor
Copy link
Member

/assign @ahmad-diaa
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 4, 2021
@rsevilla87
Copy link
Author

I've done some additional tests consisting of deploying a bunch of pause pods with requests in a single namespace:

# Number of pods in the workload's namespace
rsevilla@wonderland ~ $ kubectl get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 --no-headers  | wc -l
446

# All of the deployed pods have resource requests configured
rsevilla@wonderland ~ $ kubectl get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 node-density-1 -o jsonpath="{.spec.containers[*].resources}"
{"requests":{"cpu":"1m","memory":"10Mi"}}

# Worker nodes total
$ kubectl describe node -l node-role.kubernetes.io/worker | grep -E "(^Name:|^Non-terminated)"
Name:               ip-10-0-147-142.eu-west-3.compute.internal
Non-terminated Pods:                                      (249 in total)
Name:               ip-10-0-158-24.eu-west-3.compute.internal
Non-terminated Pods:                                      (249 in total)
Name:               ip-10-0-187-55.eu-west-3.compute.internal
Non-terminated Pods:                                      (25 in total)
Name:               ip-10-0-218-220.eu-west-3.compute.internal
Non-terminated Pods:                                      (31 in total)

# Number of pods per node in the workload's namespace
rsevilla@wonderland ~ $ kubectl get pod -n node-density-bbe06b64-991a-4a74-8d9d-75aa23f45415 -o wide --no-headers | awk '{node[$7]++ }END{ for (n in node) print n": "node[n]; }'
ip-10-0-147-142.eu-west-3.compute.internal: 218
ip-10-0-187-55.eu-west-3.compute.internal: 5
ip-10-0-158-24.eu-west-3.compute.internal: 223
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"archive", BuildDate:"2021-07-22T00:00:00Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.0-rc.0+af080cb", GitCommit:"af080cb8d127b31307ed3622992c05a4b59f15ba", GitTreeState:"clean", BuildDate:"2021-09-17T18:36:43Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

According to the comments here I thought that setting pod requests would help to get these pods scheduled properly, however as shown above, this is not happening.

@alculquicondor
Copy link
Member

It's not just about the pods that are being scheduled, but also the pods that already exist in the cluster.

249-218 is 31 Pods. That could be enough to make the BalancedAllocation plugin return 100% score, if none of those pods have requests.

@zerkms
Copy link
Contributor

zerkms commented Dec 1, 2023

For 1.22 and beyond, I think we should reduce the "default requests" that the scheduler implicitly adds when scoring. They are arguably too big. I would suggest 10 to 20% of the current numbers

@alculquicondor hello from 2023 :-) I just hit the unevenly balanced node issues due to high default for cpu requests: #122131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
9 participants