Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Considering PVC as a resource for balanced resource utilization in the scheduler #58232

Closed
abhgupta opened this issue Jan 12, 2018 · 14 comments · Fixed by #60525
Closed

Considering PVC as a resource for balanced resource utilization in the scheduler #58232

abhgupta opened this issue Jan 12, 2018 · 14 comments · Fixed by #60525
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@abhgupta
Copy link

Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature

What happened:
The scheduler currently does not have a priority function to properly spread out pods that request PVCs. In the absence of such a priority function, scheduling of pods with PVCs could be skewed resulting in:

  • some nodes exhaust their allocatable PVC limits, while still having enough schedulable cpu/memory resources
  • some nodes have enough available PVC mount points, but have exhausted their schedulable cpu/memory resources

If the above happens, we get into a state of inefficient node utilization.

What you expected to happen:
PVCs should be considered as a resource within the balanced resource utilization (BRU) priority function. Instead of the current algorithm in place within the BRU priority function, we could consider using standard deviation to allow more than 2 resources to be balanced across nodes. Input values for the standard deviation calculation could be the fractions/ratios of the scheduled->capacity for the resources (memory, cpu, and pvc).

Alternately, a separate priority function could be considered for just PVCs - whatever makes more sense.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
The PVC resource should also be considered in the "least requested" priority function.

Environment:

  • Kubernetes version (use kubectl version): Any
  • Cloud provider or hardware configuration: AWS/Azure/GCP
  • OS (e.g. from /etc/os-release): Any
  • Kernel (e.g. uname -a): Any
  • Install tools: N/A
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 12, 2018
@abhgupta
Copy link
Author

cc @derekwaynecarr @aveshagarwal

@abhgupta
Copy link
Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 12, 2018
@resouer
Copy link
Contributor

resouer commented Jan 18, 2018

But PVC is not bound to node, not very sure about how the spreading here works @abhgupta Mind to explain a little bit?

@timothysc
Copy link
Member

timothysc commented Jan 18, 2018

@abhgupta are you just looking for a count-leveling? So distributions of PVC's are not lumpy? If so, that seems pretty trivial.

@k82cn
Copy link
Member

k82cn commented Jan 22, 2018

some nodes exhaust their allocatable PVC limits

@abhgupta , any more detail about PVC limits? My understanding is that PVC are NOT ALWAYS related to node.

@ravisantoshgudimetla
Copy link
Contributor

ravisantoshgudimetla commented Feb 8, 2018

@k82 - IIUC, what @abhgupta wants is - a node which has balanced CPU, memory and number of volumes attached.
I think
We can modify the balanced resource allocation priority to compute standard deviation or variance of (CPU Fraction, Memory Fraction, (volumes_mounted_till_now_on_node+requested)/max_no_of_vols_allowed) for every node. Whichever node has least standard deviation will have high priority.

@k82cn
Copy link
Member

k82cn commented Feb 8, 2018

number of volumes attached.

Any impact if we have a bigger number, performance or max mount points ? If max mount points, that's interesting point that how many pods will run in each node :)

Another case in my mind is DFS (distributed file system) client issue : if so many Pod read/write data into dfs mount point, the connection maybe a bottleneck.

No objection, just would like to know the detail of case before we build a priorities for it :).

@k82cn
Copy link
Member

k82cn commented Feb 8, 2018

/cc @bsalamat @kubernetes/sig-storage-feature-requests

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Feb 8, 2018
@abhgupta
Copy link
Author

abhgupta commented Feb 8, 2018

@ravisantoshgudimetla Thanks! That exactly captures the gist of my proposal. We could ignore PVs (for storage classes) that do not have an upper limit. Such PVs (with no upper limits) could simply be dealt with the "least requested" priority function.

If I understand what @k82cn was highlighting
There are multiple scheduling issues with PVCs

  1. Imbalance with just the number of PVs mounted (on pods) on each node
  • This is the low hanging fruit that I wish to target
  1. Imbalance with the network I/O for data read/writes from all pods on a node to PVs
  • This is an important issue, but more complicated and I wouldn't want to club it in this issue.

@msau42
Copy link
Member

msau42 commented Feb 8, 2018

This is also related to #24317. Bigger nodes can have higher limits on the max number of volumes attached. An even count spreading assumes that all nodes are the same size.

@ravisantoshgudimetla
Copy link
Contributor

ravisantoshgudimetla commented Feb 8, 2018

@k82cn

Any impact if we have a bigger number, performance or max mount points ? If max mount points, that's interesting point that how many pods will run in each node :)

I think as of now, we are limited by 39 for AWS and 16 for GCE which are hardcoded from scheduling perspective with users being able to specify through environment variable, so the numbers aren't that high. I can create a benchmark to test this.

@msau42 - Thanks for pointing to the issue where node sizes vary. I think as long as I rely on the variable that uses configMap(#53461 (comment)), we should be good to go.

Bigger nodes can have higher limits on the max number of volumes attached. An even count spreading assumes that all nodes are the same size.

I think this statement holds good for memory and CPU as well. AFAIU, a balanced node is one which has memory utilization(requested/capacity), CPU utilization close to each other meaning variance is less irrespective of size. Won't this be same be same for volumes attached? Or is your statement based on not using configmap which has information on cloud provider specific value.

@bsalamat
Copy link
Member

bsalamat commented Feb 8, 2018

PVCs should be considered as a resource within the balanced resource utilization (BRU) priority function.

Are you using PVCs and attached volumes interchangeably here? AFAIK, PVCs should first be processed and result in an attached volume before the Pod can be scheduled.

@msau42
Copy link
Member

msau42 commented Feb 8, 2018

Right now the scheduler does not cache number of volumes attached per node. The predicate just walks through all Pods and calculates the volume counts per node. In general, there is also a lot of churn in this area:

  • Volume count predicate is being redesigned to handle variable limits per node
  • Most volume plugins are starting to get migrated to use the CSI plugin interface, which has a VolumeAttachement object that the scheduler may be able to leverage better. CSI may also be enhanced to better support reporting the node limits.

@k82cn
Copy link
Member

k82cn commented Feb 9, 2018

CSI may also be enhanced to better support reporting the node limits.

+1, it's also better to enhance scheduler for general purpose, e.g. CSI, instead of hardcode :).

Overall, that's a reasonable feature to me :).

k8s-github-robot pushed a commit that referenced this issue Mar 31, 2018
Automatic merge from submit-queue (batch tested with PRs 54997, 61869, 61816, 61909, 60525). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Balanced resource allocation priority to include volume count on nodes.

Scheduler balanced resource allocation priority to include volume count on nodes.

/cc @aveshagarwal @abhgupta



**What this PR does / why we need it**:

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #58232


**Release note**:

```release-note
Balanced resource allocation priority in scheduler to include volume count on node 
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
8 participants