Considering PVC as a resource for balanced resource utilization in the scheduler #58232

abhgupta · 2018-01-12T23:55:46Z

Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature

What happened:
The scheduler currently does not have a priority function to properly spread out pods that request PVCs. In the absence of such a priority function, scheduling of pods with PVCs could be skewed resulting in:

some nodes exhaust their allocatable PVC limits, while still having enough schedulable cpu/memory resources
some nodes have enough available PVC mount points, but have exhausted their schedulable cpu/memory resources

If the above happens, we get into a state of inefficient node utilization.

What you expected to happen:
PVCs should be considered as a resource within the balanced resource utilization (BRU) priority function. Instead of the current algorithm in place within the BRU priority function, we could consider using standard deviation to allow more than 2 resources to be balanced across nodes. Input values for the standard deviation calculation could be the fractions/ratios of the scheduled->capacity for the resources (memory, cpu, and pvc).

Alternately, a separate priority function could be considered for just PVCs - whatever makes more sense.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
The PVC resource should also be considered in the "least requested" priority function.

Environment:

Kubernetes version (use kubectl version): Any
Cloud provider or hardware configuration: AWS/Azure/GCP
OS (e.g. from /etc/os-release): Any
Kernel (e.g. uname -a): Any
Install tools: N/A
Others:

The text was updated successfully, but these errors were encountered:

abhgupta · 2018-01-12T23:57:37Z

cc @derekwaynecarr @aveshagarwal

abhgupta · 2018-01-12T23:58:18Z

/sig scheduling

resouer · 2018-01-18T05:31:16Z

But PVC is not bound to node, not very sure about how the spreading here works @abhgupta Mind to explain a little bit?

timothysc · 2018-01-18T19:31:51Z

@abhgupta are you just looking for a count-leveling? So distributions of PVC's are not lumpy? If so, that seems pretty trivial.

k82cn · 2018-01-22T05:46:24Z

some nodes exhaust their allocatable PVC limits

@abhgupta , any more detail about PVC limits? My understanding is that PVC are NOT ALWAYS related to node.

ravisantoshgudimetla · 2018-02-08T14:00:09Z

@k82 - IIUC, what @abhgupta wants is - a node which has balanced CPU, memory and number of volumes attached.
I think
We can modify the balanced resource allocation priority to compute standard deviation or variance of (CPU Fraction, Memory Fraction, (volumes_mounted_till_now_on_node+requested)/max_no_of_vols_allowed) for every node. Whichever node has least standard deviation will have high priority.

k82cn · 2018-02-08T15:56:34Z

number of volumes attached.

Any impact if we have a bigger number, performance or max mount points ? If max mount points, that's interesting point that how many pods will run in each node :)

Another case in my mind is DFS (distributed file system) client issue : if so many Pod read/write data into dfs mount point, the connection maybe a bottleneck.

No objection, just would like to know the detail of case before we build a priorities for it :).

k82cn · 2018-02-08T15:57:37Z

/cc @bsalamat @kubernetes/sig-storage-feature-requests

abhgupta · 2018-02-08T16:16:33Z

@ravisantoshgudimetla Thanks! That exactly captures the gist of my proposal. We could ignore PVs (for storage classes) that do not have an upper limit. Such PVs (with no upper limits) could simply be dealt with the "least requested" priority function.

If I understand what @k82cn was highlighting
There are multiple scheduling issues with PVCs

Imbalance with just the number of PVs mounted (on pods) on each node

This is the low hanging fruit that I wish to target

Imbalance with the network I/O for data read/writes from all pods on a node to PVs

This is an important issue, but more complicated and I wouldn't want to club it in this issue.

msau42 · 2018-02-08T17:49:56Z

This is also related to #24317. Bigger nodes can have higher limits on the max number of volumes attached. An even count spreading assumes that all nodes are the same size.

ravisantoshgudimetla · 2018-02-08T18:38:38Z

@k82cn

Any impact if we have a bigger number, performance or max mount points ? If max mount points, that's interesting point that how many pods will run in each node :)

I think as of now, we are limited by 39 for AWS and 16 for GCE which are hardcoded from scheduling perspective with users being able to specify through environment variable, so the numbers aren't that high. I can create a benchmark to test this.

@msau42 - Thanks for pointing to the issue where node sizes vary. I think as long as I rely on the variable that uses configMap(#53461 (comment)), we should be good to go.

Bigger nodes can have higher limits on the max number of volumes attached. An even count spreading assumes that all nodes are the same size.

I think this statement holds good for memory and CPU as well. AFAIU, a balanced node is one which has memory utilization(requested/capacity), CPU utilization close to each other meaning variance is less irrespective of size. Won't this be same be same for volumes attached? Or is your statement based on not using configmap which has information on cloud provider specific value.

bsalamat · 2018-02-08T18:51:09Z

PVCs should be considered as a resource within the balanced resource utilization (BRU) priority function.

Are you using PVCs and attached volumes interchangeably here? AFAIK, PVCs should first be processed and result in an attached volume before the Pod can be scheduled.

msau42 · 2018-02-08T19:30:46Z

Right now the scheduler does not cache number of volumes attached per node. The predicate just walks through all Pods and calculates the volume counts per node. In general, there is also a lot of churn in this area:

Volume count predicate is being redesigned to handle variable limits per node
Most volume plugins are starting to get migrated to use the CSI plugin interface, which has a VolumeAttachement object that the scheduler may be able to leverage better. CSI may also be enhanced to better support reporting the node limits.

k82cn · 2018-02-09T03:13:03Z

CSI may also be enhanced to better support reporting the node limits.

+1, it's also better to enhance scheduler for general purpose, e.g. CSI, instead of hardcode :).

Overall, that's a reasonable feature to me :).

@aveshagarwal

Automatic merge from submit-queue (batch tested with PRs 54997, 61869, 61816, 61909, 60525). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Balanced resource allocation priority to include volume count on nodes. Scheduler balanced resource allocation priority to include volume count on nodes. /cc @aveshagarwal @abhgupta **What this PR does / why we need it**: **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #58232 **Release note**: ```release-note Balanced resource allocation priority in scheduler to include volume count on node ```

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/feature Categorizes issue or PR as related to a new feature. labels Jan 12, 2018

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 12, 2018

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Feb 8, 2018

ravisantoshgudimetla mentioned this issue Feb 27, 2018

Balanced resource allocation priority to include volume count on nodes. #60525

Merged

k8s-github-robot closed this as completed in #60525 Mar 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Considering PVC as a resource for balanced resource utilization in the scheduler #58232

Considering PVC as a resource for balanced resource utilization in the scheduler #58232

abhgupta commented Jan 12, 2018

abhgupta commented Jan 12, 2018

abhgupta commented Jan 12, 2018

resouer commented Jan 18, 2018

timothysc commented Jan 18, 2018 •

edited

Loading

k82cn commented Jan 22, 2018

ravisantoshgudimetla commented Feb 8, 2018 •

edited

Loading

k82cn commented Feb 8, 2018

k82cn commented Feb 8, 2018

abhgupta commented Feb 8, 2018 •

edited

Loading

msau42 commented Feb 8, 2018

ravisantoshgudimetla commented Feb 8, 2018 •

edited

Loading

bsalamat commented Feb 8, 2018

msau42 commented Feb 8, 2018

k82cn commented Feb 9, 2018

Considering PVC as a resource for balanced resource utilization in the scheduler #58232

Considering PVC as a resource for balanced resource utilization in the scheduler #58232

Comments

abhgupta commented Jan 12, 2018

abhgupta commented Jan 12, 2018

abhgupta commented Jan 12, 2018

resouer commented Jan 18, 2018

timothysc commented Jan 18, 2018 • edited Loading

k82cn commented Jan 22, 2018

ravisantoshgudimetla commented Feb 8, 2018 • edited Loading

k82cn commented Feb 8, 2018

k82cn commented Feb 8, 2018

abhgupta commented Feb 8, 2018 • edited Loading

msau42 commented Feb 8, 2018

ravisantoshgudimetla commented Feb 8, 2018 • edited Loading

bsalamat commented Feb 8, 2018

msau42 commented Feb 8, 2018

k82cn commented Feb 9, 2018

timothysc commented Jan 18, 2018 •

edited

Loading

ravisantoshgudimetla commented Feb 8, 2018 •

edited

Loading

abhgupta commented Feb 8, 2018 •

edited

Loading

ravisantoshgudimetla commented Feb 8, 2018 •

edited

Loading