Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume attach limit doesn't work for WaitForFirstConsumer Volume binding mode #73724

Closed
sarjeet2013 opened this issue Feb 4, 2019 · 7 comments · Fixed by #73863
Closed

Volume attach limit doesn't work for WaitForFirstConsumer Volume binding mode #73724

sarjeet2013 opened this issue Feb 4, 2019 · 7 comments · Fixed by #73863
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@sarjeet2013
Copy link

What happened:

Volume binding mode WaitForFirstConsumer doesn't consider the volume attach limit per node.

What you expected to happen:

WaitForFirstConsumer volume binding mode should have respected the attach limit, similar to Immediate binding mode.

How to reproduce it (as minimally and precisely as possible):

In CSI driver, configured the volume-attach limit to 1 per node on a 3 node cluster.

In Immediate volume binding mode, only 3 pods with 3 PV/PVCs runs.
In WaitForFirstConsumer binding mode, all 5 pods runs with 5PV/PVCs on 3 node cluster. (Actually 4 pods scheduled on 1 node and 1 on other node, 3rd didn't have any pod scheduled)

Anything else we need to know?:

I was able to reproduced it with minimal node and configuration, and also hit this for stress case with 100 Volume limit and >500 pods running on the 5 mode cluster.

Environment:

  • Kubernetes version (use kubectl version):

➜ ~ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T20:56:12Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration: On-prem linux server

  • OS (e.g. from /etc/os-release): Centos 7

  • Kernel (e.g. uname -a): 3.10.0

  • Install tools:

  • Others:

@sarjeet2013 sarjeet2013 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 4, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 4, 2019
@sarjeet2013
Copy link
Author

cc @msau42

@sarjeet2013
Copy link
Author

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 4, 2019
@msau42
Copy link
Member

msau42 commented Feb 4, 2019

/assign @gnufied

Currently the predicate skips counting PVCs that are not bound. It should lookup the storageclass to get the provisioner/driver name.

This solution will require a StorageClass object is even if you don't use dynamic provisioning.

@msau42
Copy link
Member

msau42 commented Feb 5, 2019

I had a thought that it wouldn't be so simple to count unbound PVCs of other pods since binding is still in progress, but I think it's fine actually. We only count other pods that have been assigned to the same node. Those pods are assigned to the node AND still unbound because of one of these reasons:

  • It's actually been scheduled to the node and someone manually came in and unbound the PVC: unlikely and an unsupported sequence since we implemented StorageProtection
  • The Pod was initially created with nodeName already set, bypassing the scheduler
  • The Pod is going to be scheduled (it's been assumed), and it's currently going through binding. This is normal with delayed binding.

Either way, the fact that the Pod's nodename is set to this node means that the pod should be counting towards the node's limit.

@gnufied
Copy link
Member

gnufied commented Feb 8, 2019

Hmm, so after opening the PR that fixes this, I realize that - from storageclass we don't really know if the volume we are evaluating is CSI volume or not. I think a similar problem exists for existing predicate though. Like AzureDisk predicate does not know if PVC it is over counting is of type AzureVolume or not.

So, I suspect that those 4 hardcoded predicates are counting all delayed binding volumes.

@msau42
Copy link
Member

msau42 commented Feb 8, 2019

if the storageclass.provisioner name is not found, can we just ignore it?

@gnufied
Copy link
Member

gnufied commented Feb 8, 2019

I think we can ignore it. The PR currently does that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants