Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an optional flag to kubelet for vols outside of kublet control #81805

Conversation

matthias50
Copy link
Contributor

@matthias50 matthias50 commented Aug 22, 2019

Allows kubelet to account for volumes, other than root, which
exist on the node when reporting number of allowable volume mounts.

What type of PR is this?

/kind bug

What this PR does / why we need it:
Allows a cluster operator to inform kubelet about how many volumes exist on nodes other than the root volume. This covers use cases such as creating separate volumes for container storage. Without this information, kubelet will report too many volume mounts as being possible since it assumes the only volume in use is root.

Which issue(s) this PR fixes:
Fixes #81533

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add optional kubelet flag --num-non-kube-volumes. This allows one to specify the number of volumes, other than root, which are mounted to the node outside of kubelet's control. This information allows kubelet to accurately report the number of available volume mounts.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot
Copy link
Contributor

Welcome @matthias50!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 22, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @matthias50. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 22, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: matthias50
To complete the pull request process, please assign derekwaynecarr
You can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 22, 2019
@matthias50
Copy link
Contributor Author

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Aug 22, 2019
@matthias50
Copy link
Contributor Author

/assign @derekwaynecarr

@dashpole
Copy link
Contributor

/ok-to-test

Is the limit you are hitting specific to EBS volumes, or is it a limit on the number of mounts?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 22, 2019
@matthias50
Copy link
Contributor Author

/ok-to-test

Is the limit you are hitting specific to EBS volumes, or is it a limit on the number of mounts?

Correct, the number of EBS volumes.

Allows kubelet to account for volumes, other than root, which
exist on the node when reporting number of allowable volume mounts.
@matthias50 matthias50 force-pushed the matthias50/add-kubelet-vols-flag branch from 71b2431 to b8c0c79 Compare August 22, 2019 20:59
@matthias50
Copy link
Contributor Author

/test pull-kubernetes-bazel-test

@dashpole
Copy link
Contributor

You didn't answer my question. Is the limit specific to EBS, or is it generic to all volume types?

@matthias50
Copy link
Contributor Author

You didn't answer my question. Is the limit specific to EBS, or is it generic to all volume types?

Sorry, my particular use case is EBS, but the limit is idea of a limit is generic to all volume types. So regardless of the volume type, it is possible that some volume or volumes have been mounted to the node by some node bootstrap process that kubelet doesn't know anything about. This PR allows one to tell kubelet about this fact so it does correct reporting.

@gnufied
Copy link
Member

gnufied commented Aug 22, 2019

There are couple of issues with this proposal.

a. It does not work for CSI volumes or when in-tree to CSI migration is enabled for EBS.
b. For CSI volumes in general the plan is to override the limits in the driver themselves rather than kubelet.

But this might be a okay - "in the meanwhile" solution. cc @msau42 @leakingtapan @justinsb

@dashpole
Copy link
Contributor

Thanks @gnufied. I suspected this is an issue we would want to solve per-CSI-driver...

@msau42
Copy link
Member

msau42 commented Aug 22, 2019

@kubernetes/sig-storage-pr-reviews

@leakingtapan
Copy link

leakingtapan commented Aug 23, 2019

Yep. EBS CSI driver static volume limit implemented here. And here is a similar request to make it configurable.

But even this approach doesn't solve 100% of the attach limit issue on EC2 instance. See #80967. So which one is the problem you are facing and trying to solve?

@matthias50
Copy link
Contributor Author

But even this approach doesn't solve 100% of the attach limit issue on EC2 instance. See #80967. So which one is the problem you are facing and trying to solve?

From my perspective, it allows for solving 100% of the attach limit issues for EC2. At least it is solving 100% of the problems we are facing :). For background, our clusters have a heterogenous mix of instances. We have non-Nitro instances, Nitro instances with instance storage and without. Here is what we want to have happen and how we will get that with this PR in place:

Non Nitro: 2 EBS volumes, root and container storage. Pass --num-non-kube-volumes=1 so there can be a max of 38 volumes for PVCs.
Nitro w/o instance storage: 2 EBS volumes, root and container storage. Pass --num-non-kube-volumes=1 so there can be a max of 24 volumes for for PVCs.
Nitro instances w/instance storage: 1 EBS volume, root. Container storage is on instance storage. Pass --num-non-kube-volumes=0 (or don't pass anything as this is default). So there are 25 volumes for PVCs.

As I see it, it allows for solving #80967 by allowing for tooling outside of Kubernetes to account for the presence or absence of non-root volumes. This tooling has to be aware of this anyway, because it created the volumes in the first place, so it should be in a good place to tell Kubelet about the extra volumes if they exist.

Also, this should also allow operators on other clouds such as Azure to account for the same issue in the same way---by passing in some value for --num-non-kube-volumes which corresponds to the number of volumes attached to the node by their bootstrap process.

@matthias50
Copy link
Contributor Author

Yep. EBS CSI driver static volume limit implemented here. And here is a similar request to make it configurable.

But even this approach doesn't solve 100% of the attach limit issue on EC2 instance. See #80967. So which one is the problem you are facing and trying to solve?

So I took another look at #80967 and the earlier proposal I made doesn't address that issue. I think, as a short term solution, that instead of taking some number of volumes as a flag to kubelet and subtracting, we could instead take a kubelet level equivalent of KUBE_MAX_PD_VOLS which would be an override if set. Then,

func VolumeLimits(volumePluginListFunc func() []volume.VolumePluginWithAttachLimits, // typically Kubelet.volumePluginMgr.ListVolumePluginWithLimits
becomes something like this:

// VolumeLimits returns a Setter that updates the volume limits on the node.
func VolumeLimits(volumePluginListFunc func() []volume.VolumePluginWithAttachLimits, // typically Kubelet.volumePluginMgr.ListVolumePluginWithLimits
	kubeNodeMaxPDVols int64,
) Setter {
	return func(node *v1.Node) error {
		if node.Status.Capacity == nil {
			node.Status.Capacity = v1.ResourceList{}
		}
		if node.Status.Allocatable == nil {
			node.Status.Allocatable = v1.ResourceList{}
		}

		pluginWithLimits := volumePluginListFunc()
		for _, volumePlugin := range pluginWithLimits {
			attachLimits, err := volumePlugin.GetVolumeLimits()
			if err != nil {
				klog.V(4).Infof("Error getting volume limit for plugin %s", volumePlugin.GetPluginName())
				continue
			}
			for limitKey, value := range attachLimits {
				if kubeNodeMaxPDVols != -1 {
					value = kubeNodeMaxPDVols
				}
				node.Status.Capacity[v1.ResourceName(limitKey)] = *resource.NewQuantity(value, resource.DecimalSI)
				node.Status.Allocatable[v1.ResourceName(limitKey)] = *resource.NewQuantity(value, resource.DecimalSI)
			}
		}
		return nil
	}
}

Could something like this be a ok "in the meanwhile" solution?

@gnufied
Copy link
Member

gnufied commented Aug 28, 2019

Any solution we propose must work for both CSI and in-tree volumes. At this point, I am afraid we can't have a solution that works for one but not for the other..

So, assuming this is implemented for both CSI and in-tree volumes and assuming volume supports such limits, will this create problems if user is using two different types of volumes on the node? Usually each plugin can freely specify its own limits but having a environment variable like KUBE_MAX_PD_VOLS in kubelet, would override the limits from all types of volume types. Is that desirable? Guess - even a "in the meanwhile" solution has to be supported literally forever in k8s 1.x lifecycle, so we have to be kinda careful.

@matthias50
Copy link
Contributor Author

Any solution we propose must work for both CSI and in-tree volumes. At this point, I am afraid we can't have a solution that works for one but not for the other..

So, assuming this is implemented for both CSI and in-tree volumes and assuming volume supports such limits, will this create problems if user is using two different types of volumes on the node? Usually each plugin can freely specify its own limits but having a environment variable like KUBE_MAX_PD_VOLS in kubelet, would override the limits from all types of volume types. Is that desirable? Guess - even a "in the meanwhile" solution has to be supported literally forever in k8s 1.x lifecycle, so we have to be kinda careful.

Got it. What about passing kubelet a list of mappings from the volume type limit to the value. For instance, we could pass a comma separate list as in:

--kube-node-max-pd-vols-map=attachable-volumes-aws-ebs=38,attachable-volumes-some-other-type=15

This worms its way through Kubelet till we end up with:

// VolumeLimits returns a Setter that updates the volume limits on the node.
func VolumeLimits(volumePluginListFunc func() []volume.VolumePluginWithAttachLimits, // typically Kubelet.volumePluginMgr.ListVolumePluginWithLimits
	kubeNodeMaxPDVolsMap map[string]int64,
) Setter {
	return func(node *v1.Node) error {
		if node.Status.Capacity == nil {
			node.Status.Capacity = v1.ResourceList{}
		}
		if node.Status.Allocatable == nil {
			node.Status.Allocatable = v1.ResourceList{}
		}

		pluginWithLimits := volumePluginListFunc()
		for _, volumePlugin := range pluginWithLimits {
			attachLimits, err := volumePlugin.GetVolumeLimits()
			if err != nil {
				klog.V(4).Infof("Error getting volume limit for plugin %s", volumePlugin.GetPluginName())
				continue
			}
			for limitKey, value := range attachLimits {
				override, overrideExists := kubeNodeMaxPDVolsMap[limitKey]
				if overrideExists {
					value = override
				}
				node.Status.Capacity[v1.ResourceName(limitKey)] = *resource.NewQuantity(value, resource.DecimalSI)
				node.Status.Allocatable[v1.ResourceName(limitKey)] = *resource.NewQuantity(value, resource.DecimalSI)
			}
		}
		return nil
	}
}

I tried this out, and for the in tree AWS, it works. As for the CSI case, it seems that this would require a similar, but different solution. In that case, somehow we need to provide this information to the daemonset pod running on each of the nodes. As @gnufied mentioned in kubernetes-sigs/aws-ebs-csi-driver#347, perhaps this could be solved by having multiple daemonsets with different command line arguments.

Regardless of the approach for CSI, it seems like that is a totally separate case. Can we instead treat these as separate problems and use something like this for in-tree volumes drivers?

@matthias50
Copy link
Contributor Author

@gnufied This issue is causing us pain because we have to do custom builds of k8s components every time we update Kubernetes to keep from running into production issues where too many volumes are mounted to nodes. Can I get some feedback on the above approach of separating these into an in-tree solution and an out of tree solution? Perhaps we could bring some unity here by using a similarly named flag for the CSI provider as you suggested here: kubernetes-sigs/aws-ebs-csi-driver#347

I'm open to other approaches if someone has some better idea, but right now we seem to be in a situation where we will be stuck doing local builds of k8s binaries every time we want to update to a new version of Kubernetes.

@rmt
Copy link

rmt commented Oct 29, 2019

@gnufied, @leakingtapan, is there a path forward here, yet?

I also consider KUBE_MAX_PD_VOLS (as it stands today) to be an interim hack. If no "proper" solution is imminent, then I would consider extending KUBE_MAX_PD_VOLS to kubelet (and not just kube-controller-manager) as a small, reasonable, improvement on it.

@bertinatto
Copy link
Member

Just FYI, once CSI migration is GA (it's beta now, but disabled by default), the CSI driver will be the source for the volume limits value.

I updated the CSI docs recommending CSI drivers to allow for customization of volume limits: https://kubernetes-csi.github.io/docs/volume-limits.html

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 25, 2020
@k8s-ci-robot
Copy link
Contributor

@matthias50: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 26, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many EBS volumes mounted to a node
9 participants