Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Priority & Preemption should be restricted-by-default #65557

Closed
jberkus opened this Issue Jun 27, 2018 · 37 comments

Comments

@jberkus
Copy link

jberkus commented Jun 27, 2018

/kind bug
/area security
/sig scheduling
/priority important-soon

Currently, an admin who has Priority & Preemption enabled (as they are by default) and has ResourceQuotaScopeSelectors disabled (as they are by default) has no way to restrict regular users of Kubernetes from creating as many system-node-critical and system-cluster-critical pods as they care to. This can result in creating a denial-of-service if users abuse priorities in competition with each other.

By 1.12, we could have ResourceQuotaScopeSelectors enabled by default and add a default restriction on the use of system-* priorities. However, that kind of leaves 1.11 users in the lurch. What can we do to restrict the ability of regular users to abuse system-* priorities that can be added to 1.11.1?

/assign @bsalamat
/cc @derekwaynecarr @vikaschoudhary16 @liggitt

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jun 28, 2018

That's also been true of pods in general since 1.0 (unlimited unless a resource quota is in place). The tricky bit here is that this is a change to an existing system by which cluster admins can partition resources. @derekwaynecarr can chime in with the intended approach for making it easy for an admin who wants to limit this to indicate that cluster-wide. There may already be an issue tracking this, but he can confirm

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Jun 28, 2018

AFAIK, though, it hasn't been previously possible for two misbehaving users in a preemption race to accidentally shut down kubedns, or similar. That's what I'd like to see some tools to protect against.

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jun 28, 2018

The risk here is that users may maliciously or inadvertently create a large number of pods at system priority levels that would cause some of the system critical pods running at system-cluster-critical priority to get preempted.

Our plan to prevent the risk was to introduce Pod Priority support in ResourceQuota when moving Pod Priority and Preemption to Beta. The feature was added to ResourceQuota, but unfortunately, it is gated by default in 1.11. Kubernetes users can still enable the scopeSelector feature of ResourceQuota and use it to remove the potential DoS risk.

If users do not want to enable the feature, an alternative approach is to set regular ResourceQuota (without the scopeSelector) to limit users from creating unlimited number of pods in a cluster. This approach is not as effective as using scopeSelector, but still prevents users from creating a large number of pods at system priority that would fill up a cluster and could cause preemption of some system components.

A third alternative is to run all the system components with system-node-critical priority. Many of the system components already run with system-node-critical priority and therefore are not preempted, but there are some system components, namely, kube-dns, that run with system-cluster-critical. Changing their priority to system-node-critical can prevent them from getting preempted. This approach is probably the least effective approach, as other user workloads are still susceptible to preemption by a large number of pods created at system priority levels.

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jun 28, 2018

What I would suggest to do for 1.11.1 is to change our admission controller to allow the use of system-node-critical and system-cluster-critical priorities only in kube-system namespace. Then our current (1.11) ResourceQuota can be used to limit users from creating pods in kube-system namespace. With that users won't be able to use the two system PriorityClasses and the risk of abusing them would be mitigated.

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Jun 28, 2018

@bsalamat I discussed this with Avesh and Ravi today, I like the admission controller approach outlined above but ask that we have a mechanism to configure a set of namespace prefixes that are restricted.

For example, kube-, openshift-, vendor-*, etc that may also require use of those priorities but do not collocate in the same namespace as kube-system.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Jun 28, 2018

The feature was gated because it was a new field on a GA API and we do not promote new fields directly to beta on GA APIs. I just want to be clear that the bar for new APIs doesn't get lowered because another thing is going to beta.

The admission controller suggestion is reasonable, although derek's is probably more general.

@aveshagarwal

This comment has been minimized.

Copy link
Member

aveshagarwal commented Jun 28, 2018

If there is an agreement what Derek suggested above, I can work on a PR to priority admission plugin to address the issue. @bsalamat sounds good?

@davidopp

This comment has been minimized.

Copy link
Member

davidopp commented Jun 28, 2018

The admission controller idea is great.

We should update the release note to reflect the change.

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jun 28, 2018

I am fine with expanding the permitted namespaces from kube-system to a configurable set of namespaces in our priority admission controller.

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jun 28, 2018

I would like to add that the default permitted namespace of the admission controller should be kube-system. If a configuration is provided, it will override the default and may expand the namespaces.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jun 28, 2018

I would like to add that the default permitted namespace of the admission controller should be kube-system. If a configuration is provided, it will override the default and may expand the namespaces.

agreed. that maintains parity with the pre-API alpha critical pod annotation use

@aveshagarwal

This comment has been minimized.

Copy link
Member

aveshagarwal commented Jun 28, 2018

So for final plan, as agreed, @bsalamat will send an initial PR to restrict system priority classes to kube-system namespace for 1.11.1. After that I will work on a follow-up PR to make namespace restriction configurable in priority admission plugin.

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Jun 28, 2018

@bsalamat please let @foxish know when you'll have this fix, we'll be scheduling 1.11.1 based on it. Thanks!

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jun 28, 2018

@jberkus @foxish I just sent the above PR against the master branch. I will cherrypick it into 1.11 branch once it is reviewed.

@jberkus

This comment has been minimized.

Copy link
Author

jberkus commented Jun 28, 2018

Cool, thanks!

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Jun 28, 2018

@jberkus to clarify, please hold 1.11.1 on both PRs
#65557 (comment)

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Jun 29, 2018

Does the proposed solution work when the Priority admission controller is disabled (as it is by default)?
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubeapiserver/options/plugins.go#L131-L139
Should e.g. kubeadm start doing --enable-admission-plugins=Priority in v1.11.1 to fix this until it's enabled by default (or similar) in v1.12?

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Jun 29, 2018

[MILESTONENOTIFIER] Milestone Issue: Up-to-date for process

@bsalamat @jberkus

Issue Labels
  • sig/scheduling: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help
@aveshagarwal

This comment has been minimized.

Copy link
Member

aveshagarwal commented Jun 29, 2018

Does the proposed solution work when the Priority admission controller is disabled (as it is by default)?

No. As P&P is enabled by default, my understanding is that Priority admission plugin should be enabled by default too.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jun 29, 2018

No. As P&P is enabled by default, my understanding is that Priority admission plugin should be enabled by default too.

Any cluster that specifies --admission-control determines for itself whether the Priority admission plugin is enabled. Any cluster that is using a value from previous releases is unlikely to have that admission plugin enabled.

@aveshagarwal

This comment has been minimized.

Copy link
Member

aveshagarwal commented Jun 29, 2018

Here is a follow up PR: #65643

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Jun 29, 2018

As P&P is enabled by default

Hmm, the admission controller isn't: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubeapiserver/options/plugins.go#L131-L139.
Expected or not?

In this case, kubeadm doesn't use --admission-control but instead relies on the default list.
My question is, would you recommend that the P&P admission controller is enabled in v1.11?

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jun 30, 2018

Priority admission controller must be enabled by default in 1.11. My understanding was that it was enabled even in 1.10 (I am on mobile and cannot check right now), but the feature gate flag was preventing the admission controller from resolving priority values. In 1.11, the feature gate is enabled and the admission controller resolves priorities.

Regardless, if the admission controller is disabled, priority classes will no be resolved to the integer values. In other words, no pod will have any priority (or you can say all pods will have the same priority). So, the issue that we are discussing here won't exist.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jun 30, 2018

if the admission controller is disabled, priority classes will no be resolved to the integer values. In other words, no pod will have any priority (or you can say all pods will have the same priority). So, the issue that we are discussing here won't exist.

If the admission plugin is disabled, can't any integer priority value be set?

k8s-github-robot pushed a commit that referenced this issue Jul 2, 2018

Kubernetes Submit Queue
Merge pull request #65593 from bsalamat/priority_admission
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Limit usage of system critical priority classes to the system namespace

**What this PR does / why we need it**:
Changes Priority admission controller to limit usage of system critical priority classes to the system namespace. This change is needed to mitigate the risk of creating many pods at system critical priority levels that could cause preemption of system critical components.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #

ref/ #65557

**Special notes for your reviewer**:

**Release note**:

```release-note
Limit the usage of system-node-critical and system-cluster-critical priority classes to kube-system namespace.
```

/sig scheduling
@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jul 2, 2018

If the admission plugin is disabled, can't any integer priority value be set?

Good point, users would be able to set priority values manually in that case.

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jul 2, 2018

This is the cherrypick fo the PR into 1.11: #65717

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jul 2, 2018

Hmm, the admission controller isn't: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubeapiserver/options/plugins.go#L131-L139.
Expected or not?

My understanding is that, these are plugins which are disabled by default and Priority is not among them.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jul 2, 2018

Any plugin not in the defaultOnPlugins list is disabled by default

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Jul 2, 2018

Which means we probably have a bit broader problem wrt this as it's not enabled by default. Should we do that in v1.11.1 as well if it always was the intention in order to not let the user set any priority value (as the feature gate is enabled)?

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jul 2, 2018

Which means we probably have a bit broader problem wrt this as it's not enabled by default. Should we do that in v1.11.1 as well if it always was the intention in order to not let the user set any priority value (as the feature gate is enabled)?

Yes. #65722

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jul 2, 2018

Any plugin not in the defaultOnPlugins list is disabled by default

I have not been able to find any list called defaultOnPlugins. Could you please send a link?

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jul 2, 2018

defaultOnPlugins := sets.NewString(
lifecycle.PluginName, //NamespaceLifecycle
limitranger.PluginName, //LimitRanger
serviceaccount.PluginName, //ServiceAccount
setdefault.PluginName, //DefaultStorageClass
resize.PluginName, //PersistentVolumeClaimResize
defaulttolerationseconds.PluginName, //DefaultTolerationSeconds
mutatingwebhook.PluginName, //MutatingAdmissionWebhook
validatingwebhook.PluginName, //ValidatingAdmissionWebhook
resourcequota.PluginName, //ResourceQuota
)

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jul 2, 2018

Thanks! The function name confused me.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jul 10, 2018

addresses by #65768 and #65717

/close

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

ravisantoshgudimetla commented Jul 10, 2018

@liggitt, I think we should let #65643 merge before closing this issue.

@liggitt

This comment has been minimized.

Copy link
Member

liggitt commented Jul 10, 2018

the critical issue was protecting the feature. #65768 and #65717 do that effectively. if we want configuration to let it be re-expanded in a controlled way, that's fine, but the critical issue is resolved

@ravisantoshgudimetla

This comment has been minimized.

Copy link
Contributor

ravisantoshgudimetla commented Jul 10, 2018

Ok, I have created another issue to track the enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.