Automatically add system critical priority classes at cluster boostrapping #60519

bsalamat · 2018-02-27T19:46:15Z

What this PR does / why we need it:
We had two PriorityClasses that were hardcoded and special cased in our code base. These two priority classes never existed in API server. Priority admission controller had code to resolve these two names. This PR removes the hardcoded PriorityClasses and adds code to create these PriorityClasses automatically when API server starts.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60178

ref/ #57471

Special notes for your reviewer:

Release note:

Automatically add system critical priority classes at cluster boostrapping.

/sig scheduling

bsalamat · 2018-02-27T19:52:58Z

cc/ @liggitt @derekwaynecarr @vikaschoudhary16 @ravisantoshgudimetla

liggitt · 2018-02-28T03:03:56Z

pkg/kubelet/types/pod_update.go

@@ -168,7 +168,7 @@ func IsCriticalPodBasedOnPriority(ns string, priority int32) bool {
 	if ns != kubeapi.NamespaceSystem {


unrelated to this PR, but critical pods based on priority should not be limited to the kube-system namespace... that namespace should not have any special significance as far as the API is concerned.

As you said, that's unrelated this to PR, but K8s docs clearly state that critical pods can only exist in system namespace only.

K8s docs clearly state that critical pods can only exist in system namespace only

that is left over from critical pods being determined from the uncontrollable pod annotation. that limitation should not apply to the priority field

I will file a separate issue for this.

filed: #60596

liggitt · 2018-02-28T05:09:58Z

plugin/pkg/admission/priority/admission.go

-		return admission.NewForbidden(a, fmt.Errorf("the name of the priority class is a reserved name for system use only: %v", pc.Name))
+	// API server adds system critical priority classes at bootstrapping. We should
+	// not enforce restrictions on adding system level priority classes for API server.
+	if userInfo := a.GetUserInfo(); userInfo == nil || userInfo.GetName() != user.APIServerUser {


this isn't an appropriate use of this username. if you want to require additional authorization to set certain priority classes, pass in an authorizer and perform the authorization check (see PodSecurityPolicyPlugin for an example of obtaining and using the authorizer). The API server always has superuser permissions when making API calls to itself, and will pass any authorization check, but cluster-admins should be able to modify these objects as well.

I also find it odd to do this enforcement in admission... why not in the rest handler?

I looked at PodSecurityPolicyPlugin. I am not sure if the authorization done there is enough though. We don't want to allow any user-defined priority classes to start with "system-" or have a value in the system range (>1 billion).
Performing the check in the rest handler can be done too if there is a way to obtain userInfo.

Could you let the rest handler have the map of allowed system priorities and only allow those to be created. On upgrade, any newly allowed system priority classes would be bootstrapped by a new apiserver using the loopback connection, which would always speak to a rest handler that agreed on what system priority classes were allowed.

Then you don't need to worry about userinfo at all. Anyone authorized to create priority classes can create the system priority classes the apiserver expects to be bootstrapped.

That's actually a very good idea. I changed the PR accordingly. The new changes are under a new commit.

liggitt · 2018-02-28T05:10:37Z

plugin/pkg/admission/priority/admission.go

+		}
+		if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) {
+			return admission.NewForbidden(a, fmt.Errorf("priority class names with '"+scheduling.SystemPriorityClassPrefix+"' prefix are reserved for system use only"))
+		}
 	}
 	// If the new PriorityClass tries to be the default priority, make sure that no other priority class is marked as default.
 	if pc.GlobalDefault {


unrelated, but it still appears we're trying to enforce a singleton GlobalDefault... I thought we dropped that

We never dropped it. We just added code to handle the case of having more than one global default when they are added due to race in HA clusters.

ah, that means admission can reject updates that swap which one is the default

apply a file containing two priority classes, a and b (default)

apply an update of the file with a (default) and b (non-default)

admission rejects if the update to a happens first. it would also reject if the update to b happened first but the watch event didn't make it into the informer-fed cache the admission plugin consulted.

this guard seems more problematic than helpful to me

I agree that there are races causing one to add more than one global default, but are you suggesting that it is better to allow people to add as many global default priority classes as they like?

if we cannot guarantee a single default (which we can't), and we have a deterministic and documented behavior when there's more than one (which we do), then breaking legitimate updates in order to half-accomplish correctness doesn't seem worthwhile

I see your point. We can remove it and document the behavior. I will do that in a separate PR to ensure we are not mixing the two.

derekwaynecarr · 2018-02-28T06:38:12Z

pkg/registry/scheduling/rest/storage_scheduling.go

@@ -49,6 +63,64 @@ func (p RESTStorageProvider) v1alpha1Storage(apiResourceConfigSource serverstora
 	return storage
 }

+func (p RESTStorageProvider) PostStartHook() (string, genericapiserver.PostStartHookFunc, error) {
+	return PostStartHookName, AddSystemPriorityClasses(), nil


can i delete them? what happens if i do? will they come back only on restart?

If they are deleted, no more new Pods with these priority classes can be created. These PriorityClasses are added back on API server restart.

I think it also means no new critical pods could be created. Can we avoid that? Say something like Delete would return error, if priorityclass name is one of systemclustercritical or systemnodecritical? Not sure if that takes flexibility away from user.

Looks like we don't have a validation mechanism in our REST layer for delete operations. So, I am not sure if it is possible to prevent deletion of these priority classes.

Thanks for the information. I did a quicksearch in the codebase and it seems there are few things like immortalNamespaces(basically 'kube-system' and 'defaut' etc), where at the admission plugin level we are prohibiting user to delete them.(https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L77). Wondering if we could do the same thing but I think it can be a different admission controller from priority AdmissionController.

@liggitt You mean somewhere here? - https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/scheduling/priorityclass/registry.go#L81

inside the impl of Delete (which the priorityclassstorage would need to override, put the guard in, then delegate, like:

func (r *REST) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) { // if name is in set of expected system priority classes, forbid and return return r.store.Delete(ctx, name, options) }

Understood. Thanks @liggitt. Unrelated to this change, when do we use this mechanism vs one I mentioned earlier for kube-system(is this mechanism only for top level objects?)

Thanks, @liggitt and @ravisantoshgudimetla. I added the check.

thanks for updating.

liggitt · 2018-03-01T14:30:50Z

pkg/apis/scheduling/validation/validation.go

+	allErrs = append(allErrs, apivalidation.ValidateObjectMeta(&pc.ObjectMeta, false, apivalidation.NameIsDNSSubdomain, field.NewPath("metadata"))...)
+	// If the priorityClass starts with a system prefix, it must be one of the
+	// predefined system priority classes.
+	if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) && !scheduling.IsKnownSystemPriorityClass(pc) {


since both the name and value are immutable, comparison to the bootstrap priority classes should only be done on create.

update should not concern itself with these checks, since during rolling upgrades or in downgraded apiserver cases, you might need to update/delete a system priority class bootstrapped by a newer apiserver.

That is already the case. ValidatePriorityClassUpdate is called to validate updates.

strategy.ValidateUpdate calls this function

That's right. I will remove the call.

liggitt · 2018-03-01T14:32:04Z

pkg/apis/scheduling/helpers.go

+	for _, spc := range systemPriorityClasses {
+		spcCopy := *spc
+		spcCopy.Description = ""
+		if reflect.DeepEqual(spcCopy, pcCopy) {


only compare name and value. descriptions, annotations, labels, etc, we don't care about.

also, if the value mismatches, surfacing a better error message is important ("system priority class %s may only have value %d", etc)

The reason that I used DeepEqual was to make it resilient to changes of fields in PriorityClass. Anyway, I changed it to compare the three fields only, but I didn't add more logic to report errors for each priority class. Reporting better errors would make the logic hard to maintain. These prioirty classes are supposed to be created by the API server automatically and not by humans.

Actually, adding more detailed error does not make the function much different. Let me add it.

Done. PTAL.

liggitt · 2018-03-01T21:36:59Z

pkg/apis/scheduling/helpers.go

+			if spc.Value != pc.Value {
+				return false, fmt.Errorf("value of %v PrioityClass must be %v", spc.Name, spc.Value)
+			}
+			if spc.GlobalDefault != pc.GlobalDefault {


I don't think we care about this either... just the name and value. If they want to make the default for pods in their cluster "system-node-critical", that shouldn't matter to us.

Setting such a high priority as default could cause eviction of system critical components of a cluster. We should prevent catastrophic human errors as much as possible.

liggitt · 2018-03-01T21:38:19Z

pkg/apis/scheduling/validation/validation.go

+	} else if pc.Value > scheduling.HighestUserDefinablePriority {
+		// Similarly, if the Value is in the system range, it must be one of the
+		// predefined system priority classes.
+		if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is {


since it doesn't start with the SystemPriorityClassPrefix, there's no need to check this here

liggitt · 2018-03-01T21:38:36Z

pkg/apis/scheduling/validation/validation.go

+	// predefined system priority classes.
+	if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) {
+		if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is {
+			allErrs = append(allErrs, field.Forbidden(field.NewPath("metadata", "Name"), "priority class names with '"+scheduling.SystemPriorityClassPrefix+"' prefix are reserved for system use only. error: "+err.Error()))


"name" should match the json field

ravisantoshgudimetla · 2018-03-03T05:00:51Z

pkg/registry/scheduling/priorityclass/storage/storage.go

@@ -56,3 +61,14 @@ var _ rest.ShortNamesProvider = &REST{}
 func (r *REST) ShortNames() []string {
 	return []string{"pc"}
 }
+
+// Delete ensures that system priority classes are not deleted.


Should we have guard for update as well or else we will able to update system priority class to lower value and then create pods that have higher priority than system priority?

Update validation already prevents changing value

Ohh I missed that, it seems it is part of https://github.com/kubernetes/kubernetes/pull/60519/files#diff-35601e9483eab62205294e45c6df7712R71. Thanks.

bsalamat · 2018-03-05T22:37:48Z

@liggitt @aveshagarwal @ravisantoshgudimetla @derekwaynecarr
Any other comments on this PR?

ravisantoshgudimetla

LGTM.

derekwaynecarr · 2018-03-06T04:30:01Z

/lgtm

liggitt · 2018-03-06T17:45:53Z

/lgtm

liggitt · 2018-03-06T17:46:09Z

(needs squash)

bsalamat · 2018-03-06T18:08:01Z

squashed commits.
Thanks all again for your reviews.

bsalamat · 2018-03-21T17:25:06Z

@liggitt Could you please re-lgtm this after squash?

bsalamat · 2018-03-22T18:18:28Z

ping

liggitt · 2018-03-22T20:14:46Z

/lgtm

k8s-ci-robot · 2018-03-22T20:14:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, derekwaynecarr, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~hack/OWNERS~~ [liggitt]
~~pkg/apis/OWNERS~~ [liggitt]
~~pkg/kubelet/OWNERS~~ [derekwaynecarr]
~~pkg/registry/OWNERS~~ [liggitt]
~~pkg/scheduler/OWNERS~~ [bsalamat]
~~plugin/pkg/admission/OWNERS~~ [derekwaynecarr]
~~test/e2e/scheduling/OWNERS~~ [bsalamat,liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fejta-bot · 2018-03-22T23:35:52Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

fejta-bot · 2018-03-23T02:23:54Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

tengqm · 2018-03-23T03:12:49Z

/test pull-kubernetes-integration

bsalamat · 2018-03-26T17:17:52Z

/retest

fejta-bot · 2018-03-27T02:38:51Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

k8s-github-robot · 2018-03-27T06:20:05Z

Automatic merge from submit-queue (batch tested with PRs 60519, 61099, 61218, 61166, 61714). If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-ci-robot requested review from aveshagarwal, davidopp, deads2k, derekwaynecarr, krousey and mtaufen February 27, 2018 19:47

k8s-github-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Feb 27, 2018

bsalamat force-pushed the auto_prio_class branch from c9b7564 to d1eff44 Compare February 27, 2018 19:48

bsalamat force-pushed the auto_prio_class branch 2 times, most recently from fe3f75a to b7359ae Compare February 27, 2018 20:02

liggitt reviewed Feb 28, 2018

View reviewed changes

derekwaynecarr reviewed Feb 28, 2018

View reviewed changes

bsalamat mentioned this pull request Feb 28, 2018

Critical pods should not be limited to kube-system namespace #60596

Closed

bsalamat force-pushed the auto_prio_class branch 2 times, most recently from b5a18e4 to 2192ec8 Compare March 1, 2018 08:59

liggitt reviewed Mar 1, 2018

View reviewed changes

bsalamat force-pushed the auto_prio_class branch 3 times, most recently from 0291abb to 494dabf Compare March 1, 2018 21:31

liggitt reviewed Mar 1, 2018

View reviewed changes

bsalamat force-pushed the auto_prio_class branch from 5a21b7d to 8f82acd Compare March 3, 2018 00:35

Auto-create system critical prioity classes at API server startup

ebda958

bsalamat force-pushed the auto_prio_class branch from 8f82acd to f66b952 Compare March 3, 2018 00:50

ravisantoshgudimetla reviewed Mar 3, 2018

View reviewed changes

ravisantoshgudimetla approved these changes Mar 5, 2018

View reviewed changes

k8s-ci-robot assigned derekwaynecarr Mar 6, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2018

k8s-ci-robot assigned liggitt Mar 6, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2018

bsalamat added 2 commits March 6, 2018 10:06

autogenerated files

515ba9e

Allow system critical priority classes in API validation

9592a9e

bsalamat force-pushed the auto_prio_class branch from f66b952 to 9592a9e Compare March 6, 2018 18:07

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 6, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 22, 2018

k8s-github-robot merged commit 71050b6 into kubernetes:master Mar 27, 2018

		@@ -168,7 +168,7 @@ func IsCriticalPodBasedOnPriority(ns string, priority int32) bool {
		if ns != kubeapi.NamespaceSystem {

Automatically add system critical priority classes at cluster boostrapping #60519

Automatically add system critical priority classes at cluster boostrapping #60519

Conversation

bsalamat commented Feb 27, 2018 • edited Loading

bsalamat commented Feb 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravisantoshgudimetla Mar 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravisantoshgudimetla Mar 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat Mar 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat commented Mar 5, 2018

ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

derekwaynecarr commented Mar 6, 2018

liggitt commented Mar 6, 2018

liggitt commented Mar 6, 2018

bsalamat commented Mar 6, 2018

bsalamat commented Mar 21, 2018

bsalamat commented Mar 22, 2018

liggitt commented Mar 22, 2018

k8s-ci-robot commented Mar 22, 2018

fejta-bot commented Mar 22, 2018

fejta-bot commented Mar 23, 2018

tengqm commented Mar 23, 2018

bsalamat commented Mar 26, 2018

fejta-bot commented Mar 27, 2018

k8s-github-robot commented Mar 27, 2018

bsalamat commented Feb 27, 2018 •

edited

Loading

liggitt Feb 28, 2018 •

edited

Loading

liggitt Feb 28, 2018 •

edited

Loading

liggitt Feb 28, 2018 •

edited

Loading

ravisantoshgudimetla Mar 2, 2018 •

edited

Loading

ravisantoshgudimetla Mar 2, 2018 •

edited

Loading

bsalamat Mar 1, 2018 •

edited

Loading