New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically add system critical priority classes at cluster boostrapping #60519

Merged
merged 3 commits into from Mar 27, 2018

Conversation

@bsalamat
Contributor

bsalamat commented Feb 27, 2018

What this PR does / why we need it:
We had two PriorityClasses that were hardcoded and special cased in our code base. These two priority classes never existed in API server. Priority admission controller had code to resolve these two names. This PR removes the hardcoded PriorityClasses and adds code to create these PriorityClasses automatically when API server starts.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #60178

ref/ #57471

Special notes for your reviewer:

Release note:

Automatically add system critical priority classes at cluster boostrapping.

/sig scheduling

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Feb 27, 2018

@@ -168,7 +168,7 @@ func IsCriticalPodBasedOnPriority(ns string, priority int32) bool {
if ns != kubeapi.NamespaceSystem {

This comment has been minimized.

@liggitt

liggitt Feb 28, 2018

Member

unrelated to this PR, but critical pods based on priority should not be limited to the kube-system namespace... that namespace should not have any special significance as far as the API is concerned.

This comment has been minimized.

@bsalamat

bsalamat Feb 28, 2018

Contributor

As you said, that's unrelated this to PR, but K8s docs clearly state that critical pods can only exist in system namespace only.

This comment has been minimized.

@liggitt

liggitt Feb 28, 2018

Member

K8s docs clearly state that critical pods can only exist in system namespace only

that is left over from critical pods being determined from the uncontrollable pod annotation. that limitation should not apply to the priority field

This comment has been minimized.

@bsalamat

bsalamat Feb 28, 2018

Contributor

I will file a separate issue for this.

This comment has been minimized.

@bsalamat

bsalamat Feb 28, 2018

Contributor

filed: #60596

return admission.NewForbidden(a, fmt.Errorf("the name of the priority class is a reserved name for system use only: %v", pc.Name))
// API server adds system critical priority classes at bootstrapping. We should
// not enforce restrictions on adding system level priority classes for API server.
if userInfo := a.GetUserInfo(); userInfo == nil || userInfo.GetName() != user.APIServerUser {

This comment has been minimized.

@liggitt

liggitt Feb 28, 2018

Member

this isn't an appropriate use of this username. if you want to require additional authorization to set certain priority classes, pass in an authorizer and perform the authorization check (see PodSecurityPolicyPlugin for an example of obtaining and using the authorizer). The API server always has superuser permissions when making API calls to itself, and will pass any authorization check, but cluster-admins should be able to modify these objects as well.

I also find it odd to do this enforcement in admission... why not in the rest handler?

This comment has been minimized.

@bsalamat

bsalamat Mar 1, 2018

Contributor

I looked at PodSecurityPolicyPlugin. I am not sure if the authorization done there is enough though. We don't want to allow any user-defined priority classes to start with "system-" or have a value in the system range (>1 billion).
Performing the check in the rest handler can be done too if there is a way to obtain userInfo.

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

Could you let the rest handler have the map of allowed system priorities and only allow those to be created. On upgrade, any newly allowed system priority classes would be bootstrapped by a new apiserver using the loopback connection, which would always speak to a rest handler that agreed on what system priority classes were allowed.

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

Then you don't need to worry about userinfo at all. Anyone authorized to create priority classes can create the system priority classes the apiserver expects to be bootstrapped.

This comment has been minimized.

@bsalamat

bsalamat Mar 1, 2018

Contributor

That's actually a very good idea. I changed the PR accordingly. The new changes are under a new commit.

}
if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) {
return admission.NewForbidden(a, fmt.Errorf("priority class names with '"+scheduling.SystemPriorityClassPrefix+"' prefix are reserved for system use only"))
}
}
// If the new PriorityClass tries to be the default priority, make sure that no other priority class is marked as default.
if pc.GlobalDefault {

This comment has been minimized.

@liggitt

liggitt Feb 28, 2018

Member

unrelated, but it still appears we're trying to enforce a singleton GlobalDefault... I thought we dropped that

This comment has been minimized.

@bsalamat

bsalamat Feb 28, 2018

Contributor

We never dropped it. We just added code to handle the case of having more than one global default when they are added due to race in HA clusters.

This comment has been minimized.

@liggitt

liggitt Feb 28, 2018

Member

ah, that means admission can reject updates that swap which one is the default

  1. apply a file containing two priority classes, a and b (default)
  2. apply an update of the file with a (default) and b (non-default)

admission rejects if the update to a happens first. it would also reject if the update to b happened first but the watch event didn't make it into the informer-fed cache the admission plugin consulted.

this guard seems more problematic than helpful to me

This comment has been minimized.

@bsalamat

bsalamat Feb 28, 2018

Contributor

I agree that there are races causing one to add more than one global default, but are you suggesting that it is better to allow people to add as many global default priority classes as they like?

This comment has been minimized.

@liggitt

liggitt Feb 28, 2018

Member

if we cannot guarantee a single default (which we can't), and we have a deterministic and documented behavior when there's more than one (which we do), then breaking legitimate updates in order to half-accomplish correctness doesn't seem worthwhile

This comment has been minimized.

@bsalamat

bsalamat Feb 28, 2018

Contributor

I see your point. We can remove it and document the behavior. I will do that in a separate PR to ensure we are not mixing the two.

@@ -49,6 +63,64 @@ func (p RESTStorageProvider) v1alpha1Storage(apiResourceConfigSource serverstora
return storage
}
func (p RESTStorageProvider) PostStartHook() (string, genericapiserver.PostStartHookFunc, error) {
return PostStartHookName, AddSystemPriorityClasses(), nil

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Feb 28, 2018

Member

can i delete them? what happens if i do? will they come back only on restart?

This comment has been minimized.

@bsalamat

bsalamat Feb 28, 2018

Contributor

If they are deleted, no more new Pods with these priority classes can be created. These PriorityClasses are added back on API server restart.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 2, 2018

Contributor

I think it also means no new critical pods could be created. Can we avoid that? Say something like Delete would return error, if priorityclass name is one of systemclustercritical or systemnodecritical? Not sure if that takes flexibility away from user.

This comment has been minimized.

@bsalamat

bsalamat Mar 2, 2018

Contributor

Looks like we don't have a validation mechanism in our REST layer for delete operations. So, I am not sure if it is possible to prevent deletion of these priority classes.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 2, 2018

Contributor

Thanks for the information. I did a quicksearch in the codebase and it seems there are few things like immortalNamespaces(basically 'kube-system' and 'defaut' etc), where at the admission plugin level we are prohibiting user to delete them.(https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/admission/plugin/namespace/lifecycle/admission.go#L77). Wondering if we could do the same thing but I think it can be a different admission controller from priority AdmissionController.

This comment has been minimized.

@liggitt

liggitt Mar 2, 2018

Member

inside the impl of Delete (which the priorityclassstorage would need to override, put the guard in, then delegate, like:

func (r *REST) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
	// if name is in set of expected system priority classes, forbid and return
	return r.store.Delete(ctx, name, options)
}

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 2, 2018

Contributor

Understood. Thanks @liggitt. Unrelated to this change, when do we use this mechanism vs one I mentioned earlier for kube-system(is this mechanism only for top level objects?)

This comment has been minimized.

@bsalamat

bsalamat Mar 2, 2018

Contributor

Thanks, @liggitt and @ravisantoshgudimetla. I added the check.

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Mar 6, 2018

Member

thanks for updating.

allErrs = append(allErrs, apivalidation.ValidateObjectMeta(&pc.ObjectMeta, false, apivalidation.NameIsDNSSubdomain, field.NewPath("metadata"))...)
// If the priorityClass starts with a system prefix, it must be one of the
// predefined system priority classes.
if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) && !scheduling.IsKnownSystemPriorityClass(pc) {

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

since both the name and value are immutable, comparison to the bootstrap priority classes should only be done on create.

update should not concern itself with these checks, since during rolling upgrades or in downgraded apiserver cases, you might need to update/delete a system priority class bootstrapped by a newer apiserver.

This comment has been minimized.

@bsalamat

bsalamat Mar 1, 2018

Contributor

That is already the case. ValidatePriorityClassUpdate is called to validate updates.

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

strategy.ValidateUpdate calls this function

This comment has been minimized.

@bsalamat

bsalamat Mar 1, 2018

Contributor

That's right. I will remove the call.

for _, spc := range systemPriorityClasses {
spcCopy := *spc
spcCopy.Description = ""
if reflect.DeepEqual(spcCopy, pcCopy) {

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

only compare name and value. descriptions, annotations, labels, etc, we don't care about.

also, if the value mismatches, surfacing a better error message is important ("system priority class %s may only have value %d", etc)

This comment has been minimized.

@bsalamat

bsalamat Mar 1, 2018

Contributor

The reason that I used DeepEqual was to make it resilient to changes of fields in PriorityClass. Anyway, I changed it to compare the three fields only, but I didn't add more logic to report errors for each priority class. Reporting better errors would make the logic hard to maintain. These prioirty classes are supposed to be created by the API server automatically and not by humans.

This comment has been minimized.

@bsalamat

bsalamat Mar 1, 2018

Contributor

Actually, adding more detailed error does not make the function much different. Let me add it.

This comment has been minimized.

@bsalamat

bsalamat Mar 1, 2018

Contributor

Done. PTAL.

if spc.Value != pc.Value {
return false, fmt.Errorf("value of %v PrioityClass must be %v", spc.Name, spc.Value)
}
if spc.GlobalDefault != pc.GlobalDefault {

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

I don't think we care about this either... just the name and value. If they want to make the default for pods in their cluster "system-node-critical", that shouldn't matter to us.

This comment has been minimized.

@bsalamat

bsalamat Mar 1, 2018

Contributor

Setting such a high priority as default could cause eviction of system critical components of a cluster. We should prevent catastrophic human errors as much as possible.

} else if pc.Value > scheduling.HighestUserDefinablePriority {
// Similarly, if the Value is in the system range, it must be one of the
// predefined system priority classes.
if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is {

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

since it doesn't start with the SystemPriorityClassPrefix, there's no need to check this here

// predefined system priority classes.
if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) {
if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is {
allErrs = append(allErrs, field.Forbidden(field.NewPath("metadata", "Name"), "priority class names with '"+scheduling.SystemPriorityClassPrefix+"' prefix are reserved for system use only. error: "+err.Error()))

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

"name" should match the json field

// Similarly, if the Value is in the system range, it must be one of the
// predefined system priority classes.
if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is {
allErrs = append(allErrs, field.Forbidden(field.NewPath("Value"), fmt.Sprintf("maximum allowed value of a user defined priority is %v. Error: %v", scheduling.HighestUserDefinablePriority, err)))

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

"value" should match the json field

})
// This test verifies that when a higher priority pod is created and no node with
// enough resources is found, scheduler preempts a lower priority pod to schedule

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

update comment... this test doesn't check preemption

pod := createPausePod(f, pausePodConfig{
Name: fmt.Sprintf("pod%d-%v", i, spc),
PriorityClassName: spc,
Resources: &v1.ResourceRequirements{

This comment has been minimized.

@liggitt

liggitt Mar 1, 2018

Member

resources are unspecified... remove?

This comment has been minimized.

@bsalamat

bsalamat Mar 1, 2018

Contributor

All done. PTAL.

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Mar 1, 2018

/retest

for _, spc := range systemPriorityClasses {
if spc.Name == pc.Name {
if spc.Value != pc.Value {
return false, fmt.Errorf("value of %v PrioityClass must be %v", spc.Name, spc.Value)

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 2, 2018

Contributor

nit: s/PrioityClass/PriorityClass

This comment has been minimized.

@bsalamat

bsalamat Mar 2, 2018

Contributor

done

return false, fmt.Errorf("value of %v PrioityClass must be %v", spc.Name, spc.Value)
}
if spc.GlobalDefault != pc.GlobalDefault {
return false, fmt.Errorf("globalDefault of %v PrioityClass must be %v", spc.Name, spc.GlobalDefault)

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 2, 2018

Contributor

nit: s/PrioityClass/PriorityClass

This comment has been minimized.

@bsalamat

bsalamat Mar 2, 2018

Contributor

done

glog.Infof("created PriorityClass %s with value %v", pc.Name, pc.Value)
}
} else {
// Unable to get the priority class for reasons other than "not found".

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 2, 2018

Contributor

nit: Logging it here might be helpful

This comment has been minimized.

@bsalamat

bsalamat Mar 2, 2018

Contributor

sure.

}
}
}
glog.Infof("all system priority classes are created successfully.")

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 2, 2018

Contributor

created successfully or they already exist.

@@ -49,6 +63,64 @@ func (p RESTStorageProvider) v1alpha1Storage(apiResourceConfigSource serverstora
return storage
}
func (p RESTStorageProvider) PostStartHook() (string, genericapiserver.PostStartHookFunc, error) {
return PostStartHookName, AddSystemPriorityClasses(), nil

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 2, 2018

Contributor

I think it also means no new critical pods could be created. Can we avoid that? Say something like Delete would return error, if priorityclass name is one of systemclustercritical or systemnodecritical? Not sure if that takes flexibility away from user.

@@ -23,9 +23,17 @@ const (
// that do not specify any priority class and there is no priority class
// marked as default.
DefaultPriorityWhenNoDefaultClassExists = 0

This comment has been minimized.

@aveshagarwal

aveshagarwal Mar 2, 2018

Member

Having it not part of any priority class object does not sound right to me. why to treat lowest or default value differently than others?

This comment has been minimized.

@liggitt

liggitt Mar 2, 2018

Member

0 isn't the lowest... you can have negative values. 0 is just the default.

This comment has been minimized.

@aveshagarwal

aveshagarwal Mar 2, 2018

Member

yeah sure, I did not mean that, I just meant to have a class associated with each priority class value.

This comment has been minimized.

@liggitt

liggitt Mar 2, 2018

Member

when a pod does not specify a priorityClass, and there are no priority class objects, defaulting to 0 makes sense to me.

This comment has been minimized.

@aveshagarwal

aveshagarwal Mar 2, 2018

Member

Conceptually i think it'd be better to default to a default priority class (with whatever value to it) not to some integer value.

This comment has been minimized.

@aveshagarwal

aveshagarwal Mar 2, 2018

Member

Anyway I am ok with defaulting at API level but then I am wondering why we need DefaultPriorityWhenNoDefaultClassExists?

This comment has been minimized.

@aveshagarwal

aveshagarwal Mar 2, 2018

Member

I am assuming when you say its getting default to 0, its somewhere in https://github.com/kubernetes/api/blob/master/core/v1/generated.pb.go based on https://github.com/kubernetes/api/blob/master/core/v1/generated.proto#L3192. But as I said, in that case, I dont think we need any new constant like this DefaultPriorityWhenNoDefaultClassExists.

This comment has been minimized.

@bsalamat

bsalamat Mar 2, 2018

Contributor

Priority field is an *int32, not int32. DefaultPriorityWhenNoDefaultClassExists is needed to set the default.

This comment has been minimized.

@aveshagarwal

aveshagarwal Mar 2, 2018

Member

I am not sure how else to convince that if we are not defaulting to 0 at API level, it would be better to default based on a priority class to keep it in sync with other cases where priority of a pod is pulled from a priority class. Again if we are defaulting to 0 at API level that is different thing and thats perfectly fine, but then we would not need DefaultPriorityWhenNoDefaultClassExists.

This comment has been minimized.

@bsalamat

bsalamat Mar 2, 2018

Contributor

@aveshagarwal
Three points:

  1. The concern of this PR is something else. It would be better to have a separate issue to discuss such matters. I dislike the fact that we mix discussions of unrelated matters and slow down the flow of development. I have heard the same complaints from other colleagues as well. To be clear, I am not referring to you. This is something that many of us do and I may have done too. We should not mix discussions.
  2. We should be backward compatible and support Pods that do not have any priority class. So, we must not assume that all Pods have a priority class.
  3. Priority resolution happens in admission controller. So does priority resolution for pods without any PriorityClass.
@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Mar 2, 2018

/retest

@aveshagarwal

This comment has been minimized.

Member

aveshagarwal commented Mar 2, 2018

it lgtm.

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Mar 2, 2018

/retest

func (r *REST) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
for _, spc := range scheduling.SystemPriorityClasses() {
if name == spc.Name {
return nil, false, fmt.Errorf("%v is a system prioirty class and cannot be deleted", name)

This comment has been minimized.

@liggitt

liggitt Mar 3, 2018

Member

priority

This comment has been minimized.

@liggitt

liggitt Mar 3, 2018

Member

does this need to be an API error (e.g. NewForbidden or NewBadRequest)?

This comment has been minimized.

@bsalamat

bsalamat Mar 3, 2018

Contributor

Right. Good catch. Fixed it.

@@ -56,3 +61,14 @@ var _ rest.ShortNamesProvider = &REST{}
func (r *REST) ShortNames() []string {
return []string{"pc"}
}
// Delete ensures that system priority classes are not deleted.

This comment has been minimized.

@ravisantoshgudimetla

ravisantoshgudimetla Mar 3, 2018

Contributor

Should we have guard for update as well or else we will able to update system priority class to lower value and then create pods that have higher priority than system priority?

This comment has been minimized.

@liggitt

liggitt Mar 3, 2018

Member

Update validation already prevents changing value

This comment has been minimized.

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Mar 5, 2018

@derekwaynecarr

This comment has been minimized.

Member

derekwaynecarr commented Mar 6, 2018

/lgtm

@liggitt

This comment has been minimized.

Member

liggitt commented Mar 6, 2018

/lgtm

@liggitt

This comment has been minimized.

Member

liggitt commented Mar 6, 2018

(needs squash)

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 6, 2018

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Mar 6, 2018

squashed commits.
Thanks all again for your reviews.

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Mar 21, 2018

@liggitt Could you please re-lgtm this after squash?

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Mar 22, 2018

ping

@liggitt

This comment has been minimized.

Member

liggitt commented Mar 22, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Mar 22, 2018

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Mar 22, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, derekwaynecarr, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot

This comment has been minimized.

fejta-bot commented Mar 22, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

fejta-bot commented Mar 23, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@tengqm

This comment has been minimized.

Contributor

tengqm commented Mar 23, 2018

/test pull-kubernetes-integration

@bsalamat

This comment has been minimized.

Contributor

bsalamat commented Mar 26, 2018

/retest

@fejta-bot

This comment has been minimized.

fejta-bot commented Mar 27, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Mar 27, 2018

Automatic merge from submit-queue (batch tested with PRs 60519, 61099, 61218, 61166, 61714). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit 71050b6 into kubernetes:master Mar 27, 2018

14 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation bsalamat authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment