Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-scheduler: Support extender managed extended resources in kube-scheduler #60332

Merged
merged 1 commit into from Feb 28, 2018

Conversation

@yguo0905
Copy link
Contributor

yguo0905 commented Feb 23, 2018

What this PR does / why we need it:

This is the continuation of #58851.

  • This PR adds new extender configurations in scheduler policy config.
    • A set of extended resource names can be specified in an extender config. They are the resources that are managed by the extender. The scheduler will only send pods to the extender if the pod requests at least one of the extended resources in the set.
    • An IgnoredByScheduler flag can be set along with each of such resources. If this flag is set to true, the scheduler will not check the resource in the PodFitsResources predicate.
  • This PR also changes the default behavior of the PodFitsResources predicate. Now, by default, PodFitsResources will ignore the extended resources that are not in node status. This is required to support extender managed extended resources (including cluster-level resources) on node. Note that in kube-scheduler we override the default behavior by not ignoring such missing extended resources.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #53616 #58850

Special notes for your reviewer:

Release note:

Support extender managed extended resources in kube-scheduler
@yguo0905

This comment has been minimized.

Copy link
Contributor Author

yguo0905 commented Feb 23, 2018

/assign @bsalamat @vishh

@yguo0905

This comment has been minimized.

Copy link
Contributor Author

yguo0905 commented Feb 23, 2018

/retest

@vishh vishh added this to the v1.10 milestone Feb 24, 2018

@yguo0905

This comment has been minimized.

Copy link
Contributor Author

yguo0905 commented Feb 24, 2018

/retest

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Feb 24, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 24, 2018

// PredicateMetadataProducer that creates predicate metadata with the provided
// ignoredExtendedResources.
func RegisterPredicateMetadataProducerWithIgnoredExtendedResources(ignoredExtendedResources sets.String) {
RegisterPredicateMetadataProducer("PredicateWithIgnoredExtendedResources", func(pm *predicateMetadata) {

This comment has been minimized.

@yguo0905

yguo0905 Feb 24, 2018

Author Contributor

I changed the predicate name from PodFitsResources to PredicateWithIgnoredExtendedResources.

IIUC, the "precomp" function that sets the ignoredExtendedResources will override the one for the same predicate created here:

predicates.RegisterPredicateMetadataProducer(policy.Name, precomputationFunction)

Is this the case? And I don't see why predicateMetadataProducers is not a slice here:

var predicateMetadataProducers = make(map[string]PredicateMetadataProducer)

@yguo0905 yguo0905 changed the title [Scheduler] Support extender managed extended resources [WIP] kube-scheduler: Support extender managed extended resources Feb 24, 2018

@yguo0905 yguo0905 force-pushed the yguo0905:sched branch from 424cad7 to fbe6654 Feb 25, 2018

@k8s-github-robot k8s-github-robot removed the lgtm label Feb 25, 2018

@yguo0905 yguo0905 force-pushed the yguo0905:sched branch from fbe6654 to ae891d9 Feb 25, 2018

@@ -95,6 +134,7 @@ func (pfactory *PredicateMetadataFactory) GetMetadata(pod *v1.Pod, nodeNameToInf
podRequest: GetResourceRequest(pod),
podPorts: schedutil.GetContainerPorts(pod),
matchingAntiAffinityTerms: matchingTerms,
ignoredExtendedResources: sets.NewString(),

This comment has been minimized.

@bsalamat

bsalamat Feb 25, 2018

Contributor

Why is this needed? Can't it be default (nil)?

This comment has been minimized.

@yguo0905

yguo0905 Feb 25, 2018

Author Contributor

Not needed. Will remove this.

This comment has been minimized.

@yguo0905

yguo0905 Feb 26, 2018

Author Contributor

Done.

// Missing extended resources will be ignored by default unless specified
// in predicate metadata. Note that kube-scheduler sets this flag to false
// in predicate metadata.
ignoreMissingExtendedResources := true

This comment has been minimized.

@bsalamat

bsalamat Feb 25, 2018

Contributor

The value of this variable is confusing. I think default value of "true" is a backward incompatible change. Besides, the value of predicateMeta.ignoreMissingExtendedResources is false by default. So, I guess it makes sense to make the default value false.

This comment has been minimized.

@yguo0905

yguo0905 Feb 25, 2018

Author Contributor

Thanks for looking into this over the weekend!

Yes, this is a backward incompatible change but I haven't come up with a promising solution yet.

In PodFitsResources, we cast algorithm.PredicateMetadata to predicate.predicateMetadata and then use the metadata, including ignoreMissingExtendedResources in this PR.

if predicateMeta, ok := meta.(*predicateMetadata); ok {
podRequest = predicateMeta.podRequest

This is how the predicate is invoked in pod admission on node.

fit, reasons, err := predicates.GeneralPredicates(pod, nil, nodeInfo)
Here we pass nil as the predicate metadata, which means we use the defaults in pod admission on node. That's why I set ignoreMissingExtendedResources default to true.

Ideally, kubelet should create a predicate metadata with ignoreMissingExtendedResources set to true and pass it to the predicate. But this is not possible because of the type cast in the predicate function.

I think the type cast should be removed, and the metadata should be accessed via the algorithm.PredicateMetadata interface. kube-scheduler/node/clusterautoscalar can provide their own metadata implementation. But this will be a larger change. Maybe I'm missing some thing. WDYT?

This comment has been minimized.

@bsalamat

bsalamat Feb 26, 2018

Contributor

I see. So, for now please add a comment and explain that this default value is chosen so that the kubelet code works fine and add the details.
My only concern right now is that this is backward incompatible from the kubelet side. It is however, backward compatible on the scheduler side, because if scheduler's policy config does not exists or misses this field, the default value of the field will be false and overrides the default value provided here.

This comment has been minimized.

@yguo0905

yguo0905 Feb 26, 2018

Author Contributor

Done.

Besides scheduler and kubelet, cluster autoscaler and damonset controller also use this predicate. But I don't see any negative impacts from this change. This is also the behavior before 1.8 #53616.

@vishh @jiayingz

This comment has been minimized.

@yguo0905

yguo0905 Feb 27, 2018

Author Contributor

I created algorithm.PredicateMetadataProducer in kubelet using predicates.NewPredicateMetadataFactory().

It turned out that NewPredicateMetadataFactory's podLister argument is not used at all so I can just use this function in kubelet.

Now, the default value of predicateMeta.ignoreMissingExtendedResources is false.

podName string
extenders []algorithm.SchedulerExtender

expectedBinderType string

This comment has been minimized.

@bsalamat

bsalamat Feb 25, 2018

Contributor

Why don't you check the pointer to the extender instead of its string type?

This comment has been minimized.

@yguo0905

yguo0905 Feb 25, 2018

Author Contributor

I don't have access to the pointer to the default binder.

This comment has been minimized.

@bsalamat

bsalamat Feb 26, 2018

Contributor

Ah, right. Never mind.

}, {
sendPod: podWithID(testBindingPodName, ""),
algo: mockScheduler{testNode.Name, nil},
expectBind: testBinding, // testBinding has a name different than testBindingPodName

This comment has been minimized.

@bsalamat

bsalamat Feb 25, 2018

Contributor

I am confused. Why is the name different?

This comment has been minimized.

@yguo0905

yguo0905 Feb 25, 2018

Author Contributor

This is confusing and not that useful. I already added full tests in TestGetBinderFunc in pkg/scheduler/factory/factory_test.go so this should have been removed.

@yguo0905 yguo0905 force-pushed the yguo0905:sched branch from ae891d9 to 98f7563 Feb 25, 2018

@yguo0905 yguo0905 changed the title [WIP] kube-scheduler: Support extender managed extended resources kube-scheduler: Support extender managed extended resources in kube-scheduler Feb 25, 2018

@yguo0905

This comment has been minimized.

Copy link
Contributor Author

yguo0905 commented Feb 25, 2018

/hold cancel

// Missing extended resources will be ignored by default unless a predicate
// metadata is provided.
//
// This predicate is used by scheduler, kubelet, cluster autoscaler and

This comment has been minimized.

@bsalamat

bsalamat Feb 26, 2018

Contributor

@dchen1107 @mwielgus
This is a backward incompatible change in kubelet and autoscaler, but it applies only to Pods that have extended resources missing on the node. Do you guys approve such a change?

@@ -606,8 +606,16 @@ func (kl *Kubelet) setNodeStatusMachineInfo(node *v1.Node) {
}

for _, removedResource := range removedDevicePlugins {
glog.V(2).Infof("Remove capacity for %s", removedResource)
delete(node.Status.Capacity, v1.ResourceName(removedResource))
glog.V(2).Infof("Set capacity for %s to 0 on device removal", removedResource)

This comment has been minimized.

@yguo0905

yguo0905 Feb 27, 2018

Author Contributor

@jiayingz, please take a look.

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Feb 27, 2018

Member

should allocatable also be set to 0 ?

This comment has been minimized.

@yguo0905

yguo0905 Feb 27, 2018

Author Contributor

Allocatable will be set to capacity at line 644.

@jiayingz

This comment has been minimized.

Copy link
Member

jiayingz commented Feb 27, 2018

/lgtm for the device plugin related change.

@yguo0905

This comment has been minimized.

Copy link
Contributor Author

yguo0905 commented Feb 27, 2018

@bsalamat and @vishh. I changed the default behavior for cluster autoscaler back to what it does for now.

I keep ignoreMissingExtendedResources as a flag rather than a list of resources that are required. I think ff we need to support new extended resources in cluster autoscaler, it's very likely that we will add the support in its cloud provider on a case by case basis.

@yguo0905 yguo0905 force-pushed the yguo0905:sched branch from 1a55f8d to 770bfa9 Feb 27, 2018

@yguo0905

This comment has been minimized.

Copy link
Contributor Author

yguo0905 commented Feb 27, 2018

Per discussion with @vishh offline, I removed the ignoreMissingExtendedResource flag. Instead, we remove the requests of the extended resources from pod if the resources are missing from node status in kubelet.


// Remove the requests of the extended resources that are missing in the
// node info. This is required to support cluster-level resources, which
// are extended resources unknown to nodes.

This comment has been minimized.

@vishh

vishh Feb 27, 2018

Member

There is an issue where if a pod was manually (or incorrectly) bound to a node where a node level extended resource it requires is not found, then kubelet will not fail admission. We should document this as known issue and fix it with resource APIs.

This comment has been minimized.

@yguo0905

yguo0905 Feb 27, 2018

Author Contributor

Will do.

@@ -39,7 +39,7 @@ func init() {
// Register functions that extract metadata used by predicates and priorities computations.
factory.RegisterPredicateMetadataProducerFactory(
func(args factory.PluginFactoryArgs) algorithm.PredicateMetadataProducer {
return predicates.NewPredicateMetadataFactory()
return predicates.NewPredicateMetadataFactory(args.PodLister)

This comment has been minimized.

@vishh

vishh Feb 27, 2018

Member

Is that argument even necessary? @bsalamat can this be removed?

This comment has been minimized.

@yguo0905

yguo0905 Feb 27, 2018

Author Contributor

I think this can be removed. If yes, I can do that in a separate PR in 1.11.

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Feb 27, 2018

/lgtm
/approve
@yguo0905 please update public documentation with the repercussions of this PR. It enables cluster level resources, but with a few caveats (lack of support for daemonsets, requires scheduler extender config, etc.)

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 27, 2018

),
expectedPod: makeTestPod(
v1.ResourceList{},
v1.ResourceList{"foo.com/bar": quantity},

This comment has been minimized.

@jiayingz

jiayingz Feb 27, 2018

Member

Hmm, shouldn't this resource request be removed?

This comment has been minimized.

@yguo0905

yguo0905 Feb 27, 2018

Author Contributor

The request is in Limits which is skipped in this logic. I will refine this test once the overall design is LGTM'd.

This comment has been minimized.

@jiayingz

jiayingz Feb 27, 2018

Member

Oh I see. So this is limit and we leave limit unchanged. For extended resources, we actually require limits to equal to requests and in devicemanager, we actually examine resource Limits during resource allocation. Can we make sure to also remove limits for non-existing extended resources?

This comment has been minimized.

@jiayingz

jiayingz Feb 27, 2018

Member

Never mind. Whether we have Limit or not doesn't affect GeneralPredicate evaluation.

@jiayingz

This comment has been minimized.

Copy link
Member

jiayingz commented Feb 27, 2018

/lgtm

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Feb 27, 2018

/lgtm

@yguo0905 yguo0905 force-pushed the yguo0905:sched branch from 770bfa9 to 7918be1 Feb 27, 2018

@k8s-github-robot k8s-github-robot removed the lgtm label Feb 27, 2018

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Feb 28, 2018

Can you squash the commits @yguo0905 ?

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 28, 2018

Support cluster-level extended resources in kubelet and kube-scheduler
Co-authored-by: Yang Guo <ygg@google.com>
Co-authored-by: Chun Chen <chenchun.feed@gmail.com>

@yguo0905 yguo0905 force-pushed the yguo0905:sched branch from 7918be1 to 8d88050 Feb 28, 2018

@k8s-github-robot k8s-github-robot removed the lgtm label Feb 28, 2018

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Feb 28, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 28, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 28, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, jiayingz, vishh, yguo0905

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

k8s-github-robot commented Feb 28, 2018

Automatic merge from submit-queue (batch tested with PRs 60236, 60332, 57375, 60451, 57408). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 1aee9fd into kubernetes:master Feb 28, 2018

13 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation yguo0905 authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-unit Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.