Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: New Resource API proposal #2265

Closed
wants to merge 5 commits into
base: master
from

Conversation

@vikaschoudhary16
Copy link
Member

vikaschoudhary16 commented Jun 14, 2018

@k8s-ci-robot k8s-ci-robot requested review from dchen1107 and idvoretskyi Jun 14, 2018

@vikaschoudhary16 vikaschoudhary16 changed the title Kep resource api proposal KEP: New Resource API proposal Jun 14, 2018

@idvoretskyi

This comment has been minimized.

Copy link
Member

idvoretskyi commented Jun 14, 2018

/uncc

@k8s-ci-robot k8s-ci-robot removed the request for review from idvoretskyi Jun 14, 2018

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jun 18, 2018

operator: "GtEq"
values:
- "30G"
```

This comment has been minimized.

@resouer

resouer Jun 21, 2018

Member

Mind to add another yaml of how Pod reference ResourceClass?

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

good idea. will do.

@jdumars jdumars added this to Backlog in KEP Tracking Jun 25, 2018

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:kep-resource-api-proposal branch from a6047a0 to 1672a1b Jun 25, 2018

@vishh
Copy link
Member

vishh left a comment

  1. Other than per-device quota, I feel most other use cases are ahead of their time.
  2. If Quota is the main problem to tackle right now, then does it require a new set of Resource APIs or can be solved via [admission extensions] (https://docs.google.com/a/google.com/document/d/1Lpuw-tjm252W4oGqBvcFAFUZkwFlpZ2i0EgpCwialN0/edit?disco=AAAAB_xFEyY) or some other way?
  3. If you'd like to test the waters with other use cases, this proposal should ideally be implemented via extensions. If extensions are inadequate, we should try to address extension gaps.
## Use Stories
### As a cluster operator:
- Nodes in my cluster has GPU HW from different generations. I want to classify GPU nodes into one of the three categories, silver, gold and platinum depending upon the launch timeline of the GPU family eg: Kepler K20, K80, Pascal P40, P100, Volta V100. I want to charge each of the three categories differently. I want to offer my clients 3 GPU rates/classes to choose from.<br/>
**Motivation:** As time progresses in a cluster lifecycle, new advanced, high performance, expensive variants of GPUs gets added to the cluster nodes. At the same time older variants also co-exist. There are workloads which strictly wants latest GPUs and also there are workloads which are fine with older GPUs. But since there is a wide range of types, it will be hard to manage and confusing at the same time to have granularity at each GPU type. Grouping into few broad categories will be convenient to manage.<br/>

This comment has been minimized.

@vishh

vishh Jun 25, 2018

Member

Are there any users today that need this feature from kubernetes?

This comment has been minimized.

@adohe

adohe Jun 28, 2018

Member

also for this user story, I think we could use NodeAffinity to chose different GPU type.

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

I think NodeAffinity shares similar problem as node taint that cluster administrators can not apply proper access control to restrict a user pod from not using them. As mentioned at the very beginning, the goal of this proposal is "to better support non-native compute resources on kubernetes". We want to allow users to request them as compute resources, and allow administrators to control their access through the resource quota.

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jun 28, 2018

Author Member

This is in-line with general QoS model. We might like to experiment with this model in Openshift. /cc @derekwaynecarr
Wondering how NodeAffinity can be tied with usage metrics which will be needed to charge as per usage.

This comment has been minimized.

@vishh

vishh Jun 28, 2018

Member

https://docs.google.com/a/google.com/document/d/1Lpuw-tjm252W4oGqBvcFAFUZkwFlpZ2i0EgpCwialN0/edit?disco=AAAAB_xFEyk

Without knowing for sure real users will benefit from it, i don't see why we'd solve this problem.

This comment has been minimized.

@adohe

adohe Jul 1, 2018

Member

See ur point, and I always think a more general resource API model should be better than label based solutions. :)
👍

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Jul 10, 2018

Member

if we are going to build a feature, we should have a clear user identified that we know will consume the feature to guide the use-case. i think we need to evaluate the feature relative to other ideas seen in the ecosystem.

for the similar use-case of "specify gpu attributes such as gpu type and memory requirements for deployment in heterogenous GPU clusters", nvidia appears to enable this by carrying two API fields on the pod spec.

see:
https://developer.nvidia.com/kubernetes-gpu
https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2576
https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2389

it would be good to evaluate resource class versus this other approach.

This comment has been minimized.

@vishh

vishh Jul 11, 2018

Member

I'd first like to understand if gpu type and memory requirements are a real user concern today in the first place even before considering possible solutions.
There are users who are sufficiently happy with using node selectors and most users today seem to bind pods to specific gpu types either for cost or specific memory requirements.

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jul 11, 2018

Author Member

@derekwaynecarr we have scalability concerns with kubernetes-gpu resource. Selectors computation will be done for each compute resource in scheduler cache. OTOH with resource classes, resource classes will be fewer than compute resources.
Another concern is portability.

This comment has been minimized.

@jiayingz

jiayingz Jul 12, 2018

Member

Thanks a lot for the pointers, @derekwaynecarr
As @vikaschoudhary16 mentioned, looking at nvidia's non-upstream solution, the main difference is that they changed the current resource requirement API of the container spec. In our proposal, we explicitly mentioned that this is a non-goal for the following reasons: First, in a large cluster, computing operators like “greater than”, “less than” at pod creation can be a very slow operation and is not scalable. It can cause scaling issues on the scheduler side. Second, non-primary compute resources usually lack standard resource properties. Although there are benefits to allow users to directly express their resource metadata requirements in their container spec, it may also compromise workload portability in longer term. Third, resource quota control will become harder. Fourth, we may consider the resource requirement API change as a possible extension orthogonal to the ResourceClass proposal. By introducing ResourceClass as an additional resource abstraction layer, users can express their special resource requirements through a high-level portable name, and cluster admins can configure compute resources properly on different environments to meet such requirements. We feel this helps promote portability and separation of concerns, while still maintains API compatibility.

**How Resource classes can solve this:** I, operator/admin, creates three resource classes: GPU-Platinum, GPU-Gold, GPU-Silver. Now since resource classes are quota controlled, end-user will be able to request resource classes only if quota is allocated.

- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and shares a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.<br/>
**Motivation:** Increased performance because of local reference. Local reference also helps better use of cache<br/>

This comment has been minimized.

@vishh

vishh Jun 25, 2018

Member

This use case is unclear to me. What does local reference mean?

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

@vikaschoudhary16 s/local reference/local access?

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jun 28, 2018

Author Member

for example, "local" cache from the same NUMA node in case of cores.

This comment has been minimized.

@vishh

vishh Jun 28, 2018

Member

This intersects with topology awareness heavily. I think Resource Class (if it exists) should restrict itself to policy like allowing only certain shapes (2 GPU with max of 16 CPUs, ...). The topology aspect as currently planned is expected to be covered by QoS (or an additional application performance class API if necessary). Don't combine them both.

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

What this proposal focuses on is a building block that allows guaranteed metadata aware resource scheduling by surfacing resource metadata to the scheduler. I think what kind of metadata people want to surface should be left to HW vendors, resource providors, or infrastructure admins, based on different HW properties, platform environment, and workload requirements. We can provide best practice guidelines and scaling results for people to make right decisions. Node level best effort topology aware scheduling may allow better scaling but I don't think we want to take an opinioned position here.

**How Resource classes can solve this:** Property/attribute which forms the grouping can be advertised in the device attributes and then a resource can be created to form a grouped super-resource based on that property.<br/>
**Can this be solved without resource classes:** No

- I want to have quota control on the devices at the granularity of device properties. For example, I want to have a separate quota for ECC enabled GPUs. I want a specific user to not let use more than ‘N’ number of ECC enabled GPUs overall at namespace level.<br/>

This comment has been minimized.

@vishh

vishh Jun 25, 2018

Member

ECC is probably not a good example. Device types might be more common.
I'd like to explicitly identify if the number of dimensions that we need to support for quota is one or N ideally based on concrete user feedback.

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

It should be N. We want to provide a general framework to support different types of devices like GPU, high performance NIC, FPGA, etc.

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jun 28, 2018

Author Member

agree on Device type. Actually, ECC enabled GPU will be a different ComputeResource as mentioned in the sections below.

I'd like to explicitly identify if the number of dimensions that we need to support for quota is one or N ideally based on concrete user feedback.

"number of dimensions" not clear to me?

This comment has been minimized.

@vishh

vishh Jun 28, 2018

Member

It should be N. We want to provide a general framework to support different types of devices like GPU, high performance NIC, FPGA, etc.

I see the goal, but is there a real world use case for N? I only see the need for 1 dimension now.

- In my cluster, I have many different classes (different capabilities) of a device type (ex: NICs). End user’s expectations are met as long as device has a very small subset of these capabilities. I want a mechanism where end user can request devices which satisfies their minimum expectation.
Few nodes are connected to data network over 40 Gig NICs and others are connected over normal 1 Gig NICs. I want end user pods to be able to request
data network connectivity with high network performance while
in default case, data network connectivity is offered via normal 1 Gbps NICs.<br/>

This comment has been minimized.

@vishh

vishh Jun 25, 2018

Member

This feels like a niche use case. Why can't the existing labels+affinity features not work for this use case?
Also, why not build policies to restrict access via admission plugins rather than adding a new core resource?

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

I think if people need to build and deploy various admission plugins to restrict access on different HW with different properties, that indicates the need for a general framework to support that use case.

**Can this be solved without resource classes:** Taints and tolerations can help in steering pods but the problem in that there is no way today to have access control over use of tolerations and therefore if multiple users are there, it is not possible to have control on allowed tolerations.<br/>
**How Resource classes can solve this:** I can define a ResourceClass for the high-performance NIC with minimum bandwidth requirements, and makes sure only users with proper quota can use such resources.

- I want to be able to utilize different 'types' of a HW resource while not losing workload portability when moving from one cluster to another. There can be Nvidia GPUs on one cluster and AMD GPUs on another cluster. This is example of different ‘types’ of a HW resource(GPU). I want to offer GPUs to be consumed under a same portable name, as long as their capabilities are almost same. If pods are consuming these GPUs with a generic resource class name, workload can be migrated from one cluster to another transparently.<br/>

This comment has been minimized.

@vishh

vishh Jun 25, 2018

Member

This feels like a dream given the state of SW today based on my experience. For example, Tensorflow struggles working seamlessly across compute types (CPU, GPU, etc) and sub-architectures (Skylake, V100, AMD).
I feel we need to wait a bit for the world to evolve for this use case to become valid in k8s.

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

For GPUs from different vendors, agree their properties can be quite different currently, although I wonder whether the difference is less significant for certain workloads like video decoding. For high-performance nic, I think user experience is perhaps less diversified. I also feel promoting portability is always a strong motivation on kubernetes,

**Motivation:** I want minimum guaranteed compute performance<br/>
**Can this be solved without resource classes:**<br/>
- Yes, using node labels and NodeLabelSelectors.
Problem: Same problem of lack of access control on using labelselectors at user level as with the use of tolerations.

This comment has been minimized.

@vishh

vishh Jun 25, 2018

Member

Wouldn't quota take care of access control to an extent?

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jun 28, 2018

Author Member

can you please elaborate more on this that how quota can be used with labels?

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

Yes. That is why we would like to introduce ResourceClass that fits naturally with resource quota.

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Jul 10, 2018

Member

@vishh if i understand the proposed alternative, its basically treating it as an opaque resource by convention? the user still needs to couple the opaque resource consumption with the device consumption and that really cant be done until scheduling, no?

This comment has been minimized.

@vishh

vishh Jul 11, 2018

Member

@derekwaynecarr if we can assume 1-1 mapping between the opaque resource and actual device, then we don't have to be concerned with scheduling right?
I'm not sure if clobbering resource requests in a webhook is possible though.

- Yes, using node labels and NodeLabelSelectors.
Problem: Same problem of lack of access control on using labelselectors at user level as with the use of tolerations.
- OR, Instead of using resource class, provide flexibility to query resource properties directly in pod container resource requests.
Problem: In a large cluster, computing operators like “greater than”, “less than” at pod creation can be a very slow operation and is not scalable.

This comment has been minimized.

@vishh

vishh Jun 25, 2018

Member

What is the scale we are targeting? If generic scheduling features don't scale, then it's a problem that needs to be tackled separately.
cc @bsalamat

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

Not sure I fully understand your comment here. What this paragraph means is that if we extend container resource request API to directly specify their metadata requirements, scheduler needs to do the label selection matching on all of the compute resources in the cluster. But with ResourceClass, scheduler can cache compute resource to ResourceClass matching in its NodeInfo cache, and so the current PodFitsNodeResource evaluation will mostly stay the same without introducing new scaling concerns.

This comment has been minimized.

@vishh

vishh Jun 28, 2018

Member

Ah got it. So this is not really a point justifying the need of better resource APIs. It is about the internal design of such a new API.

**How Resource classes can solve this:**
The Kubernetes scheduler is the central place to map container resource requests expressed through ResourceClass names to the underlying qualified physical resources, which automatically supports metadata aware resource scheduling.

- As a data scientist, I want my workloads to use advanced compute resources available in the running clusters without understanding the underlying hardware configuration details. I want the same workload to run on either on-prem Kubernetes clusters or on cloud, without changing its pod spec. When a new hardware driver comes out, I hope all the required resource configurations are handled properly by my cluster operators and things will just continue to work for any of my existing workloads.<br/>

This comment has been minimized.

@vishh

vishh Jun 25, 2018

Member

This use case seems too vague. TBH, the workloads that consume additional HW are specialized enough that it requires developer maintenance and cluster admins may not be able to homogenize different environments.

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

I tend to disagree. I think a big value of Kubernetes is to allow separation of concern that application developers can focus on their own software with underlying infrastructure taken cared by cluster admins.

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jun 28, 2018

Author Member

@vishh quoting Henry from Ebay in the support of workload portability:

HI Vish, Bobby, it is not exactly this requirement. Currently we ask developers to submit resource specifications for GPU using name of the cards to our data center :

"accelerator": {
"type": "gpu",
"quantity": "1",
"labels": {
"product": "nvidia",
"family": "tesla",
"model": "m40"
}
}

But when we go to other cloud such as Google or AWS they may not have the same cards.

So I was wondering if we could offer resource such as CUDA cores and memory as resource specifications rather actual name and type of the cards.

However, different cards such as AMD vs NVIDIA was not the goal because we know program code against NVIDIA cards will not work well if run with AMD cards.

This comment has been minimized.

@vishh

vishh Jun 28, 2018

Member

I'll wait for Henry to respond to that thread. I think you all should solicit feedback from real users (I'm thinking ML WG, SIG BIG Data, etc.) to figure out if this is really feasible. No user that I have spoken to is ready to consume this level of sophistication today.

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

I think we are proposing an infrastructure building block here whose focusing users are infrastructure admins and developers who want to make their systems easier to use by hiding the underlying hardware details from end users.

This comment has been minimized.

@RenaudWasTaken

RenaudWasTaken Jul 1, 2018

Member

As a data scientist, I want my workloads to use advanced compute resources available in the running clusters without understanding the underlying hardware configuration details.

Yes and no. But mostly no. As mentioned by @vishh this is too vague and a lot of things that this statements covers is outside of K8s' scope.

  • It might make sense when your users only request one GPU (but in that case does ECC on/off is a HW config?)
  • When a user requests more than one GPU users should at the very least be able to specify if GPUs are linked through NVLINK

I want the same workload to run on either on-prem Kubernetes clusters or on cloud, without changing its pod spec.

What's blocking this today or what might be blocking this in the future?

When a new hardware driver comes out, I hope all the required resource configurations are handled properly by my cluster operators and things will just continue to work for any of my existing workloads.

How is this in the scope of Resource Classes

- I want an easy and extensible mechanism to export my resource to Kubernetes. I want to be able to roll out new hardware features to the users who require those features without breaking users who are using old versions of hardware.<br/>
**Motivation:** enables more compute resources and their advanced features on Kubernetes<br/>
**Can this be solved without resource classes:**<br/>
Yes, Using node labels and NodeLabelSelectors.<br/>

This comment has been minimized.

@vishh

vishh Jun 25, 2018

Member

Again seems pretty vague. Is this a real need today? The example mentioned below also doesn't seem realistic.
Are there use cases outside of GPUs?

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

This user story is mostly motivated by some past discussions on device plugin features requests, e.g., as @kad mentioned in kubernetes/kubernetes#59109 (comment)
I like the explicit API model that once ResourceClass is in place, Kubelet can pass ResourceClass name to a device plugin, and the device plugin can map that ResourceClass name to the special underlying resource metadata requirements.

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jun 28, 2018

Author Member

Another example could be different types of cpu cores i,e isolated cores, numa affined cores, hyperthreaded cores etc.

This comment has been minimized.

@vishh

vishh Jun 28, 2018

Member

For kubernetes/kubernetes#59109 (comment) wouldn't Pod annotations suffice?

Another example could be different types of cpu cores i,e isolated cores, numa affined cores, hyperthreaded cores etc.

I don't think we are decided yet on what level of CPU specifics we want to expose to users.

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jun 29, 2018

Author Member

After cpu manager's static policy, IMO, supporting features like isolated cores is a natural progression.
/cc @jeremyeder @kad

This comment has been minimized.

@derekwaynecarr

derekwaynecarr Jul 10, 2018

Member

i think this feature is orthogonal to cpus and confuses the discussion.

authors:
- "@vikaschoudhary16"
- "@jiayingz"
owning-sig: sig-node

This comment has been minimized.

@vishh

vishh Jun 25, 2018

Member

I'd add sig-scheduling as well since there is quite some impact to scheduling, quota, etc.

This comment has been minimized.

@jiayingz

jiayingz Jun 28, 2018

Member

Looking at the KEP template and existing ones, I think it expects one owning sig. But agree a big part of this proposal is on scheduler side, and we should add it as participating-sigs.

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Jun 28, 2018

Author Member

done

@ConnorDoyle

This comment has been minimized.

Copy link
Member

ConnorDoyle commented Jun 27, 2018

/cc

@k8s-ci-robot k8s-ci-robot requested a review from ConnorDoyle Jun 27, 2018

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:kep-resource-api-proposal branch 2 times, most recently from 22a9847 to bbf8f0c Jun 28, 2018

@vikaschoudhary16

This comment has been minimized.

Copy link
Member Author

vikaschoudhary16 commented Jun 28, 2018

@vishh

Other than per-device quota, I feel most other use cases are ahead of their time.

Workload portability is a feedback driven real use-case.

@vikaschoudhary16 vikaschoudhary16 force-pushed the vikaschoudhary16:kep-resource-api-proposal branch from bbf8f0c to a6af163 Jun 29, 2018

@jiayingz

This comment has been minimized.

Copy link
Member

jiayingz commented Jun 29, 2018

@k8s-ci-robot k8s-ci-robot requested a review from kad Jun 29, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jun 29, 2018

@jiayingz: GitHub didn't allow me to request PR reviews from the following users: hsaputra, bart0sh, fabiand.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @RenaudWasTaken @hsaputra @kad @bart0sh @fabiand

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested a review from RenaudWasTaken Jun 29, 2018


To maintain and propagate ResourceClass to ComputeResource binding information, the scheduler will need to record this information in a newly introduced ContainerSpec field, similar to the existing NodeName field, and Kubelet will need to consume this information. During the initial implementation, we propose to encode the ResourceClass to the underlying compute resource binding information in a new `AllocatedComputeResources` field in ContainerSpec.
```golang
AllocatedComputeResources map[string]AllocatedResourceList

This comment has been minimized.

@rkamudhan

rkamudhan Jul 24, 2018

@vikaschoudhary16 How this AllocatedComputeResources will be similar to AllocatedDeviceIDs map[v1.ResourceName][]types.UID field in ContainerSpec. ? Reason, I am asking to under how this can be used by the SRIOV Network device plugins to get the information from device plugin to CNI.

spec:
resourceName: "nvidia.com/nvidia-gpu"
labelSelector:
- matchExpressions:

This comment has been minimized.

@k82cn

k82cn Jul 27, 2018

Member

Does this still mean label selector, e.g. select nodes whose memory is GtEq than 15G?

This comment has been minimized.

@jiayingz

jiayingz Jul 27, 2018

Member

It means select nvidia.com/gpu ComputeResource with memory GtEq than 15G. I.t., it is resource level unit, not node level. A resource can have smaller granularity than node (e.g., special CPUs with different hardware properties) or bigger granularity than node (e.g., cluster-level resource).

```
Above resource class will select all the nvidia-gpus which have memory greater
than and equal to 30 GB.

This comment has been minimized.

@k82cn

k82cn Jul 27, 2018

Member

Two question about GtEq, take this as example:

  1. Regarding ResourceClass, are 30G, 40G also selected ?
  2. If a node has 45G, should we schedule another pod (nvidia.high.mem: 1) to the node? We had commit there're memory greater than and equal to 30 GB to the first pod; if the previous pod used more memory, we can not guarantee the second one.

BTW, what's feature interaction with QoS?

This comment has been minimized.

@jiayingz

jiayingz Jul 27, 2018

Member

Yes, both a 30G gpu or 40G gpu would match this ResourceClass request. Currently for extended resources, allocation is done at device level, that is why the current ComputeResource struct just contains list of devices. We also don't allow over-commit on extended resources, i.e., their allocation policy is always guaranteed.

when a ComputeResource matches multiple ResourceClasses with different Priority values, the scheduler will choose those with the highest Priority.
Supporting multiple ResourceClass matching also makes it easy to ensure that existing pods requesting resources through raw resource name can continue to be scheduled properly when administrators add ResourceClass in a cluster. To guarantee this, the scheduler may just consider raw resource as a special ResourceClass with empty resource metadata constraints and priority higher than any resource class.

Because a ComputeResource can match multiple ResourceClasses, Scheduler and Kubelet need to ensure a consistent view on ComputeResource to ResourceClass request binding. Let us consider an example to illustrate this problem. Suppose a node has two ComputeResources, CR1 and CR2, that have the same raw resource name but different sets of properties. Suppose they both satisfy the property constraints of ResourceClass RC1, but only CR2 satisfies the property constraints of another ResourceClass RC2. Suppose a Pod requesting RC1 is scheduled first. Because the RC1 resource request can be satisfied by either CR1 or CR2, it is important for the scheduler to record the binding information and propagate it to Kubelet, and Kubelet should honor this binding instead of making its own binding decision. This way, when another Pod comes in that requests RC2, the scheduler can determine whether Pod can fit on the node or not, depending on whether the previous RC1 request is bound to CR1 or CR2.

This comment has been minimized.

@k82cn

k82cn Jul 27, 2018

Member

Because the RC1 resource request can be satisfied by either CR1 or CR2, it is important for the scheduler to record the binding information and propagate it to Kubelet, and Kubelet should honor this binding instead of making its own binding decision.

Is there any enhancement to current behaviour between scheduler and kubelet?

This comment has been minimized.

@jiayingz

jiayingz Jul 27, 2018

Member

This can be used to ensure more guaranteed scheduling behavior between scheduler and kubelet for special resource scheduling requests, as scheduler now has better track on the availability of such special resources and would not over-allocate.

@jiayingz

This comment has been minimized.

Copy link
Member

jiayingz commented Jul 30, 2018

@jiayingz

This comment has been minimized.

Copy link
Member

jiayingz commented Jul 30, 2018

@jiayingz jiayingz referenced this pull request Jul 31, 2018

Closed

New Resource API Proposal #607

scenario as new resource properties are introduced into the system. Therefore we
support this behavior by default. To also provide an easy way for cluster admins
to reserve expensive compute resources and control their access with resource
quota, we propose to include a Priority field in ResourceClass API.

This comment has been minimized.

@ConnorDoyle

ConnorDoyle Aug 7, 2018

Member

Could we clarify the use cases that require non-overlapping resource classes?

  1. It seems this effect is achievable anyway if cluster admins design their resource class specs properly.
  2. With the described priority mechanism, to answer the question "why doesn't resource X on node Y match class Z?" users potentially have to inspect every resource class.

Possible fields we may consider to add later include:
- `DeviceUnits resource.Quantity`. This field can be used to support fractional
resource or infinite resource. In a more advanced use case, a device plugin may

This comment has been minimized.

@ipuustin

ipuustin Aug 15, 2018

Infinite resources might come handy in the case when a "default" non-countable resource should be assigned to a container. This could be used for example to make a DP set a node-specific environment variable to a container.

@RobertKrawitz

This comment has been minimized.

Copy link
Contributor

RobertKrawitz commented Aug 15, 2018

Another use for infinite resources would be for metrics, if it's desired to know how much of a resource is being used but without intent of imposing a limit. For (non-)random example (not specifically applicable to this), using filesystem quotas to measure storage use by setting an effectively infinite quota.

**How Resource classes can solve this:**<br/>
Vendors can use DevicePlugin API to propagate new hardware features, and provide best-practice ResourceClass spec to consume their new hardware or new hardware features on Kubernetes. Vendors don’t need to worry supporting this new hardware would break existing use cases on old hardware because the Kubernetes scheduler takes the resource metadata into account during pod scheduling, and so only pods that explicitly request this new hardware through the corresponding ResourceClass name will be allocated with such resources.</br>

- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and share a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.<br/>

This comment has been minimized.

@ipuustin

ipuustin Aug 16, 2018

This could very well be the frontend to previously-discussed CPU pool concept. It would just be configuration which would set the property (such as "these cpu cores belong to the AVX pool") and not any physical device property (such as the shared NUMA node). I think we need to keep the primary resources in mind for this proposal too, even if they are not part of the scope yet.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Aug 18, 2018

@vikaschoudhary16: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-community-verify 55ecd0a link /test pull-community-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

**Motivation:** Empower enterprise customers to consume and manage non-primary resources easily, similar to how they consume and manage primary resources today.<br/>
**Can this be solved without resource classes:** Without ResourceClass, people would rely on `NodeLabels`, `NodeAffinity`, `Taints`, and `Tolerations` to steer workloads to the appropriate nodes, or build their own [non-upstream solutions](https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2576) to allow users to specify their resource specific metadata requirements. Workloads would have different experience on consuming non-primary compute resources on k8s. As time goes and more non-upstream solutions were deployed, user experience becomes fragmented across different environments. Furthermore, `NodeLabels` and `Taints` were designed as node level properties. They can't support multiple types of compute resources on a single node, and don't integrate well with resource quota. Even with the recent [Pod Scheduling Policy proposal](https://github.com/kubernetes/community/pull/1937), cluster admins can either allow or deny pods in a namespace to specify a `NodeAffinity` or `Toleration`, but cannot assign different quota to different namespaces.<br/>
**How Resource classes can solve this:** I, operator/admin, create different ResourceClasses for different types of GPUs. User workloads can request different types of GPUs in their `ContainerSpec` resource requests/limits through the corresponding ResourceClass name, in the same way as they request primary resources. Now since resource classes are quota controlled, end-user will be able to consume the requested GPUs only if they have enough quota.<br/>
**Similar use case for network devices:** A cluster can have different types of high-performance NICs and/or infiniband cards, with different performance and cost. E.g., some nodes may have 40 Gig high-performance NICs and some may have 10 Gig high-performance NICs. Some devices may support RDMA and some may not. Different workloads may desire to use different type of high-network access devices depending on their performance and cost tradeoff.</br>

This comment has been minimized.

@edwarnicke

edwarnicke Aug 21, 2018

The 'devicey-ness' of NICs is usually not the only considerations. Nobody cares about a hardware NIC without caring about what its connected to and the services its being provided by that connection. Or to put it more clearly: the characteristics of the NIC itself are only small part of the puzzle for network devices. :)

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Aug 21, 2018

Author Member

Agree and it is allowed to use any of attributes to characterize a NIC as you want :).

This comment has been minimized.

@edwarnicke

edwarnicke Aug 21, 2018

@vikaschoudhary16 Yep, I like the attributes approach. I'm still trying to understand how attributes of different 'types' are handled. Perhaps a more realistic example for network devices could be added here?

- key: "speed"
operator: "Gt"
values:
- "40GBPS"

This comment has been minimized.

@edwarnicke

edwarnicke Aug 21, 2018

Question to test my own comprehension here. Could one reasonably have:

  • key: "networkservice"
    operator: "Eq"
    values:
    - "radio-network"

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Aug 21, 2018

Author Member

yes, key can be any attribute name advertised by device plugin.

This comment has been minimized.

@edwarnicke

edwarnicke Aug 21, 2018

@vikaschoudhary16 OK, so what entity has to understand that "radio-network" is of type string, instead of type network bandwidth, or other type?

This comment has been minimized.

@fabiand

fabiand Sep 5, 2018

This way we could separate the resource phy nic from the logical network "red", correct?

My only concern is that we might have a rather long list of properties for certain virtual/overlay network cases, where there are 10ths of thousands of networks.

But that might be so special that it's solved differently.

- key: "speed"
operator: "Gt"
values:
- "40GBPS"

This comment has been minimized.

@edwarnicke

edwarnicke Aug 21, 2018

Also... "40GBPS" is not a simple number, how do you propose handling units wrt comparisons? What entity has to understand the units?

This comment has been minimized.

@vikaschoudhary16

vikaschoudhary16 Aug 21, 2018

Author Member

it will be similar to how millicore units and memory units are handled in existing code already.

This comment has been minimized.

@edwarnicke

edwarnicke Aug 21, 2018

@vikaschoudhary16 Right... but some entity has to understand the units. I imagine there are many many kinds of units one might support. What entity has to understand these units? Effectively we are implicitly introducing a 'type' here by adding units... where before in the Device Plugin API we had an int. I'm just curious how new 'types' get added... and how we handle collisions of unit abbreviations.

This comment has been minimized.

@fabiand

fabiand Sep 5, 2018

You are right, the scheduler or constraint solver will need to be aware of the units (any unit used) in order to be able to interpret the matchExpression.

@booxter

This comment has been minimized.

Copy link

booxter commented Aug 28, 2018

I am trying to understand how the proposal fits scenarios with network resources (NICs), and I have some comments / questions below.

  1. A node (or device) may not be universally connectable to any data center network. Let's say a NIC is able to connect to a "red" network but not a "blue" one. In this case, I would expect that the device will be tagged with an attribute that would describe its adjacency to the "red" network. Then RC object would refer to the attribute (perhaps of list type, using In operator) to pick a particular device with expected adjacency. Is it how it's supposed to be used? If so, there may be a problem with what seems to be a push against overlapping resource classes. AFAIU every NIC device will be tagged with adjacency, and some NIC devices may also have additional characteristics that separate them from the broader "connected-to-red-network" class. (Perhaps it's performance characteristics, whether formulated in terms of bandwidth or as "golden" or "silver" classes.) If overlapping RCs are not supported, then one can't have both a generic "red" RC with a high-performing "speedy-red" RC. Which reduces usefulness of the feature for NIC resource classification. (I imagine that the issue is not as pressing for other classes of devices where they may not have a universal nearly-mandatory attribute to select with.)

  2. For NICs, perhaps the most important quantifiable characteristics will be bandwidth. For devices that are directly backed by a hardware entity (like SR-IOV PF), the current proposal seems to work fine since each entity has a limited and non-shareable bandwidth. But for devices that have no 1-to-1 mapping to a physical entity (let's say it's OVS DP that connects multiple virtual devices to a single physical NIC), bandwidth is a shared resource that describes the total bandwidth of all virtual devices connected through the NIC. One could model their devices by splitting the total bandwidth between a limited number of devices (f.e. if you have a total 10Gbps for a NIC, you create 10 virtual devices with 1Gbps each) but it won't fit a case where someone needs a single virtual device with 5Gbps allocated to it. (Again, this probably doesn't affect traditional cases like GPUs where there is a clear 1-to-1 mapping between a pod and a device.) Is this scenario being looked into in scope of this proposal? If not, are there plans to consider it in future work?

Thanks in advance for answers, and thanks for working on the proposal.

@fabiand
Copy link

fabiand left a comment

Nice proposal. Just my 2ct added.

- key: "speed"
operator: "Gt"
values:
- "40GBPS"

This comment has been minimized.

@fabiand

fabiand Sep 5, 2018

This way we could separate the resource phy nic from the logical network "red", correct?

My only concern is that we might have a rather long list of properties for certain virtual/overlay network cases, where there are 10ths of thousands of networks.

But that might be so special that it's solved differently.

spec:
resourceName: "nvidia.com/gpu"
resourceSelector:
- matchExpressions:

This comment has been minimized.

@fabiand

fabiand Sep 5, 2018

These attributes are the attributes exposed frm the DP to kubelet?

- key: "speed"
operator: "Gt"
values:
- "40GBPS"

This comment has been minimized.

@fabiand

fabiand Sep 5, 2018

You are right, the scheduler or constraint solver will need to be aware of the units (any unit used) in order to be able to interpret the matchExpression.

Possible fields we may consider to add later include:
- `AutoProvisionConfig`. This field can be used to specify resource auto provisioning config in different cloud environments.
- `Scope`. Indicate whether it maps to node level resource or cluster level resource. For cluster level resource, scheduler, Kubelet, and cluster autoscaler can skip the PodFitsResources predicate evaluation. This allows consistent resource predicate evaluation among these components.
- `ResourceRequestParameters`. This field can be used to indicate special resource request prameters that device plugins may need to perform special configurations on their devices to be consumed by workload pods requesting this resource.

This comment has been minimized.

@fabiand

fabiand Sep 5, 2018

This owuld be nice. Parameters to requests. But IIUIC it was explicitly exlucded from this proposal, correct?

- `Scope`. Indicate whether it maps to node level resource or cluster level resource. For cluster level resource, scheduler, Kubelet, and cluster autoscaler can skip the PodFitsResources predicate evaluation. This allows consistent resource predicate evaluation among these components.
- `ResourceRequestParameters`. This field can be used to indicate special resource request prameters that device plugins may need to perform special configurations on their devices to be consumed by workload pods requesting this resource.

Note we intentially leave these fields out of the initial design to limit the scope

This comment has been minimized.

@fabiand

fabiand Sep 5, 2018

… Ah yes, to answer myself.

- Another option is that Kubelet can evict the pods that are allocated with a non-existing ComputeResource. Although simple, this approach may disturb long-running workloads during device plugin upgrade.
- To support a less disruptive model, upon resource property change, Kubelet can still export capacity at old ComputeResource name for the devices used by active pods, and exports capacity at new matching ComputeResource name for devices not in use. Only when those pods finish running, that particular node finishes its transition. This approach avoids resource multiple counting and simplifies the scheduler resource accounting. One potential downside is that the transition may take quite long process if there are long running pods using the resource on the nodes. In that case, cluster admins can still drain the node at convenient time to speed up the transition. Note that this approach does add certain code complexity on Kubelet DeviceManager component.

We propose to start with the first option, i.e., device property change requires

This comment has been minimized.

@fabiand
```

On the other hand, cluster admin may want to allow pods requesting nvidia-p100 to use ecc p100 GPUs if they are idle, but relies on scheduler preemption to re-assign those devices to pods requesting nvidia-p100-ecc and with higher priority. Such use cases require the scheduler support on matching a ComputeResource to multiple qualified ResourceClasses.
We feel this model

This comment has been minimized.

@fabiand

fabiand Sep 5, 2018

There are to many lines breaks in the next three lines.

@fabiand

This comment has been minimized.

Copy link

fabiand commented Sep 5, 2018

Is there also a KEP to track gRPC changes to support additional device types better (i.e. NICs)?

@justaugustus

This comment has been minimized.

Copy link
Member

justaugustus commented Oct 13, 2018

/kind kep

@justaugustus

This comment has been minimized.

Copy link
Member

justaugustus commented Nov 20, 2018

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

@justaugustus

This comment has been minimized.

Copy link
Member

justaugustus commented Dec 1, 2018

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 1, 2018

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@RobertKrawitz RobertKrawitz referenced this pull request Mar 6, 2019

Closed

REQUEST: New membership for RobertKrawitz #573

6 of 6 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.