KEP: New Resource API proposal #2265

vikaschoudhary16 · 2018-06-14T07:18:00Z

@derekwaynecarr @vishh @dchen1107 @jiayingz

idvoretskyi · 2018-06-14T08:36:12Z

/uncc

vikaschoudhary16 · 2018-06-18T11:27:58Z

resouer · 2018-06-21T13:17:18Z

keps/sig-node/00014-resource-api.md

+          operator: "GtEq"
+          values:
+            - "30G"
+```


Mind to add another yaml of how Pod reference ResourceClass?

good idea. will do.

vishh

Other than per-device quota, I feel most other use cases are ahead of their time.
If Quota is the main problem to tackle right now, then does it require a new set of Resource APIs or can be solved via [admission extensions] (https://docs.google.com/a/google.com/document/d/1Lpuw-tjm252W4oGqBvcFAFUZkwFlpZ2i0EgpCwialN0/edit?disco=AAAAB_xFEyY) or some other way?
If you'd like to test the waters with other use cases, this proposal should ideally be implemented via extensions. If extensions are inadequate, we should try to address extension gaps.

vishh · 2018-06-25T23:02:26Z

keps/sig-node/00014-resource-api.md

+## Use Stories
+### As a cluster operator:
+- Nodes in my cluster has GPU HW from different generations. I want to classify GPU nodes into one of the three categories, silver, gold and platinum depending upon the launch timeline of the GPU family eg: Kepler K20, K80, Pascal P40, P100, Volta V100. I want to charge each of the three categories differently. I want to offer my clients 3 GPU rates/classes to choose from.<br/>
+**Motivation:** As time progresses in a cluster lifecycle, new advanced, high performance, expensive variants of GPUs gets added to the cluster nodes. At the same time older variants also co-exist. There are workloads which strictly wants latest GPUs and also there are workloads which are fine with older GPUs. But since there is a wide range of types, it will be hard to manage and confusing at the same time to have granularity at each GPU type. Grouping into few broad categories will be convenient to manage.<br/>


Are there any users today that need this feature from kubernetes?

also for this user story, I think we could use NodeAffinity to chose different GPU type.

I think NodeAffinity shares similar problem as node taint that cluster administrators can not apply proper access control to restrict a user pod from not using them. As mentioned at the very beginning, the goal of this proposal is "to better support non-native compute resources on kubernetes". We want to allow users to request them as compute resources, and allow administrators to control their access through the resource quota.

This is in-line with general QoS model. We might like to experiment with this model in Openshift. /cc @derekwaynecarr
Wondering how NodeAffinity can be tied with usage metrics which will be needed to charge as per usage.

https://docs.google.com/a/google.com/document/d/1Lpuw-tjm252W4oGqBvcFAFUZkwFlpZ2i0EgpCwialN0/edit?disco=AAAAB_xFEyk

Without knowing for sure real users will benefit from it, i don't see why we'd solve this problem.

See ur point, and I always think a more general resource API model should be better than label based solutions. :)
👍

if we are going to build a feature, we should have a clear user identified that we know will consume the feature to guide the use-case. i think we need to evaluate the feature relative to other ideas seen in the ecosystem.

for the similar use-case of "specify gpu attributes such as gpu type and memory requirements for deployment in heterogenous GPU clusters", nvidia appears to enable this by carrying two API fields on the pod spec.

see:
https://developer.nvidia.com/kubernetes-gpu
https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2576
https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2389

it would be good to evaluate resource class versus this other approach.

I'd first like to understand if gpu type and memory requirements are a real user concern today in the first place even before considering possible solutions.
There are users who are sufficiently happy with using node selectors and most users today seem to bind pods to specific gpu types either for cost or specific memory requirements.

@derekwaynecarr we have scalability concerns with kubernetes-gpu resource. Selectors computation will be done for each compute resource in scheduler cache. OTOH with resource classes, resource classes will be fewer than compute resources.
Another concern is portability.

Thanks a lot for the pointers, @derekwaynecarr
As @vikaschoudhary16 mentioned, looking at nvidia's non-upstream solution, the main difference is that they changed the current resource requirement API of the container spec. In our proposal, we explicitly mentioned that this is a non-goal for the following reasons: First, in a large cluster, computing operators like “greater than”, “less than” at pod creation can be a very slow operation and is not scalable. It can cause scaling issues on the scheduler side. Second, non-primary compute resources usually lack standard resource properties. Although there are benefits to allow users to directly express their resource metadata requirements in their container spec, it may also compromise workload portability in longer term. Third, resource quota control will become harder. Fourth, we may consider the resource requirement API change as a possible extension orthogonal to the ResourceClass proposal. By introducing ResourceClass as an additional resource abstraction layer, users can express their special resource requirements through a high-level portable name, and cluster admins can configure compute resources properly on different environments to meet such requirements. We feel this helps promote portability and separation of concerns, while still maintains API compatibility.

vishh · 2018-06-25T23:03:15Z

keps/sig-node/00014-resource-api.md

+**How Resource classes can solve this:** I, operator/admin, creates three resource classes: GPU-Platinum, GPU-Gold, GPU-Silver. Now since resource classes are quota controlled, end-user will be able to request resource classes only if quota is allocated.
+
+- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and shares a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.<br/>
+**Motivation:** Increased performance because of local reference. Local reference also helps better use of cache<br/>


This use case is unclear to me. What does local reference mean?

@vikaschoudhary16 s/local reference/local access?

for example, "local" cache from the same NUMA node in case of cores.

This intersects with topology awareness heavily. I think Resource Class (if it exists) should restrict itself to policy like allowing only certain shapes (2 GPU with max of 16 CPUs, ...). The topology aspect as currently planned is expected to be covered by QoS (or an additional application performance class API if necessary). Don't combine them both.

What this proposal focuses on is a building block that allows guaranteed metadata aware resource scheduling by surfacing resource metadata to the scheduler. I think what kind of metadata people want to surface should be left to HW vendors, resource providors, or infrastructure admins, based on different HW properties, platform environment, and workload requirements. We can provide best practice guidelines and scaling results for people to make right decisions. Node level best effort topology aware scheduling may allow better scaling but I don't think we want to take an opinioned position here.

vishh · 2018-06-25T23:04:44Z

keps/sig-node/00014-resource-api.md

+**How Resource classes can solve this:**  Property/attribute which forms the grouping can be advertised in the device attributes and then a resource can be created to form a grouped super-resource based on that property.<br/>
+**Can this be solved without resource classes:** No
+
+- I want to have quota control on the devices at the granularity of device properties. For example, I want to have a separate quota for ECC enabled GPUs. I want a specific user to not let use more than ‘N’ number of ECC enabled GPUs overall at namespace level.<br/>


ECC is probably not a good example. Device types might be more common.
I'd like to explicitly identify if the number of dimensions that we need to support for quota is one or N ideally based on concrete user feedback.

It should be N. We want to provide a general framework to support different types of devices like GPU, high performance NIC, FPGA, etc.

agree on Device type. Actually, ECC enabled GPU will be a different ComputeResource as mentioned in the sections below.

I'd like to explicitly identify if the number of dimensions that we need to support for quota is one or N ideally based on concrete user feedback.

"number of dimensions" not clear to me?

It should be N. We want to provide a general framework to support different types of devices like GPU, high performance NIC, FPGA, etc.

I see the goal, but is there a real world use case for N? I only see the need for 1 dimension now.

vishh · 2018-06-25T23:08:06Z

keps/sig-node/00014-resource-api.md

+- In my cluster, I have many different classes (different capabilities) of a device type (ex: NICs). End user’s expectations are met as long as device has a very small subset of these capabilities. I want a mechanism where end user can request devices which satisfies their minimum expectation.
+Few nodes are connected to data network over 40 Gig NICs and others are connected over normal 1 Gig NICs. I want end user pods to be able to request
+data network connectivity with high network performance while
+in default case, data network connectivity is offered via normal 1 Gbps NICs.<br/>


This feels like a niche use case. Why can't the existing labels+affinity features not work for this use case?
Also, why not build policies to restrict access via admission plugins rather than adding a new core resource?

I think if people need to build and deploy various admission plugins to restrict access on different HW with different properties, that indicates the need for a general framework to support that use case.

vishh · 2018-06-25T23:10:09Z

keps/sig-node/00014-resource-api.md

+**Can this be solved without resource classes:** Taints and tolerations can help in steering pods but the problem in that there is no way today to have access control over use of tolerations and therefore if multiple users are there, it is not possible to have control on allowed tolerations.<br/>
+**How Resource classes can solve this:** I can define a ResourceClass for the high-performance NIC with minimum bandwidth requirements, and makes sure only users with proper quota can use such resources.
+
+- I want to be able to utilize different 'types' of a HW resource while not losing workload portability when moving from one cluster to another. There can be Nvidia GPUs on one cluster and AMD GPUs on another cluster. This is example of different ‘types’ of a HW resource(GPU). I want to offer GPUs to be consumed under a same portable name, as long as their capabilities are almost same. If pods are consuming these GPUs with a generic resource class name, workload can be migrated from one cluster to another transparently.<br/>


This feels like a dream given the state of SW today based on my experience. For example, Tensorflow struggles working seamlessly across compute types (CPU, GPU, etc) and sub-architectures (Skylake, V100, AMD).
I feel we need to wait a bit for the world to evolve for this use case to become valid in k8s.

For GPUs from different vendors, agree their properties can be quite different currently, although I wonder whether the difference is less significant for certain workloads like video decoding. For high-performance nic, I think user experience is perhaps less diversified. I also feel promoting portability is always a strong motivation on kubernetes,

vishh · 2018-06-25T23:12:00Z

keps/sig-node/00014-resource-api.md

+  **Motivation:** I want minimum guaranteed compute performance<br/>
+  **Can this be solved without resource classes:**<br/>
+  - Yes, using node labels and NodeLabelSelectors.
+    Problem: Same problem of lack of access control on using labelselectors at user level as with the use of tolerations.


Wouldn't quota take care of access control to an extent?

can you please elaborate more on this that how quota can be used with labels?

Yes. That is why we would like to introduce ResourceClass that fits naturally with resource quota.

https://docs.google.com/a/google.com/document/d/1Lpuw-tjm252W4oGqBvcFAFUZkwFlpZ2i0EgpCwialN0/edit?disco=AAAAB_xFEyY

@vishh if i understand the proposed alternative, its basically treating it as an opaque resource by convention? the user still needs to couple the opaque resource consumption with the device consumption and that really cant be done until scheduling, no?

@derekwaynecarr if we can assume 1-1 mapping between the opaque resource and actual device, then we don't have to be concerned with scheduling right?
I'm not sure if clobbering resource requests in a webhook is possible though.

vishh · 2018-06-25T23:12:51Z

keps/sig-node/00014-resource-api.md

+  - Yes, using node labels and NodeLabelSelectors.
+    Problem: Same problem of lack of access control on using labelselectors at user level as with the use of tolerations.
+  - OR, Instead of using resource class, provide flexibility to query resource properties directly in pod container resource requests.
+    Problem: In a large cluster, computing operators like “greater than”, “less than” at pod creation can be a very slow operation and is not scalable.


What is the scale we are targeting? If generic scheduling features don't scale, then it's a problem that needs to be tackled separately.
cc @bsalamat

Not sure I fully understand your comment here. What this paragraph means is that if we extend container resource request API to directly specify their metadata requirements, scheduler needs to do the label selection matching on all of the compute resources in the cluster. But with ResourceClass, scheduler can cache compute resource to ResourceClass matching in its NodeInfo cache, and so the current PodFitsNodeResource evaluation will mostly stay the same without introducing new scaling concerns.

Ah got it. So this is not really a point justifying the need of better resource APIs. It is about the internal design of such a new API.

vishh · 2018-06-25T23:19:17Z

keps/sig-node/00014-resource-api.md

+  **How Resource classes can solve this:**
+The Kubernetes scheduler is the central place to map container resource requests expressed through ResourceClass names to the underlying qualified physical resources, which automatically supports metadata aware resource scheduling.
+
+- As a data scientist, I want my workloads to use advanced compute resources available in the running clusters without understanding the underlying hardware configuration details. I want the same workload to run on either on-prem Kubernetes clusters or on cloud, without changing its pod spec. When a new hardware driver comes out, I hope all the required resource configurations are handled properly by my cluster operators and things will just continue to work for any of my existing workloads.<br/>


This use case seems too vague. TBH, the workloads that consume additional HW are specialized enough that it requires developer maintenance and cluster admins may not be able to homogenize different environments.

I tend to disagree. I think a big value of Kubernetes is to allow separation of concern that application developers can focus on their own software with underlying infrastructure taken cared by cluster admins.

@vishh quoting Henry from Ebay in the support of workload portability:

HI Vish, Bobby, it is not exactly this requirement. Currently we ask developers to submit resource specifications for GPU using name of the cards to our data center :

"accelerator": {
"type": "gpu",
"quantity": "1",
"labels": {
"product": "nvidia",
"family": "tesla",
"model": "m40"
}
}

But when we go to other cloud such as Google or AWS they may not have the same cards.

So I was wondering if we could offer resource such as CUDA cores and memory as resource specifications rather actual name and type of the cards.

However, different cards such as AMD vs NVIDIA was not the goal because we know program code against NVIDIA cards will not work well if run with AMD cards.

I'll wait for Henry to respond to that thread. I think you all should solicit feedback from real users (I'm thinking ML WG, SIG BIG Data, etc.) to figure out if this is really feasible. No user that I have spoken to is ready to consume this level of sophistication today.

I think we are proposing an infrastructure building block here whose focusing users are infrastructure admins and developers who want to make their systems easier to use by hiding the underlying hardware details from end users.

As a data scientist, I want my workloads to use advanced compute resources available in the running clusters without understanding the underlying hardware configuration details.

Yes and no. But mostly no. As mentioned by @vishh this is too vague and a lot of things that this statements covers is outside of K8s' scope.

It might make sense when your users only request one GPU (but in that case does ECC on/off is a HW config?)

When a user requests more than one GPU users should at the very least be able to specify if GPUs are linked through NVLINK

I want the same workload to run on either on-prem Kubernetes clusters or on cloud, without changing its pod spec.

What's blocking this today or what might be blocking this in the future?

When a new hardware driver comes out, I hope all the required resource configurations are handled properly by my cluster operators and things will just continue to work for any of my existing workloads.

How is this in the scope of Resource Classes

vishh · 2018-06-25T23:20:53Z

keps/sig-node/00014-resource-api.md

+- I want an easy and extensible mechanism to export my resource to Kubernetes. I want to be able to roll out new hardware features to the users who require those features without breaking users who are using old versions of hardware.<br/>
+**Motivation:** enables more compute resources and their advanced features on Kubernetes<br/>
+**Can this be solved without resource classes:**<br/>
+Yes, Using node labels and NodeLabelSelectors.<br/>


Again seems pretty vague. Is this a real need today? The example mentioned below also doesn't seem realistic.
Are there use cases outside of GPUs?

This user story is mostly motivated by some past discussions on device plugin features requests, e.g., as @kad mentioned in kubernetes/kubernetes#59109 (comment)
I like the explicit API model that once ResourceClass is in place, Kubelet can pass ResourceClass name to a device plugin, and the device plugin can map that ResourceClass name to the special underlying resource metadata requirements.

Another example could be different types of cpu cores i,e isolated cores, numa affined cores, hyperthreaded cores etc.

For kubernetes/kubernetes#59109 (comment) wouldn't Pod annotations suffice?

Another example could be different types of cpu cores i,e isolated cores, numa affined cores, hyperthreaded cores etc.

I don't think we are decided yet on what level of CPU specifics we want to expose to users.

After cpu manager's static policy, IMO, supporting features like isolated cores is a natural progression.
/cc @jeremyeder @kad

i think this feature is orthogonal to cpus and confuses the discussion.

vishh · 2018-06-25T23:27:52Z

keps/sig-node/00014-resource-api.md

+authors:
+  - "@vikaschoudhary16"
+  - "@jiayingz"
+owning-sig: sig-node


I'd add sig-scheduling as well since there is quite some impact to scheduling, quota, etc.

Looking at the KEP template and existing ones, I think it expects one owning sig. But agree a big part of this proposal is on scheduler side, and we should add it as participating-sigs.

ConnorDoyle · 2018-06-27T00:25:33Z

/cc

vikaschoudhary16 · 2018-06-28T11:10:05Z

@vishh

Other than per-device quota, I feel most other use cases are ahead of their time.

Workload portability is a feedback driven real use-case.

jiayingz · 2018-06-29T06:07:34Z

/cc @RenaudWasTaken @hsaputra @kad @bart0sh @fabiand

k8s-ci-robot · 2018-06-29T06:07:36Z

@jiayingz: GitHub didn't allow me to request PR reviews from the following users: hsaputra, bart0sh, fabiand.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @RenaudWasTaken @hsaputra @kad @bart0sh @fabiand

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jiayingz · 2018-07-30T16:52:40Z

On Mon, Jul 30, 2018 at 3:33 AM Renaud Gaubert ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In keps/sig-node/00014-resource-api.md <#2265 (comment)>: > +to reserve expensive compute resources and control their access with resource +quota, we propose to include a Priority field in ResourceClass API. +By default, the value of this field is set to zero, but cluster admins can set +it to a higher value, which would prevent its matching compute resources from +being matched by lower priority ResourceClasses. i.e., +when a ComputeResource matches multiple ResourceClasses with different Priority values, the scheduler will choose those with the highest Priority. +Supporting multiple ResourceClass matching also makes it easy to ensure that existing pods requesting resources through raw resource name can continue to be scheduled properly when administrators add ResourceClass in a cluster. To guarantee this, the scheduler may just consider raw resource as a special ResourceClass with empty resource metadata constraints. + +Because a ComputeResource can match multiple ResourceClasses, Scheduler and Kubelet need to ensure a consistent view on ComputeResource to ResourceClass request binding. Let us consider an example to illustrate this problem. Suppose a node has two ComputeResources, CR1 and CR2, that have the same raw resource name but different sets of properties. Suppose they both satisfy the property constraints of ResourceClass RC1, but only CR2 satisfies the property constraints of another ResourceClass RC2. Suppose a Pod requesting RC1 is scheduled first. Because the RC1 resource request can be satisfied by either CR1 or CR2, it is important for the scheduler to record the binding information and propagate it to Kubelet, and Kubelet should honor this binding instead of making its own binding decision. This way, when another Pod comes in that requests RC2, the scheduler can determine whether Pod can fit on the node or not, depending on whether the previous RC1 request is bound to CR1 or CR2. + +To maintain and propagate ResourceClass to ComputeResource binding information, the scheduler will need to record this information in a newly introduced ContainerSpec field, similar to the existing NodeName field, and Kubelet will need to consume this information. During the initial implementation, we propose to encode the ResourceClass to the underlying compute resource binding information in a new `AllocatedDeviceIDs map[v1.ResourceName][]types.UID` field in ContainerSpec. Adding this field has been discussed as a possible solution to support other use cases, such as third-party resource monitoring and network device plugins. For the purpose to support ResourceClass, we will extend the scheduler NodeInfo cache to store ResourceClass to the matching ComputeResource information on the node. For a given ComputeResource, its capacity will be reflected in NodeInfo.allocatableResource with all matching ResourceClass names. This way, the current node resource fitness evaluation will stay most the same. After a pod is bound to a node, the scheduler will choose the requested number of devices from the matching ComputeResource on the node, and record this information in the mentioned new field. After that, it increases the NodeInfo.requestedResource for all of the matching ResourceClass names of that ComputeResource. Note that if the AllocatedDeviceIDs field is pre-specified, scheduler should honor this binding instead of overwriting it, similar to how it handles pre-specified NodeName. + +A main reason we propose to have the scheduler make and record device level +scheduling decision is so that the scheduler can maintain accurate resource acounting information. +The matching from a ResourceClass to the underlying compute resources may change +from two kinds of updates. First, cluster admins may want to add, delete, or modify a ResourceClass by adding or removing some metadata constraints or changing its priority. For ResourceClass add/delete/update handling, as long as scheduler has already assigned the pod to a node with a ComputeResource, it doesn't matter whether the old ResourceClass would be valid or not I pretty much disagree here. If we consider that Resource Class should represent a billable resource then: As a user when I submit a deployment using a certain Resource Class I expect that the underlying pods created by that deployment will always refer to the same kind of device. That Resource Class should not suddenly refer to something else or cost more when I update the number of replicas (or for whatever reason a pod might be created).

For billing use case, the billable resource is at ResourceClass level, not on physical device. If the underlying hardware devices are different enough that we want to charge them differently, we would then want to create different ResourceClasses for them and assign different resource quotas.

IMHO, edits should not be possible and Deletes should only be possible if all references to that Resource Class have been deleted.

We may start with not allowing edit at the beginning but I think totally disallowing this would quite impact user experience. Considering e.g., a cluster has a ResourceClass defined to match different types of the same hardware, it would be quite inconvinient if cluster operators have to delete and recreate the ResourceClass to add a new type of hardware. For deletion case, agree we may consider to not allow delete unless all running pods no longer refer to that ResourceClass. However, this doesn't seem to affect the logic on scheduler part much given that we need to support ResourceClass addition anyway, which requires the scheduler to update its cached info on ComputeResource to ResourceClass matching.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2265 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcIZlG6DR3vB-sghSBJoyJDDNCIec5Ocks5uLuEFgaJpZM4Unbd4> .

jiayingz · 2018-07-30T17:02:23Z

On Mon, Jul 30, 2018 at 4:02 AM Renaud Gaubert ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In keps/sig-node/00014-resource-api.md <#2265 (comment)>: > + - key: "Type" + operator: "Eq" + values: + - "nvidia-tesla-p100" + - key: "Nvlink" + operator: "Eq" + values: + - "true" +``` + +Now we face the question that whether the scheduler should allow Pods requesting +"nvidia-p100" to land on a node in this new GPU node groups. So far, we have +received different feedbacks on this question. In some use cases, users would +like to have minimum matching behavior that as long as the underlying hardware +matches the minimal requirements specified through ResourceClass contraints, +they want to allow Pods to be scheduled on the hardware. On the other hand, some users desire to reserve expensive hardware resources for users who explicitly request them. to hide underlying hardware difference to workloads that don't care about such difference Please highlight in the document which workloads and which hardware specifically this requirement is for. Your code is usually already compiled for a specific hardware. Currently people are doing this through taint and toleration I agree that Resource Class need a mechanism to signal that a Compute Resource should belong to a Resource Class. Priority just doesn't seem to be the right model. It's not an intuitive model, it's completely error prone and dangerous and absolutly not at the right granularity level. I'd much rather we explore different simpler models: - Overlapping Resource Class (multiple RC maps to the same Compute Resource) with the ability to remove Comp Res from an RC - Manually assign CRs to an RC when there is overlap More generally if we image Resource Class as way to bill my users, I expect them to be able to handle overlap. If I want to see how many GPUs I have then I can just list the Compute Resources.

Could you add to the agenda to discuss whether we should omit priority field in the initial design in tomorrow's sig-node meeting? I think we are open to not adding this field during the initial phase if people have concerns on its usage and the use case is not considered as must-have. We can explore better way to support that use case after initial phase. Looks like you have some models in mind that you think would work better. Could you write down your design in a document with end-to-end support in more detail? Right now, it is hard to tell how your proposed models would work end-to-end.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2265 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AcIZlOXD2jnB_gd2-utAc58a_tMjIB3Pks5uLuergaJpZM4Unbd4> .

ConnorDoyle · 2018-08-07T17:57:27Z

keps/sig-node/00014-resource-api.md

+scenario as new resource properties are introduced into the system. Therefore we
+support this behavior by default. To also provide an easy way for cluster admins
+to reserve expensive compute resources and control their access with resource
+quota, we propose to include a Priority field in ResourceClass API.


Could we clarify the use cases that require non-overlapping resource classes?

It seems this effect is achievable anyway if cluster admins design their resource class specs properly.

With the described priority mechanism, to answer the question "why doesn't resource X on node Y match class Z?" users potentially have to inspect every resource class.

ipuustin · 2018-08-15T14:14:20Z

keps/sig-node/00014-resource-api.md

+
+Possible fields we may consider to add later include:
+- `DeviceUnits resource.Quantity`. This field can be used to support fractional
+  resource or infinite resource. In a more advanced use case, a device plugin may


Infinite resources might come handy in the case when a "default" non-countable resource should be assigned to a container. This could be used for example to make a DP set a node-specific environment variable to a container.

RobertKrawitz · 2018-08-15T14:33:37Z

Another use for infinite resources would be for metrics, if it's desired to know how much of a resource is being used but without intent of imposing a limit. For (non-)random example (not specifically applicable to this), using filesystem quotas to measure storage use by setting an effectively infinite quota.

ipuustin · 2018-08-16T13:51:24Z

keps/sig-node/00014-resource-api.md

+**How Resource classes can solve this:**<br/>
+Vendors can use DevicePlugin API to propagate new hardware features, and provide best-practice ResourceClass spec to consume their new hardware or new hardware features on Kubernetes. Vendors don’t need to worry supporting this new hardware would break existing use cases on old hardware because the Kubernetes scheduler takes the resource metadata into account during pod scheduling, and so only pods that explicitly request this new hardware through the corresponding ResourceClass name will be allocated with such resources.</br>
+
+- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and share a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.<br/>


This could very well be the frontend to previously-discussed CPU pool concept. It would just be configuration which would set the property (such as "these cpu cores belong to the AVX pool") and not any physical device property (such as the shared NUMA node). I think we need to keep the primary resources in mind for this proposal too, even if they are not part of the scope yet.

k8s-ci-robot · 2018-08-18T05:06:06Z

@vikaschoudhary16: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-community-verify	`55ecd0a`	link	`/test pull-community-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

edwarnicke · 2018-08-21T13:30:39Z

keps/sig-node/00014-resource-api.md

+**Motivation:** Empower enterprise customers to consume and manage non-primary resources easily, similar to how they consume and manage primary resources today.<br/>
+**Can this be solved without resource classes:** Without ResourceClass, people would rely on `NodeLabels`, `NodeAffinity`, `Taints`, and `Tolerations` to steer workloads to the appropriate nodes, or build their own [non-upstream solutions](https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2576) to allow users to specify their resource specific metadata requirements. Workloads would have different experience on consuming non-primary compute resources on k8s. As time goes and more non-upstream solutions were deployed, user experience becomes fragmented across different environments. Furthermore, `NodeLabels` and `Taints` were designed as node level properties. They can't support multiple types of compute resources on a single node, and don't integrate well with resource quota. Even with the recent [Pod Scheduling Policy proposal](https://github.com/kubernetes/community/pull/1937), cluster admins can either allow or deny pods in a namespace to specify a `NodeAffinity` or `Toleration`, but cannot assign different quota to different namespaces.<br/>
+**How Resource classes can solve this:** I, operator/admin, create different ResourceClasses for different types of GPUs. User workloads can request different types of GPUs in their `ContainerSpec` resource requests/limits through the corresponding ResourceClass name, in the same way as they request primary resources. Now since resource classes are quota controlled, end-user will be able to consume the requested GPUs only if they have enough quota.<br/>
+**Similar use case for network devices:** A cluster can have different types of high-performance NICs and/or infiniband cards, with different performance and cost. E.g., some nodes may have 40 Gig high-performance NICs and some may have 10 Gig high-performance NICs. Some devices may support RDMA and some may not. Different workloads may desire to use different type of high-network access devices depending on their performance and cost tradeoff.</br>


The 'devicey-ness' of NICs is usually not the only considerations. Nobody cares about a hardware NIC without caring about what its connected to and the services its being provided by that connection. Or to put it more clearly: the characteristics of the NIC itself are only small part of the puzzle for network devices. :)

Agree and it is allowed to use any of attributes to characterize a NIC as you want :).

@vikaschoudhary16 Yep, I like the attributes approach. I'm still trying to understand how attributes of different 'types' are handled. Perhaps a more realistic example for network devices could be added here?

edwarnicke · 2018-08-21T13:38:21Z

keps/sig-node/00014-resource-api.md

+        - key: "speed"
+          operator: "Gt"
+          values:
+            - "40GBPS"


Question to test my own comprehension here. Could one reasonably have:

key: "networkservice"
operator: "Eq"
values:
- "radio-network"

yes, key can be any attribute name advertised by device plugin.

@vikaschoudhary16 OK, so what entity has to understand that "radio-network" is of type string, instead of type network bandwidth, or other type?

This way we could separate the resource phy nic from the logical network "red", correct?

My only concern is that we might have a rather long list of properties for certain virtual/overlay network cases, where there are 10ths of thousands of networks.

But that might be so special that it's solved differently.

edwarnicke · 2018-08-21T13:39:04Z

keps/sig-node/00014-resource-api.md

+        - key: "speed"
+          operator: "Gt"
+          values:
+            - "40GBPS"


Also... "40GBPS" is not a simple number, how do you propose handling units wrt comparisons? What entity has to understand the units?

it will be similar to how millicore units and memory units are handled in existing code already.

@vikaschoudhary16 Right... but some entity has to understand the units. I imagine there are many many kinds of units one might support. What entity has to understand these units? Effectively we are implicitly introducing a 'type' here by adding units... where before in the Device Plugin API we had an int. I'm just curious how new 'types' get added... and how we handle collisions of unit abbreviations.

You are right, the scheduler or constraint solver will need to be aware of the units (any unit used) in order to be able to interpret the matchExpression.

booxter · 2018-08-28T16:23:38Z

I am trying to understand how the proposal fits scenarios with network resources (NICs), and I have some comments / questions below.

A node (or device) may not be universally connectable to any data center network. Let's say a NIC is able to connect to a "red" network but not a "blue" one. In this case, I would expect that the device will be tagged with an attribute that would describe its adjacency to the "red" network. Then RC object would refer to the attribute (perhaps of list type, using In operator) to pick a particular device with expected adjacency. Is it how it's supposed to be used? If so, there may be a problem with what seems to be a push against overlapping resource classes. AFAIU every NIC device will be tagged with adjacency, and some NIC devices may also have additional characteristics that separate them from the broader "connected-to-red-network" class. (Perhaps it's performance characteristics, whether formulated in terms of bandwidth or as "golden" or "silver" classes.) If overlapping RCs are not supported, then one can't have both a generic "red" RC with a high-performing "speedy-red" RC. Which reduces usefulness of the feature for NIC resource classification. (I imagine that the issue is not as pressing for other classes of devices where they may not have a universal nearly-mandatory attribute to select with.)
For NICs, perhaps the most important quantifiable characteristics will be bandwidth. For devices that are directly backed by a hardware entity (like SR-IOV PF), the current proposal seems to work fine since each entity has a limited and non-shareable bandwidth. But for devices that have no 1-to-1 mapping to a physical entity (let's say it's OVS DP that connects multiple virtual devices to a single physical NIC), bandwidth is a shared resource that describes the total bandwidth of all virtual devices connected through the NIC. One could model their devices by splitting the total bandwidth between a limited number of devices (f.e. if you have a total 10Gbps for a NIC, you create 10 virtual devices with 1Gbps each) but it won't fit a case where someone needs a single virtual device with 5Gbps allocated to it. (Again, this probably doesn't affect traditional cases like GPUs where there is a clear 1-to-1 mapping between a pod and a device.) Is this scenario being looked into in scope of this proposal? If not, are there plans to consider it in future work?

Thanks in advance for answers, and thanks for working on the proposal.

fabiand

Nice proposal. Just my 2ct added.

fabiand · 2018-09-03T12:35:00Z

keps/sig-node/00014-resource-api.md

+        - key: "speed"
+          operator: "Gt"
+          values:
+            - "40GBPS"


This way we could separate the resource phy nic from the logical network "red", correct?

My only concern is that we might have a rather long list of properties for certain virtual/overlay network cases, where there are 10ths of thousands of networks.

But that might be so special that it's solved differently.

fabiand · 2018-09-03T12:35:29Z

keps/sig-node/00014-resource-api.md

+spec:
+  resourceName: "nvidia.com/gpu"
+  resourceSelector:
+    - matchExpressions:


These attributes are the attributes exposed frm the DP to kubelet?

fabiand · 2018-09-03T12:36:34Z

keps/sig-node/00014-resource-api.md

+        - key: "speed"
+          operator: "Gt"
+          values:
+            - "40GBPS"


You are right, the scheduler or constraint solver will need to be aware of the units (any unit used) in order to be able to interpret the matchExpression.

fabiand · 2018-09-03T12:37:38Z

keps/sig-node/00014-resource-api.md

+Possible fields we may consider to add later include:
+- `AutoProvisionConfig`. This field can be used to specify resource auto provisioning config in different cloud environments.
+- `Scope`. Indicate whether it maps to node level resource or cluster level resource. For cluster level resource, scheduler, Kubelet, and cluster autoscaler can skip the PodFitsResources predicate evaluation. This allows consistent resource predicate evaluation among these components.
+- `ResourceRequestParameters`. This field can be used to indicate special resource request prameters that device plugins may need to perform special configurations on their devices to be consumed by workload pods requesting this resource.


This owuld be nice. Parameters to requests. But IIUIC it was explicitly exlucded from this proposal, correct?

fabiand · 2018-09-03T12:37:50Z

keps/sig-node/00014-resource-api.md

+- `Scope`. Indicate whether it maps to node level resource or cluster level resource. For cluster level resource, scheduler, Kubelet, and cluster autoscaler can skip the PodFitsResources predicate evaluation. This allows consistent resource predicate evaluation among these components.
+- `ResourceRequestParameters`. This field can be used to indicate special resource request prameters that device plugins may need to perform special configurations on their devices to be consumed by workload pods requesting this resource.
+
+Note we intentially leave these fields out of the initial design to limit the scope


… Ah yes, to answer myself.

fabiand · 2018-09-03T12:41:46Z

keps/sig-node/00014-resource-api.md

+- Another option is that Kubelet can evict the pods that are allocated with a non-existing ComputeResource. Although simple, this approach may disturb long-running workloads during device plugin upgrade.
+- To support a less disruptive model, upon resource property change, Kubelet can still export capacity at old ComputeResource name for the devices used by active pods, and exports capacity at new matching ComputeResource name for devices not in use. Only when those pods finish running, that particular node finishes its transition. This approach avoids resource multiple counting and simplifies the scheduler resource accounting. One potential downside is that the transition may take quite long process if there are long running pods using the resource on the nodes. In that case, cluster admins can still drain the node at convenient time to speed up the transition. Note that this approach does add certain code complexity on Kubelet DeviceManager component.
+
+We propose to start with the first option, i.e., device property change requires


fabiand · 2018-09-03T12:43:40Z

keps/sig-node/00014-resource-api.md

+```
+
+On the other hand, cluster admin may want to allow pods requesting nvidia-p100 to use ecc p100 GPUs if they are idle, but relies on scheduler preemption to re-assign those devices to pods requesting nvidia-p100-ecc and with higher priority. Such use cases require the scheduler support on matching a ComputeResource to multiple qualified ResourceClasses.
+We feel this model 


There are to many lines breaks in the next three lines.

fabiand · 2018-09-05T08:58:30Z

Is there also a KEP to track gRPC changes to support additional device types better (i.e. NICs)?

justaugustus · 2018-10-13T03:34:53Z

/kind kep

justaugustus · 2018-11-20T04:42:53Z

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

justaugustus · 2018-12-01T08:05:39Z

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

k8s-ci-robot · 2018-12-01T08:05:40Z

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

angao · 2019-07-25T08:20:54Z

I want to know the current status of this proposal, I did not find it in k/enhancements repo. And we have some requirements that are similar to this one. We are concerned about whether Kubernetes has a plan to implement device-based scheduling, instead of having devicemanager randomly select devices for Pod. We want to implement such requirements as scheduling based on GPU models.
/reopen @k82cn @vishh

k8s-ci-robot · 2019-07-25T08:22:56Z

@angao: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jiayingz · 2019-07-25T17:29:15Z

@angao this proposal is currently put on hold. I am not aware of any plan to implement device-based scheduling in k8s.

angao · 2019-07-26T01:59:31Z

@jiayingz thx. I wonder if we can try to implement device-specific allocation by modifying the devicemanager interface instead of the current random allocation. Other implementations such as API can be implemented via CRD. This way, we can maybe alleviate such problems.

jiayingz · 2019-07-27T00:16:49Z

@angao there has been some effort on extending device plugin API for topology aware scheduling, but not sure whether this is something you are looking for.

Reserve KEP number for new resource api proposal

ef8916b

k8s-ci-robot requested review from dchen1107 and idvoretskyi June 14, 2018 07:18

vikaschoudhary16 changed the title ~~Kep resource api proposal~~ KEP: New Resource API proposal Jun 14, 2018

This was referenced Jun 14, 2018

Reserve KEP number for new resource api proposal #2264

Closed

Add New Resource API proposal #782

Closed

k8s-ci-robot removed the request for review from idvoretskyi June 14, 2018 08:36

k8s-ci-robot assigned dchen1107 and derekwaynecarr Jun 18, 2018

resouer reviewed Jun 21, 2018

View reviewed changes

vikaschoudhary16 force-pushed the kep-resource-api-proposal branch from a6047a0 to 1672a1b Compare June 25, 2018 11:47

vishh reviewed Jun 25, 2018

View reviewed changes

k8s-ci-robot requested a review from ConnorDoyle June 27, 2018 00:25

vikaschoudhary16 force-pushed the kep-resource-api-proposal branch 2 times, most recently from 22a9847 to bbf8f0c Compare June 28, 2018 11:07

KEP: New Resource API

a6af163

vikaschoudhary16 force-pushed the kep-resource-api-proposal branch from bbf8f0c to a6af163 Compare June 29, 2018 01:55

k8s-ci-robot requested a review from kad June 29, 2018 06:07

k8s-ci-robot requested a review from RenaudWasTaken June 29, 2018 06:07

jiayingz mentioned this pull request Jul 31, 2018

New Resource API Proposal kubernetes/enhancements#607

Closed

ConnorDoyle reviewed Aug 7, 2018

View reviewed changes

jiayingz mentioned this pull request Aug 9, 2018

Is sharing GPU to multiple containers feasible? kubernetes/kubernetes#52757

Open

ipuustin reviewed Aug 15, 2018

View reviewed changes

ipuustin reviewed Aug 16, 2018

View reviewed changes

jiayingz mentioned this pull request Aug 16, 2018

KEP: Support Device Monitoring #2454

Merged

Some changes to reflect the recent POR

55ecd0a

edwarnicke reviewed Aug 21, 2018

View reviewed changes

flx42 mentioned this pull request Aug 30, 2018

The nvidia-device-plugin label GPU node automatically? NVIDIA/k8s-device-plugin#70

Closed

fabiand reviewed Sep 5, 2018

View reviewed changes

k8s-ci-robot added the kind/kep label Oct 13, 2018

k8s-ci-robot closed this Dec 1, 2018

RobertKrawitz mentioned this pull request Mar 6, 2019

REQUEST: New membership for RobertKrawitz kubernetes/org#573

Closed

6 tasks

KEP: New Resource API proposal #2265

KEP: New Resource API proposal #2265

Conversation

vikaschoudhary16 commented Jun 14, 2018

idvoretskyi commented Jun 14, 2018

vikaschoudhary16 commented Jun 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vishh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adohe-zz Jul 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vikaschoudhary16 Jun 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConnorDoyle commented Jun 27, 2018

vikaschoudhary16 commented Jun 28, 2018

jiayingz commented Jun 29, 2018

k8s-ci-robot commented Jun 29, 2018

jiayingz commented Jul 30, 2018 via email

jiayingz commented Jul 30, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobertKrawitz commented Aug 15, 2018

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adohe-zz Jul 1, 2018 •

edited

Loading

vikaschoudhary16 Jun 29, 2018 •

edited

Loading