Device Plugin Beta work tracker #56649

jiayingz · 2017-11-30T21:47:15Z

/kind feature

What happened:
The DevicePlugins feature was introduced in 1.8 Kubernetes release as an alpha feature.
Since then, we have seen good development work on stabling the feature as well as early adopters on using the framework to support device resources in Kubernetes.
This issue is to track the work requirements of entering this feature into beta.
FYI, here is list of requirements from the OSS Kubernetes feature stage doc: https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#alpha-beta-and-stable-versions
"""
Beta level:
Object Versioning: API version name contains beta (e.g. v2beta3)
Availability: in official Kubernetes releases, and enabled by default
Audience: users interested in providing feedback on features
Completeness: all API operations, CLI commands, and UI support should be implemented; end-to-end tests complete; the API has had a thorough API review and is thought to be complete, though use during beta may frequently turn up API issues not thought of during review
Upgradeability: the object schema and semantics may change in a later software release; when this happens, an upgrade path will be documented; in some cases, objects will be automatically converted to the new version; in other cases, a manual upgrade may be necessary; a manual upgrade may require downtime for anything relying on the new feature, and may require manual conversion of objects to the new version; when manual conversion is necessary, the project will provide documentation on the process
Cluster Reliability: since the feature has e2e tests, enabling the feature via a flag should not create new bugs in unrelated features; because the feature is new, it may have minor bugs
Support: the project commits to complete the feature, in some form, in a subsequent Stable version; typically this will happen within 3 months, but sometimes longer; releases should simultaneously support two consecutive versions (e.g. v1beta1 and v1beta2; or v1beta2 and v1) for at least one minor release cycle (typically 3 months) so that users have enough time to upgrade and migrate objects
Recommended Use Cases: in short-lived testing clusters; in production clusters as part of a short-lived evaluation of the feature in order to provide feedback
"""
I think we met most of the requirements here, with the commit that if we need to make incompatible API change in the future, we will document the upgrade path and if necessary, may support multiple API versions in a couple of releases.

Things that I think would be nice to have to enter beta (entering beta may not necessary depends on their completeness)

Regarding to the thorough API review requirement, I think we had good discussions during the design proposal but there have been some recent discussions on whether we should use registration vs probe model for the general Kubelet plugin communication. I think we should have a general discussion on this topic with larger audience like people working on CSI plugins. I will create a FR issue to track this discussion soon.
We would like to see multiple device plugin implementations. We know some people have started working on device plugin implementations for Solarflare, FPGA, and Infiniband, and probably want to keep closer track on their progresses.

We would like to target this FR to 1.10. Please let us know your thoughts or add things that you think are critical to finish. Thanks!

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

jiayingz · 2017-11-30T22:39:24Z

/sig node

lichuqiang · 2017-12-01T02:05:16Z

I think we had good discussions during the design proposal but there have been some recent discussions on whether we should use registration vs probe model for the general Kubelet plugin communication.

I'm a fan of probe model. I do feel it unfriendly to plugin developers to let them care about the re-registration after kubelet's restart. And have received complaint from my colleague.

ScorpioCPH · 2017-12-01T03:37:23Z

Hi, for cluster reliability, I think we need more test cases, including scale and soak tests, to simulate more production scenarios.

cmluciano · 2017-12-08T04:29:29Z

can we consolidate this with #53497 ?

jiayingz · 2017-12-08T20:59:15Z

Sure. Will do this with Vish after KubeCon.

RenaudWasTaken · 2018-01-10T23:34:30Z

Sorry to bring up this topic a bit late (feature freeze is the 22nd of Jan).
Here are the features we (NVIDIA) consider important or blocker before entering beta:

[BLOCKER] Per container allocation
Add labels during device plugin registration (e.g: gpu-family: tesla)
Add annotations to a container for beta and add an option to CRI for GA to support OCI hooks

Things that are not explicitly related to the device plugin:

Daemonset restart policy: OnFailure

We're also wondering if we should tie up extended resources quotas

jiayingz · 2018-01-11T00:03:23Z

Thanks a lot for your input, @RenaudWasTaken

For per container allocation, I think we should be able to implement it in 1.10. I think we should be able to support ER quota in 1.10 as well.

I think allowing device plugins to export device properties is an important feature, especially when we introduce ResourceClass, but I don't think it is a beta blocker.

For "Add annotations to a container for beta and add an option to CRI for GA to support OCI hooks", this seems outside the scope of device plugin. There are different ways to implement them (if they have to be implemented) on Kubernetes. We haven't seen strong use cases indicating that they have to go through device plugin.

RenaudWasTaken · 2018-01-11T00:22:55Z

For "Add annotations to a container for beta and add an option to CRI for GA to support OCI hooks", this seems outside the scope of device plugin.

Sorry, I meant from the device plugin.
i.e: the device plugin's AllocateResponse returns container annotations for beta

Product wise this would allow us to have the nvidia-runtime be supported by CRI-O (which allows hooks through container annotations) as well as docker + nvidia as the default runtime.
BTW That would allow a convergence towards a single device plugin in beta.

As for GA this would be a different discussion but it would be part of converging towards using any runtime and have them support the nvidia-hooks.

RenaudWasTaken · 2018-01-11T22:09:13Z

I think allowing device plugins to export device properties is an important feature, especially when we introduce ResourceClass, but I don't think it is a beta blocker.

Not saying it's a beta blocker but it would be useful to have labels on homogeneous nodes about what devices are present. Especially:

GPU memory
GPU compute capability
...

It's also not a big change, it just requires consensus

@vishh

Automatic merge from submit-queue (batch tested with PRs 58184, 59307, 58172). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add annotations to the device plugin API **What this PR does / why we need it**: **Which issue(s) this PR fixes** : Related to #56649 but does not fix it This adds the ability for the device plugins to annotate containers. Product wise, this allows the NVIDIA device plugin to support CRI-O (which allows hooks through container annotations). **Special notes for your reviewer**: /area hw-accelerators /cc @vishh @jiayingz @vikaschoudhary16 I'm wondering if it would make sense to fire a blank call to `newContainerAnnotations` at the start of the deviceplugin to get Annotations that are forbidden. Current behavior is that any Annotations that conflicts with Kubelet will be overwritten by Kubelet. **Release note**: ```release-note NONE ```

jiayingz · 2018-02-21T23:41:22Z

As a heads up, I am going to send a PR soon to change DevicePlugins feature to beta.

After introducing the feature in 1.8, we have seen good development work on improving the feature and received many valuable feedbacks from early adopters on how we may extend the framework to support various kinds of resources on Kubernetes. To us, entering beta means:
a) We think we have done enough early experiments during alpha phase to get the confidence that this a useful feature that we are willing to carry on to beta and to GA.
b) We think the current DevicePlugins is ready to be tried on production. It has good e2e and e2e_node test coverage. Some early adopters have tried it on production and didn't report any major breakage.

Very likely we will keep seeing new feature requests coming in after beta, as more people try to use DevicePlugins for their specific use cases. Some of these features may even require incompatible API changes. When this happens, the required complexity to support multiple device plugin API versions on Kubelet side should be manageable and mostly local to the devicemanager component. It should also be manageable for a device plugin to support multiple API versions. As an example, in GoogleCloudPlatform/container-engine-accelerators#55, I extended gke gpu device plugin to support both v1alpha and v1beta1 versions.

For reference, here is list of changes we made on device plugin during 1.10:

Support resource quota on extended resources (Issue 46639). Owner: @lichuqiang
Added PreStartContainer grpc function to allow device plugins to perform per container start operations. Owner: @vikaschoudhary16
Add annotations to the device plugin API. Owner: @RenaudWasTaken
Track unhealthy devices in devicemanager, and have node status capacity to reflect raw capacity but allocatable to reflect healthy device capacity. Owner: @vikaschoudhary16
Stub device plugin based e2e_node test Owner: @ScorpioCPH

There are a few pending PRs that would be nice to get into 1.10 but we don't consider as beta blocker:

Determine and implement a general Kubelet plugin communication establish model and use it in device plugin. Owner: @vikaschoudhary16
Uses common checkpointing mechanism in device plugin (PR 56040 Issue 56159) Owner: @vikaschoudhary16
Device plugin stress test. Owner: @ScorpioCPH

Please let us know if anyone has objections on entering the feature to beta in 1.10. Thanks!

ScorpioCPH · 2018-02-22T02:32:22Z

@jiayingz Hi, for stress test, here is a PR #59024 :)

jiayingz · 2018-02-22T06:23:14Z

/cc @vishh @dchen1107 @RenaudWasTaken @vikaschoudhary16 @mindprince @derekwaynecarr @ConnorDoyle @jeremyeder @cmluciano @sameo @ScorpioCPH @lichuqiang @kad @fabiand

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. DevicePlugins feature is beta in 1.10 release **What this PR does / why we need it**: Graduates DevicePlugins feature to beta. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #56649 **Special notes for your reviewer**: **Release note**: ```release-note DevicePlugins feature graduates to beta. ```

ScorpioCPH · 2018-02-27T05:59:59Z

@jiayingz Should we reopen this tracing issue?

jiayingz · 2018-02-27T06:42:03Z

I think this issue can be closed with #60170. We can open separate issues to track future device plugin changes.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 30, 2017

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 30, 2017

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Nov 30, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 30, 2017

RenaudWasTaken mentioned this issue Jan 11, 2018

Add annotations to the device plugin API #58172

Merged

vikaschoudhary16 mentioned this issue Jan 17, 2018

Extend Device Plugin api to support resource attributes #58388

Closed

k8s-ci-robot added the area/hw-accelerators label Feb 12, 2018

jiayingz mentioned this issue Feb 21, 2018

DevicePlugins feature is beta in 1.10 release #60170

Merged

k8s-github-robot closed this as completed in #60170 Feb 25, 2018

swatisehgal mentioned this issue Oct 2, 2022

Add Device Manager to kubelet kubernetes/enhancements#3573

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device Plugin Beta work tracker #56649

Device Plugin Beta work tracker #56649

jiayingz commented Nov 30, 2017

jiayingz commented Nov 30, 2017

lichuqiang commented Dec 1, 2017 •

edited

ScorpioCPH commented Dec 1, 2017

cmluciano commented Dec 8, 2017

jiayingz commented Dec 8, 2017

RenaudWasTaken commented Jan 10, 2018 •

edited

jiayingz commented Jan 11, 2018

RenaudWasTaken commented Jan 11, 2018

RenaudWasTaken commented Jan 11, 2018

jiayingz commented Feb 21, 2018 •

edited

ScorpioCPH commented Feb 22, 2018

jiayingz commented Feb 22, 2018

ScorpioCPH commented Feb 27, 2018

jiayingz commented Feb 27, 2018

Device Plugin Beta work tracker #56649

Device Plugin Beta work tracker #56649

Comments

jiayingz commented Nov 30, 2017

jiayingz commented Nov 30, 2017

lichuqiang commented Dec 1, 2017 • edited

ScorpioCPH commented Dec 1, 2017

cmluciano commented Dec 8, 2017

jiayingz commented Dec 8, 2017

RenaudWasTaken commented Jan 10, 2018 • edited

jiayingz commented Jan 11, 2018

RenaudWasTaken commented Jan 11, 2018

RenaudWasTaken commented Jan 11, 2018

jiayingz commented Feb 21, 2018 • edited

ScorpioCPH commented Feb 22, 2018

jiayingz commented Feb 22, 2018

ScorpioCPH commented Feb 27, 2018

jiayingz commented Feb 27, 2018

lichuqiang commented Dec 1, 2017 •

edited

RenaudWasTaken commented Jan 10, 2018 •

edited

jiayingz commented Feb 21, 2018 •

edited