New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device Plugin Beta work tracker #56649
Comments
/sig node |
I'm a fan of probe model. I do feel it unfriendly to plugin developers to let them care about the re-registration after kubelet's restart. And have received complaint from my colleague. |
Hi, for cluster reliability, I think we need more test cases, including scale and soak tests, to simulate more production scenarios. |
can we consolidate this with #53497 ? |
Sure. Will do this with Vish after KubeCon. |
Sorry to bring up this topic a bit late (feature freeze is the 22nd of Jan).
Things that are not explicitly related to the device plugin: We're also wondering if we should tie up extended resources quotas |
Thanks a lot for your input, @RenaudWasTaken For per container allocation, I think we should be able to implement it in 1.10. I think we should be able to support ER quota in 1.10 as well. I think allowing device plugins to export device properties is an important feature, especially when we introduce ResourceClass, but I don't think it is a beta blocker. For "Add annotations to a container for beta and add an option to CRI for GA to support OCI hooks", this seems outside the scope of device plugin. There are different ways to implement them (if they have to be implemented) on Kubernetes. We haven't seen strong use cases indicating that they have to go through device plugin. |
Sorry, I meant from the device plugin. Product wise this would allow us to have the nvidia-runtime be supported by CRI-O (which allows hooks through container annotations) as well as docker + nvidia as the default runtime. As for GA this would be a different discussion but it would be part of converging towards using any runtime and have them support the nvidia-hooks. |
Not saying it's a beta blocker but it would be useful to have labels on homogeneous nodes about what devices are present. Especially:
It's also not a big change, it just requires consensus |
Automatic merge from submit-queue (batch tested with PRs 58184, 59307, 58172). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add annotations to the device plugin API **What this PR does / why we need it**: **Which issue(s) this PR fixes** : Related to #56649 but does not fix it This adds the ability for the device plugins to annotate containers. Product wise, this allows the NVIDIA device plugin to support CRI-O (which allows hooks through container annotations). **Special notes for your reviewer**: /area hw-accelerators /cc @vishh @jiayingz @vikaschoudhary16 I'm wondering if it would make sense to fire a blank call to `newContainerAnnotations` at the start of the deviceplugin to get Annotations that are forbidden. Current behavior is that any Annotations that conflicts with Kubelet will be overwritten by Kubelet. **Release note**: ```release-note NONE ```
As a heads up, I am going to send a PR soon to change DevicePlugins feature to beta. After introducing the feature in 1.8, we have seen good development work on improving the feature and received many valuable feedbacks from early adopters on how we may extend the framework to support various kinds of resources on Kubernetes. To us, entering beta means: Very likely we will keep seeing new feature requests coming in after beta, as more people try to use DevicePlugins for their specific use cases. Some of these features may even require incompatible API changes. When this happens, the required complexity to support multiple device plugin API versions on Kubelet side should be manageable and mostly local to the devicemanager component. It should also be manageable for a device plugin to support multiple API versions. As an example, in GoogleCloudPlatform/container-engine-accelerators#55, I extended gke gpu device plugin to support both v1alpha and v1beta1 versions. For reference, here is list of changes we made on device plugin during 1.10:
There are a few pending PRs that would be nice to get into 1.10 but we don't consider as beta blocker:
Please let us know if anyone has objections on entering the feature to beta in 1.10. Thanks! |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. DevicePlugins feature is beta in 1.10 release **What this PR does / why we need it**: Graduates DevicePlugins feature to beta. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes #56649 **Special notes for your reviewer**: **Release note**: ```release-note DevicePlugins feature graduates to beta. ```
@jiayingz Should we reopen this tracing issue? |
I think this issue can be closed with #60170. We can open separate issues to track future device plugin changes. |
/kind feature
What happened:
The DevicePlugins feature was introduced in 1.8 Kubernetes release as an alpha feature.
Since then, we have seen good development work on stabling the feature as well as early adopters on using the framework to support device resources in Kubernetes.
This issue is to track the work requirements of entering this feature into beta.
FYI, here is list of requirements from the OSS Kubernetes feature stage doc: https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#alpha-beta-and-stable-versions
"""
Beta level:
Object Versioning: API version name contains beta (e.g. v2beta3)
Availability: in official Kubernetes releases, and enabled by default
Audience: users interested in providing feedback on features
Completeness: all API operations, CLI commands, and UI support should be implemented; end-to-end tests complete; the API has had a thorough API review and is thought to be complete, though use during beta may frequently turn up API issues not thought of during review
Upgradeability: the object schema and semantics may change in a later software release; when this happens, an upgrade path will be documented; in some cases, objects will be automatically converted to the new version; in other cases, a manual upgrade may be necessary; a manual upgrade may require downtime for anything relying on the new feature, and may require manual conversion of objects to the new version; when manual conversion is necessary, the project will provide documentation on the process
Cluster Reliability: since the feature has e2e tests, enabling the feature via a flag should not create new bugs in unrelated features; because the feature is new, it may have minor bugs
Support: the project commits to complete the feature, in some form, in a subsequent Stable version; typically this will happen within 3 months, but sometimes longer; releases should simultaneously support two consecutive versions (e.g. v1beta1 and v1beta2; or v1beta2 and v1) for at least one minor release cycle (typically 3 months) so that users have enough time to upgrade and migrate objects
Recommended Use Cases: in short-lived testing clusters; in production clusters as part of a short-lived evaluation of the feature in order to provide feedback
"""
I think we met most of the requirements here, with the commit that if we need to make incompatible API change in the future, we will document the upgrade path and if necessary, may support multiple API versions in a couple of releases.
Things that I think would be nice to have to enter beta (entering beta may not necessary depends on their completeness)
We would like to target this FR to 1.10. Please let us know your thoughts or add things that you think are critical to finish. Thanks!
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
):uname -a
):The text was updated successfully, but these errors were encountered: