Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DisableAcceleratorUsageMetrics Feature Gate #91930

Merged

Conversation

RenaudWasTaken
Copy link
Contributor

@RenaudWasTaken RenaudWasTaken commented Jun 9, 2020

Signed-off-by: Renaud Gaubert rgaubert@nvidia.com

What type of PR is this?
/kind feature

What this PR does / why we need it: Adds the "DisableAcceleratorUsageMetrics" Feature Gate.

TLDR: Kubelet collects GPU metrics when it sees that the NVIDIA driver is present on the node.

  1. Kubelet should no longer be collecting metrics from devices. The expected path is to use the pod resources API and have device vendors expose metrics through their own metrics container. This path is here for legacy reasons and we are deprecating it.

  2. Furthermore because Kubelet now has an open handle on the NVIDIA driver, this breaks any infrastructure interactions (e.g: Removing or Updating the driver) with the NVIDIA driver. In other words any actions related to the NVIDIA driver cannot be taken without killing the kubelet.

See google/cadvisor#2574 for more details

Special notes for your reviewer:

This change does not remove the handle cadvisor has on the NVIDIA driver. To do so we will also need to wait for cadvisor to cut a new release and re-vendor cadvisor (so that we have the following PR in: google/cadvisor#2574).

/cc @dashpole

Does this PR introduce a user-facing change?:

Adds the ability to disable Accelerator/GPU metrics collected by Kubelet

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
None, does this require a KEP / Doc / ... ?

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 9, 2020
@RenaudWasTaken
Copy link
Contributor Author

/sig-node

@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 9, 2020
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@RenaudWasTaken RenaudWasTaken force-pushed the DisableAcceleratorUsageMetrics branch 2 times, most recently from 93ecdea to fe462d5 Compare June 9, 2020 18:12
@RenaudWasTaken RenaudWasTaken force-pushed the DisableAcceleratorUsageMetrics branch 3 times, most recently from 0441f3b to b12cd14 Compare June 10, 2020 04:22
@RenaudWasTaken
Copy link
Contributor Author

/test pull-kubernetes-e2e-kind-ipv6

@RenaudWasTaken
Copy link
Contributor Author

/remove-kind api-change

@k8s-ci-robot k8s-ci-robot removed the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jun 10, 2020
@RenaudWasTaken
Copy link
Contributor Author

/retest

@dchen1107
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2020
@RenaudWasTaken
Copy link
Contributor Author

/kind api-change

@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jun 18, 2020
@derekwaynecarr
Copy link
Member

/lgtm
/approve

will move back in milestone if sig-release approves.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, derekwaynecarr, RenaudWasTaken

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@derekwaynecarr derekwaynecarr added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jul 13, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jul 13, 2020
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2020
@derekwaynecarr
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 13, 2020
@RenaudWasTaken
Copy link
Contributor Author

/retest

1 similar comment
@RenaudWasTaken
Copy link
Contributor Author

/retest

@dchen1107
Copy link
Member

/lgtm again!

@palnabarun
Copy link
Member

palnabarun commented Jul 21, 2020

Since the exception request was approved and the PR is already LGTM'ed and Approved by the owning SIG, adding the PR back into the v1.19 milestone.

/milestone v1.19

@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Jul 21, 2020
@RenaudWasTaken
Copy link
Contributor Author

/retest

2 similar comments
@RenaudWasTaken
Copy link
Contributor Author

/retest

@RenaudWasTaken
Copy link
Contributor Author

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@nikhita
Copy link
Member

nikhita commented Jul 22, 2020

/retest

1 similar comment
@nikhita
Copy link
Member

nikhita commented Jul 22, 2020

/retest

@nikhita
Copy link
Member

nikhita commented Jul 22, 2020 via email

@nikhita
Copy link
Member

nikhita commented Jul 22, 2020

/retest

@nikhita
Copy link
Member

nikhita commented Jul 22, 2020 via email

@k8s-ci-robot k8s-ci-robot merged commit ae7dce7 into kubernetes:master Jul 22, 2020
@palnabarun
Copy link
Member

It merged! 🚀

@RenaudWasTaken
Copy link
Contributor Author

Wooooooooooooooooooooo!

Thanks for running the retest command 😛!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet