Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable AcceleratorUsage Metrics initial kep #1868

Conversation

RenaudWasTaken
Copy link

@RenaudWasTaken RenaudWasTaken commented Jun 18, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it: Adds the "Disable AcceleratorUsage Metrics" KEP

TLDR: Add a FeatureGate to Disable the AcceleratorUsage Metrics collected by Kubelet.
Kubelet should no longer be collecting metrics from devices as the path decided by Sig-node is to use the pod resources API and have device vendors expose metrics through their own metrics container.

Some context that I left out of the KEP, because Kubelet has an open handle on the NVIDIA driver, this breaks any infrastructure interactions (e.g: Removing or Updating the driver) with the NVIDIA driver. In other words any actions related to the NVIDIA driver cannot be taken without killing the kubelet.

/cc @dashpole
/sig node
/stage alpha

Does this PR introduce a user-facing change?: Yes!
Relevant enhancement issue: #1867

Finally note that I'm still fairly new with the process and might have overlooked some fields or mis-answered them because of a lack of understanding :)

Signed-off-by: Renaud Gaubert rgaubert@nvidia.com

@k8s-ci-robot k8s-ci-robot added stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 18, 2020
@RenaudWasTaken
Copy link
Author

/cc @dchen1107 @derekwaynecarr

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 18, 2020
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. labels Jun 18, 2020
@derekwaynecarr
Copy link
Member

/assign

### Graduation Criteria

#### Alpha Graduation
* Feature Flag is present.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The awkward part of this process is that enabling this feature flag is recommended by all deployers at this phase. I think we should document and promote immediate usage of this flag even if it’s alpha.

@brancz has sig-monitoring encountered a similar scenario? Usage of these metrics actually impairs the ability to consume what is measured.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A situation exactly like this hasn't come up, but generally speaking, since we have the metrics stability framework in place and all these metrics are alpha metrics, theoretically by the metrics stability framework you wouldn't even have to have a feature flag, but you could just remove them.

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will let @dashpole review and remove the hold, but I agree with this change and only wish we work hard to promote enabling the alpha feature gate as a best practice given how it impacts usage of the device it monitors. I cannot think of a similar situation like this prior.

/approve
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, RenaudWasTaken

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2020
Copy link
Member

@dchen1107 dchen1107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but will leave @dashpole to take a look.

@RenaudWasTaken RenaudWasTaken force-pushed the 1867-disable-accelerator-usage-metrics branch from 0009496 to 21dc414 Compare June 19, 2020 02:45

#### Alpha -> Beta Graduation

* Sufficient heads up has been given (1 year) to users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we can move to Beta after a single release, but should give a long-ish period before moving to GA. Most users won't notice this until it is enabled by default, and thus if we want to give users time to adapt and migrate to the daemonset, it would be between Beta and GA, where they can explicitly opt-out of the deprecation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of doing that! Is that something others would be in favor of :) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @derekwaynecarr @dchen1107

Thanks !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@RenaudWasTaken
Copy link
Author

The PR is ready to go, @dashpole should we figure out the beta graduation strategy in the next release?
This would allow us to:

  • Merge the KEP
  • File for the exception process
  • Merge the Feature PR

What do you think?

Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
@RenaudWasTaken RenaudWasTaken force-pushed the 1867-disable-accelerator-usage-metrics branch from 21dc414 to 6b4b9d0 Compare June 30, 2020 21:37
@dashpole
Copy link
Contributor

dashpole commented Jul 6, 2020

sounds great!
/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 6, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants