Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow show hidden metrics in kube-apiserver #84292

Merged

Conversation

RainbowMango
Copy link
Member

@RainbowMango RainbowMango commented Oct 24, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
Add a flag to show hidden metrics.

Refer to stability framework kep:

On a subsequent release (when the metric's deprecatedVersion is equal to current_kubernetes_version - 1)), a deprecated metric will become a hidden metric. Unlike their deprecated counterparts, hidden metrics will no longer be automatically registered to the metrics endpoint (hence hidden). However, they can be explicitly enabled through a command line flag on the binary (i.e. '--enable-hidden-metrics=really_deprecated_metric'). This is to provide cluster admins an escape hatch to properly migrate off of a deprecated metric, if they were not able to react to the earlier deprecation warnings. Hidden metrics should be deleted after one release.

Which issue(s) this PR fixes:

Special notes for your reviewer:
It's a lightweight implementation against the KEP.

Does this PR introduce a user-facing change?:

New flag `--show-hidden-metrics-for-version` in kube-apiserver can be used to show all hidden metrics that deprecated in the previous minor release.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-instrumentation/20190404-kubernetes-control-plane-metrics-stability.md

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 24, 2019
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 24, 2019
@fedebongio
Copy link
Contributor

/assign @logicalhan

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @ehashman

This looks reasonable to me.

@lavalamp, does this flag look okay to you?

@ehashman
Copy link
Member

There's no tests included in this PR, we should really include a unit test at minimum.

I'm assuming this is sort of a follow-up to #81970.

@logicalhan we previously discussed that we wanted this to be a feature flag rather than a CLI option as adding a CLI option removes the possibility of removing it/changing schema in the future, and the KEP specifies that we should be able to list specific hidden metrics to enable/disable (rather than a blanket disable for all of them). I think for now it would make more sense to go with the feature flag approach as I previously looked at?

@ehashman
Copy link
Member

/sig instrumentation

@k8s-ci-robot k8s-ci-robot added the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Oct 24, 2019
@@ -153,6 +154,9 @@ func (s *ServerRunOptions) Flags() (fss cliflag.NamedFlagSets) {
fs.BoolVar(&s.AllowPrivileged, "allow-privileged", s.AllowPrivileged,
"If true, allow privileged containers. [default=false]")

fs.BoolVar(&s.ShowHiddenMetrics, "show-hidden-metrics", s.ShowHiddenMetrics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is to prevent an admin from:

  1. in version X, turn this on to get back deprecated metric M
  2. in version Y, not turn this off, and therefore fail to notice metric N is being deprecated
  3. in version Z, metric N is removed with (effectively) no warning.

To prevent this scenario, I suggest forcing the user to specify the version for which they want to show metrics as the value of this flag rather than "true".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggestion, it seems reasonable to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea (and I also wouldn't mind getting rid of the KEP specification to be able to turn on/off individual metrics and just make it binary-wide for hidden) so perhaps we should update the KEP to include this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think we should be able to turning off individual metrics, since we have had memory leak issues in the past. These don't tend to manifest except under certain cardinality conditions, so it's possible that metrics like that will get sneak by and get released at some point. I'm thinking of that as a beta/GA feature though. Always good to have that safety net of being able to turn off a memory leak without having to patch and release kubernetes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggestion, it seems reasonable to me.

+1
It's a really good point!

@logicalhan
Copy link
Member

@logicalhan we previously discussed that we wanted this to be a feature flag rather than a CLI option as adding a CLI option removes the possibility of removing it/changing schema in the future, and the KEP specifies that we should be able to list specific hidden metrics to enable/disable (rather than a blanket disable for all of them). I think for now it would make more sense to go with the feature flag approach as I previously looked at?

I was thinking maybe we should just separate the functionality. A generic toggle for deprecated/hidden metrics but we should still eventually have the ability to disable metric[s] on demand. I want the safety net to be able to get rid of a metric memory leak without having to release a new version of kubernetes.

@ehashman
Copy link
Member

So from this discussion I'm hearing we should specify two things:

  • what baseline version we want base unhiding metrics against
  • what specific metrics we want to unhide

Sounds like we now need two CLI flags... can we have a quick design discussion somewhere, maybe on the KEP, before moving forward with this?

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2019
@RainbowMango
Copy link
Member Author

@ehashman

what specific metrics we want to unhide

If I understand correctly, here @logicalhan want to hide specific metrics, not unhide. :)

I doubt if it is convincing enough to add a flag(hide metrics) just to prevent memory leak?

@RainbowMango
Copy link
Member Author

From this discussion, personally I think we should provide a feature gate as in #81970 which can provide stability guarantees for the flag(s).

e.g

  • in 1.17 flag(alpha) feature gate(default off)
  • in 1.18 flag(beta) feature gate(default on)
  • in 1.19 flag(stable) feature gate(removed)

@logicalhan
Copy link
Member

From this discussion, personally I think we should provide a feature gate as in #81970 which can provide stability guarantees for the flag(s).

e.g

  • in 1.17 flag(alpha) feature gate(default off)
  • in 1.18 flag(beta) feature gate(default on)
  • in 1.19 flag(stable) feature gate(removed)

This is basically not at all the conclusion I am coming to. How are you arriving at feature gates from this conversation?

@logicalhan
Copy link
Member

I doubt if it is convincing enough to add a flag(hide metrics) just to prevent memory leak?

We have already had to do this: #74636. We should not require a new release of kubernetes in order to turn off a bad metric.

@RainbowMango
Copy link
Member Author

How are you arriving at feature gates from this conversation?

I like @lavalamp 's idea. I am a little worried we will lose the possibility of improving this flag when we get another good idea.

Am I over-cautious?

@logicalhan
Copy link
Member

@lavalamp’s idea (as I understand it from talking to him about it today) is that the flag takes in a string parameter, a version for which we will show what would otherwise be hidden metrics. This eliminates the danger of what happens when someone forgets to untoggle a flag after migrating a set of deprecated metrics.

Being able to hide metrics is something which we can address orthogonally, using another flag.

@RainbowMango
Copy link
Member Author

@lavalamp’s idea (as I understand it from talking to him about it today) is that the flag takes in a string parameter, a version for which we will show what would otherwise be hidden metrics. This eliminates the danger of what happens when someone forgets to untoggle a flag after migrating a set of deprecated metrics.

Being able to hide metrics is something which we can address orthogonally, using another flag.

OK. Copy that. as well as

We have already had to do this: #74636. We should not require a new release of kubernetes in order to turn off a bad metric.

I will push a change according @lavalamp's comments.

@RainbowMango
Copy link
Member Author

Note that we will also need this flag implemented in the rest of the kube binaries, not just the apiserver (e.g. kubelet, scheduler, etc.)

Yeah, I know that. I intended to deal with them after this PR. Because they will share the validation logic from the component base.

// TODO(RainbowMango): move it to genericoptions before next flag comes.
mfs := fss.FlagSet("metrics")
mfs.StringVar(&s.ShowHiddenMetricsForVersion, "show-hidden-metrics-for-version", s.ShowHiddenMetricsForVersion,
"The previous version(x.y) for which you want to show hidden metrics. Only the previous minor version is allowed.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this version? Which formats are accepted? What does the last sentence mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not --show-hidden-metrics without any argument?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of flag is incompatible by design to the previous version on each version change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not --show-hidden-metrics without any argument?

See the discussion above:
#84292 (comment)

Also updated to KEP:
why not --show-hidden-metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got that.

Another point: I have not read the KEP. I have no clue what this flag does. What does "show" mean? What does "hidden" mean? This must be super clear from the flag description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the question I asked here on the KEP update: kubernetes/enhancements#1358 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn't we stick with the original plan of allowing enabling a specific deprecated metric?

Because we want to reserve the ability to disable a metric (regardless of whether it is stable or not) at application boot and that functionality is orthogonal to metrics being auto-disabled by metrics stability framework.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original plan dictates providing a mechanism for turning off individual metrics via the command line.

..all metrics should be able to be individually disabled by the cluster administrator, regardless of stability class. By default, all non-deprecated metrics will be automatically registered to the metrics endpoint unless explicitly blacklisted via a command line flag (i.e. '--disable-metrics=somebrokenmetric,anothermetric')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest: The previous version for which you want to show hidden metrics. Only the previous minor version is meaningful, other values will be ignored. The format is <major>.<minor>, e.g.: 1.16

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussion, I think it makes sense to add a comment here explaining the purpose of this (to avoid giving the admin a surprise at v N+2). This could also be said in the help text, e.g.: The purpose of this format is make sure you have the opportunity to notice if the next release hides additional metrics, rather than being surprised when they are permanently removed in the release after that.

@RainbowMango RainbowMango force-pushed the pr_add_metrics_flag_to_apiserver branch from 22dd973 to 1290698 Compare November 12, 2019 03:16
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 12, 2019
@RainbowMango
Copy link
Member Author

I suggest: The previous version for which you want to show hidden metrics. Only the previous minor version is meaningful, other values will be ignored. The format is ., e.g.: 1.16

New push with a minor change: other values will be ignored --> other values will not be allowed according to the logic of validation.

@lavalamp
Copy link
Member

Does "not be allowed" mean apiserver won't start? I can't decide if that's good or not.

@liggitt
Copy link
Member

liggitt commented Nov 12, 2019

silently ignoring an explicit option they asked for seems not good, especially for something like this where that would mean alerting/monitoring configurations could be making incorrect decisions based on absent metrics


targetVersion, err := semver.Parse(pathVersion)
if err != nil {
return fmt.Errorf("specified --show-hidden-metrics-for-version '%v' not follows the version format x.y", targetVersionStr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/not follows/does not follow/

return fmt.Errorf("specified --show-hidden-metrics-for-version '%v' not follows the version format x.y", targetVersionStr)
}

maxAllowedVersion, err := semver.Make(fmt.Sprintf("%d.%d.0", currentVersion.Major, currentVersion.Minor))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need any of this semver code. Only the previous minor version is allowed, so you can print it to a string and compare directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that can handle the above error condition, too. The error message can say, --show-hidden-metrics-for-version must be omitted or have the value "%v". Only the previous minor version is allowed.

@lavalamp
Copy link
Member

OK, if @liggitt agrees then I'm sold, but that means the error handling code can be drastically simplified.

@RainbowMango RainbowMango force-pushed the pr_add_metrics_flag_to_apiserver branch from 1290698 to b2fbdee Compare November 13, 2019 02:33
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 13, 2019
@RainbowMango
Copy link
Member Author

OK, if @liggitt agrees then I'm sold, but that means the error handling code can be drastically simplified.

Thanks, much clear now.

@brancz
Copy link
Member

brancz commented Nov 13, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2019
return nil
}

validVersionStr := fmt.Sprintf("%d.%d", currentVersion.Major, currentVersion.Minor-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're ever going to have a v2, but if we did, it would print 2.-1 which is not right :)

Feel free to fix in a follow-up if you want.

@lavalamp
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, lavalamp, logicalhan, RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2019
@k8s-ci-robot k8s-ci-robot merged commit 21df24c into kubernetes:master Nov 13, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Nov 13, 2019
@RainbowMango RainbowMango mentioned this pull request Nov 14, 2019
5 tasks
@RainbowMango RainbowMango deleted the pr_add_metrics_flag_to_apiserver branch May 30, 2020 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants