Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transformation_success_total and transformation_last_status metrics. #70715

Merged
merged 1 commit into from May 28, 2019

Conversation

immutableT
Copy link
Contributor

@immutableT immutableT commented Nov 6, 2018

Add transformation_success_total and transformation_last_status metrics to apiserver/pkg/storage/value/metrics.go

What type of PR is this?
/kind feature

What this PR does / why we need it:
When managing KMS Encryption feature within a sizable deployment, answering the following monitoring/health questions becomes important:

  1. How many requests is my KMS-Plugin handling?
  2. What is the ratio between successful and failed KMS requests?
  3. Show me all clusters where the last request to KMS-Plugin failed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

The transformer_failures_total metric is deprecated in favor of transformation_operation_total. The old metric will continue to be populated but will be removed in a future release.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 6, 2018
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 6, 2018
@immutableT
Copy link
Contributor Author

/assign @lavalamp

@immutableT
Copy link
Contributor Author

/assign @liggitt

@@ -41,6 +41,23 @@ var (
},
[]string{"transformation_type"},
)
transformerSuccessTotal = prometheus.NewCounterVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three metrics:

  1. total transformation counter
  2. a counter of either successes or failures
  3. some measure of latency, not sure the best way to format it off the top of my head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something in your comment.
Including the changes in this PR, we will have the following metrics:

transformation_success_total (this PR)
transformation_failures_total
transformation_latencies_microseconds
transformation_last_status (this PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I failed to expand the github diff. success_total + failures_total + latencies are sufficient. I'd slightly prefer if you added a total usage counter rather than a success counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in retrospect, I would have define a single counter - operations_total with two labels (success and failure).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in retrospect, I would have define a single counter - operations_total with two labels (success and failure).

It's not too late. If we're adding a metric here, let's add the one we want.

The current metric doesn't provide info about which resource type is being transformed (the feature is not limited just to Secret resources). Is that important to include?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added transformation_operations_total with two labels: transformation_type and status.
I left transformation_failures_total, despite its being redundant - it may be already used.

With respect to including resource as label; how would we do it? Today, transformers, are not aware of what resource is being transformed. We could:

  1. Infer the resource type from value.Context - it contains etcd key, which in turn contains the resource type (ex. /registry/secrets/default/dev-db-secret-0166 - second part is our resource type).
  2. Pass resource type into TransformTo/FromStorage from the store.go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 would be my preference, but I'm ok with deferring that until there is demand for it

@timothysc timothysc removed their request for review November 7, 2018 16:18
@immutableT
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot
Copy link
Contributor

@immutableT: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@immutableT
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 7, 2018
@immutableT
Copy link
Contributor Author

@lavalamp PTAL.
I removed the transformer_last_status metric.

},
[]string{"transformation_type", "status"},
)
// Deprecated, use transformerOperationsTotal instead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note the deprecation in the help text so callers are aware?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the pattern we used in k8s.io/client-go/util/workqueue/metrics.go, and added deprecated prefix to the variable.
Also added deprecation notice to the help text.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the deprecation notice be added to line 59 then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go either way.
Tools like Intellij/Goland provide visual indication of the deprecation when the notice is placed at the variable declaration level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but if it is added to the help it will be output in '/metrics' endpoint which is kinda too nice, imo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.
Done.

@liggitt
Copy link
Member

liggitt commented Jan 2, 2019

add deprecation to release note

@immutableT
Copy link
Contributor Author

I assume that deprecation notice is added here as a separate PR:
https://github.com/kubernetes/website/blob/master/content/en/docs/setup/release/notes.md
after this PR is merged, correct?

@immutableT
Copy link
Contributor Author

/retest

@liggitt
Copy link
Member

liggitt commented Jan 16, 2019

I assume that deprecation notice is added here as a separate PR:
https://github.com/kubernetes/website/blob/master/content/en/docs/setup/release/notes.md
after this PR is merged, correct?

it goes in the PR description inside the ```release-note block

squash, then lgtm

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jan 16, 2019
prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "transformation_failures_total",
Help: "Total number of failed transformation operations.",
Help: "Deprecated, use transformerOperationsTotal instead. Total number of failed transformation operations.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecated description could follow this KEP , for example:

Help: "(Deprecated) Cumulative number of Docker operation timeout by operation type.",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @danielqsj commented is the correct deprecation procedure. We should do it consistently in this PR.

@brancz
Copy link
Member

brancz commented Mar 11, 2019

besides some consistency on the deprecation process, this looks good from instrumentation side

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2019
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 14, 2019
@immutableT
Copy link
Contributor Author

@awly PTAL

@brancz
Copy link
Member

brancz commented May 15, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2019
@@ -20,6 +20,10 @@ import (
"sync"
"time"

"google.golang.org/grpc/status"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -106,23 +121,37 @@ func RegisterMetrics() {
registerMetrics.Do(func() {
prometheus.MustRegister(transformerLatencies)
prometheus.MustRegister(deprecatedTransformerLatencies)
prometheus.MustRegister(transformerFailuresTotal)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove all the newlines here too

deprecatedTransformerLatencies.WithLabelValues(transformationType).Observe(sinceInMicroseconds(start))
default:
deprecatedTransformerFailuresTotal.WithLabelValues(transformationType).Inc()
st, ok := status.FromError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just do

transformerOperationsTotal.WithLabelValues(transformationType, status.Code(err)).Inc()

outside the switch. It'll handle nil errors too.

@@ -0,0 +1,97 @@
/*
Copyright 2017 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2019

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still see 2017 here

"apiserver_storage_transformation_operations_total",
"apiserver_storage_transformation_failures_total",
},
want: `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be latencies? Or will testutil.GatherAndCompare do a substring match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it testuilt does will only compare the metrics explicitly requested.
Not sure if I can reliably test latencies using method, so leaving them out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for clarifying

@immutableT
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-100-performance

@@ -20,6 +20,8 @@ import (
"sync"
"time"

"google.golang.org/grpc/status"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awly PTAL

@@ -20,6 +20,10 @@ import (
"sync"
"time"

"google.golang.org/grpc/status"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -0,0 +1,97 @@
/*
Copyright 2017 The Kubernetes Authors.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"apiserver_storage_transformation_operations_total",
"apiserver_storage_transformation_failures_total",
},
want: `
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it testuilt does will only compare the metrics explicitly requested.
Not sure if I can reliably test latencies using method, so leaving them out.

@immutableT
Copy link
Contributor Author

/test pull-kubernetes-node-e2e

@zouyee
Copy link
Member

zouyee commented May 16, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 16, 2019
if err != nil {
transformerFailuresTotal.WithLabelValues(transformationType).Inc()
return
transformerOperationsTotal.WithLabelValues(transformationType, status.Code(err).String()).Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the cost of status.Code(err).String(), and are the possible values bounded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The possible values are bound to these.
I would classify the cost as negligible, see here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt PTAL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt
I ran benchmarks against the RecordTransformation function and it completes within approximately 420 ns for a generic error and in 540 ns for a status error.
With respect to the possible values, they are bound to 17 possible values, but in the context of KMS Plugin we would expect OK, Cancelled, Unknown, DeadlineExceeded, NotFound, PermissionsDenied, ResourceExhausted and FailedPrecondition.

@liggitt
Copy link
Member

liggitt commented May 28, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: immutableT, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 28, 2019
@k8s-ci-robot k8s-ci-robot merged commit 3c3c1b1 into kubernetes:master May 28, 2019
@tpepper
Copy link
Member

tpepper commented May 30, 2019

@kubernetes/sig-api-machinery-pr-reviews @kubernetes/sig-instrumentation-pr-reviews can somebody comment on the validity of the cherry pick to 1.14 here? Ie: we don't cherry pick features, so is this mismarked, or the CP invalid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants