New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add transformation_success_total and transformation_last_status metrics. #70715
Add transformation_success_total and transformation_last_status metrics. #70715
Conversation
/assign @lavalamp |
/assign @liggitt |
@@ -41,6 +41,23 @@ var ( | |||
}, | |||
[]string{"transformation_type"}, | |||
) | |||
transformerSuccessTotal = prometheus.NewCounterVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three metrics:
- total transformation counter
- a counter of either successes or failures
- some measure of latency, not sure the best way to format it off the top of my head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be missing something in your comment.
Including the changes in this PR, we will have the following metrics:
transformation_success_total (this PR)
transformation_failures_total
transformation_latencies_microseconds
transformation_last_status (this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I failed to expand the github diff. success_total + failures_total + latencies are sufficient. I'd slightly prefer if you added a total usage counter rather than a success counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in retrospect, I would have define a single counter - operations_total with two labels (success and failure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in retrospect, I would have define a single counter - operations_total with two labels (success and failure).
It's not too late. If we're adding a metric here, let's add the one we want.
The current metric doesn't provide info about which resource type is being transformed (the feature is not limited just to Secret
resources). Is that important to include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added transformation_operations_total with two labels: transformation_type and status.
I left transformation_failures_total, despite its being redundant - it may be already used.
With respect to including resource as label; how would we do it? Today, transformers, are not aware of what resource is being transformed. We could:
- Infer the resource type from value.Context - it contains etcd key, which in turn contains the resource type (ex. /registry/secrets/default/dev-db-secret-0166 - second part is our resource type).
- Pass resource type into TransformTo/FromStorage from the store.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 would be my preference, but I'm ok with deferring that until there is demand for it
/ok-to-test |
@immutableT: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
@lavalamp PTAL. |
}, | ||
[]string{"transformation_type", "status"}, | ||
) | ||
// Deprecated, use transformerOperationsTotal instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note the deprecation in the help text so callers are aware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the pattern we used in k8s.io/client-go/util/workqueue/metrics.go, and added deprecated prefix to the variable.
Also added deprecation notice to the help text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the deprecation notice be added to line 59 then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can go either way.
Tools like Intellij/Goland provide visual indication of the deprecation when the notice is placed at the variable declaration level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but if it is added to the help it will be output in '/metrics' endpoint which is kinda too nice, imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
Done.
add deprecation to release note |
I assume that deprecation notice is added here as a separate PR: |
/retest |
it goes in the PR description inside the ```release-note block squash, then lgtm |
prometheus.CounterOpts{ | ||
Namespace: namespace, | ||
Subsystem: subsystem, | ||
Name: "transformation_failures_total", | ||
Help: "Total number of failed transformation operations.", | ||
Help: "Deprecated, use transformerOperationsTotal instead. Total number of failed transformation operations.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deprecated description could follow this KEP , for example:
Help: "(Deprecated) Cumulative number of Docker operation timeout by operation type.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What @danielqsj commented is the correct deprecation procedure. We should do it consistently in this PR.
besides some consistency on the deprecation process, this looks good from instrumentation side |
190c9e3
to
1067d76
Compare
@awly PTAL |
9fc8216
to
6411222
Compare
/lgtm |
6411222
to
98b1d43
Compare
@@ -20,6 +20,10 @@ import ( | |||
"sync" | |||
"time" | |||
|
|||
"google.golang.org/grpc/status" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -106,23 +121,37 @@ func RegisterMetrics() { | |||
registerMetrics.Do(func() { | |||
prometheus.MustRegister(transformerLatencies) | |||
prometheus.MustRegister(deprecatedTransformerLatencies) | |||
prometheus.MustRegister(transformerFailuresTotal) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove all the newlines here too
deprecatedTransformerLatencies.WithLabelValues(transformationType).Observe(sinceInMicroseconds(start)) | ||
default: | ||
deprecatedTransformerFailuresTotal.WithLabelValues(transformationType).Inc() | ||
st, ok := status.FromError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just do
transformerOperationsTotal.WithLabelValues(transformationType, status.Code(err)).Inc()
outside the switch. It'll handle nil errors too.
@@ -0,0 +1,97 @@ | |||
/* | |||
Copyright 2017 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still see 2017 here
"apiserver_storage_transformation_operations_total", | ||
"apiserver_storage_transformation_failures_total", | ||
}, | ||
want: ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be latencies? Or will testutil.GatherAndCompare
do a substring match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it testuilt does will only compare the metrics explicitly requested.
Not sure if I can reliably test latencies using method, so leaving them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for clarifying
/test pull-kubernetes-e2e-gce-100-performance |
28076e7
to
7074a4f
Compare
7074a4f
to
90c9421
Compare
@@ -20,6 +20,8 @@ import ( | |||
"sync" | |||
"time" | |||
|
|||
"google.golang.org/grpc/status" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awly PTAL
@@ -20,6 +20,10 @@ import ( | |||
"sync" | |||
"time" | |||
|
|||
"google.golang.org/grpc/status" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -0,0 +1,97 @@ | |||
/* | |||
Copyright 2017 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"apiserver_storage_transformation_operations_total", | ||
"apiserver_storage_transformation_failures_total", | ||
}, | ||
want: ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it testuilt does will only compare the metrics explicitly requested.
Not sure if I can reliably test latencies using method, so leaving them out.
/test pull-kubernetes-node-e2e |
/lgtm |
if err != nil { | ||
transformerFailuresTotal.WithLabelValues(transformationType).Inc() | ||
return | ||
transformerOperationsTotal.WithLabelValues(transformationType, status.Code(err).String()).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the cost of status.Code(err).String()
, and are the possible values bounded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt
I ran benchmarks against the RecordTransformation function and it completes within approximately 420 ns for a generic error and in 540 ns for a status error.
With respect to the possible values, they are bound to 17 possible values, but in the context of KMS Plugin we would expect OK, Cancelled, Unknown, DeadlineExceeded, NotFound, PermissionsDenied, ResourceExhausted and FailedPrecondition.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: immutableT, liggitt The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@kubernetes/sig-api-machinery-pr-reviews @kubernetes/sig-instrumentation-pr-reviews can somebody comment on the validity of the cherry pick to 1.14 here? Ie: we don't cherry pick features, so is this mismarked, or the CP invalid? |
Add transformation_success_total and transformation_last_status metrics to apiserver/pkg/storage/value/metrics.go
What type of PR is this?
/kind feature
What this PR does / why we need it:
When managing KMS Encryption feature within a sizable deployment, answering the following monitoring/health questions becomes important:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: