Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transformation_success_total and transformation_last_status metrics. #70715

Merged
merged 1 commit into from May 28, 2019

Conversation

@immutableT
Copy link
Contributor

commented Nov 6, 2018

Add transformation_success_total and transformation_last_status metrics to apiserver/pkg/storage/value/metrics.go

What type of PR is this?
/kind feature

What this PR does / why we need it:
When managing KMS Encryption feature within a sizable deployment, answering the following monitoring/health questions becomes important:

  1. How many requests is my KMS-Plugin handling?
  2. What is the ratio between successful and failed KMS requests?
  3. Show me all clusters where the last request to KMS-Plugin failed.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

The transformer_failures_total metric is deprecated in favor of transformation_operation_total. The old metric will continue to be populated but will be removed in a future release.
@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

/assign @lavalamp

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

/assign @liggitt

@@ -41,6 +41,23 @@ var (
},
[]string{"transformation_type"},
)
transformerSuccessTotal = prometheus.NewCounterVec(

This comment has been minimized.

Copy link
@lavalamp

lavalamp Nov 6, 2018

Member

Three metrics:

  1. total transformation counter
  2. a counter of either successes or failures
  3. some measure of latency, not sure the best way to format it off the top of my head.

This comment has been minimized.

Copy link
@immutableT

immutableT Nov 7, 2018

Author Contributor

I may be missing something in your comment.
Including the changes in this PR, we will have the following metrics:

transformation_success_total (this PR)
transformation_failures_total
transformation_latencies_microseconds
transformation_last_status (this PR)

This comment has been minimized.

Copy link
@lavalamp

lavalamp Nov 7, 2018

Member

Sorry I failed to expand the github diff. success_total + failures_total + latencies are sufficient. I'd slightly prefer if you added a total usage counter rather than a success counter.

This comment has been minimized.

Copy link
@immutableT

immutableT Nov 7, 2018

Author Contributor

Yes, in retrospect, I would have define a single counter - operations_total with two labels (success and failure).

This comment has been minimized.

Copy link
@liggitt

liggitt Dec 30, 2018

Member

Yes, in retrospect, I would have define a single counter - operations_total with two labels (success and failure).

It's not too late. If we're adding a metric here, let's add the one we want.

The current metric doesn't provide info about which resource type is being transformed (the feature is not limited just to Secret resources). Is that important to include?

This comment has been minimized.

Copy link
@immutableT

immutableT Dec 30, 2018

Author Contributor

Added transformation_operations_total with two labels: transformation_type and status.
I left transformation_failures_total, despite its being redundant - it may be already used.

With respect to including resource as label; how would we do it? Today, transformers, are not aware of what resource is being transformed. We could:

  1. Infer the resource type from value.Context - it contains etcd key, which in turn contains the resource type (ex. /registry/secrets/default/dev-db-secret-0166 - second part is our resource type).
  2. Pass resource type into TransformTo/FromStorage from the store.go.

This comment has been minimized.

Copy link
@liggitt

liggitt Jan 2, 2019

Member

2 would be my preference, but I'm ok with deferring that until there is demand for it

@timothysc timothysc removed their request for review Nov 7, 2018

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Nov 7, 2018

/ok-to-test

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Nov 7, 2018

@immutableT: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Nov 7, 2018

/ok-to-test

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Nov 7, 2018

@lavalamp PTAL.
I removed the transformer_last_status metric.

},
[]string{"transformation_type", "status"},
)
// Deprecated, use transformerOperationsTotal instead.

This comment has been minimized.

Copy link
@liggitt

liggitt Jan 2, 2019

Member

note the deprecation in the help text so callers are aware?

This comment has been minimized.

Copy link
@immutableT

immutableT Jan 2, 2019

Author Contributor

I followed the pattern we used in k8s.io/client-go/util/workqueue/metrics.go, and added deprecated prefix to the variable.
Also added deprecation notice to the help text.

This comment has been minimized.

Copy link
@logicalhan

logicalhan Jan 2, 2019

Contributor

Shouldn't the deprecation notice be added to line 59 then?

This comment has been minimized.

Copy link
@immutableT

immutableT Jan 2, 2019

Author Contributor

I can go either way.
Tools like Intellij/Goland provide visual indication of the deprecation when the notice is placed at the variable declaration level.

This comment has been minimized.

Copy link
@logicalhan

logicalhan Jan 2, 2019

Contributor

Yes, but if it is added to the help it will be output in '/metrics' endpoint which is kinda too nice, imo.

This comment has been minimized.

Copy link
@immutableT

immutableT Jan 3, 2019

Author Contributor

Makes sense.
Done.

@liggitt

This comment has been minimized.

Copy link
Member

commented Jan 2, 2019

add deprecation to release note

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Jan 2, 2019

I assume that deprecation notice is added here as a separate PR:
https://github.com/kubernetes/website/blob/master/content/en/docs/setup/release/notes.md
after this PR is merged, correct?

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented Jan 3, 2019

/retest

@liggitt

This comment has been minimized.

Copy link
Member

commented Jan 16, 2019

I assume that deprecation notice is added here as a separate PR:
https://github.com/kubernetes/website/blob/master/content/en/docs/setup/release/notes.md
after this PR is merged, correct?

it goes in the PR description inside the ```release-note block

squash, then lgtm

@brancz

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

besides some consistency on the deprecation process, this looks good from instrumentation side

@immutableT immutableT force-pushed the immutableT:kube-apiserver-metrics branch from 190c9e3 to 1067d76 May 6, 2019

@k8s-ci-robot k8s-ci-robot added size/L and removed size/S labels May 14, 2019

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2019

@awly PTAL

@immutableT immutableT force-pushed the immutableT:kube-apiserver-metrics branch from 9fc8216 to 6411222 May 15, 2019

@brancz

This comment has been minimized.

Copy link
Member

commented May 15, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label May 15, 2019

@immutableT immutableT force-pushed the immutableT:kube-apiserver-metrics branch from 6411222 to 98b1d43 May 15, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label May 15, 2019

@@ -20,6 +20,10 @@ import (
"sync"
"time"

"google.golang.org/grpc/status"

This comment has been minimized.

Copy link
@awly

awly May 15, 2019

Contributor

remove empty line

This comment has been minimized.

Copy link
@immutableT

immutableT May 15, 2019

Author Contributor

done.

@@ -106,23 +121,37 @@ func RegisterMetrics() {
registerMetrics.Do(func() {
prometheus.MustRegister(transformerLatencies)
prometheus.MustRegister(deprecatedTransformerLatencies)
prometheus.MustRegister(transformerFailuresTotal)

This comment has been minimized.

Copy link
@awly

awly May 15, 2019

Contributor

remove all the newlines here too

deprecatedTransformerLatencies.WithLabelValues(transformationType).Observe(sinceInMicroseconds(start))
default:
deprecatedTransformerFailuresTotal.WithLabelValues(transformationType).Inc()
st, ok := status.FromError(err)

This comment has been minimized.

Copy link
@awly

awly May 15, 2019

Contributor

You could just do

transformerOperationsTotal.WithLabelValues(transformationType, status.Code(err)).Inc()

outside the switch. It'll handle nil errors too.

@@ -0,0 +1,97 @@
/*
Copyright 2017 The Kubernetes Authors.

This comment has been minimized.

Copy link
@awly

awly May 15, 2019

Contributor

2019

This comment has been minimized.

Copy link
@immutableT

immutableT May 15, 2019

Author Contributor

Done.

This comment has been minimized.

Copy link
@awly

awly May 15, 2019

Contributor

Still see 2017 here

"apiserver_storage_transformation_operations_total",
"apiserver_storage_transformation_failures_total",
},
want: `

This comment has been minimized.

Copy link
@awly

awly May 15, 2019

Contributor

Should there be latencies? Or will testutil.GatherAndCompare do a substring match?

This comment has been minimized.

Copy link
@immutableT

immutableT May 15, 2019

Author Contributor

Yes, it testuilt does will only compare the metrics explicitly requested.
Not sure if I can reliably test latencies using method, so leaving them out.

This comment has been minimized.

Copy link
@awly

awly May 15, 2019

Contributor

Makes sense, thanks for clarifying

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2019

/test pull-kubernetes-e2e-gce-100-performance

@immutableT immutableT force-pushed the immutableT:kube-apiserver-metrics branch from 28076e7 to 7074a4f May 15, 2019

@immutableT immutableT force-pushed the immutableT:kube-apiserver-metrics branch from 7074a4f to 90c9421 May 15, 2019

@@ -20,6 +20,8 @@ import (
"sync"
"time"

"google.golang.org/grpc/status"

This comment has been minimized.

Copy link
@immutableT

immutableT May 15, 2019

Author Contributor

@awly PTAL

@@ -20,6 +20,10 @@ import (
"sync"
"time"

"google.golang.org/grpc/status"

This comment has been minimized.

Copy link
@immutableT

immutableT May 15, 2019

Author Contributor

done.

@@ -0,0 +1,97 @@
/*
Copyright 2017 The Kubernetes Authors.

This comment has been minimized.

Copy link
@immutableT

immutableT May 15, 2019

Author Contributor

Done.

"apiserver_storage_transformation_operations_total",
"apiserver_storage_transformation_failures_total",
},
want: `

This comment has been minimized.

Copy link
@immutableT

immutableT May 15, 2019

Author Contributor

Yes, it testuilt does will only compare the metrics explicitly requested.
Not sure if I can reliably test latencies using method, so leaving them out.

@immutableT

This comment has been minimized.

Copy link
Contributor Author

commented May 15, 2019

/test pull-kubernetes-node-e2e

@zouyee

This comment has been minimized.

Copy link
Member

commented May 16, 2019

/lgtm

if err != nil {
transformerFailuresTotal.WithLabelValues(transformationType).Inc()
return
transformerOperationsTotal.WithLabelValues(transformationType, status.Code(err).String()).Inc()

This comment has been minimized.

Copy link
@liggitt

liggitt May 17, 2019

Member

what's the cost of status.Code(err).String(), and are the possible values bounded?

This comment has been minimized.

Copy link
@immutableT

immutableT May 17, 2019

Author Contributor

The possible values are bound to these.
I would classify the cost as negligible, see here

This comment has been minimized.

Copy link
@immutableT

immutableT May 20, 2019

Author Contributor

@liggitt PTAL

This comment has been minimized.

Copy link
@immutableT

immutableT May 28, 2019

Author Contributor

@liggitt
I ran benchmarks against the RecordTransformation function and it completes within approximately 420 ns for a generic error and in 540 ns for a status error.
With respect to the possible values, they are bound to 17 possible values, but in the context of KMS Plugin we would expect OK, Cancelled, Unknown, DeadlineExceeded, NotFound, PermissionsDenied, ResourceExhausted and FailedPrecondition.

@liggitt

This comment has been minimized.

Copy link
Member

commented May 28, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: immutableT, liggitt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 3c3c1b1 into kubernetes:master May 28, 2019

20 checks passed

cla/linuxfoundation immutableT authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@tpepper

This comment has been minimized.

Copy link
Contributor

commented May 30, 2019

@kubernetes/sig-api-machinery-pr-reviews @kubernetes/sig-instrumentation-pr-reviews can somebody comment on the validity of the cherry pick to 1.14 here? Ie: we don't cherry pick features, so is this mismarked, or the CP invalid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.