Improve volume operation metrics #75750

msau42 · 2019-03-26T23:58:29Z

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

Extend the histogram buckets to be able to account for storage systems that take longer to provision
It's not possible to calculate an error rate with the current error count metric, because there's no "total" count to compare to. This PR adds a new "storage_operation_status_count" metric, with a status dimension, with values "success" or "fail-unknown". Then it will be possible to compute an error rate

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Adds a new "storage_operation_status_count" metric for kube-controller-manager and kubelet to count success and error statues.

msau42 · 2019-03-27T00:00:37Z

/assign @gnufied
@kubernetes/sig-instrumentation-pr-reviews
/priority important-soon

msau42 · 2019-03-27T00:03:26Z

pkg/volume/util/metrics.go

 	},
-	[]string{"volume_plugin", "operation_name"},
+	[]string{"volume_plugin", "operation_name", "status"},


Another thought I had was also having an "error code" dimension. Then we could potentially have queries to exclude certain types of errors.

But I'm not sure if we can get that level of granularity from the volume operations, which often just do "fmt.Errorf" without any error code.

You can export "status" or "status_code", with some value representing success (e.g. "OK" or "success" as you have now) and treat everything else as an error. In this PR you could label all errors as e.g. "unknown" and do some refactoring afterwards to surface particular error types without changing the metric definition.
I'm against having both boolean "success"/"failure" AND some error code in another label though.

msau42 · 2019-03-27T00:16:08Z

/retest

msau42 · 2019-03-27T03:13:49Z

/hold
still need to update our e2e test

gnufied · 2019-03-27T19:14:57Z

pkg/volume/util/metrics.go

 		}
+		storageOperationMetric.WithLabelValues(plugin, operationName, status).Observe(timeTaken)


Will this cause existing dashboards to break because now suddenly this metric will include errors as well. Do we need to capture this in a changelog or something?

I have a release note with an action required.

Dashboards shouldn't break however now that the metrics are capturing more cases than before, the semantics of the values may change. So any alerts based on this metric may need to be adjusted.

After some further thought, I'm going to leave the latency histogram as is (only counting successes), and adding a new "status_count" metric to track number of success and errors

brancz · 2019-04-01T09:54:39Z

pkg/volume/util/metrics.go

 	},
-	[]string{"volume_plugin", "operation_name"},
+	[]string{"volume_plugin", "operation_name", "status_code"},


is status_code really appropriate here? When I read that I expect http code-like values, but here we're just getting "success"/"unknown", right? In that case I would either expect this label to be status or result

"status" sounds good to me

brancz · 2019-04-03T09:23:26Z

Looks good from sig-instrumentation side.

/lgtm

k8s-ci-robot · 2019-04-03T21:05:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/volume/util/OWNERS~~ [msau42]
~~test/e2e/storage/OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

msau42 · 2019-04-03T21:06:02Z

Extended the existing e2e test for success counting and added a new e2e test for the failure counting. @gnufied ptal

msau42 · 2019-04-03T21:07:56Z

/hold cancel

gnufied · 2019-04-04T19:05:52Z

test/e2e/storage/volume_metrics.go

+		updatedStorageMetrics := getControllerStorageMetrics(updatedControllerMetrics)
+
+		Expect(len(updatedStorageMetrics.statusMetrics)).ToNot(Equal(0), "Error fetching c-m updated storage metrics")
+		verifyMetricCount(storageOpMetrics, updatedStorageMetrics, "volume_provision", true)


Doesn't this fail at provision time itself? Do we have to create pod and stuff?

We need to create the Pod to handle delayed binding cases, where provisioning is not triggered until a Pod is created.

gnufied · 2019-04-04T20:31:18Z

/lgtm

…-upstream-release-1.14 Automated cherry pick of #75750: Improve volume operation metrics

k8s-ci-robot requested review from rootfs and saad-ali March 26, 2019 23:58

k8s-ci-robot assigned gnufied Mar 27, 2019

msau42 commented Mar 27, 2019

View reviewed changes

msau42 force-pushed the metrics branch from 6418a64 to f190316 Compare March 27, 2019 01:07

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019

msau42 force-pushed the metrics branch from f190316 to 185c42e Compare March 27, 2019 18:02

gnufied reviewed Mar 27, 2019

View reviewed changes

brancz reviewed Apr 1, 2019

View reviewed changes

msau42 force-pushed the metrics branch 2 times, most recently from d973119 to 53f17ec Compare April 3, 2019 06:12

k8s-ci-robot assigned brancz Apr 3, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 3, 2019

Improve volume operation metrics

33bf81f

Add e2e tests

db472c8

msau42 force-pushed the metrics branch from 53f17ec to db472c8 Compare April 3, 2019 21:05

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 3, 2019

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 3, 2019

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. labels Apr 3, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 3, 2019

gnufied reviewed Apr 4, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2019

k8s-ci-robot merged commit f25fa0e into kubernetes:master Apr 5, 2019

msau42 mentioned this pull request Apr 6, 2019

Automated cherry pick of #75750: Improve volume operation metrics #76222

Merged

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 6, 2019

k8s-ci-robot added a commit that referenced this pull request May 2, 2019

Merge pull request #76222 from msau42/automated-cherry-pick-of-#75750…

8ffc4b2

…-upstream-release-1.14 Automated cherry pick of #75750: Improve volume operation metrics

msau42 mentioned this pull request Dec 16, 2019

Add prometheus metrics to CSI external-provisioner kubernetes-csi/external-provisioner#386

Closed

msau42 deleted the metrics branch July 22, 2021 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve volume operation metrics #75750

Improve volume operation metrics #75750

msau42 commented Mar 26, 2019 •

edited

Loading

msau42 commented Mar 27, 2019

msau42 Mar 27, 2019

x13n Mar 27, 2019

msau42 commented Mar 27, 2019

msau42 commented Mar 27, 2019

gnufied Mar 27, 2019

msau42 Mar 27, 2019

msau42 Apr 1, 2019

brancz Apr 1, 2019

msau42 Apr 1, 2019

brancz commented Apr 3, 2019

k8s-ci-robot commented Apr 3, 2019

msau42 commented Apr 3, 2019

msau42 commented Apr 3, 2019

gnufied Apr 4, 2019

msau42 Apr 4, 2019

gnufied commented Apr 4, 2019

		}
		storageOperationMetric.WithLabelValues(plugin, operationName, status).Observe(timeTaken)

Improve volume operation metrics #75750

Improve volume operation metrics #75750

Conversation

msau42 commented Mar 26, 2019 • edited Loading

msau42 commented Mar 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Mar 27, 2019

msau42 commented Mar 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brancz commented Apr 3, 2019

k8s-ci-robot commented Apr 3, 2019

msau42 commented Apr 3, 2019

msau42 commented Apr 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Apr 4, 2019

msau42 commented Mar 26, 2019 •

edited

Loading