Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve volume operation metrics #75750

Merged
merged 2 commits into from Apr 5, 2019

Conversation

@msau42
Copy link
Member

commented Mar 26, 2019

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

  • Extend the histogram buckets to be able to account for storage systems that take longer to provision
  • It's not possible to calculate an error rate with the current error count metric, because there's no "total" count to compare to. This PR adds a new "storage_operation_status_count" metric, with a status dimension, with values "success" or "fail-unknown". Then it will be possible to compute an error rate

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Adds a new "storage_operation_status_count" metric for kube-controller-manager and kubelet to count success and error statues.
@msau42

This comment has been minimized.

Copy link
Member Author

commented Mar 27, 2019

/assign @gnufied
@kubernetes/sig-instrumentation-pr-reviews
/priority important-soon

},
[]string{"volume_plugin", "operation_name"},
[]string{"volume_plugin", "operation_name", "status"},

This comment has been minimized.

Copy link
@msau42

msau42 Mar 27, 2019

Author Member

Another thought I had was also having an "error code" dimension. Then we could potentially have queries to exclude certain types of errors.

But I'm not sure if we can get that level of granularity from the volume operations, which often just do "fmt.Errorf" without any error code.

This comment has been minimized.

Copy link
@x13n

x13n Mar 27, 2019

Member

You can export "status" or "status_code", with some value representing success (e.g. "OK" or "success" as you have now) and treat everything else as an error. In this PR you could label all errors as e.g. "unknown" and do some refactoring afterwards to surface particular error types without changing the metric definition.
I'm against having both boolean "success"/"failure" AND some error code in another label though.

@msau42

This comment has been minimized.

Copy link
Member Author

commented Mar 27, 2019

/retest

@msau42 msau42 force-pushed the msau42:metrics branch from 6418a64 to f190316 Mar 27, 2019

@msau42

This comment has been minimized.

Copy link
Member Author

commented Mar 27, 2019

/hold
still need to update our e2e test

@msau42 msau42 force-pushed the msau42:metrics branch from f190316 to 185c42e Mar 27, 2019

}
storageOperationMetric.WithLabelValues(plugin, operationName, status).Observe(timeTaken)

This comment has been minimized.

Copy link
@gnufied

gnufied Mar 27, 2019

Member

Will this cause existing dashboards to break because now suddenly this metric will include errors as well. Do we need to capture this in a changelog or something?

This comment has been minimized.

Copy link
@msau42

msau42 Mar 27, 2019

Author Member

I have a release note with an action required.

Dashboards shouldn't break however now that the metrics are capturing more cases than before, the semantics of the values may change. So any alerts based on this metric may need to be adjusted.

This comment has been minimized.

Copy link
@msau42

msau42 Apr 1, 2019

Author Member

After some further thought, I'm going to leave the latency histogram as is (only counting successes), and adding a new "status_count" metric to track number of success and errors

},
[]string{"volume_plugin", "operation_name"},
[]string{"volume_plugin", "operation_name", "status_code"},

This comment has been minimized.

Copy link
@brancz

brancz Apr 1, 2019

Member

is status_code really appropriate here? When I read that I expect http code-like values, but here we're just getting "success"/"unknown", right? In that case I would either expect this label to be status or result

This comment has been minimized.

Copy link
@msau42

msau42 Apr 1, 2019

Author Member

"status" sounds good to me

@msau42 msau42 force-pushed the msau42:metrics branch 2 times, most recently from d973119 to 53f17ec Apr 2, 2019

@brancz

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

Looks good from sig-instrumentation side.

/lgtm

@msau42 msau42 force-pushed the msau42:metrics branch from 53f17ec to db472c8 Apr 3, 2019

@k8s-ci-robot k8s-ci-robot added size/L and removed lgtm size/S labels Apr 3, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@msau42

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

Extended the existing e2e test for success counting and added a new e2e test for the failure counting. @gnufied ptal

@msau42

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2019

/hold cancel

updatedStorageMetrics := getControllerStorageMetrics(updatedControllerMetrics)

Expect(len(updatedStorageMetrics.statusMetrics)).ToNot(Equal(0), "Error fetching c-m updated storage metrics")
verifyMetricCount(storageOpMetrics, updatedStorageMetrics, "volume_provision", true)

This comment has been minimized.

Copy link
@gnufied

gnufied Apr 4, 2019

Member

Doesn't this fail at provision time itself? Do we have to create pod and stuff?

This comment has been minimized.

Copy link
@msau42

msau42 Apr 4, 2019

Author Member

We need to create the Pod to handle delayed binding cases, where provisioning is not triggered until a Pod is created.

@gnufied

This comment has been minimized.

Copy link
Member

commented Apr 4, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Apr 4, 2019

@k8s-ci-robot k8s-ci-robot merged commit f25fa0e into kubernetes:master Apr 5, 2019

17 checks passed

cla/linuxfoundation msau42 authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
k8s-ci-robot added a commit that referenced this pull request May 2, 2019
Merge pull request #76222 from msau42/automated-cherry-pick-of-#75750…
…-upstream-release-1.14

Automated cherry pick of #75750: Improve volume operation metrics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.