Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve volume operation metrics #75750

Merged
merged 2 commits into from
Apr 5, 2019
Merged

Conversation

msau42
Copy link
Member

@msau42 msau42 commented Mar 26, 2019

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

  • Extend the histogram buckets to be able to account for storage systems that take longer to provision
  • It's not possible to calculate an error rate with the current error count metric, because there's no "total" count to compare to. This PR adds a new "storage_operation_status_count" metric, with a status dimension, with values "success" or "fail-unknown". Then it will be possible to compute an error rate

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Adds a new "storage_operation_status_count" metric for kube-controller-manager and kubelet to count success and error statues.

@k8s-ci-robot k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 26, 2019
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 26, 2019
@msau42
Copy link
Member Author

msau42 commented Mar 27, 2019

/assign @gnufied
@kubernetes/sig-instrumentation-pr-reviews
/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 27, 2019
},
[]string{"volume_plugin", "operation_name"},
[]string{"volume_plugin", "operation_name", "status"},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought I had was also having an "error code" dimension. Then we could potentially have queries to exclude certain types of errors.

But I'm not sure if we can get that level of granularity from the volume operations, which often just do "fmt.Errorf" without any error code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can export "status" or "status_code", with some value representing success (e.g. "OK" or "success" as you have now) and treat everything else as an error. In this PR you could label all errors as e.g. "unknown" and do some refactoring afterwards to surface particular error types without changing the metric definition.
I'm against having both boolean "success"/"failure" AND some error code in another label though.

@msau42
Copy link
Member Author

msau42 commented Mar 27, 2019

/retest

@msau42
Copy link
Member Author

msau42 commented Mar 27, 2019

/hold
still need to update our e2e test

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 27, 2019
}
storageOperationMetric.WithLabelValues(plugin, operationName, status).Observe(timeTaken)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this cause existing dashboards to break because now suddenly this metric will include errors as well. Do we need to capture this in a changelog or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a release note with an action required.

Dashboards shouldn't break however now that the metrics are capturing more cases than before, the semantics of the values may change. So any alerts based on this metric may need to be adjusted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some further thought, I'm going to leave the latency histogram as is (only counting successes), and adding a new "status_count" metric to track number of success and errors

},
[]string{"volume_plugin", "operation_name"},
[]string{"volume_plugin", "operation_name", "status_code"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is status_code really appropriate here? When I read that I expect http code-like values, but here we're just getting "success"/"unknown", right? In that case I would either expect this label to be status or result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"status" sounds good to me

@msau42 msau42 force-pushed the metrics branch 2 times, most recently from d973119 to 53f17ec Compare April 3, 2019 06:12
@brancz
Copy link
Member

brancz commented Apr 3, 2019

Looks good from sig-instrumentation side.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 3, 2019
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 3, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 3, 2019
@msau42
Copy link
Member Author

msau42 commented Apr 3, 2019

Extended the existing e2e test for success counting and added a new e2e test for the failure counting. @gnufied ptal

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. labels Apr 3, 2019
@msau42
Copy link
Member Author

msau42 commented Apr 3, 2019

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 3, 2019
updatedStorageMetrics := getControllerStorageMetrics(updatedControllerMetrics)

Expect(len(updatedStorageMetrics.statusMetrics)).ToNot(Equal(0), "Error fetching c-m updated storage metrics")
verifyMetricCount(storageOpMetrics, updatedStorageMetrics, "volume_provision", true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this fail at provision time itself? Do we have to create pod and stuff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to create the Pod to handle delayed binding cases, where provisioning is not triggered until a Pod is created.

@gnufied
Copy link
Member

gnufied commented Apr 4, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2019
@k8s-ci-robot k8s-ci-robot merged commit f25fa0e into kubernetes:master Apr 5, 2019
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 6, 2019
k8s-ci-robot added a commit that referenced this pull request May 2, 2019
…-upstream-release-1.14

Automated cherry pick of #75750: Improve volume operation metrics
@msau42 msau42 deleted the metrics branch July 22, 2021 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants