Add metrics to all major gce operations (latency, errors) #44510

bowei · 2017-04-14T21:04:06Z

Add metrics to all major gce operations {latency, errors}

The new metrics are:

  cloudprovider_gce_api_request_duration_seconds{request, region, zone}
  cloudprovider_gce_api_request_errors{request, region, zone}
 
`request` is the specific function that is used.
`region` is the target region (Will be "<n/a>" if not applicable)
`zone` is the target zone (Will be "<n/a>" if not applicable)

Note: this fixes some issues with the previous implementation of
metrics for disks:
- Time duration tracked was of the initial API call, not the entire
  operation.
- Metrics label tuple would have resulted in many independent
  histograms stored, one for each disk. (Did not aggregate well).

k8s-reviewable · 2017-04-14T21:04:13Z

This change is

bowei · 2017-04-14T21:05:30Z

/assign @saad-ali @nicksardo

bowei · 2017-04-15T17:45:14Z

@k8s-bot non-cri e2e test this

bowei · 2017-04-18T17:20:02Z

@saad-ali -- assign to someone for review?

gnufied · 2017-04-18T18:33:02Z

/assign @gnufied

saad-ali · 2017-04-18T18:39:37Z

Thanks for the fixes here @bowei. @gnufied from the storage-sig will review the storage changes.

CC @brancz to review on behalf of sig-instrumentation; @kubernetes/sig-storage-pr-reviews as FYI

gnufied · 2017-04-18T18:47:26Z

cc @brancz

gnufied · 2017-04-19T00:21:16Z

@bowei can you explain -

"Metrics label tuple would have resulted in many independent
histograms stored, one for each disk. (Did not aggregate well)." ?

The old metrics only had namespace as dimensions and this PR replaces that with zone and node as dimension. So likely old PR will create 1 time series for 1 metric but this PR will create 2000 time series for 1 metric.

Also --

"- Time duration tracked was of the initial API call, not the entire
operation."

What you mean by entire operation? Do you mean we capture error and success metrics separately?

bowei · 2017-04-19T07:31:04Z

If you look at the metrics created before this PR:

bowei@e2e-test-bowei-master ~ $ curl localhost:10252/metrics 2>/dev/null | grep _disk_ | grep Inf   
gce_attach_disk_duration_seconds_bucket{namespace="e2e-test-bowei-dynamic-pvc-02c15253-1f0f-11e7-8e35-42010a280002",le="+Inf"} 1
gce_attach_disk_duration_seconds_bucket{namespace="e2e-test-bowei-dynamic-pvc-1c0aba8f-1f0f-11e7-8e35-42010a280002",le="+Inf"} 1
gce_attach_disk_duration_seconds_bucket{namespace="e2e-test-bowei-dynamic-pvc-de1c20ab-1f0e-11e7-8e35-42010a280002",le="+Inf"} 1
... (many entries)

You will end up keeping a histogram for each PVC that is created.

The number of separate stats kept by Prometheus is the number of unique tuples for the given metric. For disk, I aggregate based on (zone, node), which means the metrics tracked will be the same as the number of nodes in the cluster.

After this change, we see the following, aggregating per node.

bowei@e2e-test-bowei-master ~ $ curl localhost:10252/metrics 2>/dev/null | grep cloud | grep Inf
cloudprovider_gce_disk_duration_seconds_bucket{function="attach",node="e2e-test-bowei-minion-group-3wd9",zone="us-central1-b",le="+Inf"} 1
cloudprovider_gce_disk_duration_seconds_bucket{function="attach",node="e2e-test-bowei-minion-group-52ql",zone="us-central1-b",le="+Inf"} 2
cloudprovider_gce_routes_duration_seconds_bucket{function="create",le="+Inf"} 41

Most of the GCP operations that change state return immediately after the change has been submitted, but the completion of the API call is signaled by polling the API for the state change. When you attach the duration timer to the HTTP transaction context, you will be timing the API submission, not the duration of the state change. We can track completion time of the HTTP request/response but it seems to be more interesting to track the time taken by the overall state change.

bowei · 2017-04-19T18:06:25Z

@gnufied responded

gnufied · 2017-04-19T20:21:22Z

@bowei I don't have a prometheus install handy but depending on metrics scrapper, some dimensions (such as node) are automatically added when metrics are collected. It is entirely possible I am wrong. can you please check?

bowei · 2017-04-19T20:25:45Z

@gnufied: the Prometheus scraper will add the node the metric was scraped from, which in this case will be the master node where the controller was run, not the target of the operation (e.g. the node that the PV is being mounted on).

gnufied · 2017-04-21T13:13:58Z

@bowei I think we have bunch of parallel threads going on about metric naming and dimensions. I opened a PR for aws here - #43477 in which @justinsb proposed that we report api action as a dimension rather than part of the metric name.

You have also dropped namespace from the reported metrics. I am not sure using e2e tests is a good example since it creates lot of ephemeral namespaces.

There is also a design proposal which should be amended if we are going to change metric names - https://github.com/kubernetes/community/blob/master/contributors/design-proposals/cloudprovider-storage-metrics.md

Can we have a quick call today sometime and discuss this offline? I would ask @justinsb to be present as well, so as we can discuss a common naming scheme for all cloudprovider metrics.

gnufied · 2017-04-21T13:30:29Z

@bowei @justinsb does 1.00PM EST works for you guys?

brancz · 2017-04-24T08:26:23Z

Sorry I was on vacation, is there a summary of your offline discussion, or what was the outcome? Do you want me to hold off on reviewing for now?

gnufied · 2017-04-24T16:17:29Z

Me, @chauncey(Jonathan) @bowei and @justinsb discussed this offline. Following things were finalized:

As agreed previously API action name will be moved to a dimension and will not be part of metric name.
The actual meaning of metrics between cloudproviders might be subtly different. For example - aws_create_disk could just mean time it took to submit the API request, whereas gce_create_disk could mean - time it took to submit the API request and waiting for API action to take effect. These low level metrics, shouldn't be used to measure relative performance of cloudprovider (namely - if creating a disk is faster in AWS or GCE). It was agreed that, it is upto cloudprovider maintainers to decide how they are going to emit metrics.
For now, we are going to drop the namespace requirement from the metrics. Namespace information isn't available in most call sites anyways. It was discussed that, we may amend metric emission to pass Context objects which can hold additional dimensions.

@justinsb @bowei let me know, if I missed anything.

bowei · 2017-04-24T17:47:15Z

Thanks @gnufied -- that is a good summary. Are you going to update the metrics doc as well?

bowei · 2017-04-24T17:48:22Z

@gnufied -- oh, did you guys decide on the specific naming?

"cloudprovider_{gce,aws,...}_{disk, }"?

Or something else...

gnufied · 2017-04-24T18:18:13Z

@bowei we didn't really had a chance to get consensus around metric name. But I have updated the design proposal kubernetes/community#507 to use cloudprovider_gce_api_duration_seconds as a name and api_request as a dimension.

bowei · 2017-04-26T22:38:00Z

@gnufied please take a look, I updated the metrics to fit with the discussed metrics proposal

gnufied

Two small nits but looks good.

gnufied · 2017-04-27T00:29:09Z

pkg/cloudprovider/providers/gce/metrics.go

+}
+
+// Value for an unused label in the metric dimension.
+const unusedLabel = "<n/a>"


Can we rename it to unusedMetricLabel ?

gnufied · 2017-04-27T00:32:07Z

pkg/cloudprovider/providers/gce/metrics.go

+// Observe the result of a API call.
+func (mc *metricContext) Observe(err error) error {
+	apiMetrics.latency.WithLabelValues(mc.attributes...).Observe(
+		time.Now().Sub(mc.start).Seconds())


time.Since(mc.start).Seconds() seems simpler?

gnufied · 2017-04-27T03:59:32Z

@bowei lets wait for @brancz to have a look by tomorrow. It looks all good from me.

brancz · 2017-04-27T12:31:30Z

Same thing applies here as I mentioned here.

The new metrics is: cloudprovider_gce_api_request_duration_seconds{request, region, zone} cloudprovider_gce_api_request_errors{request, region, zone} `request` is the specific function that is used. `region` is the target region (Will be "<n/a>" if not applicable) `zone` is the target zone (Will be "<n/a>" if not applicable) Note: this fixes some issues with the previous implementation of metrics for disks: - Time duration tracked was of the initial API call, not the entire operation. - Metrics label tuple would have resulted in many independent histograms stored, one for each disk. (Did not aggregate well).

bowei · 2017-04-27T19:49:55Z

@brancz @gnufied -- fixed the naming as requested

gnufied · 2017-04-27T19:55:54Z

/lgtm

k8s-github-robot · 2017-04-27T19:58:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bowei, gnufied

Needs approval from an approver in each of these OWNERS Files:

~~pkg/cloudprovider/providers/gce/OWNERS~~ [bowei]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-04-27T23:14:57Z

Automatic merge from submit-queue (batch tested with PRs 44124, 44510)

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 14, 2017

k8s-github-robot assigned ghost and markturansky Apr 14, 2017

k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 14, 2017

bowei force-pushed the gce-metrics branch from 9039803 to 580992e Compare April 14, 2017 21:04

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 14, 2017

k8s-ci-robot assigned nicksardo and saad-ali Apr 14, 2017

bowei force-pushed the gce-metrics branch 5 times, most recently from 41846b2 to 9ee5e0a Compare April 15, 2017 04:07

ghost removed their assignment Apr 18, 2017

k8s-ci-robot assigned gnufied Apr 18, 2017

saad-ali assigned brancz Apr 18, 2017

gnufied mentioned this pull request Apr 21, 2017

Start recording cloud provider metrics for AWS #43477

Merged

bowei force-pushed the gce-metrics branch from 9ee5e0a to 7758102 Compare April 26, 2017 22:34

bowei force-pushed the gce-metrics branch from 7758102 to 3f16302 Compare April 26, 2017 22:40

gnufied reviewed Apr 27, 2017

View reviewed changes

bowei force-pushed the gce-metrics branch 2 times, most recently from 3e369b2 to 862facc Compare April 27, 2017 01:00

bowei force-pushed the gce-metrics branch from 862facc to ee847eb Compare April 27, 2017 19:49

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2017

k8s-github-robot merged commit 09747e6 into kubernetes:master Apr 27, 2017

bowei deleted the gce-metrics branch September 5, 2017 20:15

huzhengchuan mentioned this pull request Oct 18, 2017

Supports monitoring metrics for network/instance operations in openstack cloud provider #50751

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics to all major gce operations (latency, errors) #44510

Add metrics to all major gce operations (latency, errors) #44510

bowei commented Apr 14, 2017 •

edited

k8s-reviewable commented Apr 14, 2017

bowei commented Apr 14, 2017

bowei commented Apr 15, 2017

bowei commented Apr 18, 2017

gnufied commented Apr 18, 2017

saad-ali commented Apr 18, 2017

gnufied commented Apr 18, 2017

gnufied commented Apr 19, 2017

bowei commented Apr 19, 2017 •

edited

bowei commented Apr 19, 2017

gnufied commented Apr 19, 2017

bowei commented Apr 19, 2017 •

edited

gnufied commented Apr 21, 2017

gnufied commented Apr 21, 2017

brancz commented Apr 24, 2017

gnufied commented Apr 24, 2017

bowei commented Apr 24, 2017

bowei commented Apr 24, 2017

gnufied commented Apr 24, 2017

bowei commented Apr 26, 2017

gnufied left a comment

gnufied Apr 27, 2017

bowei Apr 27, 2017

gnufied Apr 27, 2017

bowei Apr 27, 2017

gnufied commented Apr 27, 2017

brancz commented Apr 27, 2017

bowei commented Apr 27, 2017

gnufied commented Apr 27, 2017

k8s-github-robot commented Apr 27, 2017

k8s-github-robot commented Apr 27, 2017

Add metrics to all major gce operations (latency, errors) #44510

Add metrics to all major gce operations (latency, errors) #44510

Conversation

bowei commented Apr 14, 2017 • edited

k8s-reviewable commented Apr 14, 2017

bowei commented Apr 14, 2017

bowei commented Apr 15, 2017

bowei commented Apr 18, 2017

gnufied commented Apr 18, 2017

saad-ali commented Apr 18, 2017

gnufied commented Apr 18, 2017

gnufied commented Apr 19, 2017

bowei commented Apr 19, 2017 • edited

bowei commented Apr 19, 2017

gnufied commented Apr 19, 2017

bowei commented Apr 19, 2017 • edited

gnufied commented Apr 21, 2017

gnufied commented Apr 21, 2017

brancz commented Apr 24, 2017

gnufied commented Apr 24, 2017

bowei commented Apr 24, 2017

bowei commented Apr 24, 2017

gnufied commented Apr 24, 2017

bowei commented Apr 26, 2017

gnufied left a comment

Choose a reason for hiding this comment

gnufied Apr 27, 2017

Choose a reason for hiding this comment

bowei Apr 27, 2017

Choose a reason for hiding this comment

gnufied Apr 27, 2017

Choose a reason for hiding this comment

bowei Apr 27, 2017

Choose a reason for hiding this comment

gnufied commented Apr 27, 2017

brancz commented Apr 27, 2017

bowei commented Apr 27, 2017

gnufied commented Apr 27, 2017

k8s-github-robot commented Apr 27, 2017

k8s-github-robot commented Apr 27, 2017

bowei commented Apr 14, 2017 •

edited

bowei commented Apr 19, 2017 •

edited

bowei commented Apr 19, 2017 •

edited