Metric Instrumentation Framework #1767

torredil · 2023-09-29T15:35:26Z

What is this PR about? / Why do we need it?

This PR implements a framework that can be utilized to instrument driver internals and AWS API calls. Collected metrics are exposed via a Prometheus endpoint.

Changes:

Metrics package: For clearer separation of concerns and encapsulating metric-related functionality, metrics collection has been moved from cloud to a dedicated metrics package.
Singleton Metric Recorder: A singleton metricRecorder has been implemented to provide an organized way of recording and managing metrics across the driver. The recorder takes care of registering the metrics only once throughout the driver's lifetime and provides exported methods to record operation durations, errors, throttle events, and values.

The singleton is initialized lazily when Recorder() is called for the first time, (which would be during driver startup if metrics are enabled via the enableMetrics parameter in our Helm chart):
```
if options.ServerOptions.HttpEndpoint != "" {
	r := metrics.Recorder()
	r.InitializeHttpHandler(options.ServerOptions.HttpEndpoint, "/metrics")
}
```
Any subsequent calls to Recorder() simply return the already initialized instance.

As an alternative solution, DI was also considered but ultimately having to explictly pass the metric recorder instsance everywhere needed limits the scope of what can be instrumented. We could technically inject the recorder into the controller and node service but -- as a theoretical example -- if future maintainers / users were to introduce a batching framework or interested in instrumenting deeply integrated code in the driver, they'd run into a wall. The singleton approach allows for simply importing the metrics package wherever needed and gives a consistent interface for metric recording throughout the driver. Also removes the possibility of registering metrics multiple times and other discrepancies.
HTTP Handler Initialization: metrics package now takes care of setting up the HTTP server for metrics via InitializeHttpHandler. This abstracts the process of starting the metric server away from main and lets us remove the cloud import there as well.

What testing is done?

make test
Manual testing:

$ kubectl logs ebs-csi-controller-79f798ffbc-kxx6d -n kube-system

I0929 14:58:45.071800       1 driver.go:77] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.23.0"
I0929 14:58:45.071968       1 metadata.go:85] "retrieving instance data from ec2 metadata"
I0929 14:58:45.072450       1 metrics.go:134] "Metric server listening" address="0.0.0.0:3301" path="/metrics"

$ helm upgrade --install aws-ebs-csi-driver --namespace kube-system ./charts/aws-ebs-csi-driver --values ./charts/aws-ebs-csi-driver/values.yaml --set controller.enableMetrics=true

$ kubectl port-forward ebs-csi-controller-79f798ffbc-kxx6d 3301:3301 -n kube-system &

$ kubectl apply -f examples/kubernetes/dynamic-provisioning/manifests

$ curl 127.0.0.1:3301/metrics

$ # HELP cloudprovider_aws_api_request_duration_seconds [ALPHA] Latency of AWS API calls
# TYPE cloudprovider_aws_api_request_duration_seconds histogram
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.1"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.25"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="1"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="2.5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="10"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="+Inf"} 1
cloudprovider_aws_api_request_duration_seconds_sum{request="AttachVolume"} 0.38017901
cloudprovider_aws_api_request_duration_seconds_count{request="AttachVolume"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.1"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.25"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.5"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="1"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="2.5"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="5"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="10"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="+Inf"} 3
cloudprovider_aws_api_request_duration_seconds_sum{request="DescribeInstances"} 0.270916661
cloudprovider_aws_api_request_duration_seconds_count{request="DescribeInstances"} 3
...

github-actions · 2023-09-29T15:42:54Z

Code Coverage Diff

File	Old Coverage	New Coverage	Delta
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/cloud.go	80.7%	80.9%	0.2
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/metrics/metrics.go	Does not exist	72.0%

torredil · 2023-09-29T15:44:42Z

/test pull-aws-ebs-csi-driver-test-helm-chart

AndrewSirenko · 2023-09-29T20:24:34Z

Can we add a PR Description?

Maybe something like:

Is this a bug fix or adding new feature?
Feature

What is this PR about? / Why do we need it?
Today, the EBS CSI Driver exposes AWS API metrics via a Prometheus endpoint.

However, currently we only expose AWS API Call latency, not driver operation latency numbers.

This PR will let the driver expose metrics for CSI RPC Call Latency (i.e. time between initiation of the ControllerPublishVolume RPC call (which triggers the attachment) to the moment the volume's "attached" state is confirmed).

Furthermore, this PR decouples the metrics implementation from the cloud package, instead instantiating a metricsRecorder singleton for the life of the driver.

What testing is done?
??

pkg/metrics/metrics.go

ConnorJC3

Top level comment: this current iteration feels weird to me.

It essentially enforces an "operation" style recording of metrics, and tries to stuff everything into that.

Does it really make sense to have one metric for every possible "duration" that could occur within the EBS CSI Driver?

This is just me spitballing, but I wonder if it would make sense for the caller to supply their own name of what the metric should be.

We could have a metric for csi_rpc_duration, for k8s_api_duration, for mkfs_duration, etc - stuffing all those into a generic duration_seconds bucket feels very wrong and is antiethical to standard Prometheus design.

pkg/metrics/metrics.go

torredil · 2023-10-06T18:19:53Z

/test pull-aws-ebs-csi-driver-test-e2e-external-eks-windows

AndrewSirenko · 2023-10-06T18:57:05Z

/lgtm

AndrewSirenko

/lgtm

ConnorJC3

/lgtm

but consider the suggestions below:

pkg/metrics/metrics.go

pkg/metrics/metrics_test.go

Signed-off-by: Eddie Torres <torredil@amazon.com>

AndrewSirenko · 2023-10-10T18:26:24Z

/approve

k8s-ci-robot · 2023-10-10T18:26:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndrewSirenko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [AndrewSirenko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ConnorJC3 · 2023-10-11T14:38:25Z

/lgtm

torredil · 2023-10-11T15:42:59Z

/test pull-aws-ebs-csi-driver-test-e2e-external-eks-windows

AndrewSirenko · 2023-10-11T17:08:40Z

/retest

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 29, 2023

k8s-ci-robot requested review from AndrewSirenko and ConnorJC3 September 29, 2023 15:35

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 29, 2023

AndrewSirenko reviewed Sep 29, 2023

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

torredil marked this pull request as draft September 29, 2023 22:01

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2023

torredil changed the title ~~Add metrics package~~ Metric Instrumentation Framework Sep 29, 2023

torredil force-pushed the metrics-0927364 branch 2 times, most recently from 4356a23 to 9b92246 Compare September 30, 2023 03:18

torredil marked this pull request as ready for review September 30, 2023 05:02

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 30, 2023

k8s-ci-robot requested a review from AndrewSirenko September 30, 2023 05:03

rdpsin reviewed Oct 3, 2023

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

torredil force-pushed the metrics-0927364 branch from 9b92246 to ceccd6b Compare October 3, 2023 15:34

rdpsin reviewed Oct 3, 2023

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

torredil force-pushed the metrics-0927364 branch from ceccd6b to 0afaf1f Compare October 4, 2023 15:47

ConnorJC3 reviewed Oct 5, 2023

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

torredil force-pushed the metrics-0927364 branch from 0afaf1f to d65c3eb Compare October 6, 2023 17:40

k8s-ci-robot assigned AndrewSirenko Oct 6, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2023

AndrewSirenko reviewed Oct 9, 2023

View reviewed changes

ConnorJC3 reviewed Oct 9, 2023

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics_test.go Show resolved Hide resolved

k8s-ci-robot assigned ConnorJC3 Oct 9, 2023

torredil force-pushed the metrics-0927364 branch from d65c3eb to ae2798b Compare October 9, 2023 14:57

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2023

Metric instrumentation framework

0f8c191

Signed-off-by: Eddie Torres <torredil@amazon.com>

torredil force-pushed the metrics-0927364 branch from ae2798b to 0f8c191 Compare October 9, 2023 15:06

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 10, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2023

k8s-ci-robot merged commit b7a6060 into kubernetes-sigs:master Oct 11, 2023
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric Instrumentation Framework #1767

Metric Instrumentation Framework #1767

torredil commented Sep 29, 2023 •

edited

github-actions bot commented Sep 29, 2023 •

edited

torredil commented Sep 29, 2023

AndrewSirenko commented Sep 29, 2023

ConnorJC3 left a comment

torredil commented Oct 6, 2023

AndrewSirenko commented Oct 6, 2023

AndrewSirenko left a comment

ConnorJC3 left a comment

AndrewSirenko commented Oct 10, 2023

k8s-ci-robot commented Oct 10, 2023

ConnorJC3 commented Oct 11, 2023

torredil commented Oct 11, 2023

AndrewSirenko commented Oct 11, 2023

Metric Instrumentation Framework #1767

Metric Instrumentation Framework #1767

Conversation

torredil commented Sep 29, 2023 • edited

github-actions bot commented Sep 29, 2023 • edited

Code Coverage Diff

torredil commented Sep 29, 2023

AndrewSirenko commented Sep 29, 2023

ConnorJC3 left a comment

Choose a reason for hiding this comment

torredil commented Oct 6, 2023

AndrewSirenko commented Oct 6, 2023

AndrewSirenko left a comment

Choose a reason for hiding this comment

ConnorJC3 left a comment

Choose a reason for hiding this comment

AndrewSirenko commented Oct 10, 2023

k8s-ci-robot commented Oct 10, 2023

ConnorJC3 commented Oct 11, 2023

torredil commented Oct 11, 2023

AndrewSirenko commented Oct 11, 2023

torredil commented Sep 29, 2023 •

edited

github-actions bot commented Sep 29, 2023 •

edited