Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric Instrumentation Framework #1767

Merged
merged 1 commit into from
Oct 11, 2023

Conversation

torredil
Copy link
Member

@torredil torredil commented Sep 29, 2023

What is this PR about? / Why do we need it?

This PR implements a framework that can be utilized to instrument driver internals and AWS API calls. Collected metrics are exposed via a Prometheus endpoint.

Changes:

  1. Metrics package: For clearer separation of concerns and encapsulating metric-related functionality, metrics collection has been moved from cloud to a dedicated metrics package.

  2. Singleton Metric Recorder: A singleton metricRecorder has been implemented to provide an organized way of recording and managing metrics across the driver. The recorder takes care of registering the metrics only once throughout the driver's lifetime and provides exported methods to record operation durations, errors, throttle events, and values.

    The singleton is initialized lazily when Recorder() is called for the first time, (which would be during driver startup if metrics are enabled via the enableMetrics parameter in our Helm chart):

    if options.ServerOptions.HttpEndpoint != "" {
    	r := metrics.Recorder()
    	r.InitializeHttpHandler(options.ServerOptions.HttpEndpoint, "/metrics")
    }
    

    Any subsequent calls to Recorder() simply return the already initialized instance.

    As an alternative solution, DI was also considered but ultimately having to explictly pass the metric recorder instsance everywhere needed limits the scope of what can be instrumented. We could technically inject the recorder into the controller and node service but -- as a theoretical example -- if future maintainers / users were to introduce a batching framework or interested in instrumenting deeply integrated code in the driver, they'd run into a wall. The singleton approach allows for simply importing the metrics package wherever needed and gives a consistent interface for metric recording throughout the driver. Also removes the possibility of registering metrics multiple times and other discrepancies.

  3. HTTP Handler Initialization: metrics package now takes care of setting up the HTTP server for metrics via InitializeHttpHandler. This abstracts the process of starting the metric server away from main and lets us remove the cloud import there as well.


What testing is done?

  • make test
  • Manual testing:
$ kubectl logs ebs-csi-controller-79f798ffbc-kxx6d -n kube-system

I0929 14:58:45.071800       1 driver.go:77] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.23.0"
I0929 14:58:45.071968       1 metadata.go:85] "retrieving instance data from ec2 metadata"
I0929 14:58:45.072450       1 metrics.go:134] "Metric server listening" address="0.0.0.0:3301" path="/metrics"
$ helm upgrade --install aws-ebs-csi-driver --namespace kube-system ./charts/aws-ebs-csi-driver --values ./charts/aws-ebs-csi-driver/values.yaml --set controller.enableMetrics=true

$ kubectl port-forward ebs-csi-controller-79f798ffbc-kxx6d 3301:3301 -n kube-system &

$ kubectl apply -f examples/kubernetes/dynamic-provisioning/manifests

$ curl 127.0.0.1:3301/metrics

$ # HELP cloudprovider_aws_api_request_duration_seconds [ALPHA] Latency of AWS API calls
# TYPE cloudprovider_aws_api_request_duration_seconds histogram
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.1"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.25"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="0.5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="1"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="2.5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="10"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="AttachVolume",le="+Inf"} 1
cloudprovider_aws_api_request_duration_seconds_sum{request="AttachVolume"} 0.38017901
cloudprovider_aws_api_request_duration_seconds_count{request="AttachVolume"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.1"} 2
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.25"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.5"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="1"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="2.5"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="5"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="10"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="+Inf"} 3
cloudprovider_aws_api_request_duration_seconds_sum{request="DescribeInstances"} 0.270916661
cloudprovider_aws_api_request_duration_seconds_count{request="DescribeInstances"} 3
...

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 29, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 29, 2023
@github-actions
Copy link

github-actions bot commented Sep 29, 2023

Code Coverage Diff

File Old Coverage New Coverage Delta
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/cloud.go 80.7% 80.9% 0.2
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/metrics/metrics.go Does not exist 72.0%

@torredil
Copy link
Member Author

/test pull-aws-ebs-csi-driver-test-helm-chart

@AndrewSirenko
Copy link
Contributor

Can we add a PR Description?

Maybe something like:

Is this a bug fix or adding new feature?
Feature

What is this PR about? / Why do we need it?
Today, the EBS CSI Driver exposes AWS API metrics via a Prometheus endpoint.

However, currently we only expose AWS API Call latency, not driver operation latency numbers.

This PR will let the driver expose metrics for CSI RPC Call Latency (i.e. time between initiation of the ControllerPublishVolume RPC call (which triggers the attachment) to the moment the volume's "attached" state is confirmed).

Furthermore, this PR decouples the metrics implementation from the cloud package, instead instantiating a metricsRecorder singleton for the life of the driver.

What testing is done?
??

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
@torredil torredil marked this pull request as draft September 29, 2023 22:01
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2023
@torredil torredil changed the title Add metrics package Metric Instrumentation Framework Sep 29, 2023
@torredil torredil force-pushed the metrics-0927364 branch 2 times, most recently from 4356a23 to 9b92246 Compare September 30, 2023 03:18
@torredil torredil marked this pull request as ready for review September 30, 2023 05:02
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 30, 2023
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top level comment: this current iteration feels weird to me.

It essentially enforces an "operation" style recording of metrics, and tries to stuff everything into that.

Does it really make sense to have one metric for every possible "duration" that could occur within the EBS CSI Driver?

This is just me spitballing, but I wonder if it would make sense for the caller to supply their own name of what the metric should be.

We could have a metric for csi_rpc_duration, for k8s_api_duration, for mkfs_duration, etc - stuffing all those into a generic duration_seconds bucket feels very wrong and is antiethical to standard Prometheus design.

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
@torredil
Copy link
Member Author

torredil commented Oct 6, 2023

/test pull-aws-ebs-csi-driver-test-e2e-external-eks-windows

@AndrewSirenko
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2023
Copy link
Contributor

@AndrewSirenko AndrewSirenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

but consider the suggestions below:

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2023
Signed-off-by: Eddie Torres <torredil@amazon.com>
@AndrewSirenko
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AndrewSirenko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 10, 2023
@ConnorJC3
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2023
@torredil
Copy link
Member Author

/test pull-aws-ebs-csi-driver-test-e2e-external-eks-windows

@AndrewSirenko
Copy link
Contributor

/retest

@k8s-ci-robot k8s-ci-robot merged commit b7a6060 into kubernetes-sigs:master Oct 11, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants