Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrument metrics #645

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

cerberus20
Copy link

image

Copy link

linux-foundation-easycla bot commented Feb 29, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 29, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @cerberus20. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 29, 2024
@dashpole
Copy link

dashpole commented Mar 7, 2024

/assign @dgrisonnet
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 7, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 27, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cerberus20
Once this PR has been reviewed and has the lgtm label, please ask for approval from dgrisonnet. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cerberus20
Copy link
Author

@dgrisonnet /ping

@cerberus20
Copy link
Author

@dgrisonnet Can you please help review? thank you.

Comment on lines 51 to 58
apiErrorCount = metrics.NewCounterVec(
&metrics.CounterOpts{
Namespace: "prometheus_adapter",
Subsystem: "prometheus_client",
Name: "api_errors_total",
Help: "Total number of API errors",
},
[]string{"error_code", "path", "server"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to compute an error rate and produce an alert out of this metric, it is usually better to record all the request including the one that succeed.

A common practice is to have a http_requests_total metric with a code label. In you case I would suggest to have a api_requests_total metric that record all the requests and rename the label to code and set it to the http status code of the request.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, makes sense, updated with api_requests_total. thanks for the suggestion.

Comment on lines 66 to 74
errRegisterQueryLatency := registry.Register(queryLatency)
if errRegisterQueryLatency != nil {
return nil, errRegisterQueryLatency
}

errRegisterAPIErrorCount := registry.Register(apiErrorCount)
if errRegisterAPIErrorCount != nil {
return nil, errRegisterAPIErrorCount
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't necessarily need to differentiate between the errors here, it is usually a common practice in Go to just have one err variable and reuse it for reach Register call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

Comment on lines 96 to 100
if apiErr, wasAPIErr := err.(*client.Error); wasAPIErr {
apiRequestsTotal.With(prometheus.Labels{"code": string(apiErr.Type), "path": endpoint, "server": c.serverName}).Inc()
} else {
apiRequestsTotal.With(prometheus.Labels{"code": "generic", "path": endpoint, "server": c.serverName}).Inc()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first looked at your PR I had forgotten that the client errors were wrapped https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/pkg/client/types.go#L27-L31 which is not ideal as it is less explicit to users than a normal http code would be. I feel like we should change some things in order to surface the http code and expose it in the metrics.

I'll be out till the 27th of May, but I'll try to have a look at that again when I am back.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgrisonnet I updated code to add HTTPStatusCode to APIResponse to expose the http code in the metric. Can you please review? Thank you.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 15, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants