Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added requestSloLatencies metric #105890

Merged

Conversation

pawbana
Copy link
Contributor

@pawbana pawbana commented Oct 25, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

Current apiserver_request_duration_* metrics measure whole request duration including time spent processing webhooks. Webhook duration is mostly dependant on user configuration de so they should not be counted when considering request latency SLOs.

Which issue(s) this PR fixes:

This PR introduces better metric for measuring request latency SLOs.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 25, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @pawbana. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 25, 2021
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 26, 2021
@pawbana
Copy link
Contributor Author

pawbana commented Oct 26, 2021

/cc @marseel
/cc @wojtek-t

@wojtek-t
Copy link
Member

/ok-to-test

I'm super interested in that metric (as something that is really reflecting healthiness of the control-plane and not skewed by badly setup webhooks).
But I would like to get some feedback from sig-apimachinery folks.

@sttts @deads2k - thoughts?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 26, 2021
@pawbana pawbana force-pushed the added_request_slo_latency_metric branch 3 times, most recently from 532d6e7 to 5d0a3d7 Compare October 26, 2021 16:13
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 26, 2021
@pawbana pawbana force-pushed the added_request_slo_latency_metric branch from 0aef7dc to 0c54cfe Compare November 10, 2021 16:18
@@ -263,7 +264,13 @@ func (a *mutatingDispatcher) callAttrMutatingHook(ctx context.Context, h *admiss
}
}

if err := r.Do(ctx).Into(response); err != nil {
wd, ok := endpointsrequest.WebhookDurationFrom(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

do := func() { err = r.Do(ctx).Into(response)
if wd, ok := endpointsrequest.WebhookDurationFrom(ctx); ok {
  tmp := do
  do = func() { wd.AdmitTracker.Track(do())
}
do()

}

func WithWebhookDurationAndCustomClock(parent context.Context, c clock.Clock) context.Context {
return WithValue(parent, webhookDurationKey, WebhookDuration{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually store a pointer in the context.

[And then below you just return "nil, false" in WebhookDurationFrom below]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pawbana pawbana force-pushed the added_request_slo_latency_metric branch 4 times, most recently from b1ff221 to d6bc59e Compare November 12, 2021 14:56
type durationTracker struct {
clock clock.Clock
latency time.Duration
mu *sync.RWMutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't make it a pointer - just make it sync.Lock

[and don't use RWLock - we don't expect any contention on read here and RWLock is more expensive than Lock]

@pawbana pawbana force-pushed the added_request_slo_latency_metric branch 2 times, most recently from f5918c2 to dc4a8df Compare November 12, 2021 19:10
wd, ok := request.WebhookDurationFrom(ctx)
if !ok {
if test.InitContext {
t.Errorf("expected webhook duraiton to be initialized")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: duration (same below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@pawbana pawbana force-pushed the added_request_slo_latency_metric branch 2 times, most recently from 7ad0c53 to fc0f7be Compare November 15, 2021 09:46
@@ -468,6 +482,10 @@ func MonitorRequest(req *http.Request, verb, group, version, resource, subresour
}
}
requestLatencies.WithContext(req.Context()).WithLabelValues(reportedVerb, dryRun, group, version, resource, subresource, scope, component).Observe(elapsedSeconds)
wd, ok := request.WebhookDurationFrom(req.Context())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit;

if wd, ok := ... ; ok {

@@ -468,6 +482,10 @@ func MonitorRequest(req *http.Request, verb, group, version, resource, subresour
}
}
requestLatencies.WithContext(req.Context()).WithLabelValues(reportedVerb, dryRun, group, version, resource, subresource, scope, component).Observe(elapsedSeconds)
wd, ok := request.WebhookDurationFrom(req.Context())
if ok {
requestSloLatencies.WithContext(req.Context()).WithLabelValues(reportedVerb, group, version, resource, subresource, scope).Observe(elapsedSeconds - (wd.AdmitTracker.GetLatency() + wd.ValidateTracker.GetLatency()).Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

sloLatency := elapsedSecond - (wd.AdmiTracker.GetLatency() + wd.ValidateTracker.GetLatency()).Seconds()
requestSLOLatencies...

@pawbana pawbana force-pushed the added_request_slo_latency_metric branch from fc0f7be to 2446939 Compare November 15, 2021 10:49
@pawbana pawbana force-pushed the added_request_slo_latency_metric branch from 2446939 to 0afa569 Compare November 15, 2021 11:11
@wojtek-t wojtek-t added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 15, 2021
@wojtek-t
Copy link
Member

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 15, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pawbana, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 15, 2021
@k8s-ci-robot k8s-ci-robot merged commit 1e6f3b5 into kubernetes:master Nov 15, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 15, 2021
@pawbana pawbana deleted the added_request_slo_latency_metric branch December 22, 2021 14:11
aiyengar2 added a commit to aiyengar2/charts that referenced this pull request Oct 24, 2022
…available in <1.23.0-0 clusters

In upstream Kubernetes (specifically in this commit kubernetes/apiserver@9144ea1), starting k8s 1.23 a new version of the apiserver_request_duration_seconds metric was released called apiserver_request_slo_duration_seconds.

As noted in the pull request kubernetes/kubernetes#105890, the fundamental difference between the newer and older metric is that the newer metric excludes the processing time spent by webhooks that the API server calls out to when configured as part of a MutatingWebhookConfiguration or ValidatingWebhookConfiguration.

However, since this metric was only introduced in k8s 1.23, this breaks the API Server dashboards in clusters <1.23.0 since that metric does not exist.

Therefore, to fix this issue in older clusters using the latest Monitoring chart, specific logic has been added to the chart to utilize the older version of the metric when the Kubernetes version is detected to be <1.23.
aiyengar2 added a commit to aiyengar2/charts that referenced this pull request Oct 25, 2022
…available in <1.23.0-0 clusters

In upstream Kubernetes (specifically in this commit kubernetes/apiserver@9144ea1), starting k8s 1.23 a new version of the apiserver_request_duration_seconds metric was released called apiserver_request_slo_duration_seconds.

As noted in the pull request kubernetes/kubernetes#105890, the fundamental difference between the newer and older metric is that the newer metric excludes the processing time spent by webhooks that the API server calls out to when configured as part of a MutatingWebhookConfiguration or ValidatingWebhookConfiguration.

However, since this metric was only introduced in k8s 1.23, this breaks the API Server dashboards in clusters <1.23.0 since that metric does not exist.

Therefore, to fix this issue in older clusters using the latest Monitoring chart, specific logic has been added to the chart to utilize the older version of the metric when the Kubernetes version is detected to be <1.23.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants