Update etcdRequestLatency metrics bucket size #107042

kkkkun · 2021-12-15T03:37:39Z

What type of PR is this?

/kind flake
/release-note-none

What this PR does / why we need it:

Apiserver has two metrics etcdRequestLatency and requestLatencies which show latency.
But they have different bucket size.

When we have a large cluster, It is difficulty to distinguish long latency which is spent in etcd or apiserver. Keeping consistent would help us more simple to find question.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2021-12-15T03:37:47Z

Hi @kkkkun. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kkkkun · 2021-12-15T03:38:28Z

/release-note-none

dgrisonnet · 2021-12-15T13:00:18Z

These changes looks good from my side although I am wondering if we really need that much granularity for etcd.

That said, considering that the number of buckets allocated to the apiserver latency was recently reduced from 40 to 20, I think it is reasonable to reconsider aligning the buckets of both metrics to be consistent since that's what we used to have in the past, but I would still like to have the opinion from others considering this past effort to reduce cardinality: #96754.

/cc @wojtek-t @tkashem

tkashem · 2021-12-15T13:45:16Z

I had a PR to increase the buckets #94134 (similar to this PR), but then I had to reduce it #96754.

I agree that keeping the bucket size similar makes these metrics more comparable, but we have to trade it off with a performance impact. The cardinality explosion was noticeable, we saw it in OpenShift. That's why I reduced it.

Also, not sure how directly comparable it can be - an api call may result in multiple calls to storage? @wojtek-t will know the answer to this.

On a related note, I am working on a PR to add the latency incurred in the storage layer as an annotation in the audit logs, if you mine the audit logs you will be able to have a clear picture what portion of the latency is spent in storage layer, admission layer, object transformation.

kkkkun · 2021-12-15T14:08:44Z

PR #94134 has 41 buckets, and PR #96754 has 11 buckets.
In this PR, it has 21 buckets.

aojea · 2021-12-15T14:19:58Z

do we need more than 11 buckets?

kkkkun · 2021-12-15T14:46:46Z

The bucket size is not the matter. We just want to keep the bucket size similar with requestLatencies .

tkashem · 2021-12-15T15:25:10Z

PR #94134 has 41 buckets, and PR #96754 has 11 buckets.
In this PR, it has 21 buckets.

Even that comes with a cost, please note that type (label in this metric) is pretty much unbounded, it includes the built in types plus the CRDs in the cluster. imo, this is one of the most expensive metrics we have.

It would be nice to see how this change impacts cpu utilization or memory footprint on a cluster and how it grows with increasing number of CRDs.

fedebongio · 2021-12-16T21:06:16Z

/assign @jpbetz
/triage accepted

dgrisonnet · 2022-01-18T11:36:09Z

/lgtm
/assign @logicalhan

kkkkun · 2022-02-15T11:47:57Z

@logicalhan Hi, Could you please give some advice?

k8s-triage-robot · 2022-05-16T12:38:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-06-15T13:18:09Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-07-15T13:56:44Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-07-15T13:57:01Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

logicalhan · 2022-07-15T18:03:46Z

/reopen
/remove-lifecycle stale

k8s-ci-robot · 2022-07-15T18:04:05Z

@logicalhan: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle stale

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

logicalhan · 2022-07-15T18:04:20Z

/remove-lifecycle rotten

logicalhan

/lgtm
/approve

k8s-ci-robot · 2022-07-15T18:06:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kkkkun, logicalhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiserver/pkg/endpoints/metrics/OWNERS~~ [logicalhan]
~~staging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/OWNERS~~ [logicalhan]
~~test/instrumentation/testdata/OWNERS~~ [logicalhan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-triage-robot · 2022-07-15T18:57:10Z

This PR may require stable metrics review.

Stable metrics are guaranteed to not change. Please review the documentation for the requirements and lifecycle of stable metrics and ensure that your metrics meet these guidelines.

k8s-triage-robot · 2022-07-15T21:47:10Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-triage-robot · 2022-07-16T01:28:08Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-ci-robot · 2022-07-16T04:13:56Z

@kkkkun: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-gce-ubuntu-containerd	`fb372d0`	link	unknown	`/test pull-kubernetes-e2e-gce-ubuntu-containerd`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-triage-robot · 2022-07-16T06:28:08Z

The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass.

This bot retests PRs for certain kubernetes repos according to the following rules:

The PR does have any do-not-merge/* labels
The PR does not have the needs-ok-to-test label
The PR is mergeable (does not have a needs-rebase label)
The PR is approved (has cncf-cla: yes, lgtm, approved labels)
The PR is failing tests required for merge

You can:

Review the full test history for this PR
Prevent this bot from retesting with /lgtm cancel or /hold
Help make our tests less flaky by following our Flaky Tests Guide

/retest

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 15, 2021

k8s-ci-robot requested review from hongchaodeng and timothysc December 15, 2021 03:37

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 15, 2021

k8s-ci-robot requested review from tkashem and wojtek-t December 15, 2021 13:00

k8s-ci-robot assigned jpbetz Dec 16, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 16, 2021

k8s-ci-robot assigned logicalhan and dgrisonnet Jan 18, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 18, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 16, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 15, 2022

k8s-ci-robot closed this Jul 15, 2022

k8s-ci-robot reopened this Jul 15, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 15, 2022

logicalhan reviewed Jul 15, 2022

View reviewed changes

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/stable-metrics Issues or PRs involving stable metrics labels Jul 15, 2022

k8s-ci-robot merged commit e841000 into kubernetes:master Jul 16, 2022

k8s-ci-robot added this to the v1.25 milestone Jul 16, 2022

kkkkun mentioned this pull request Jul 16, 2022

REQUEST: New membership for @kkkkun kubernetes/org#3561

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update etcdRequestLatency metrics bucket size #107042

Update etcdRequestLatency metrics bucket size #107042

kkkkun commented Dec 15, 2021 •

edited

k8s-ci-robot commented Dec 15, 2021

kkkkun commented Dec 15, 2021

dgrisonnet commented Dec 15, 2021

tkashem commented Dec 15, 2021 •

edited

kkkkun commented Dec 15, 2021

aojea commented Dec 15, 2021

kkkkun commented Dec 15, 2021

tkashem commented Dec 15, 2021

fedebongio commented Dec 16, 2021

dgrisonnet commented Jan 18, 2022

kkkkun commented Feb 15, 2022 •

edited

k8s-triage-robot commented May 16, 2022

k8s-triage-robot commented Jun 15, 2022

k8s-triage-robot commented Jul 15, 2022

k8s-ci-robot commented Jul 15, 2022

logicalhan commented Jul 15, 2022

k8s-ci-robot commented Jul 15, 2022

logicalhan commented Jul 15, 2022

logicalhan left a comment

k8s-ci-robot commented Jul 15, 2022

k8s-triage-robot commented Jul 15, 2022

k8s-triage-robot commented Jul 15, 2022

k8s-triage-robot commented Jul 16, 2022

k8s-ci-robot commented Jul 16, 2022

k8s-triage-robot commented Jul 16, 2022

Update etcdRequestLatency metrics bucket size #107042

Update etcdRequestLatency metrics bucket size #107042

Conversation

kkkkun commented Dec 15, 2021 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Dec 15, 2021

kkkkun commented Dec 15, 2021

dgrisonnet commented Dec 15, 2021

tkashem commented Dec 15, 2021 • edited

kkkkun commented Dec 15, 2021

aojea commented Dec 15, 2021

kkkkun commented Dec 15, 2021

tkashem commented Dec 15, 2021

fedebongio commented Dec 16, 2021

dgrisonnet commented Jan 18, 2022

kkkkun commented Feb 15, 2022 • edited

k8s-triage-robot commented May 16, 2022

k8s-triage-robot commented Jun 15, 2022

k8s-triage-robot commented Jul 15, 2022

k8s-ci-robot commented Jul 15, 2022

logicalhan commented Jul 15, 2022

k8s-ci-robot commented Jul 15, 2022

logicalhan commented Jul 15, 2022

logicalhan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 15, 2022

k8s-triage-robot commented Jul 15, 2022

k8s-triage-robot commented Jul 15, 2022

k8s-triage-robot commented Jul 16, 2022

k8s-ci-robot commented Jul 16, 2022

k8s-triage-robot commented Jul 16, 2022

kkkkun commented Dec 15, 2021 •

edited

tkashem commented Dec 15, 2021 •

edited

kkkkun commented Feb 15, 2022 •

edited