Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiserver: add a metric exposing etcd database size #89151

Merged
merged 2 commits into from
Mar 31, 2020

Conversation

jingyih
Copy link
Contributor

@jingyih jingyih commented Mar 16, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:
Adding a metric exposing etcd database file size. Example metric output:

# HELP etcd_db_total_size_in_bytes [ALPHA] Total size of the etcd database file physically allocated in bytes.
# TYPE etcd_db_total_size_in_bytes gauge
etcd_db_total_size_in_bytes{endpoint="http://127.0.0.1:2379"} 585728

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

kube-apiserver backed by etcd3 exports metric showing the database file size.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 16, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @jingyih. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jingyih
Copy link
Contributor Author

jingyih commented Mar 16, 2020

cc @logicalhan @apelisse @wenjiaswe

@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 16, 2020
@neolit123
Copy link
Member

/ok-to-test
/priority backlog

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 16, 2020
@jingyih
Copy link
Contributor Author

jingyih commented Mar 17, 2020

/retest

@answer1991
Copy link
Contributor

cc

@@ -28,7 +28,8 @@ const (
StorageTypeUnset = ""
StorageTypeETCD3 = "etcd3"

DefaultCompactInterval = 5 * time.Minute
DefaultCompactInterval = 5 * time.Minute
DefaultDbMetricPollInterval = 30 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why is the polling interval set to 30s? Just curious if there is a specific reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is the call? If it's cheap, then have you considered using a GaugeFunc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenjiaswe I don't have a specific reason that leads to 30s. A normal k8s cluster has 3 to 5 etcd servers, so 5 calls every 30s should be negligible. On the other hand I don't feel we need to update this metric very often as its value changes rather slowly.

@logicalhan Using GaugeFunc means we make rpc call to each etcd server when user hits metrics endpoint? This is probably too much latency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, db size change is usually very slow, that why I was thinking and wonder if we want a longer interval. But I am fine with what it is now as long as it does not impact performance. Thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think a GaugeFunc is probably more idiomatic. You can cache the value and not call out to etcd if the value hasn't TTL'd yet (the TTL would be the same as the poll interval currently)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logicalhan I did not find an easy way to incorporate GaugeFunc with a gauge vector metric?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it turns out it isn't exposed.

Comment on lines 180 to 181
fs.DurationVar(&s.StorageConfig.DbMetricPollInterval, "etcd-db-metric-poll-interval", s.StorageConfig.DbMetricPollInterval,
"The interval of requests to poll etcd and update metrics. 0 disables the metrics collection")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if we want to add a flag for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure neither. Apiserver already has a lot of flags. I was thinking for some users running super heavy workloads, they might want to monitor this metric more often. If so, this flag could be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI. Today we do have a similar flag: --etcd-count-metric-poll-period

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it could be useful, most of the time db size is pretty stable, but there are cases where overuse of etcd would cause db size increase more often than others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/metrics/metric/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/metrics/metric/

Done.

@@ -28,7 +28,8 @@ const (
StorageTypeUnset = ""
StorageTypeETCD3 = "etcd3"

DefaultCompactInterval = 5 * time.Minute
DefaultCompactInterval = 5 * time.Minute
DefaultDbMetricPollInterval = 30 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How expensive is the call? If it's cheap, then have you considered using a GaugeFunc?

@fedebongio
Copy link
Contributor

/assign @logicalhan

@wenjiaswe
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 18, 2020
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -159,6 +159,7 @@ func TestAddFlags(t *testing.T) {
Prefix: "/registry",
CompactionInterval: storagebackend.DefaultCompactInterval,
CountMetricPollPeriod: time.Minute,
DbMetricPollInterval: storagebackend.DefaultDbMetricPollInterval,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "DB"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer DBSizePollInterval or the current DBMetricPollInterval? I am using DBMetricPollInterval hoping that we can reuse the flag if we add new etcd metrics in future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the goal is to compute multiple metrics, then the name is fine.

@@ -218,13 +229,19 @@ func newETCD3Storage(c storagebackend.Config) (storage.Interface, DestroyFunc, e
return nil, nil, err
}

stopDbMetricMonitor, err := startDbMetricMonitorPerEndpoint(client, c.DbMetricPollInterval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB. But shouldn't this name be stopDBSizeMonitor? Are we going to measure more than just the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to stopDBSizeMonitor.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 31, 2020
@lavalamp
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 31, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jingyih, lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 31, 2020
@k8s-ci-robot k8s-ci-robot merged commit 0804667 into kubernetes:master Mar 31, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Mar 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants