-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apiserver: add a metric exposing etcd database size #89151
apiserver: add a metric exposing etcd database size #89151
Conversation
Hi @jingyih. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
03f6548
to
3c8ce5e
Compare
/retest |
cc |
@@ -28,7 +28,8 @@ const ( | |||
StorageTypeUnset = "" | |||
StorageTypeETCD3 = "etcd3" | |||
|
|||
DefaultCompactInterval = 5 * time.Minute | |||
DefaultCompactInterval = 5 * time.Minute | |||
DefaultDbMetricPollInterval = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why is the polling interval set to 30s? Just curious if there is a specific reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive is the call? If it's cheap, then have you considered using a GaugeFunc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenjiaswe I don't have a specific reason that leads to 30s. A normal k8s cluster has 3 to 5 etcd servers, so 5 calls every 30s should be negligible. On the other hand I don't feel we need to update this metric very often as its value changes rather slowly.
@logicalhan Using GaugeFunc means we make rpc call to each etcd server when user hits metrics endpoint? This is probably too much latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, db size change is usually very slow, that why I was thinking and wonder if we want a longer interval. But I am fine with what it is now as long as it does not impact performance. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think a GaugeFunc is probably more idiomatic. You can cache the value and not call out to etcd if the value hasn't TTL'd yet (the TTL would be the same as the poll interval currently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@logicalhan I did not find an easy way to incorporate GaugeFunc with a gauge vector metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it turns out it isn't exposed.
fs.DurationVar(&s.StorageConfig.DbMetricPollInterval, "etcd-db-metric-poll-interval", s.StorageConfig.DbMetricPollInterval, | ||
"The interval of requests to poll etcd and update metrics. 0 disables the metrics collection") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure if we want to add a flag for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure neither. Apiserver already has a lot of flags. I was thinking for some users running super heavy workloads, they might want to monitor this metric more often. If so, this flag could be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI. Today we do have a similar flag: --etcd-count-metric-poll-period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it could be useful, most of the time db size is pretty stable, but there are cases where overuse of etcd would cause db size increase more often than others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/metrics/metric/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/metrics/metric/
Done.
@@ -28,7 +28,8 @@ const ( | |||
StorageTypeUnset = "" | |||
StorageTypeETCD3 = "etcd3" | |||
|
|||
DefaultCompactInterval = 5 * time.Minute | |||
DefaultCompactInterval = 5 * time.Minute | |||
DefaultDbMetricPollInterval = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How expensive is the call? If it's cheap, then have you considered using a GaugeFunc?
/assign @logicalhan |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@@ -159,6 +159,7 @@ func TestAddFlags(t *testing.T) { | |||
Prefix: "/registry", | |||
CompactionInterval: storagebackend.DefaultCompactInterval, | |||
CountMetricPollPeriod: time.Minute, | |||
DbMetricPollInterval: storagebackend.DefaultDbMetricPollInterval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "DB"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you prefer DBSizePollInterval or the current DBMetricPollInterval? I am using DBMetricPollInterval hoping that we can reuse the flag if we add new etcd metrics in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to compute multiple metrics, then the name is fine.
@@ -218,13 +229,19 @@ func newETCD3Storage(c storagebackend.Config) (storage.Interface, DestroyFunc, e | |||
return nil, nil, err | |||
} | |||
|
|||
stopDbMetricMonitor, err := startDbMetricMonitorPerEndpoint(client, c.DbMetricPollInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DB. But shouldn't this name be stopDBSizeMonitor? Are we going to measure more than just the size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to stopDBSizeMonitor.
3c8ce5e
to
313d677
Compare
313d677
to
e15c49f
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jingyih, lavalamp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adding a metric exposing etcd database file size. Example metric output:
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/sig api-machinery