New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve apiserver storage size metric #118812
Conversation
/cc @dgrisonnet |
@@ -238,6 +239,21 @@ func (s *EtcdOptions) ApplyWithStorageFactoryTo(factory serverstorage.StorageFac | |||
return err | |||
} | |||
|
|||
metrics.SetStorageMonitorGetter(func() (monitors []metrics.Monitor, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need this monster to avoid dependency cycle
staging/src/k8s.io/apiserver/pkg/storage/etcd3/metrics/metrics.go
Outdated
Show resolved
Hide resolved
da7ffff
to
64d761e
Compare
/retest |
/assign @jpbetz |
ping @deads2k @jpbetz @logicalhan |
ping @dgrisonnet |
dbTotalSize = compbasemetrics.NewGaugeVec( | ||
&compbasemetrics.GaugeOpts{ | ||
Subsystem: "apiserver", | ||
Name: "storage_db_total_size_in_bytes", | ||
Help: "Total size of the storage database file physically allocated in bytes.", | ||
StabilityLevel: compbasemetrics.ALPHA, | ||
}, | ||
[]string{"endpoint"}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this metric will be breaking change for some users, it would be safer if we leave a 1 release deprecation period before removing it completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is what was decided when api stability was designed, we've been careful when removing old metrics without giving a grace period to the users.
IMO, we will need to continue being careful when removing alpha metrics and apply deprecation periods until we start moving metrics to beta.
I leave it up to @logicalhan to make a decision here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I would argue that old metrics are treated more like Beta and not Alpha. SIG instrumentation should make it explicit and just move them to Beta to avoid making exceptions. For now I treat this as an Alpha metric, thus removing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would deprecate it, people likely depend on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
} | ||
|
||
for i, m := range monitors { | ||
server := fmt.Sprintf("etcd-%d", i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will be able to correlate this label with any other one from the other metrics since AFAIK this is the only place we have something like that. So wouldn't having an endpoint like before be easier to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's third place to do that. It's also done by apiserver health probes and apiserver component status. As for usefulness of endpoints, it's worse as apiserver will expose all etcd members it connects to. For etcd HA clusters thats a 3 endpoints with the same value (+/- time shift). For each API override that's additional 3 endpoints.
The semantic meaning of server is:
etcd-0
the default etcdetcd-1
the first API Resource override, based on order of flags provided to apiserver.etcd-2
the second API Resource override, and so forth.
Of cause it should be documented. Maybe I can include it in metric description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thank you for the clarification 👍
ping @deads2k @jpbetz @logicalhan |
158a056
to
78ad99b
Compare
78ad99b
to
d7b9dca
Compare
03a59b0
to
70312fe
Compare
Change name to make it compliant with prometheus guidelines. Calculate it on demand instead of periodic to comply with prometheus standards. Replace "endpoint" with "server" label to make it semantically consistent with storage factory
70312fe
to
7a63997
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
LGTM label has been added. Git tree hash: 7676176268460083184f1d9be59c3ab83caff518
|
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jpbetz, logicalhan, serathius The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/kind feature
/cc @mborsz @logicalhan @dgrisonnet
Most names are just temporary stand-ins, for now I wanted to confirm that making the metric on demand is even possible.
Goal of this PR is make changes to
apiserver_storage_db_total_size_in_bytes
necessary to graduate it in the future:kubernetes/pkg/registry/core/rest/storage_core.go
Lines 421 to 423 in 3e28404