-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add describe and limiting of http expostion #110
Conversation
This works by adding an optional describe method to collectors, which returns data in the same format as collect (though hopefully without the samples). If present it is called at registration time. If describe is not present and auto_describe=True is set on the registry, then collect is called instead. This is enabled by default on the default registry, but disabled elsewhere. There's not much point doing extra checks for a transient registry used to push to a pushgateway, particularly if collection is expensive.
This works by specifiny URL parameters called name[]. Only the required collectors will be called, and then filtered down. This depends on the collectors sufficiently implementing describe.
'''Returns object that only collects some metrics. | ||
|
||
Returns an object which upon collect() will return | ||
only samples with the given names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't the plan to allow regexp matches for names, too? And inverted matches? (i.e. collect all metrics except those matching 'foo.*')?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is v1, so I didn't add the regexes yet.
Inverted wasn't something I'd planned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would imagine it's a common use case: You want a scrape as "normal" as possible, just that one expensive metric with high cardinality should go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm presuming that that'd be handled on the Prometheus side with metric relabelling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which would not avoid collecting the metric (which might be expensive) and sending it over the wire (which might be bandwidth limited).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I've ever come across the case where the cost of collecting, sending, receiving and rejecting the metric was relevant (beyond a slight dip in CPU when the metric was fixed).
If someone asks us for it we can consider it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. If this is the case, why are we doing client-side restrictions (like in this PR) at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost has never been relevant when dealing with a high-cardinality metric for normal scraping.
It has been relevant when attempting to do high-frequency scraping of a very limited set of metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we are missing each others point here. I try to rephrase:
- I wish to do high frequency scraping.
- I have identified a single metric that is expensive for any of the following reasons:
- It comprises 80% of the scrape volume, and I am bandwidth limited.
- It is expensive to
collect()
.
- Let's assume for the sake of your argument that it's usually the latter and not the former that matters. I could of course hand-pick and whitelist the metrics I'm interested in. But if I can do my high frequency scraping without problem by just blacklisting the one metric known to be expensive to
collect()
?
IIRC the mysqld_exporter has exactly that case. So what we do right now is to switch certain metrics on and off on the command line. At SC, we run separate high-frequency collectors with fewer metrics activated. It would be great if we could just tell the collector via URL params to not collect selected few expensive-to-collect metrics for high-frequency scraping rather than running separate exporter instances with different command line flags.
If I understand you correctly, you are claiming that the high-frequency scrape case will naturally result in scraping a whitelisted set of metrics. But if I understand the mysqld_exporter use case correctly, it's the opposite: One tends to blacklist known expensive metrics and wants to collect all others with high frequency. @SuperQ please correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand you correctly, you are claiming that the high-frequency scrape case will naturally result in scraping a whitelisted set of metrics.
Yes. Given the data volumes involved with such systems no other way is practical, and I've never seen any other approach taken. Usually you'd be whitelisting on the order of 10 metrics, and with a blacklisting approach the organic growth of metrics would be sufficient to take out such a system.
To be honest, if you've a system where one metric comprises 80% of data volume you've bigger problems (I came across that exact case once, and greatly decreased the metric cardinality).
But if I understand the mysqld_exporter use case correctly, it's the opposite: One tends to blacklist known expensive metrics and wants to collect all others with high frequency.
That's not my understanding. The mysqld exporter isn't covered by this anyway, it's sufficiently weird that it can't implement describe.
(which is the case for the default registry) then `collect` will be called at | ||
registration time instead of `describe`. If this could cause problems, either | ||
implement a proper `describe`, or if that's not practical have `describe` | ||
return an empty list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention that in the latter case, collectors returning an empty list will never be included by a restricted_registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. My thinking is that for something like JMX which can be quite expensive, it's best not to call it rather than silently calling it in the background when users thing they're only fetching a few cheap counters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand the resource argument, it's opaque to the user of an /metrics
endpoint which collectors describe themselves properly, and thus the behavior might appear quite puzzling. Between "accidentally using a lot of resources" and "accidentally and silently not getting expected metrics" I'd prefer the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This expected use case of this is high-frequency polling of a very small number of metrics. Using a lot of resources in such a case would be bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I would notice – because it would defeat the purpose of restricting the metrics.
If some metrics are silently missing, I might not notice. I'm imagining the Postmortem review right now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd be specifying a small number of names by hand so you'd likely notice.
The logic here is that calling collect()
at registration time is problematic, so we should also be tending not to call it when the collector has asked us not to include it in this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have reached the same point as in the thread above: Apparently, we see the use case exactly inverted, see there.
However, the point about an expensive collect()
at registration time is different. The doc comment reads: "… if that's not practical have describe
return an empty list." But then describe
is not expensive to call, and collect
won't be called at registration time because describe
is implemented. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, after some background processing, I think I got your last point now. You assumption is essentially that collectors that have problems describing themselves will also be those that are expensive to collect()
.
That's true for the JMX case, I guess, but not in general.
I see the value of allowing collectors to say "I cannot describe myself but I'm also expensive to collect so please never collect me if the user has restricted the scrape in any way". But I also see the potential for bad surprises, and I guess the feeling is so strong for me because I see the "blacklist a few metrics" as a common use case. If suddenly some seemingly unrelated metrics disappear just because I blacklisted a single one, feels weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope that non-describing collectors will be rare. In direct instrumentation, they should never occur, essentially. It's mostly an exporter thing, so we should encourage clear documentation about metrics not being collected if any restriction is given as a scrape parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You assumption is essentially that collectors that have problems describing themselves will also be those that are expensive to collect().
Yes, or impossible to describe(). JMX is both impossible and expensive. A one-off collect() at startup would be tolerable (ignoring that that breaks with config reloading), but at every collection would be bad.
If suddenly some seemingly unrelated metrics disappear just because I blacklisted a single one, feels weird.
The whitelist approach doesn't really have that issue. Returning an empty describe signals that you don't support this feature.
It's mostly an exporter thing, so we should encourage clear documentation about metrics not being collected if any restriction is given as a scrape parameter.
Yeah, this only comes up with custom collectors.
OK, I'll try to summarize the threads above here:
|
That sounds about right. Do you have any comments on the spelling of the HTTP API, or thoughts on the restricted_registry? I think the former is fine, but I'm not perfectly happy with the latter. |
@SuperQ could you comment on our current use of the hires MySQL exporter? Is that just a few selected metrics we scrape for it, or is it the other way round, i.e. we just excluded the expensive metrics? I've just looked at our actual setup. Normal MySQL exporter is about 6000 data points per scrape, the hires exporter only about 1000. Still would be good to get @SuperQ 's idea of what approach would be preferable in that real-life scenario. WRT the HTTP API: I would model it after the federation API, i.e. use
|
The hires case we select for a few specific metrics that we care about. Some short spiky gauges, some high-churn counters. With MySQL exporter, we need to know ahead of time, at the scrape request, which collector functions we need to call. This is to avoid accessing expensive or locking collectors. So I guess it's a little of both. |
What I'm thinking will happen with mysqld_exporter is that it'd have URL parameters like |
Which would mean we would go for a pretty customized behavior anyway. Without the necessity (or ability) to define a universally used API, I have no problem with deferring negative match until somebody shows up with a use case. |
Sounds like we're in rough agreement then. Any other comments? |
Final call, any comments on this? I intend to submit tomorrow. |
This is the first pass on limiting what's returned over http exposition via a url parameter. We'll likely need to support POST also in future to get around URL length limits.
@juliusv @fabxc @beorn7 @SuperQ
Fixes #97