Add describe and limiting of http expostion #110

brian-brazil · 2016-11-03T17:07:12Z

This is the first pass on limiting what's returned over http exposition via a url parameter. We'll likely need to support POST also in future to get around URL length limits.

@juliusv @fabxc @beorn7 @SuperQ

Fixes #97

This works by adding an optional describe method to collectors, which returns data in the same format as collect (though hopefully without the samples). If present it is called at registration time. If describe is not present and auto_describe=True is set on the registry, then collect is called instead. This is enabled by default on the default registry, but disabled elsewhere. There's not much point doing extra checks for a transient registry used to push to a pushgateway, particularly if collection is expensive.

This works by specifiny URL parameters called name[]. Only the required collectors will be called, and then filtered down. This depends on the collectors sufficiently implementing describe.

beorn7 · 2016-11-04T21:35:04Z

prometheus_client/core.py

+        '''Returns object that only collects some metrics.
+
+        Returns an object which upon collect() will return
+        only samples with the given names.


Wasn't the plan to allow regexp matches for names, too? And inverted matches? (i.e. collect all metrics except those matching 'foo.*')?

This is v1, so I didn't add the regexes yet.

Inverted wasn't something I'd planned.

I would imagine it's a common use case: You want a scrape as "normal" as possible, just that one expensive metric with high cardinality should go.

I'm presuming that that'd be handled on the Prometheus side with metric relabelling.

Which would not avoid collecting the metric (which might be expensive) and sending it over the wire (which might be bandwidth limited).

I don't think I've ever come across the case where the cost of collecting, sending, receiving and rejecting the metric was relevant (beyond a slight dip in CPU when the metric was fixed).

If someone asks us for it we can consider it then.

I don't understand. If this is the case, why are we doing client-side restrictions (like in this PR) at all?

The cost has never been relevant when dealing with a high-cardinality metric for normal scraping.

It has been relevant when attempting to do high-frequency scraping of a very limited set of metrics.

Perhaps we are missing each others point here. I try to rephrase:

I wish to do high frequency scraping.

I have identified a single metric that is expensive for any of the following reasons:

It comprises 80% of the scrape volume, and I am bandwidth limited.

It is expensive to collect().

Let's assume for the sake of your argument that it's usually the latter and not the former that matters. I could of course hand-pick and whitelist the metrics I'm interested in. But if I can do my high frequency scraping without problem by just blacklisting the one metric known to be expensive to collect()?

IIRC the mysqld_exporter has exactly that case. So what we do right now is to switch certain metrics on and off on the command line. At SC, we run separate high-frequency collectors with fewer metrics activated. It would be great if we could just tell the collector via URL params to not collect selected few expensive-to-collect metrics for high-frequency scraping rather than running separate exporter instances with different command line flags.

If I understand you correctly, you are claiming that the high-frequency scrape case will naturally result in scraping a whitelisted set of metrics. But if I understand the mysqld_exporter use case correctly, it's the opposite: One tends to blacklist known expensive metrics and wants to collect all others with high frequency. @SuperQ please correct me if I'm wrong.

If I understand you correctly, you are claiming that the high-frequency scrape case will naturally result in scraping a whitelisted set of metrics.

Yes. Given the data volumes involved with such systems no other way is practical, and I've never seen any other approach taken. Usually you'd be whitelisting on the order of 10 metrics, and with a blacklisting approach the organic growth of metrics would be sufficient to take out such a system.

To be honest, if you've a system where one metric comprises 80% of data volume you've bigger problems (I came across that exact case once, and greatly decreased the metric cardinality).

But if I understand the mysqld_exporter use case correctly, it's the opposite: One tends to blacklist known expensive metrics and wants to collect all others with high frequency.

That's not my understanding. The mysqld exporter isn't covered by this anyway, it's sufficiently weird that it can't implement describe.

beorn7 · 2016-11-04T21:37:19Z

README.md

+(which is the case for the default registry) then `collect` will be called at
+registration time instead of `describe`. If this could cause problems, either
+implement a proper `describe`, or if that's not practical have `describe`
+return an empty list.


Is the intention that in the latter case, collectors returning an empty list will never be included by a restricted_registry?

Yes. My thinking is that for something like JMX which can be quite expensive, it's best not to call it rather than silently calling it in the background when users thing they're only fetching a few cheap counters.

While I understand the resource argument, it's opaque to the user of an /metrics endpoint which collectors describe themselves properly, and thus the behavior might appear quite puzzling. Between "accidentally using a lot of resources" and "accidentally and silently not getting expected metrics" I'd prefer the former.

This expected use case of this is high-frequency polling of a very small number of metrics. Using a lot of resources in such a case would be bad.

But I would notice – because it would defeat the purpose of restricting the metrics.

If some metrics are silently missing, I might not notice. I'm imagining the Postmortem review right now...

You'd be specifying a small number of names by hand so you'd likely notice.

The logic here is that calling collect() at registration time is problematic, so we should also be tending not to call it when the collector has asked us not to include it in this feature.

I think we have reached the same point as in the thread above: Apparently, we see the use case exactly inverted, see there.

However, the point about an expensive collect() at registration time is different. The doc comment reads: "… if that's not practical have describe return an empty list." But then describe is not expensive to call, and collect won't be called at registration time because describe is implemented. No?

OK, after some background processing, I think I got your last point now. You assumption is essentially that collectors that have problems describing themselves will also be those that are expensive to collect().

That's true for the JMX case, I guess, but not in general.

I see the value of allowing collectors to say "I cannot describe myself but I'm also expensive to collect so please never collect me if the user has restricted the scrape in any way". But I also see the potential for bad surprises, and I guess the feeling is so strong for me because I see the "blacklist a few metrics" as a common use case. If suddenly some seemingly unrelated metrics disappear just because I blacklisted a single one, feels weird.

I would hope that non-describing collectors will be rare. In direct instrumentation, they should never occur, essentially. It's mostly an exporter thing, so we should encourage clear documentation about metrics not being collected if any restriction is given as a scrape parameter.

You assumption is essentially that collectors that have problems describing themselves will also be those that are expensive to collect().

Yes, or impossible to describe(). JMX is both impossible and expensive. A one-off collect() at startup would be tolerable (ignoring that that breaks with config reloading), but at every collection would be bad.

If suddenly some seemingly unrelated metrics disappear just because I blacklisted a single one, feels weird.

The whitelist approach doesn't really have that issue. Returning an empty describe signals that you don't support this feature.

It's mostly an exporter thing, so we should encourage clear documentation about metrics not being collected if any restriction is given as a scrape parameter.

Yeah, this only comes up with custom collectors.

beorn7 · 2016-11-05T23:03:11Z

OK, I'll try to summarize the threads above here:

Collectors that are unable to describe themselves must return an empty description and will be excluded from any restricted collection. (I concur that there is no better solution in the current state of affairs.)
There remains disagreement if a whitelisting-only approach is sufficient, or if blacklisting AKA negative matching should be part of the API. Input from others welcome.

brian-brazil · 2016-11-05T23:15:45Z

That sounds about right.

Do you have any comments on the spelling of the HTTP API, or thoughts on the restricted_registry? I think the former is fine, but I'm not perfectly happy with the latter.

beorn7 · 2016-11-08T20:15:58Z

@SuperQ could you comment on our current use of the hires MySQL exporter? Is that just a few selected metrics we scrape for it, or is it the other way round, i.e. we just excluded the expensive metrics?

I've just looked at our actual setup. Normal MySQL exporter is about 6000 data points per scrape, the hires exporter only about 1000. Still would be good to get @SuperQ 's idea of what approach would be preferable in that real-life scenario.

WRT the HTTP API: I would model it after the federation API, i.e. use match[] parameters. But the arguments are not PromQL selectors but just regexps for the metric name. Contract would be:

Non-supporting targets ignore the pramater.
Supporting targets will not collect from any collector that cannot describe itself if at least one match[] parameter is given. From all other collectors, they will return all metrics whose name matches any of the regexps given.

SuperQ · 2016-11-08T20:41:05Z

The hires case we select for a few specific metrics that we care about. Some short spiky gauges, some high-churn counters. With MySQL exporter, we need to know ahead of time, at the scrape request, which collector functions we need to call. This is to avoid accessing expensive or locking collectors.

So I guess it's a little of both.

brian-brazil · 2016-11-08T23:42:53Z

What I'm thinking will happen with mysqld_exporter is that it'd have URL parameters like collector[], following the pattern set here and on /federate as collector selection is the important thing. For the JMX exporter (if it ever happened) it'd probably be mbean[] or similar.

beorn7 · 2016-11-09T09:03:28Z

Which would mean we would go for a pretty customized behavior anyway. Without the necessity (or ability) to define a universally used API, I have no problem with deferring negative match until somebody shows up with a use case.

brian-brazil · 2016-11-10T06:35:24Z

Sounds like we're in rough agreement then. Any other comments?

brian-brazil · 2016-11-15T01:55:46Z

Final call, any comments on this? I intend to submit tomorrow.

brian-brazil added 2 commits November 3, 2016 15:58

Add ability to limit what HTTP exposition returns.

29e83ad

This works by specifiny URL parameters called name[]. Only the required collectors will be called, and then filtered down. This depends on the collectors sufficiently implementing describe.

beorn7 reviewed Nov 4, 2016

View reviewed changes

brian-brazil merged commit 1020618 into master Nov 16, 2016

brian-brazil deleted the desc branch November 16, 2016 11:31

beorn7 mentioned this pull request Jan 5, 2017

Allow filtering via query parameters prometheus/client_golang#135

Open

juliusv mentioned this pull request Apr 14, 2017

Should containerd provide image filesystem metrics? containerd/containerd#678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add describe and limiting of http expostion #110

Add describe and limiting of http expostion #110

brian-brazil commented Nov 3, 2016

beorn7 Nov 4, 2016

brian-brazil Nov 4, 2016

beorn7 Nov 4, 2016

brian-brazil Nov 4, 2016

beorn7 Nov 4, 2016

brian-brazil Nov 4, 2016

beorn7 Nov 5, 2016

brian-brazil Nov 5, 2016

beorn7 Nov 5, 2016

brian-brazil Nov 5, 2016

beorn7 Nov 4, 2016

brian-brazil Nov 4, 2016

beorn7 Nov 4, 2016

brian-brazil Nov 4, 2016

beorn7 Nov 5, 2016

brian-brazil Nov 5, 2016

beorn7 Nov 5, 2016

beorn7 Nov 5, 2016

beorn7 Nov 5, 2016

brian-brazil Nov 5, 2016

beorn7 commented Nov 5, 2016

brian-brazil commented Nov 5, 2016

beorn7 commented Nov 8, 2016

SuperQ commented Nov 8, 2016

brian-brazil commented Nov 8, 2016

beorn7 commented Nov 9, 2016

brian-brazil commented Nov 10, 2016

brian-brazil commented Nov 15, 2016

Add describe and limiting of http expostion #110

Add describe and limiting of http expostion #110

Conversation

brian-brazil commented Nov 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beorn7 commented Nov 5, 2016

brian-brazil commented Nov 5, 2016

beorn7 commented Nov 8, 2016

SuperQ commented Nov 8, 2016

brian-brazil commented Nov 8, 2016

beorn7 commented Nov 9, 2016

brian-brazil commented Nov 10, 2016

brian-brazil commented Nov 15, 2016