Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add describe and limiting of http expostion #110

Merged
merged 2 commits into from
Nov 16, 2016
Merged

Add describe and limiting of http expostion #110

merged 2 commits into from
Nov 16, 2016

Conversation

brian-brazil
Copy link
Contributor

This is the first pass on limiting what's returned over http exposition via a url parameter. We'll likely need to support POST also in future to get around URL length limits.

@juliusv @fabxc @beorn7 @SuperQ

Fixes #97

This works by adding an optional describe method to
collectors, which returns data in the same format as collect (though
hopefully without the samples).
If present it is called at registration time.

If describe is not present and auto_describe=True is set on the
registry, then collect is called instead. This is enabled by default on
the default registry, but disabled elsewhere. There's not much
point doing extra checks for a transient registry used to
push to a pushgateway, particularly if collection is expensive.
This works by specifiny URL parameters called name[].
Only the required collectors will be called, and
then filtered down.

This depends on the collectors sufficiently implementing
describe.
'''Returns object that only collects some metrics.

Returns an object which upon collect() will return
only samples with the given names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't the plan to allow regexp matches for names, too? And inverted matches? (i.e. collect all metrics except those matching 'foo.*')?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is v1, so I didn't add the regexes yet.

Inverted wasn't something I'd planned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would imagine it's a common use case: You want a scrape as "normal" as possible, just that one expensive metric with high cardinality should go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm presuming that that'd be handled on the Prometheus side with metric relabelling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which would not avoid collecting the metric (which might be expensive) and sending it over the wire (which might be bandwidth limited).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I've ever come across the case where the cost of collecting, sending, receiving and rejecting the metric was relevant (beyond a slight dip in CPU when the metric was fixed).

If someone asks us for it we can consider it then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. If this is the case, why are we doing client-side restrictions (like in this PR) at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost has never been relevant when dealing with a high-cardinality metric for normal scraping.

It has been relevant when attempting to do high-frequency scraping of a very limited set of metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we are missing each others point here. I try to rephrase:

  • I wish to do high frequency scraping.
  • I have identified a single metric that is expensive for any of the following reasons:
    • It comprises 80% of the scrape volume, and I am bandwidth limited.
    • It is expensive to collect().
  • Let's assume for the sake of your argument that it's usually the latter and not the former that matters. I could of course hand-pick and whitelist the metrics I'm interested in. But if I can do my high frequency scraping without problem by just blacklisting the one metric known to be expensive to collect()?

IIRC the mysqld_exporter has exactly that case. So what we do right now is to switch certain metrics on and off on the command line. At SC, we run separate high-frequency collectors with fewer metrics activated. It would be great if we could just tell the collector via URL params to not collect selected few expensive-to-collect metrics for high-frequency scraping rather than running separate exporter instances with different command line flags.

If I understand you correctly, you are claiming that the high-frequency scrape case will naturally result in scraping a whitelisted set of metrics. But if I understand the mysqld_exporter use case correctly, it's the opposite: One tends to blacklist known expensive metrics and wants to collect all others with high frequency. @SuperQ please correct me if I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand you correctly, you are claiming that the high-frequency scrape case will naturally result in scraping a whitelisted set of metrics.

Yes. Given the data volumes involved with such systems no other way is practical, and I've never seen any other approach taken. Usually you'd be whitelisting on the order of 10 metrics, and with a blacklisting approach the organic growth of metrics would be sufficient to take out such a system.

To be honest, if you've a system where one metric comprises 80% of data volume you've bigger problems (I came across that exact case once, and greatly decreased the metric cardinality).

But if I understand the mysqld_exporter use case correctly, it's the opposite: One tends to blacklist known expensive metrics and wants to collect all others with high frequency.

That's not my understanding. The mysqld exporter isn't covered by this anyway, it's sufficiently weird that it can't implement describe.

(which is the case for the default registry) then `collect` will be called at
registration time instead of `describe`. If this could cause problems, either
implement a proper `describe`, or if that's not practical have `describe`
return an empty list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention that in the latter case, collectors returning an empty list will never be included by a restricted_registry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. My thinking is that for something like JMX which can be quite expensive, it's best not to call it rather than silently calling it in the background when users thing they're only fetching a few cheap counters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand the resource argument, it's opaque to the user of an /metrics endpoint which collectors describe themselves properly, and thus the behavior might appear quite puzzling. Between "accidentally using a lot of resources" and "accidentally and silently not getting expected metrics" I'd prefer the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expected use case of this is high-frequency polling of a very small number of metrics. Using a lot of resources in such a case would be bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I would notice – because it would defeat the purpose of restricting the metrics.

If some metrics are silently missing, I might not notice. I'm imagining the Postmortem review right now...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd be specifying a small number of names by hand so you'd likely notice.

The logic here is that calling collect() at registration time is problematic, so we should also be tending not to call it when the collector has asked us not to include it in this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have reached the same point as in the thread above: Apparently, we see the use case exactly inverted, see there.

However, the point about an expensive collect() at registration time is different. The doc comment reads: "… if that's not practical have describe return an empty list." But then describe is not expensive to call, and collect won't be called at registration time because describe is implemented. No?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, after some background processing, I think I got your last point now. You assumption is essentially that collectors that have problems describing themselves will also be those that are expensive to collect().

That's true for the JMX case, I guess, but not in general.

I see the value of allowing collectors to say "I cannot describe myself but I'm also expensive to collect so please never collect me if the user has restricted the scrape in any way". But I also see the potential for bad surprises, and I guess the feeling is so strong for me because I see the "blacklist a few metrics" as a common use case. If suddenly some seemingly unrelated metrics disappear just because I blacklisted a single one, feels weird.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope that non-describing collectors will be rare. In direct instrumentation, they should never occur, essentially. It's mostly an exporter thing, so we should encourage clear documentation about metrics not being collected if any restriction is given as a scrape parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You assumption is essentially that collectors that have problems describing themselves will also be those that are expensive to collect().

Yes, or impossible to describe(). JMX is both impossible and expensive. A one-off collect() at startup would be tolerable (ignoring that that breaks with config reloading), but at every collection would be bad.

If suddenly some seemingly unrelated metrics disappear just because I blacklisted a single one, feels weird.

The whitelist approach doesn't really have that issue. Returning an empty describe signals that you don't support this feature.

It's mostly an exporter thing, so we should encourage clear documentation about metrics not being collected if any restriction is given as a scrape parameter.

Yeah, this only comes up with custom collectors.

@beorn7
Copy link
Member

beorn7 commented Nov 5, 2016

OK, I'll try to summarize the threads above here:

  • Collectors that are unable to describe themselves must return an empty description and will be excluded from any restricted collection. (I concur that there is no better solution in the current state of affairs.)
  • There remains disagreement if a whitelisting-only approach is sufficient, or if blacklisting AKA negative matching should be part of the API. Input from others welcome.

@brian-brazil
Copy link
Contributor Author

That sounds about right.

Do you have any comments on the spelling of the HTTP API, or thoughts on the restricted_registry? I think the former is fine, but I'm not perfectly happy with the latter.

@beorn7
Copy link
Member

beorn7 commented Nov 8, 2016

@SuperQ could you comment on our current use of the hires MySQL exporter? Is that just a few selected metrics we scrape for it, or is it the other way round, i.e. we just excluded the expensive metrics?

I've just looked at our actual setup. Normal MySQL exporter is about 6000 data points per scrape, the hires exporter only about 1000. Still would be good to get @SuperQ 's idea of what approach would be preferable in that real-life scenario.

WRT the HTTP API: I would model it after the federation API, i.e. use match[] parameters. But the arguments are not PromQL selectors but just regexps for the metric name. Contract would be:

  • Non-supporting targets ignore the pramater.
  • Supporting targets will not collect from any collector that cannot describe itself if at least one match[] parameter is given. From all other collectors, they will return all metrics whose name matches any of the regexps given.

@SuperQ
Copy link
Member

SuperQ commented Nov 8, 2016

The hires case we select for a few specific metrics that we care about. Some short spiky gauges, some high-churn counters. With MySQL exporter, we need to know ahead of time, at the scrape request, which collector functions we need to call. This is to avoid accessing expensive or locking collectors.

So I guess it's a little of both.

@brian-brazil
Copy link
Contributor Author

What I'm thinking will happen with mysqld_exporter is that it'd have URL parameters like collector[], following the pattern set here and on /federate as collector selection is the important thing. For the JMX exporter (if it ever happened) it'd probably be mbean[] or similar.

@beorn7
Copy link
Member

beorn7 commented Nov 9, 2016

Which would mean we would go for a pretty customized behavior anyway. Without the necessity (or ability) to define a universally used API, I have no problem with deferring negative match until somebody shows up with a use case.

@brian-brazil
Copy link
Contributor Author

Sounds like we're in rough agreement then. Any other comments?

@brian-brazil
Copy link
Contributor Author

Final call, any comments on this? I intend to submit tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants