Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose histogram_quantile() target bucket lower/upper bounds as series? #5706

Closed
juliusv opened this issue Jun 25, 2019 · 16 comments
Closed

Comments

@juliusv
Copy link
Member

juliusv commented Jun 25, 2019

The idea was born out of a conversation at https://twitter.com/juliusvolz/status/1142364117661036544.

To give users a better idea of the quantile error margin of histogram_quantile(), it would be useful to expose the target bucket's lower and upper bounds as two extra series in the same graph as the approximated quantile.

Something like this (using a hacked-up version of histogram_quantile()):

quantile_buckets

  1. Do we want this feature?
  2. Should it be a separate function for just the bucket boundaries, or integrated into histogram_quantile() as an option? E.g. histogram_quantile(φ float, b instant-vector, include_bounds bool, type_label string)

Doing it in the same function would have the advantage that we don't need to calculate the target bucket twice.

@brian-brazil
Copy link
Contributor

Do we want this feature?

I'm not clear that this is something that belongs in PromQL. If someone wants to do this sort of deeper analysis, the data is already there.

Should it be a separate function for just the bucket boundaries, or integrated into histogram_quantile() as an option?

We presently only have one function (absent) that can return more series than it was passed in, and I'd like to keep it that way. I'm also against overloading functions.

@juliusv
Copy link
Member Author

juliusv commented Jun 25, 2019

I'm not clear that this is something that belongs in PromQL. If someone wants to do this sort of deeper analysis, the data is already there.

It's hard for the average user to do by themselves though. If this turned out to be useful for enough users it would be a shame if they had to implement this themselves (which I think nobody would do), while it's a readily-available by-product of calculations already happening in histogram_quantile().

But yeah, that's a question of how popular and useful it would be. I do see that it would be somewhat quirky.

@brian-brazil
Copy link
Contributor

This feels more like a sanity check thing to me, which would be better in a linter that's looking for alerts on series that don't exist, rates of sums etc.

@juliusv
Copy link
Member Author

juliusv commented Jun 25, 2019

I'd see it as giving a user an ongoing idea of the possible quantile error during operation, though it'd also be useful for the tuning/linting use case you mention.

@brian-brazil
Copy link
Contributor

That's determinable by inspection, no need for runtime information.

@beorn7
Copy link
Member

beorn7 commented Jun 25, 2019

That's determinable by inspection

If you want to present a 99th-perc-latency graph or a current 99th-perc-latency value as part of a dashboard, and you also want to show the lower and upper bound of the currently relevant bucket, how would you solve that by "inspection" in practice? I assume users wouldn't like to hardcode the bucketing scheme in their dashboard builder. With the current tooling, you would probably need to export the bucket boundaries as separate time series. Which is pretty crazy.

It would be much easier if Prometheus represented the boundaries as floating point numbers rather than strings. You could do math on them, and use cases like this, even if they are a bit niche, would be solvable in a natural way as a byproduct.

Back in the days, we considered the representation of the boundaries as strings a temporary work-around to get an MVP going ASAP. It's sad that we somehow got stuck with it. Even worse, our MVP is now leaking into OpenMetrics, with the consequences already haunting us.

@brian-brazil
Copy link
Contributor

If you want to present a 99th-perc-latency graph or a current 99th-perc-latency value as part of a dashboard, and you also want to show the lower and upper bound of the currently relevant bucket, how would you solve that by "inspection" in practice?

My point was more that you can determine the accuracy in general by knowing the histogram buckets, without having to know which bucket is in use. Inconsistently sized buckets (e.g. not arithmetic or geometric growth) in particular could be caught.

@beorn7
Copy link
Member

beorn7 commented Jun 25, 2019

With the current cost of histograms (one series per bucket), spreading them evenly over the range of interest will overwhelm even the beefiest Prometheus servers in many scenarios. (At SoundCloud, teams had to revert to put buckets around interesting values like the latency mentioned in the SLO. If the current value was far away from that, the higher error range that came with that was just something you had to know about and couldn't visualize in dashboards.)

But even with geometric spacing, you still just need to know if you are on a x1.5, x2, or even x10 bucketing scheme.

@harsha070
Copy link

Hi. I am using prometheus along with k8s. Response time for my service is 10-15s usually. When I use prometheus' histogram_quantile(), my P90 is capped at 10.0s. I noticed in my prometheus config (kubectl describe handler prometheus) that largest bucket is 10s. I added 25 in Metrics.Buckets.explicit_buckets.Bounds list, yet my P90 seems to be capped at 10.0s

@harsha070
Copy link

Can someone help me with this?

@beorn7
Copy link
Member

beorn7 commented Nov 11, 2019

@harsha070 a GH issue is not a good place to get support for a (only vaguely related) configuration problem. It makes more sense to ask questions like this on the prometheus-users mailing list. On the mailing list, more people are available to potentially respond to your question, and the whole community can benefit from the answers provided.

valyala added a commit to VictoriaMetrics/VictoriaMetrics that referenced this issue Dec 11, 2019
…ted percentile from `histogram_quantile` if third arg is passed

Updates prometheus/prometheus#5706
valyala added a commit to VictoriaMetrics/VictoriaMetrics that referenced this issue Dec 11, 2019
…ted percentile from `histogram_quantile` if third arg is passed

Updates prometheus/prometheus#5706
@valyala
Copy link

valyala commented Dec 11, 2019

FYI, just added ability to pass third arg to histogram_quantile in VictoriaMetrics, so it can be evaluated before deciding whether it is needed in Prometheus.
Try using histogram_quantile(phi, q, boundsLabel) at http://play-grafana.victoriametrics.com:3000/d/4ome8yJmz/node-exporter-on-victoriametrics-demo (graphs can be edited there). Example query:

histogram_quantile(0.9, histogram(process_resident_memory_bytes), "bound")

The query uses histogram aggregate function from extendend PromQL. The histogram function described in this article.

@juliusv
Copy link
Member Author

juliusv commented Dec 11, 2019

Hi @valyala, I/we are really happy about anyone building cool products/projects around Prometheus, and I think VictoriaMetrics is making some interesting technical contributions. But at the same time we're seeing a pattern with VM that we see as increasingly problematic:

  • It seems like every GitHub issue, mailing list thread, etc. of the Prometheus OSS community that can be connected thematically in any way to VM, is used to push VictoriaMetrics. We might not have direct written-down rules about this (yet) for our channels because sometimes it's difficult to draw the line and pointing out products can also be helpful, but I don't think it is good etiquette to use OSS community channels to push own (incompatible) products at every possible chance. The fact that VM initially marketed itself as simply "The best Prometheus storage" without any nuances wasn't helpful in my perception.

  • IMO more problematically, VM introduces its own implementation of PromQL that is incompatible in multiple ways with native PromQL, while calling itself "Extended PromQL". This implies that it is a compatible superset. I think this will be dangerous in that it confuses users about capabilities of each PromQL variant and their compatibility. I did read https://medium.com/@valyala/evaluating-performance-and-correctness-victoriametrics-response-e27315627e87 and am worried by the post downplaying these compatibility problems as well. But even if "Extended PromQL" was a compatible superset of PromQL, it still will cause fragmentation and interoperability problems in the greater Prometheus space (see your above example of histogram_quantile(), since now anyone on VM cannot go "back"). This may not be your intention, but it does have echos of the "embrace, extend, extinguish" strategy that Microsoft used in the 90s (https://en.wikipedia.org/wiki/Embrace,_extend,_and_extinguish). Finally: I am not a lawyer, but do note that "PromQL" is a pending trademark of the Linux Foundation (https://www.linuxfoundation.org/trademark-list/), and my non-lawyerly reading of the LF's trademark usage terms specifically sounds like "Extended PromQL" is a problem, especially for the compatibility communication purpose.

So from my perspective: I'm happy to see what comes out of VictoriaMetrics, but maybe be a bit more sensitive about using Prometheus channels to push it at every available opportunity. Secondly and more importantly, I think it would be the right thing to do to at least make "Extended PromQL" a fully compatible superset of PromQL or rename it completely to not suggest compatibility. Even if it was compatible however, it wouldn't completely avoid the issues about interoperability/fragmentation, and the connected trademark issues (without an appropriate rename).

@vearutop
Copy link

Dear @juliusv, thank you for a very detailed comment that hooked me on this topic.

I was wondering when you're saying "It seems like every GitHub issue" are you talking about these 8 issues?

@juliusv
Copy link
Member Author

juliusv commented Dec 14, 2019

@vearutop Most of those qualify, yes. The mailing list at https://groups.google.com/forum/#!searchin/prometheus-users/from$3Avalyala@gmail.com$20victoriametrics%7Csort:date contains many more examples though. And again, it's not necessarily about each individual comment or post, it's about the pattern, and about pushing users onto a system that is deliberately incompatible, both in its PromQL subset, and in its "Extended PromQL" superset. This means that users can never go back to PromQL-compatible systems without breakage. @valyala has the right to build an incompatible system (naming issues aside), but then the Prometheus OSS community also has the right to discourage the above-mentioned communication behavior on our OSS channels.

@juliusv
Copy link
Member Author

juliusv commented Dec 14, 2019

Closing and locking this issue as this is not the right place for this discussion (we're in the process of finding it), and the original issue probably won't be implemented anyway.

@juliusv juliusv closed this as completed Dec 14, 2019
@prometheus prometheus locked as off-topic and limited conversation to collaborators Dec 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants