Add exponential bucketing to histogram protobuf #149

yzhuge · 2021-03-08T19:37:30Z

Follow up of open-telemetry/opentelemetry-proto#226

jmacd · 2021-03-10T07:20:05Z

@oertl please note.

text/0149-exponential-histogram.md

jmacd

Very well done! I look forward to another take at open-telemetry/opentelemetry-proto#226. Thank you @yzhuge

jmacd · 2021-04-07T00:08:13Z

@open-telemetry/specs-approvers

reyang

LGTM.

hdost

The one thing that I am left with on this is "what is the benefit of doing this?"
Maybe I am missing something here.
Is it that they are "Mergeable" but not universally mergeble?

Or is it mainly:

distributions, which are common in the OTEL target application like response time measurement. Exponential buckets (ie. log scale buckets) need far fewer buckets than linear scale buckets to cover the wide range of a long tail distribution.

I know that there's the Open Questions section mentioning it and the options.

Edit: I guess even without that based on your referenced implementation it would be possible. I was attempting to think about it from the use case perspective of if I have different servers getting calls of different latencies such that they don't produce the same buckets as output. How does that look in this. I'd assume based on proto that it would work fine. Or should we in fact still need to require the implementation to define these bounds?

I apologize if this is not germane to this particular topic. I want to make sure I'm understanding the intent/implications.

text/0149-exponential-histogram.md

yzhuge · 2021-04-19T18:45:38Z

@hdost The intention is to have OTEL support for log scale histograms natively and efficiently. Log scale is commonly used for high dynamic range data such as response time with long tail.
In this proposal, I basically left the mergeability issue open. The good news is that we have a way forward for "universally mergeable" histograms. But as the initial step, we just do a generic one (any base is allowed).

text/0149-exponential-histogram.md

HeinrichHartmann · 2021-04-20T14:20:28Z

My 2-cents from the side-lines:

I like this proposal a lot in general. Log-histograms are a pragmatic and effective way forward. Theoretical backing from the DDSketch paper, and there is real-world experience from the DataDog implementation of this method.
Configuring histograms can be a high barrier of entry, since most users will not be familiar with the subtleties of the involved bin layout, and will not be able to predict value ranges up-front very well. This is constant struggle, e.g. with Prom-Histograms, HDR-Histograms and also DDSKetch to a certain degree. We should strive to avoid configuration where possible.
If different configurations result in un-mergable histograms, this is a tragedy. AFAIK, this is one of the main regrets of Google's initial Histogram implementation in Borgmon. Universal Mergeability is a very nice insight, that helps here, as long as referenceBase stays fixed.

Fortunately, with log histograms you can get away with universal choices, that work good enough for all operational use cases I have seen. Those choices are:

a) Sparse bin-layout with bin range covering a very large value range of 10^{+/- 100}
b) referenceBase=2 (or 10 does not matter, just pick one)
c) referenceScale=-3, resulting in theoretical max relative error of 5% (in practice this is much lower => DDSketch paper).

There is a case to be made for leaving referenceScale configurable, that I can get behind, but setting the default to -3 with a strong recommendation to leave it as-is seems appropriate.

yzhuge · 2021-04-20T21:34:21Z

So far many open source histogram libs (HDR, DDSketch, etc.) have fixed/config'ed accuracy, variable memory footprint. This makes config harder and risks "out of memory" error, which often kills the app.
What I envision is fixed/config'ed memory footprint, variable accuracy. For example, New Relic Distribution metric defaults to max 320 buckets, which usually cost just about 1KB (dense array, 4 byte counter). This gives about 3% relative error for a data max/min contrast up to 1M (example, 1ms to 1 million ms). When contrast exceeds the limit, it auto reduces accuracy to keep memory footprint constant. The rationale is "less accurate histogram is better than breaking monitored app".
Of course, such histograms requires built-in auto rescaling. New Relic uses the 2 to 1 bucket merge, which is also proposed in UDDSketch paper, and in Google internal use.
Look at this another way: Regardless of the value range of the data, the histogram provides a constant number of "steps" (ie. buckets) for understanding the distribution. 320 buckets is enough to visualize the "shape" of the distribution.
New Relic does have plan to open source its histogram lib. No promise on timeline is given though.

HeinrichHartmann · 2021-04-21T12:19:59Z

many open source histogram have fixed/config'ed accuracy, variable memory footprint. This makes config harder and risks "out of memory" error, which often kills the app.
What I envision is fixed/config'ed memory footprint, variable accuracy.

I clearly see the benefits of the "fixed memory / variable accuracy" design. Seems like a good fit here. If we can realize "fixed memory / variable accuracy" histograms with near-zero configuration, and full mergeability (!) this is a clear win.

Let me just remark, that my experience with "fixed accuracy / variable memory" has been different (a) Circllhist has zero configuration, and (b) I never have seen them blowing up or "killing an app" in practice. So the "fixed accuracy / variable memory" should be a viable as well.

jsuereth

Sorry I didn't approve this before. I'm totally on board with this OTEP.

I think Heinrich raised some good questions that I'd like to see addressed when we formalize this specification and model, but I think this can be done as part of the actual submission into the specification.

text/0149-exponential-histogram.md

jmacd · 2021-05-18T18:35:24Z

All: we plan to merge this OTEP but need one more approval (@bogdandrutu?). The consensus is to go forward with a fixed referenceBase of 2.

This follows what looks to be the winning proposal in open-telemetry/oteps#149 See more detail in upcoming commit for prometheus/client_golang. Signed-off-by: beorn7 <beorn@grafana.com>

This seem what OTel is converging towards, see open-telemetry/oteps#149 . I see pros and cons with base-10 vs base-2. They are discussed in detail in that OTel PR, and the gist of the discussion is pretty much in line with my design doc. Since the balance is easy to tip here, I think we should go with base-2 if OTel picks base-2. This also seems to be in agreement with several proprietary solution (see again the discussion on that OTel PR.) The idea to make the number of buckets per power of 2 (or formerly 10) a power of 2 itself was also sketched out in the design doc already. It guarantees mergeability of different resolutions. I was undecided between making it a recommendation or mandatory. Now I think it should be mandatory as it has the additional benefit of playing well with OTel's plans. Signed-off-by: beorn7 <beorn@grafana.com>

jmacd · 2021-06-22T06:18:37Z

Responding to @lizthegrey's question above, about the possibility of adopting the OpenHistogram libraries and protocol as a short-cut for OpenTelemetry's metrics protocol.

OpenHistogram uses a decimal base with linear sub-buckets, which this OTEP argues is additional complexity compared with a simple exponential histogram.

This OTEP argues in favor of base-2 histogram with variable scale, with support for lower-resolution histograms and something we may call "perfect subsetting"--the idea that we can automatically lower resolution without introducing error by a change of scale. OpenHistogram would have to be extended to support change of scale, where resolutions like 3, 9, and 18 (30x, 10x, and 5x reductions) make sense.

We have at least three vendors and platform providers willing to contribute client code for computing these histograms who have spoken in this review thread. The decision facing OpenTelemetry and OpenMetrics communities -- which histogram to adopt? -- will impact development in the OpenTelemetry collector and Prometheus server far more than in client SDKs, because we will be handling both our old explicit-boundary histogram and the new exponential histogram. This is why I think adopting OpenHistogram probably would not speed our overall process.

We are missing an approval on this OTEP, @open-telemetry/specs-approvers please consider reviewing. I want to strongly encourage that we merge this PR as a signal that we found an agreement, then I would propose that @yzhuge follow up with a PR in the protocol repository. If there are loose ends, such as about overflow buckets and whether the zero bucket has finite width (these points have been raised to me, privately), then I suggest we discuss them in the following PR.

Thank you @yzhuge.

This seem what OTel is converging towards, see open-telemetry/oteps#149 . I see pros and cons with base-10 vs base-2. They are discussed in detail in that OTel PR, and the gist of the discussion is pretty much in line with my design doc. Since the balance is easy to tip here, I think we should go with base-2 if OTel picks base-2. This also seems to be in agreement with several proprietary solution (see again the discussion on that OTel PR.) The idea to make the number of buckets per power of 2 (or formerly 10) a power of 2 itself was also sketched out in the design doc already. It guarantees mergeability of different resolutions. I was undecided between making it a recommendation or mandatory. Now I think it should be mandatory as it has the additional benefit of playing well with OTel's plans. This commit also addresses a number of outstanding TODOs. Signed-off-by: beorn7 <beorn@grafana.com>

SergeyKanzhelev · 2021-06-23T18:28:55Z

I think this otep has reached critical mass of approvals. Should we give another day to raise concerns. @lizthegrey do you agree with @jmacd reasoning?

jmacd · 2021-06-23T22:25:57Z

Thank you @SergeyKanzhelev. I agree with leaving this open for comment a little longer, and I think we should merge it either way because it is the result of a valuable engineering discovery process. We could still decide to adopt a base-10 solution following this OTEP, but I don't see enough support for it.

For me, the final signal that base-2 is our best choice came from @beorn7, who has been (quietly) changing the Prometheus spare histograms prototype being built in https://github.com/prometheus/client_golang/tree/sparsehistogram to use a reference base of 2 and power-of-2 scale, i.e., we have reached agreement with a key figure in the OpenMetrics community.

There are a number of contributors working on this, every one of us representing a vendor, so it's difficult to separate purely technical considerations from business-motivated reasoning in this arena. Let us know what you think, @lizthegrey, thanks!

postwait · 2021-06-23T23:10:22Z

I realize I'm only one voice in the choir here, but after dealing with a lot of end user consumers of histograms we switched from base 2 to base 10 for two reasons (both stakeholder driven). The first was that base 10 SAS easier to deal with on systems without floating point not because of the log based part but because the second stage linear binning lined up well. The second reason wss stronger. End users including business people, analysts, operations people and software engineers were able to reason about the formed histograms more intuitively. While cs people think they think in base 2, or research showed that *everyone* reasoned more competently in base 10. I strongly believe that base 10 is a better solution for these reasons, but will freely admit that it seems that base 2 has one the popularity contest. I suspect that important stakeholders (the ultimate end users of this data) haven't been well represented in this decision as they had in our research. Best regards, Theo

…

On Wed, Jun 23, 2021, 6:26 PM Joshua MacDonald ***@***.***> wrote: Thank you @SergeyKanzhelev <https://github.com/SergeyKanzhelev>. I agree with leaving this open for comment a little longer, and I think we should merge it either way because it is the result of a valuable engineering discovery process. We could still decide to adopt a base-10 solution following this OTEP, but I don't see enough support for it. For me, the final signal that base-2 is our best choice came from @beorn7 <https://github.com/beorn7>, who has been (quietly) changing the Prometheus spare histograms prototype <https://docs.google.com/document/d/1cLNv3aufPZb3fNfaJgdaRBZsInZKKIHo9E6HinJVbpM/edit#> being built in https://github.com/prometheus/client_golang/tree/sparsehistogram to use a reference base of 2 and power-of-2 scale, i.e., we have reached agreement with a key figure in the OpenMetrics community. There are a number of contributors working on this, every one of us representing a vendor, so it's difficult to separate purely technical considerations from business-motivated reasoning in this arena. Let us know what you think, @lizthegrey <https://github.com/lizthegrey>, thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#149 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACR7BCAUMLG2BA4RC2WBSTTUJNQBANCNFSM4Y2AFPBQ> .

jmacd · 2021-06-24T23:04:15Z

Thank you @yzhuge.

All, please see open-telemetry/opentelemetry-specification#1776.

giltene · 2021-06-27T21:17:26Z

I realize this may seem like some late in the game stuff, but I'd suggest looking at HdrHistogram's auto-ranging, auto-resizing DoubleHistogram for some mature conceptual work around a close-to-zero-config-as-possible histogram for positive floating point numbers. A DoubleHistogram can accommodate any value in the positive double precision floating point range (of 0...2^1023), and only limits the range of the actually recorded values in a single histogram to not exceed a ratio of e.g. 1 : 2^55 for 2 decimal point precision (+/- 1% value error worst case) and e.g. 1 : 2^53 for 3 for 3 decimal point precision (+/- 0.1% value error worst case). Those ranges are high enough that e.g. when measurement units end up being as a fine grained to be in nanoseconds, values as high as Billions of years can be tracked in the same histogram as a nanosecond, all while maintaining the required +/- % error rate.

DoubleHistogram was borne out of real world needs, and seems to have served some of those well. While configuration options do exist when one actually wants to control or contain in-memory data structures size such that they will error rather than auto-resize to accommodate recorded values, the only required configuration parameter in creating a DoubleHistogram is the precision (stated as number of decimal points) (see Java constructor example). And as the most commonly used precision is "2" decimal points (i.e. relative error for values is kept to below 1% across the entire covered range), a default of 2 can be used in e.g. specification if one wants to reach true zero-config. The wire form for HdrHistogram (and DoubleHistogram) includes the actual configuration of the recorded histogram, allowing stored histograms of varying configurations to be read and manipulated, merged and added etc. into histograms with different precisions and experienced value ranges, all without having the source configuration dictate or dominate the outcome.

There are probably many possible ways to shave this Yak, but HdrHistogram has been at it for almost a decade, with some pretty wide use and exposures, so it may be a good starting point and a good way to get a head start on some of the learnings that often come from attempts to set one-scheme-fits-the-all solutions.

lizthegrey · 2021-06-27T21:22:48Z

I've been on PTO, please proceed without me.

jmacd · 2021-06-28T19:20:57Z

I meant to merge this PR, not to close it!

* add text/0000-exponential-histogram.md * rename to match PR# * fix lint complaint * more lint fix * remove statement on inclusive vs. exclusive bounds Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>

This seem what OTel is converging towards, see open-telemetry/oteps#149 . I see pros and cons with base-10 vs base-2. They are discussed in detail in that OTel PR, and the gist of the discussion is pretty much in line with my design doc. Since the balance is easy to tip here, I think we should go with base-2 if OTel picks base-2. This also seems to be in agreement with several proprietary solution (see again the discussion on that OTel PR.) The idea to make the number of buckets per power of 2 (or formerly 10) a power of 2 itself was also sketched out in the design doc already. It guarantees mergeability of different resolutions. I was undecided between making it a recommendation or mandatory. Now I think it should be mandatory as it has the additional benefit of playing well with OTel's plans. This commit also addresses a number of outstanding TODOs. Signed-off-by: beorn7 <beorn@grafana.com>

add text/0000-exponential-histogram.md

b9d8487

yzhuge requested a review from a team as a code owner March 8, 2021 19:37

rename to match PR#

c07acda

This was referenced Mar 8, 2021

Add more histogram types open-telemetry/opentelemetry-proto#226

Closed

Create a oneof for Histogram buckets; add ExplicitBuckets; deprecate DoubleHistogram open-telemetry/opentelemetry-proto#272

Closed

yzhuge added 2 commits March 8, 2021 12:02

fix lint complaint

49ce1af

more lint fix

4b7765b

CharlesMasson reviewed Mar 17, 2021

View reviewed changes

text/0149-exponential-histogram.md Outdated Show resolved Hide resolved

yzhuge mentioned this pull request Mar 19, 2021

Histogram Metrics: Add some default boundaries or algorithms to Histogram open-telemetry/opentelemetry-proto#280

Closed

jsuereth added the metrics Relates to the Metrics API/SDK label Apr 6, 2021

jsuereth added this to Performance in Spec - Metrics Data Model and Protocol Apr 6, 2021

jsuereth moved this from Performance to Histograms in Spec - Metrics Data Model and Protocol Apr 6, 2021

jmacd approved these changes Apr 7, 2021

View reviewed changes

reyang approved these changes Apr 13, 2021

View reviewed changes

remove statement on inclusive vs. exclusive bounds

5413074

hdost reviewed Apr 15, 2021

View reviewed changes