Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exponential bucketing to histogram protobuf #149

Merged
merged 7 commits into from Jun 28, 2021

Conversation

yzhuge
Copy link
Contributor

@yzhuge yzhuge commented Mar 8, 2021

@yzhuge yzhuge requested a review from a team as a code owner March 8, 2021 19:37
@jmacd
Copy link
Contributor

jmacd commented Mar 10, 2021

@oertl please note.

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well done! I look forward to another take at open-telemetry/opentelemetry-proto#226. Thank you @yzhuge

@jmacd
Copy link
Contributor

jmacd commented Apr 7, 2021

@open-telemetry/specs-approvers

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link

@hdost hdost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one thing that I am left with on this is "what is the benefit of doing this?"
Maybe I am missing something here.
Is it that they are "Mergeable" but not universally mergeble?

Or is it mainly:

distributions, which are common in the OTEL target application like response time measurement. Exponential buckets (ie. log scale buckets) need far fewer buckets than linear scale buckets to cover the wide range of a long tail distribution.

I know that there's the Open Questions section mentioning it and the options.

Edit: I guess even without that based on your referenced implementation it would be possible. I was attempting to think about it from the use case perspective of if I have different servers getting calls of different latencies such that they don't produce the same buckets as output. How does that look in this. I'd assume based on proto that it would work fine. Or should we in fact still need to require the implementation to define these bounds?

I apologize if this is not germane to this particular topic. I want to make sure I'm understanding the intent/implications.

text/0149-exponential-histogram.md Show resolved Hide resolved
@yzhuge
Copy link
Contributor Author

yzhuge commented Apr 19, 2021

@hdost The intention is to have OTEL support for log scale histograms natively and efficiently. Log scale is commonly used for high dynamic range data such as response time with long tail.
In this proposal, I basically left the mergeability issue open. The good news is that we have a way forward for "universally mergeable" histograms. But as the initial step, we just do a generic one (any base is allowed).

@HeinrichHartmann
Copy link

HeinrichHartmann commented Apr 20, 2021

My 2-cents from the side-lines:

  • I like this proposal a lot in general. Log-histograms are a pragmatic and effective way forward. Theoretical backing from the DDSketch paper, and there is real-world experience from the DataDog implementation of this method.

  • Configuring histograms can be a high barrier of entry, since most users will not be familiar with the subtleties of the involved bin layout, and will not be able to predict value ranges up-front very well. This is constant struggle, e.g. with Prom-Histograms, HDR-Histograms and also DDSKetch to a certain degree. We should strive to avoid configuration where possible.

  • If different configurations result in un-mergable histograms, this is a tragedy. AFAIK, this is one of the main regrets of Google's initial Histogram implementation in Borgmon. Universal Mergeability is a very nice insight, that helps here, as long as referenceBase stays fixed.

Fortunately, with log histograms you can get away with universal choices, that work good enough for all operational use cases I have seen. Those choices are:

a) Sparse bin-layout with bin range covering a very large value range of 10^{+/- 100}
b) referenceBase=2 (or 10 does not matter, just pick one)
c) referenceScale=-3, resulting in theoretical max relative error of 5% (in practice this is much lower => DDSketch paper).

There is a case to be made for leaving referenceScale configurable, that I can get behind, but setting the default to -3 with a strong recommendation to leave it as-is seems appropriate.

@yzhuge
Copy link
Contributor Author

yzhuge commented Apr 20, 2021

So far many open source histogram libs (HDR, DDSketch, etc.) have fixed/config'ed accuracy, variable memory footprint. This makes config harder and risks "out of memory" error, which often kills the app.
What I envision is fixed/config'ed memory footprint, variable accuracy. For example, New Relic Distribution metric defaults to max 320 buckets, which usually cost just about 1KB (dense array, 4 byte counter). This gives about 3% relative error for a data max/min contrast up to 1M (example, 1ms to 1 million ms). When contrast exceeds the limit, it auto reduces accuracy to keep memory footprint constant. The rationale is "less accurate histogram is better than breaking monitored app".
Of course, such histograms requires built-in auto rescaling. New Relic uses the 2 to 1 bucket merge, which is also proposed in UDDSketch paper, and in Google internal use.
Look at this another way: Regardless of the value range of the data, the histogram provides a constant number of "steps" (ie. buckets) for understanding the distribution. 320 buckets is enough to visualize the "shape" of the distribution.
New Relic does have plan to open source its histogram lib. No promise on timeline is given though.

@HeinrichHartmann
Copy link

HeinrichHartmann commented Apr 21, 2021

many open source histogram have fixed/config'ed accuracy, variable memory footprint. This makes config harder and risks "out of memory" error, which often kills the app.
What I envision is fixed/config'ed memory footprint, variable accuracy.

I clearly see the benefits of the "fixed memory / variable accuracy" design. Seems like a good fit here. If we can realize "fixed memory / variable accuracy" histograms with near-zero configuration, and full mergeability (!) this is a clear win.

Let me just remark, that my experience with "fixed accuracy / variable memory" has been different (a) Circllhist has zero configuration, and (b) I never have seen them blowing up or "killing an app" in practice. So the "fixed accuracy / variable memory" should be a viable as well.

Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't approve this before. I'm totally on board with this OTEP.

I think Heinrich raised some good questions that I'd like to see addressed when we formalize this specification and model, but I think this can be done as part of the actual submission into the specification.

text/0149-exponential-histogram.md Show resolved Hide resolved
@jmacd
Copy link
Contributor

jmacd commented May 18, 2021

All: we plan to merge this OTEP but need one more approval (@bogdandrutu?). The consensus is to go forward with a fixed referenceBase of 2.

beorn7 added a commit to prometheus/client_model that referenced this pull request Jun 11, 2021
This follows what looks to be the winning proposal in
open-telemetry/oteps#149

See more detail in upcoming commit for prometheus/client_golang.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit to prometheus/client_golang that referenced this pull request Jun 11, 2021
This seem what OTel is converging towards, see
open-telemetry/oteps#149 .

I see pros and cons with base-10 vs base-2. They are discussed in
detail in that OTel PR, and the gist of the discussion is pretty much
in line with my design doc. Since the balance is easy to tip here, I
think we should go with base-2 if OTel picks base-2. This also seems
to be in agreement with several proprietary solution (see again the
discussion on that OTel PR.)

The idea to make the number of buckets per power of 2 (or formerly 10)
a power of 2 itself was also sketched out in the design doc
already. It guarantees mergeability of different resolutions. I was
undecided between making it a recommendation or mandatory. Now I think
it should be mandatory as it has the additional benefit of playing
well with OTel's plans.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit to prometheus/client_golang that referenced this pull request Jun 14, 2021
This seem what OTel is converging towards, see
open-telemetry/oteps#149 .

I see pros and cons with base-10 vs base-2. They are discussed in
detail in that OTel PR, and the gist of the discussion is pretty much
in line with my design doc. Since the balance is easy to tip here, I
think we should go with base-2 if OTel picks base-2. This also seems
to be in agreement with several proprietary solution (see again the
discussion on that OTel PR.)

The idea to make the number of buckets per power of 2 (or formerly 10)
a power of 2 itself was also sketched out in the design doc
already. It guarantees mergeability of different resolutions. I was
undecided between making it a recommendation or mandatory. Now I think
it should be mandatory as it has the additional benefit of playing
well with OTel's plans.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit to prometheus/client_golang that referenced this pull request Jun 15, 2021
This seem what OTel is converging towards, see
open-telemetry/oteps#149 .

I see pros and cons with base-10 vs base-2. They are discussed in
detail in that OTel PR, and the gist of the discussion is pretty much
in line with my design doc. Since the balance is easy to tip here, I
think we should go with base-2 if OTel picks base-2. This also seems
to be in agreement with several proprietary solution (see again the
discussion on that OTel PR.)

The idea to make the number of buckets per power of 2 (or formerly 10)
a power of 2 itself was also sketched out in the design doc
already. It guarantees mergeability of different resolutions. I was
undecided between making it a recommendation or mandatory. Now I think
it should be mandatory as it has the additional benefit of playing
well with OTel's plans.

Signed-off-by: beorn7 <beorn@grafana.com>
@jmacd
Copy link
Contributor

jmacd commented Jun 22, 2021

Responding to @lizthegrey's question above, about the possibility of adopting the OpenHistogram libraries and protocol as a short-cut for OpenTelemetry's metrics protocol.

OpenHistogram uses a decimal base with linear sub-buckets, which this OTEP argues is additional complexity compared with a simple exponential histogram.

This OTEP argues in favor of base-2 histogram with variable scale, with support for lower-resolution histograms and something we may call "perfect subsetting"--the idea that we can automatically lower resolution without introducing error by a change of scale. OpenHistogram would have to be extended to support change of scale, where resolutions like 3, 9, and 18 (30x, 10x, and 5x reductions) make sense.

We have at least three vendors and platform providers willing to contribute client code for computing these histograms who have spoken in this review thread. The decision facing OpenTelemetry and OpenMetrics communities -- which histogram to adopt? -- will impact development in the OpenTelemetry collector and Prometheus server far more than in client SDKs, because we will be handling both our old explicit-boundary histogram and the new exponential histogram. This is why I think adopting OpenHistogram probably would not speed our overall process.

We are missing an approval on this OTEP, @open-telemetry/specs-approvers please consider reviewing. I want to strongly encourage that we merge this PR as a signal that we found an agreement, then I would propose that @yzhuge follow up with a PR in the protocol repository. If there are loose ends, such as about overflow buckets and whether the zero bucket has finite width (these points have been raised to me, privately), then I suggest we discuss them in the following PR.

Thank you @yzhuge.

beorn7 added a commit to prometheus/client_golang that referenced this pull request Jun 22, 2021
This seem what OTel is converging towards, see
open-telemetry/oteps#149 .

I see pros and cons with base-10 vs base-2. They are discussed in
detail in that OTel PR, and the gist of the discussion is pretty much
in line with my design doc. Since the balance is easy to tip here, I
think we should go with base-2 if OTel picks base-2. This also seems
to be in agreement with several proprietary solution (see again the
discussion on that OTel PR.)

The idea to make the number of buckets per power of 2 (or formerly 10)
a power of 2 itself was also sketched out in the design doc
already. It guarantees mergeability of different resolutions. I was
undecided between making it a recommendation or mandatory. Now I think
it should be mandatory as it has the additional benefit of playing
well with OTel's plans.

This commit also addresses a number of outstanding TODOs.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit to prometheus/client_golang that referenced this pull request Jun 22, 2021
This seem what OTel is converging towards, see
open-telemetry/oteps#149 .

I see pros and cons with base-10 vs base-2. They are discussed in
detail in that OTel PR, and the gist of the discussion is pretty much
in line with my design doc. Since the balance is easy to tip here, I
think we should go with base-2 if OTel picks base-2. This also seems
to be in agreement with several proprietary solution (see again the
discussion on that OTel PR.)

The idea to make the number of buckets per power of 2 (or formerly 10)
a power of 2 itself was also sketched out in the design doc
already. It guarantees mergeability of different resolutions. I was
undecided between making it a recommendation or mandatory. Now I think
it should be mandatory as it has the additional benefit of playing
well with OTel's plans.

This commit also addresses a number of outstanding TODOs.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit to prometheus/client_golang that referenced this pull request Jun 23, 2021
This seem what OTel is converging towards, see
open-telemetry/oteps#149 .

I see pros and cons with base-10 vs base-2. They are discussed in
detail in that OTel PR, and the gist of the discussion is pretty much
in line with my design doc. Since the balance is easy to tip here, I
think we should go with base-2 if OTel picks base-2. This also seems
to be in agreement with several proprietary solution (see again the
discussion on that OTel PR.)

The idea to make the number of buckets per power of 2 (or formerly 10)
a power of 2 itself was also sketched out in the design doc
already. It guarantees mergeability of different resolutions. I was
undecided between making it a recommendation or mandatory. Now I think
it should be mandatory as it has the additional benefit of playing
well with OTel's plans.

This commit also addresses a number of outstanding TODOs.

Signed-off-by: beorn7 <beorn@grafana.com>
beorn7 added a commit to prometheus/client_golang that referenced this pull request Jun 23, 2021
This seem what OTel is converging towards, see
open-telemetry/oteps#149 .

I see pros and cons with base-10 vs base-2. They are discussed in
detail in that OTel PR, and the gist of the discussion is pretty much
in line with my design doc. Since the balance is easy to tip here, I
think we should go with base-2 if OTel picks base-2. This also seems
to be in agreement with several proprietary solution (see again the
discussion on that OTel PR.)

The idea to make the number of buckets per power of 2 (or formerly 10)
a power of 2 itself was also sketched out in the design doc
already. It guarantees mergeability of different resolutions. I was
undecided between making it a recommendation or mandatory. Now I think
it should be mandatory as it has the additional benefit of playing
well with OTel's plans.

This commit also addresses a number of outstanding TODOs.

Signed-off-by: beorn7 <beorn@grafana.com>
@SergeyKanzhelev SergeyKanzhelev self-assigned this Jun 23, 2021
@SergeyKanzhelev
Copy link
Member

I think this otep has reached critical mass of approvals. Should we give another day to raise concerns. @lizthegrey do you agree with @jmacd reasoning?

@jmacd
Copy link
Contributor

jmacd commented Jun 23, 2021

Thank you @SergeyKanzhelev. I agree with leaving this open for comment a little longer, and I think we should merge it either way because it is the result of a valuable engineering discovery process. We could still decide to adopt a base-10 solution following this OTEP, but I don't see enough support for it.

For me, the final signal that base-2 is our best choice came from @beorn7, who has been (quietly) changing the Prometheus spare histograms prototype being built in https://github.com/prometheus/client_golang/tree/sparsehistogram to use a reference base of 2 and power-of-2 scale, i.e., we have reached agreement with a key figure in the OpenMetrics community.

There are a number of contributors working on this, every one of us representing a vendor, so it's difficult to separate purely technical considerations from business-motivated reasoning in this arena. Let us know what you think, @lizthegrey, thanks!

@postwait
Copy link

postwait commented Jun 23, 2021 via email

@jmacd
Copy link
Contributor

jmacd commented Jun 24, 2021

Thank you @yzhuge.

All, please see open-telemetry/opentelemetry-specification#1776.

@jmacd jmacd closed this Jun 24, 2021
@giltene
Copy link

giltene commented Jun 27, 2021

I realize this may seem like some late in the game stuff, but I'd suggest looking at HdrHistogram's auto-ranging, auto-resizing DoubleHistogram for some mature conceptual work around a close-to-zero-config-as-possible histogram for positive floating point numbers. A DoubleHistogram can accommodate any value in the positive double precision floating point range (of 0...2^1023), and only limits the range of the actually recorded values in a single histogram to not exceed a ratio of e.g. 1 : 2^55 for 2 decimal point precision (+/- 1% value error worst case) and e.g. 1 : 2^53 for 3 for 3 decimal point precision (+/- 0.1% value error worst case). Those ranges are high enough that e.g. when measurement units end up being as a fine grained to be in nanoseconds, values as high as Billions of years can be tracked in the same histogram as a nanosecond, all while maintaining the required +/- % error rate.

DoubleHistogram was borne out of real world needs, and seems to have served some of those well. While configuration options do exist when one actually wants to control or contain in-memory data structures size such that they will error rather than auto-resize to accommodate recorded values, the only required configuration parameter in creating a DoubleHistogram is the precision (stated as number of decimal points) (see Java constructor example). And as the most commonly used precision is "2" decimal points (i.e. relative error for values is kept to below 1% across the entire covered range), a default of 2 can be used in e.g. specification if one wants to reach true zero-config. The wire form for HdrHistogram (and DoubleHistogram) includes the actual configuration of the recorded histogram, allowing stored histograms of varying configurations to be read and manipulated, merged and added etc. into histograms with different precisions and experienced value ranges, all without having the source configuration dictate or dominate the outcome.

There are probably many possible ways to shave this Yak, but HdrHistogram has been at it for almost a decade, with some pretty wide use and exposures, so it may be a good starting point and a good way to get a head start on some of the learnings that often come from attempts to set one-scheme-fits-the-all solutions.

@lizthegrey
Copy link
Member

I've been on PTO, please proceed without me.

@jmacd jmacd reopened this Jun 28, 2021
@jmacd jmacd merged commit a365bcc into open-telemetry:main Jun 28, 2021
@jmacd
Copy link
Contributor

jmacd commented Jun 28, 2021

I meant to merge this PR, not to close it!

jsuereth pushed a commit to jsuereth/oteps that referenced this pull request Sep 7, 2021
* add text/0000-exponential-histogram.md

* rename to match PR#

* fix lint complaint

* more lint fix

* remove statement on inclusive vs. exclusive bounds

Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
Cori1109 added a commit to Cori1109/client_golang that referenced this pull request Jan 9, 2023
This seem what OTel is converging towards, see
open-telemetry/oteps#149 .

I see pros and cons with base-10 vs base-2. They are discussed in
detail in that OTel PR, and the gist of the discussion is pretty much
in line with my design doc. Since the balance is easy to tip here, I
think we should go with base-2 if OTel picks base-2. This also seems
to be in agreement with several proprietary solution (see again the
discussion on that OTel PR.)

The idea to make the number of buckets per power of 2 (or formerly 10)
a power of 2 itself was also sketched out in the design doc
already. It guarantees mergeability of different resolutions. I was
undecided between making it a recommendation or mandatory. Now I think
it should be mandatory as it has the additional benefit of playing
well with OTel's plans.

This commit also addresses a number of outstanding TODOs.

Signed-off-by: beorn7 <beorn@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics Relates to the Metrics API/SDK
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet