Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new metric: t-Digest #2682

Open
derrickburns opened this Issue May 5, 2017 · 21 comments

Comments

Projects
None yet
3 participants
@derrickburns
Copy link

derrickburns commented May 5, 2017

The t-digest is a structure that provides the expressiveness of both histograms and summaries while overcoming the deficiencies of both. Specifically, it is a structure that supports aggregation and quantiles with high accuracy.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 5, 2017

Can you provide more information on how this works? In particular if I take two snapshots of the produced metrics at arbitrary times, can I calculate the distribution of events in that time period?

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 5, 2017

That is what is sent from the client and that is what is maintained on the server.

Prometheus servers only have timeseries, they won't be tracking any other state.

Combining two snapshots is as simple as running the tdigest algorithm for merging two clusterings.

My question is about comparing snapshots, not combining them.

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 5, 2017

If I have snapsots at two different times, I want to know the distribution of events that happened betwen those times.

These are the semantics thats Histograms give us, and the semantics that any proposed improvement must offer.

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 5, 2017

Wouldn't that mean that the clusters gets less precise over time, as old data will dominate?

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 5, 2017

That's not really useful, as that only supports one scraper. It's also not resilient to failed scrapes.

I'd suggest watching https://www.youtube.com/watch?v=67Ulrq6DxwA to get an idea of the model we have for metrics.

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 5, 2017

It is no less resilient to failed scrapes as any other metric. If you fail to scrape a gauge, you lose the data. There is no difference.

No, your proposal loses data. For other metrics we only lose granularity.

In that case, then do not reset the digest. Treat it like a counter.

Then it's going to be dominated by older data.

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 5, 2017

It's clustering though, so if 1M requests in there's an excursion in the next 1k requests would you spot it?

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 5, 2017

It would be worse as far as I can tell, as with fixed buckets and counters only data in the time period in question matters.

With clustering since the start of the process, I don't believe that'll be the case.

@derrickburns

This comment has been minimized.

Copy link
Author

derrickburns commented May 5, 2017

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 6, 2017

My intent is to see if this approach works within our model, it appears that it doesn't.

Our constraints are our constraints, what you're proposing is in my opinion a worse tradeoff than either our Summary quantiles or Histogram. It's only since the process start which isn't that useful tactically. That also means it's not actually aggregatable in practice, as instances restart at different times. It also uses notable resources on both the client and server.

Summary uses notable resources on the client, requires quantile preselection, and is not aggregatable.

Histogram uses notable resources on the server, requires bucket preselection, and is aggregatable.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 8, 2017

I have looked at various algorithms recently, including q-digest and t-digest. The general problem with the application for Prometheus is indeed our scrape and computing model: Prometheus collects the state of metrics at arbitrary intervals. It is then possible to calculate rates (from counters) or quantiles (from histograms) over arbitrary time spans (e.g. what's the 99th percentile over the last 10m). While q-digest and t-digest support merging, the process is not reversible, i.e. calculating the difference between now and 10m ago is not precise enough to allow the calculation we need for Prometheus.

I still need to look at t-digest in more detail, and I have a few ideas about algorithms that would work for Prometheus, but all of this needs more work before I can write something up. Cannot give an ETA at the moment in view of my remaining work load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.