Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd new metric: t-Digest #2682
Comments
This comment has been minimized.
This comment has been minimized.
|
Can you provide more information on how this works? In particular if I take two snapshots of the produced metrics at arbitrary times, can I calculate the distribution of events in that time period? |
This comment has been minimized.
This comment has been minimized.
|
Yes. The t-digest can be understood as a clustering of the input data in limited space. If you sort the clusters by their mean value and count the number of items in reach cluster, then you get a histogram. The t-digest defines an algorithm for adding new points to the clustering. That algorithm basically either finds an existing cluster and adds the point OR it creates a new cluster. The choice depends on the location of the mean and three current size if the cluster. That same algorithm can be used to merge two clusters.
So, to answer your question, each snapshot is a simple clustering (sorted list of means/counts). That is what is sent from the client and that is what is maintained on the server. Combining two snapshots is as simple as running the tdigest algorithm for merging two clusterings. The result is a new tdigest that can be converted trivially to a histogram OR be queried to get ANY quantile value with high accuracy.
Derrick
…On May 5, 2017, 7:57 AM -0700, Brian Brazil ***@***.***>, wrote:
Can you provide more information on how this works? In particular if I take two snapshots of the produced metrics at arbitrary times, can I calculate the distribution of events in that time period?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-jLJw5JjYTBqvNhRQGUuGtvN9p6Vks5r2zjqgaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
Prometheus servers only have timeseries, they won't be tracking any other state.
My question is about comparing snapshots, not combining them. |
This comment has been minimized.
This comment has been minimized.
|
I am afraid that I don't understand the question. What do you mean "compare"?
Derrick
…On May 5, 2017, 8:19 AM -0700, Brian Brazil ***@***.***>, wrote:
>
> That is what is sent from the client and that is what is maintained on the server.
>
Prometheus servers only have timeseries, they won't be tracking any other state.
>
> Combining two snapshots is as simple as running the tdigest algorithm for merging two clusterings.
>
My question is about comparing snapshots, not combining them.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-pPChuMrnD7Xt0zBmA2c5lwrCGPdks5r2z4CgaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
|
If I have snapsots at two different times, I want to know the distribution of events that happened betwen those times. These are the semantics thats Histograms give us, and the semantics that any proposed improvement must offer. |
This comment has been minimized.
This comment has been minimized.
|
The clustering is a snapshot in time. One can index the means and the counts by their offset to derive a way to associate a single float per time. However, these individual values cannot be combined. The entire clustering is needed.
Derrick
…On May 5, 2017, 8:19 AM -0700, Brian Brazil ***@***.***>, wrote:
>
> That is what is sent from the client and that is what is maintained on the server.
>
Prometheus servers only have timeseries, they won't be tracking any other state.
>
> Combining two snapshots is as simple as running the tdigest algorithm for merging two clusterings.
>
My question is about comparing snapshots, not combining them.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-pPChuMrnD7Xt0zBmA2c5lwrCGPdks5r2z4CgaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
|
The way to think of the clustering from your perspective is that it is a histogram with dynamically changing buckets. So, yes, one can ask for a cumulative count from the t-digest just like one can from a histogram, but the algorithm is different.
Derrick
…On May 5, 2017, 8:42 AM -0700, Brian Brazil ***@***.***>, wrote:
If I have snapsots at two different times, I want to know the distribution of events that happened betwen those times.
These are the semantics thats Histograms give us, and the semantics that any proposed improvement must offer.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-o6vnCywySRDpkw64Q10xnu3VJM3ks5r20NngaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
|
Wouldn't that mean that the clusters gets less precise over time, as old data will dominate? |
This comment has been minimized.
This comment has been minimized.
|
For Prometheus, you could treat each sample as a separate clustering just like you treat a gauge or a counter. So, when the metric is read, the cluster is sent you the server and then it is reset on the client to empty.
In this way, the t-digest is representative of a time period.
Derrick
…On May 5, 2017, 8:50 AM -0700, Brian Brazil ***@***.***>, wrote:
Wouldn't that mean that the clusters gets less precise over time, as old data will dominate?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-tDjKEDqkBf4MhBKDTthTvtAAIBRks5r20U4gaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
|
That's not really useful, as that only supports one scraper. It's also not resilient to failed scrapes. I'd suggest watching https://www.youtube.com/watch?v=67Ulrq6DxwA to get an idea of the model we have for metrics. |
This comment has been minimized.
This comment has been minimized.
|
It is no less resilient to failed scrapes as any other metric. If you fail to scrape a gauge, you lose the data. There is no difference.
Derrick
…On May 5, 2017, 9:14 AM -0700, Brian Brazil ***@***.***>, wrote:
That's not really useful, as that only supports one scraper. It's also not resilient to failed scrapes.
I'd suggest watching https://www.youtube.com/watch?v=67Ulrq6DxwA to get an idea of the model we have for metrics.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-v17gwGQKNP_zsi-kNsl8Qoq99N2ks5r20rpgaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
|
But your point about one scraper is valid.
In that case, then do not reset the digest. Treat it like a counter.
Derrick
…On May 5, 2017, 9:14 AM -0700, Brian Brazil ***@***.***>, wrote:
That's not really useful, as that only supports one scraper. It's also not resilient to failed scrapes.
I'd suggest watching https://www.youtube.com/watch?v=67Ulrq6DxwA to get an idea of the model we have for metrics.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-v17gwGQKNP_zsi-kNsl8Qoq99N2ks5r20rpgaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
|
The concern about precision over time is also misplaced. It is no different from a counter in terms of "precision".
Derrick
…On May 5, 2017, 9:14 AM -0700, Brian Brazil ***@***.***>, wrote:
That's not really useful, as that only supports one scraper. It's also not resilient to failed scrapes.
I'd suggest watching https://www.youtube.com/watch?v=67Ulrq6DxwA to get an idea of the model we have for metrics.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-v17gwGQKNP_zsi-kNsl8Qoq99N2ks5r20rpgaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
No, your proposal loses data. For other metrics we only lose granularity.
Then it's going to be dominated by older data. |
This comment has been minimized.
This comment has been minimized.
|
One can take a difference of tdigests is a similar manner to taking differences of cumulative histograms.
Derrick
…On May 5, 2017, 9:40 AM -0700, Brian Brazil ***@***.***>, wrote:
>
> It is no less resilient to failed scrapes as any other metric. If you fail to scrape a gauge, you lose the data. There is no difference.
>
No, your proposal loses data. For other metrics we only lose granularity.
>
> In that case, then do not reset the digest. Treat it like a counter.
>
Then it's going to be dominated by older data.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-lWAeR_Ra5_MheuijvPsXm3iXDD1ks5r21ENgaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
|
It's clustering though, so if 1M requests in there's an excursion in the next 1k requests would you spot it? |
This comment has been minimized.
This comment has been minimized.
|
The result would be no worse than having fixed histogram buckets. That's the point.
Derrick
…On May 5, 2017, 10:05 AM -0700, Brian Brazil ***@***.***>, wrote:
It's clustering though, so if 1M requests in there's an excursion in the next 1k requests would you spot it?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-grfelw3JEOB2MbQnMOMokNeHczrks5r21bGgaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
|
It would be worse as far as I can tell, as with fixed buckets and counters only data in the time period in question matters. With clustering since the start of the process, I don't believe that'll be the case. |
This comment has been minimized.
This comment has been minimized.
|
You have constrained the problem such that no limited space algorithm can provide a solution. If that was your intent, then congratulations.
I am looking for a solution to a real problem: providing real quantile information in limited space without prior knowledge of histogram buckets.
If you stopped looking for faults and focused on solutions, then you could advance the utility of Prometheus.
Derrick
…On May 5, 2017, 1:48 PM -0700, Brian Brazil ***@***.***>, wrote:
It would be worse as far as I can tell, as with fixed buckets and counters only data in the time period in question matters.
With clustering since the start of the process, I don't believe that'll be the case.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub (#2682 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AAjo-q0hf5dvsp-XLSzFgvl6LSy28XtUks5r24sEgaJpZM4NSBnv).
|
This comment has been minimized.
This comment has been minimized.
|
My intent is to see if this approach works within our model, it appears that it doesn't. Our constraints are our constraints, what you're proposing is in my opinion a worse tradeoff than either our Summary quantiles or Histogram. It's only since the process start which isn't that useful tactically. That also means it's not actually aggregatable in practice, as instances restart at different times. It also uses notable resources on both the client and server. Summary uses notable resources on the client, requires quantile preselection, and is not aggregatable. Histogram uses notable resources on the server, requires bucket preselection, and is aggregatable. |
This comment has been minimized.
This comment has been minimized.
|
I have looked at various algorithms recently, including q-digest and t-digest. The general problem with the application for Prometheus is indeed our scrape and computing model: Prometheus collects the state of metrics at arbitrary intervals. It is then possible to calculate rates (from counters) or quantiles (from histograms) over arbitrary time spans (e.g. what's the 99th percentile over the last 10m). While q-digest and t-digest support merging, the process is not reversible, i.e. calculating the difference between now and 10m ago is not precise enough to allow the calculation we need for Prometheus. I still need to look at t-digest in more detail, and I have a few ideas about algorithms that would work for Prometheus, but all of this needs more work before I can write something up. Cannot give an ETA at the moment in view of my remaining work load. |
derrickburns commentedMay 5, 2017
The t-digest is a structure that provides the expressiveness of both histograms and summaries while overcoming the deficiencies of both. Specifically, it is a structure that supports aggregation and quantiles with high accuracy.