Improve CKMSQuantiles and address memory leak #755

DieBauer · 2022-01-22T23:01:11Z

CKMSQuantiles is copied from an implementation in 2012, where it states that a ‘HACK’ was done, admitting a space leak. This is observed by umbrant/QuantileEstimation#2, but never addressed.
This leak has been noticed several times in Prometheus context (#422, #550, #654).
By correctly applying the algorithm from the paper we fix the leak.

I have added unit-tests to show that the class's behaviour is correct.
I have also added a benchmark in the benchmark module showing the difference with the old (moved to the benchmark module) and current implementation.

According to my benchmarks, is in the new implementation a get of a quantile that has ‘seen’ 1 million elements 440 times faster.
Inserting 1 million elements is 3.5 times faster.

The amount of samples needed to keep its accuracy (within error bounds) is 80 times less than the previous implementation, and according to manual testing more or less constant, while the old implementation grows with the number of observed items (hence the space leak comment).

New
Q(0,50, 0,050) was 500461,000 (off by 0,000)
Q(0,90, 0,010) was 895809,000 (off by 0,004)
Q(0,95, 0,005) was 947982,000 (off by 0,002)
Q(0,99, 0,001) was 989100,000 (off by 0,001)
Time (ms): 282
# of samples: 41

Old
Q(0,50, 0,050) was 500400,000 (off by 0,000)
Q(0,90, 0,010) was 900874,000 (off by 0,001)
Q(0,95, 0,005) was 950874,000 (off by 0,001)
Q(0,99, 0,001) was 990974,000 (off by 0,001)
Time (ms): 441
# of samples: 3246

While going through the CKMS paper and the Java implementation I have added remarks and snippets
from the paper, to clarify why certain choices are made.

edit: added benchmark results.

Benchmark results

   Benchmark                                             (value)  Mode  Cnt    Score   Error  Units
   CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark       10000  avgt    4    0,476 ± 0,011  ms/op
   CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark      100000  avgt    4    4,794 ± 0,167  ms/op
   CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark     1000000  avgt    4   49,373 ± 2,321  ms/op
   CKMSQuantileBenchmark.ckmsQuantileOldInsertBenchmark    10000  avgt    4    0,398 ± 0,023  ms/op
   CKMSQuantileBenchmark.ckmsQuantileOldInsertBenchmark   100000  avgt    4    6,265 ± 0,253  ms/op
   CKMSQuantileBenchmark.ckmsQuantileOldInsertBenchmark  1000000  avgt    4  184,418 ± 8,570  ms/op
 

   Benchmark                                       Mode  Cnt    Score    Error  Units
   CKMSQuantileBenchmark.ckmsQuantileGetBenchmark  avgt    8  292,048 ± 35,153  ns/op
   CKMSQuantileBenchmark.ckmsQuantileOldGetBenchmark  avgt    4  128559,874 ± 1801,818  ns/op

CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak. This leak has been noticed several times (prometheus#422, prometheus#550, prometheus#654). By correctly applying the algorithm from the paper we fix the leak. I have added unit-tests to show that the behaviour is correct. I have also added a Benchmark in the benchmark module showing the difference with the old and current implementation. According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster. Inserting 1 million elements is 3.5 times faster. While going through the CKMS paper and the Java implementation I have added remarks and snippets from the paper, to clarify why certain choices are made. Signed-off-by: Jens <jenskat@gmail.com>

Median of 1..11 = 6 Signed-off-by: Jens <jenskat@gmail.com>

fstab · 2022-01-23T22:14:55Z

Thanks a lot for putting in the effort and reading through the CKMSQuantiles implementation. This is certainly not an easy piece of code.

The PR contains some refactoring and some changes in functionality. I'm trying to understand the functional changes.

The "hack" you mention in the original implementation is this:

        // NOTE: according to CKMS, this should be count, not size, but this
        // leads
        // to error larger than the error bounds. Leaving it like this is
        // essentially a HACK, and blows up memory, but does "work".
        // int size = count;
        int size = sample.size();
        double minError = size + 1;

Your PR changes it to this:

        int n = count;
        double minError = count;

So basically you enabled the line int size = count; that the original author had commented out.

The other functional changes seem to be in insertBatch(), so I assume the original code has a problem there leading to the "too large errors" that the original author mentioned.

I understand that you aligned the implementation more closely with the pseudo-code in the paper, and wrote tests to prove that this fixes the "too large errors" issue.

I'm wondering: Can you pinpoint what exactly is wrong in the original implementation? Maybe it's not possible to say that, but it would be awesome to know what the error was and how your change fixes it.

DieBauer · 2022-01-25T21:42:48Z

Yes I think the PR can be split in 3 parts.
1 is the actual fix of adhering to the algorithm. Then addition of comments and some refactoring of variable names, or unused variables. And third the addition of a Benchmark. I'm not sure where to keep the old implementation or that having this as a commit in the git history is enough to remove the 'old' part in general. Let me know what you think is best?

Then for the actual issue of using the sample.size() in compress, is as far as I can tell the following, this becomes hand-wavy though:

The compress method iterates over the LinkedList iterator returned by sample.listIterator(), and according to some error bound checks removes elements. However, calling remove on this iterator changes the underlying LinkedList.
Having the calculation depending on a 'moving target' causes problems in merging items that should belong together.

TLDR; do not depend on mutable global state

Example:
Let's say we have an already compressed sample.size() of 500, and a new batch size is added, extra 500 elements. After the insertBatch method we have a sample.size() of 1000.

Given that we have a uniform distribution of [1, 500] and that we are interested in the p99 within 1%.

We start compressing elements by looping over the sample.listIterator(). We know that all elements in sample are ordered on value.
So the first sample we encounter has the lowest value.
We are going to check if we can merge the first and the second element?
We calculate the 'allowableError' for this rank (which is 1, since it's the index of this iterator) and return a value based on the length of the iterator. (0.99 * 1000)
If the answer is positive (see code for actual check, there's also a delta in there which has the same flaw in this implementation we are discussing and makes the situation worse) we sum the g values, and adjust the iterator. Right now the length of the iterator is 1 smaller. so 999 values.

I don't have this written down in math, but you can convince yourself that having this moving sample.size() adjusts the boundary check of the merging of the, now first, and second item.

We jump right into the merging, have rank 1, and calculate a value based on the length of the iterator (0.99 * 999).

At some point it is just not possible to match a given item with an error bound on this 'moving' sample.size(), while for example it should be merged given the original sample size. (I have checked that the size does indeed become better (not as good as it is now) when you pass a 'size' to the allowableErrors method, that does not change in the iteration).

(note that there is also a delta in the item which depends on the 'width' and thus indirectly on the sample.size()).
In the old implementation this would mean that the two items I{val=996,000, g=1, del=421} and I{val=996,000, g=1, del=401} could never be merged, and thus are stuck forever in the sample LinkedList.

All of this is fixed if you do not take the rank in the sample into account, but the rank that that sample represents and not the sample size but the fixed 'observed values' size.

fstab

Thanks a lot again for the fix the explanation. I took the time to read the paper, and I think your changes are correct.

I have a couple of minor code review comments, but none of them is related to the algorithm itself.

Please also remove CKMSQuantilesOld, no need to keep that.

fstab · 2022-01-28T12:06:11Z