Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CKMSQuantiles and address memory leak #755

Merged
merged 3 commits into from Jan 30, 2022
Merged

Conversation

DieBauer
Copy link
Contributor

@DieBauer DieBauer commented Jan 22, 2022

CKMSQuantiles is copied from an implementation in 2012, where it states that a ‘HACK’ was done, admitting a space leak. This is observed by umbrant/QuantileEstimation#2, but never addressed.
This leak has been noticed several times in Prometheus context (#422, #550, #654).
By correctly applying the algorithm from the paper we fix the leak.

I have added unit-tests to show that the class's behaviour is correct.
I have also added a benchmark in the benchmark module showing the difference with the old (moved to the benchmark module) and current implementation.

According to my benchmarks, is in the new implementation a get of a quantile that has ‘seen’ 1 million elements 440 times faster.
Inserting 1 million elements is 3.5 times faster.

The amount of samples needed to keep its accuracy (within error bounds) is 80 times less than the previous implementation, and according to manual testing more or less constant, while the old implementation grows with the number of observed items (hence the space leak comment).

New
Q(0,50, 0,050) was 500461,000 (off by 0,000)
Q(0,90, 0,010) was 895809,000 (off by 0,004)
Q(0,95, 0,005) was 947982,000 (off by 0,002)
Q(0,99, 0,001) was 989100,000 (off by 0,001)
Time (ms): 282
# of samples: 41

Old
Q(0,50, 0,050) was 500400,000 (off by 0,000)
Q(0,90, 0,010) was 900874,000 (off by 0,001)
Q(0,95, 0,005) was 950874,000 (off by 0,001)
Q(0,99, 0,001) was 990974,000 (off by 0,001)
Time (ms): 441
# of samples: 3246

While going through the CKMS paper and the Java implementation I have added remarks and snippets
from the paper, to clarify why certain choices are made.

edit: added benchmark results.

Benchmark results

   Benchmark                                             (value)  Mode  Cnt    Score   Error  Units
   CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark       10000  avgt    4    0,476 ± 0,011  ms/op
   CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark      100000  avgt    4    4,794 ± 0,167  ms/op
   CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark     1000000  avgt    4   49,373 ± 2,321  ms/op
   CKMSQuantileBenchmark.ckmsQuantileOldInsertBenchmark    10000  avgt    4    0,398 ± 0,023  ms/op
   CKMSQuantileBenchmark.ckmsQuantileOldInsertBenchmark   100000  avgt    4    6,265 ± 0,253  ms/op
   CKMSQuantileBenchmark.ckmsQuantileOldInsertBenchmark  1000000  avgt    4  184,418 ± 8,570  ms/op
 

   Benchmark                                       Mode  Cnt    Score    Error  Units
   CKMSQuantileBenchmark.ckmsQuantileGetBenchmark  avgt    8  292,048 ± 35,153  ns/op
   CKMSQuantileBenchmark.ckmsQuantileOldGetBenchmark  avgt    4  128559,874 ± 1801,818  ns/op

CKMSQuantiles is copied from an implementation of 2012, where it states that a ‘HACK’ was done, admitting a space leak.
This leak has been noticed several times (prometheus#422, prometheus#550, prometheus#654).
By correctly applying the algorithm from the paper we fix the leak.

I have added unit-tests to show that the behaviour is correct.
I have also added a Benchmark in the benchmark module showing the difference with the old and current
implementation.

According to my benchmarks, is in the new implementation a `get` of a quantile that has ‘seen’ 1 million elements 440 times faster.
Inserting 1 million elements is 3.5 times faster.

While going through the CKMS paper and the Java implementation I have added remarks and snippets
from the paper, to clarify why certain choices are made.

Signed-off-by: Jens <jenskat@gmail.com>
Median of 1..11 = 6

Signed-off-by: Jens <jenskat@gmail.com>
@fstab
Copy link
Member

fstab commented Jan 23, 2022

Thanks a lot for putting in the effort and reading through the CKMSQuantiles implementation. This is certainly not an easy piece of code.

The PR contains some refactoring and some changes in functionality. I'm trying to understand the functional changes.

The "hack" you mention in the original implementation is this:

        // NOTE: according to CKMS, this should be count, not size, but this
        // leads
        // to error larger than the error bounds. Leaving it like this is
        // essentially a HACK, and blows up memory, but does "work".
        // int size = count;
        int size = sample.size();
        double minError = size + 1;

Your PR changes it to this:

        int n = count;
        double minError = count;

So basically you enabled the line int size = count; that the original author had commented out.

The other functional changes seem to be in insertBatch(), so I assume the original code has a problem there leading to the "too large errors" that the original author mentioned.

I understand that you aligned the implementation more closely with the pseudo-code in the paper, and wrote tests to prove that this fixes the "too large errors" issue.

I'm wondering: Can you pinpoint what exactly is wrong in the original implementation? Maybe it's not possible to say that, but it would be awesome to know what the error was and how your change fixes it.

@DieBauer
Copy link
Contributor Author

DieBauer commented Jan 25, 2022

Yes I think the PR can be split in 3 parts.
1 is the actual fix of adhering to the algorithm. Then addition of comments and some refactoring of variable names, or unused variables. And third the addition of a Benchmark. I'm not sure where to keep the old implementation or that having this as a commit in the git history is enough to remove the 'old' part in general. Let me know what you think is best?

Then for the actual issue of using the sample.size() in compress, is as far as I can tell the following, this becomes hand-wavy though:

The compress method iterates over the LinkedList iterator returned by sample.listIterator(), and according to some error bound checks removes elements. However, calling remove on this iterator changes the underlying LinkedList.
Having the calculation depending on a 'moving target' causes problems in merging items that should belong together.

TLDR; do not depend on mutable global state

Example:
Let's say we have an already compressed sample.size() of 500, and a new batch size is added, extra 500 elements. After the insertBatch method we have a sample.size() of 1000.

Given that we have a uniform distribution of [1, 500] and that we are interested in the p99 within 1%.

We start compressing elements by looping over the sample.listIterator(). We know that all elements in sample are ordered on value.
So the first sample we encounter has the lowest value.
We are going to check if we can merge the first and the second element?
We calculate the 'allowableError' for this rank (which is 1, since it's the index of this iterator) and return a value based on the length of the iterator. (0.99 * 1000)
If the answer is positive (see code for actual check, there's also a delta in there which has the same flaw in this implementation we are discussing and makes the situation worse) we sum the g values, and adjust the iterator. Right now the length of the iterator is 1 smaller. so 999 values.

I don't have this written down in math, but you can convince yourself that having this moving sample.size() adjusts the boundary check of the merging of the, now first, and second item.

We jump right into the merging, have rank 1, and calculate a value based on the length of the iterator (0.99 * 999).

At some point it is just not possible to match a given item with an error bound on this 'moving' sample.size(), while for example it should be merged given the original sample size. (I have checked that the size does indeed become better (not as good as it is now) when you pass a 'size' to the allowableErrors method, that does not change in the iteration).

(note that there is also a delta in the item which depends on the 'width' and thus indirectly on the sample.size()).
In the old implementation this would mean that the two items I{val=996,000, g=1, del=421} and I{val=996,000, g=1, del=401} could never be merged, and thus are stuck forever in the sample LinkedList.

All of this is fixed if you do not take the rank in the sample into account, but the rank that that sample represents and not the sample size but the fixed 'observed values' size.

Copy link
Member

@fstab fstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot again for the fix the explanation. I took the time to read the paper, and I think your changes are correct.

I have a couple of minor code review comments, but none of them is related to the algorithm itself.

Please also remove CKMSQuantilesOld, no need to keep that.

*/
class CKMSQuantiles {
public final class CKMSQuantiles { /* public for benchmark-module */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it package private, as the benchmark is in the same package that should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the benchmark is in a subpackage of client: io.prometheus.client.benchmark so not accessible when not public.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'll move the benchmark up. Let me know how you want to proceed.

this.quantiles = quantiles;
// hard-coded epsilon of 0.1% to determine the batch size, and default epsilon in case of empty quantiles
double pointOnePercent = 0.001;
if (quantiles.length == 0) { // we need at least one for this algorithm to work
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should rather throw an IllegalArgumentException in that case than using default quantiles. If this constructor is called with an empty array that is almost certainly a bug and not on purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I mainly have added this because in the comments of summary it states:

  final List<Quantile> quantiles; // Can be empty, but can never be null.

https://github.com/prometheus/client_java/blob/master/simpleclient/src/main/java/io/prometheus/client/Summary.java#L83

I'm not sure what changes when this restriction is applied to users of Summaries that don't have specific quantiles applied to the summary.

* g_i is the difference between the lowest
* possible rank of item i and the lowest possible rank of item
* i − 1
*/
public int g;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again not your code, but g should be final as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This g is written to in the compress method. next.g += prev.g;. I will remove the public from the members in Quantile and Item though.

@DieBauer
Copy link
Contributor Author

Thanks a lot again for the fix the explanation. I took the time to read the paper, and I think your changes are correct.

I have a couple of minor code review comments, but none of them is related to the algorithm itself.

Please also remove CKMSQuantilesOld, no need to keep that.

Thanks for taking the time and effort to review this! I'll address the comments.

Let me know what the git commit strategy is (squash/rebase, ..)

@fstab
Copy link
Member

fstab commented Jan 29, 2022

I like rebase and squash, so that we have a linear history with a single commit per PR.

Signed-off-by: Jens <jenskat@gmail.com>
@fstab
Copy link
Member

fstab commented Jan 30, 2022

Thank you!

@fstab
Copy link
Member

fstab commented Jan 30, 2022

FYI I found a little bug in the million entries test after I merged the PR:

Collections.shuffle(Arrays.asList(shuffle), rand);

Arrays.asList(shuffle) returns Collection<int[]> not Collection<Integer>, so the array was never shuffled.

With an actual random array it turns out that the corner case Quantile(0.0, 0.01) doesn't work.

I fixed the test and changed the constructor of Quantile to make sure the target quantile is strictly > 0.0 and < 1.0, because allowing 0.0 and 1.0 doesn't work.

@fstab
Copy link
Member

fstab commented Jan 30, 2022

I found another bug, so I reverted the PR on master and moved it to a ckms branch.

The bug is: If you take testGetWithAMillionElements() but remove all quantiles except for the 0.95 quantile (remove them both from the test setup at the beginning of the test as well as from the assertions at the end of the test) then the test fails:

java.lang.AssertionError: quantile=0.95, actual=855053.0, lowerBound=929999.9999999999

I'm not sure at the moment what causes this.

@fstab
Copy link
Member

fstab commented Jan 30, 2022

If you find what's going wrong, feel free to create a PR for a fix against the ckms branch.

@DieBauer
Copy link
Contributor Author

I have looked into this. In the case of only 1 quantile, the insertThreshold is too high (hard coded 500).
This can be mitigated by setting the threshold, and thus compressionIdx, (as is written in the paper) to 1/(2*epsilon) = 25, then the test passes.
We could implement this optimization in case of 1 targeted quantile.
Having 2 or more quantiles seems to mitigate this in general.

I have run the same unit-test suite against the 'old' implementation and multiple tests fail, also the 'edge'-case you mentioned.

I have a fix for the quantiles of 1.0 and 0.0. These should be allowed. The compress function was too eager and should not have compressed the minimum observed value.
I'll create a PR to address this to the ckms branch.

@DieBauer
Copy link
Contributor Author

I benchmarked the length of the buffer (1 / 2 e) given an epsilon.

Adding a million elements gives this.

Benchmark                                          (epsilon)  Mode  Cnt    Score    Error  Units
CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark       0.01  avgt    4  151,359 ±  7,421  ms/op
CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark      0.001  avgt    4  174,215 ±  3,788  ms/op
CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark     0.0001  avgt    4  238,256 ± 20,575  ms/op

So with an epsilon of 1 percent, the buffersize is 50, while at 0.01% this is 5000. There the time it takes to sort the buffer kicks in.

The arrays.sort benchmark looks like this:

Benchmark                   (size)  Mode  Cnt      Score      Error  Units
CKMSQuantileBenchmark.sort      50  avgt    4    182,980 ±    1,044  ns/op
CKMSQuantileBenchmark.sort     500  avgt    4   1988,166 ±  788,125  ns/op
CKMSQuantileBenchmark.sort    5000  avgt    4  30943,722 ± 5770,539  ns/op

Arrays.sort grows faster with a larger number of elements than 10 * a smaller number of elements.

Now in the insert benchmark, there is also the hidden cost of a smaller epsilon, thus the samples.size() grows faster to gain the precision required.

I think it's best to have a buffer size dependent on the epsilon, to cater for the case where an epsilon is way bigger than the hard-coded one, and max it at 500, to not have too much performance degradation.

@fstab
Copy link
Member

fstab commented Jan 31, 2022

Shouldn't the algorithm be correct independent of the insertThreshold? The threshold just says how often compress() is called. If the algorithm is correct we should get correct results independent of how often we call compress().

@DieBauer
Copy link
Contributor Author

I'll try to see if I can get a better answer. The insertThreshold is indeed for compressing, when to compress or not depends on the amount of observed values n in allowableErrors. n is updated every insertBatch, I believe that when the batches are inserted more often, and thus the count updated, that the error margins are smaller, or closer to reality than with batches updated less often.
So indirectly, does the compress depend on the batch size.

@DieBauer
Copy link
Contributor Author

DieBauer commented Feb 1, 2022

I need to look into this statement in section 6.2: Space Usage for Targeted Quantiles

For the random input, the bounds
gave guarantees that held with 70% probability to find
quantiles that our algorithm found with absolute certainty.
For the “hard” input, which attempts to force the worst-case
space usage, the probability for the randomized algorithm
improved to around 95%, still far short of the low failure
rates demanded by network managers. Hence, we do not
report further on the results obtained by random sampling
for the remainder of this section

@DieBauer
Copy link
Contributor Author

DieBauer commented Feb 1, 2022

Found it. was indeed a bug in the implementation, a classic off-by-one error ;).

Thanks for the critical look!

The error was in taking into account a new item in the insertBatch() method, near the end we do:

            Item newItem = new Item(v, delta);
            it.add(newItem);
            count++;
            item = newItem;

The newly created item is added in the correct position in the list, with it.add. We increase the count of total observed values and then the for-loop takes the next value that should be inserted.
However, the current rank is not increased on insert.

Section 4.1 Proof mentions

Note that if the new value v is inserted before vi then ri increases by 1.

Since g is 1 by default we can add the newly created item.g, like this: currentRank += item.g;

I'll update the PR.

@fstab
Copy link
Member

fstab commented Feb 1, 2022

Hi, looks like we were both working on it at the same time. I found this as well, and a couple of other issues (subtle rounding errors in the error function, which sounds small, but once a sample is at a wrong position this affects future inserts; doing compress from right to left instead of left to right; etc).

I also added much more fine grained tests that go through the list of samples after each insert() and compress() and make sure that the invariants defined in the paper are satisfied. That way I found some issues that are hard to see if you just test the end result.

I also realized that if you read that code, you need to have the paper next to you. So I removed comments that are just paragraphs from the paper, because people will read them in the paper anyway and not in the Javadoc. I also renamed a few things to be aligned with the names in the paper, like allowableError() -> f() or count -> n.

I pushed a commit where my tests are green, but I guess it would be good to have another look. And I did not run any benchmarks.

BTW regarding the 0.0 and 1.0 quantiles: I did not look at this, but the question is what happens if a user just specifies a 0.0 quantile to track the minimum or a 1.0 quantile to track the maximum, and no other quantiles?

@DieBauer
Copy link
Contributor Author

DieBauer commented Feb 2, 2022

Alright, great to hear! I can imagine the amount that has to go in to testing this properly, which I skipped because of the 'ballpark' figures I got from the high-level tests.

Let me know where I can have a look at your changes!

Regarding only defining a quantile with 0 or 1, yes these are indeed edge cases of min and max in the value stream.
I'm not sure if there's anything special that happens, of course there is overhead of going through the compress and insert.

@fstab
Copy link
Member

fstab commented Feb 3, 2022

I thought about the 0 and 1 quantiles again, and it turned out that it was simpler to support this than I thought. I just pushed an update allowing 0 and 1. Thanks a lot for pointing out that this would be useful.

@fstab
Copy link
Member

fstab commented Feb 3, 2022

I merged it once more, thanks a lot for your contribution!

@DieBauer
Copy link
Contributor Author

DieBauer commented Feb 3, 2022

Nice.

I did benchmarking on the current implementation, insertion is constant (within error bounds) while in the old implementation it was linear.

     OLD
     Benchmark                                          (value)  Mode  Cnt    Score    Error  Units
     CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark    10000  avgt    4    1,104 ±  0,013  ms/op
     CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark   100000  avgt    4   15,577 ±  2,418  ms/op
     CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark  1000000  avgt    4  366,650 ± 48,997  ms/op
     NEW
     Benchmark                                          (value)  Mode  Cnt    Score    Error  Units
     CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark    10000  avgt    4    1,549 ±  0,360  ms/op
     CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark   100000  avgt    4   16,395 ±  1,138  ms/op
     CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark  1000000  avgt    4  168,386 ± 35,664  ms/op

Lookups are 1 or 2 μs for p95 with 1 million elements.

     Benchmark                                       (value)  Mode  Cnt     Score     Error  Units
     CKMSQuantileBenchmark.ckmsQuantileGetBenchmark    10000  avgt    4  1009,883 ± 127,939  ns/op
     CKMSQuantileBenchmark.ckmsQuantileGetBenchmark   100000  avgt    4  1484,729 ± 136,455  ns/op
     CKMSQuantileBenchmark.ckmsQuantileGetBenchmark  1000000  avgt    4  2245,291 ± 238,582  ns/op

This is because the sample size also increases with the number of elements.
see these results

testInsertBatch(batchSize=128, compressInterval=128, totalNumber=10000)
10000 samples: 77
testInsertBatch(batchSize=128, compressInterval=128, totalNumber=100000)
100000 samples: 92
testInsertBatch(batchSize=128, compressInterval=128, totalNumber=1000000)
1000000 samples: 123

So it is expected that looking up the same percentile in a larger list takes more time.

One small improvement I see is in the f method, you account for the double precision rounding, and then do a Math.floor and then cast to int. This is more expensive than just casting to int. And since we can be sure that u, v, q and r are positive the result will be positive, so in the end the result is not changed.

This change is worthwhile since f is being called in many places (insert, lookup, compress).

Here's a benchmark result when removing the math.floor call.

CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark    10000  avgt    4    1,119 ± 0,068  ms/op
CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark   100000  avgt    4   11,775 ± 0,698  ms/op
CKMSQuantileBenchmark.ckmsQuantileInsertBenchmark  1000000  avgt    4  133,567 ± 5,849  ms/op

And the f method itself: Count = 10000 and r = 9500 (and 4 quantiles to loop over)

Benchmark                               Mode  Cnt   Score   Error  Units
CKMSQuantileBenchmark.ckmsQuantileF  avgt    4  37,188 ± 6,681  ns/op
CKMSQuantileBenchmark.ckmsQuantileF  avgt    4  20,500 ± 1,054  ns/op <= Without floor()

I'll create a PR for that.

@fstab
Copy link
Member

fstab commented Feb 5, 2022

It's released 🎉. Thanks again for this awesome PR!

@fstab
Copy link
Member

fstab commented Feb 5, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants