-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to LogLog-Beta Cardinality estimation #8512
Conversation
The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. Our simulations show that the accuracy provided by the new algorithm is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++. The sparse representation was kept in to provide better low cardinality accuracy. However the linear counting and range estimations are replaced.
@seiflotfy thanks for this, sorry for the delay. I plan to look at this soon. I was on vacation last week. |
No worries @e-dard, I am working on a second implementation the builds on this that will reduce the space usage by 50% and allow conversion from your version of hyperloglog to the new one. https://github.com/axiomhq/hyperloglog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm in favour of this approach as it's conceptually simpler than the current state-of-the-art.
However, based on some further testing on datasets with a wide range of cardinalities, I'm concerned that this new approach displays some significantly larger maximum errors.
Cardinalities between 100K and 500K seem to be most likely to be over-estimated. Unfortunately many of our users/customers will have datasets with cardinalities in these ranges.
This could be down to a bug in the implementation or quite possibly my testing tool(!!), so I'm not closing the PR per se yet 😄 .
The results I have are as follows:
Size Actual Cardinality Estimation Error (%) Duplication (%)
500 500 500 0.0000% 0.00%
500 500 500 0.0000% 0.00%
500 390 390 0.0000% 22.00%
500 390 390 0.0000% 22.00%
500 95 95 0.0000% 81.00%
500 95 95 0.0000% 81.00%
1000 1000 1000 0.0000% 0.00%
1000 1000 1000 0.0000% 0.00%
1000 758 758 0.0000% 24.20%
1000 758 758 0.0000% 24.20%
1000 222 222 0.0000% 77.80%
1000 222 222 0.0000% 77.80%
5000 5000 5000 0.0000% 0.00%
5000 5000 5000 0.0000% 0.00%
5000 3769 3769 0.0000% 24.62%
5000 3769 3769 0.0000% 24.62%
5000 1016 1016 0.0000% 79.68%
5000 1016 1016 0.0000% 79.68%
10000 10000 10000 0.0000% 0.00%
10000 10000 10000 0.0000% 0.00%
10000 7523 7524 0.0133% 24.77%
10000 7523 7524 0.0133% 24.77%
10000 1947 1948 0.0513% 80.53%
10000 1947 1948 0.0513% 80.53%
100000 100000 100123 0.1228% 0.00%
100000 100000 102333 2.2798% 0.00%
100000 74973 75076 0.1372% 25.03%
100000 74973 77152 2.8243% 25.03%
100000 19926 19925 -0.0050% 80.07%
100000 19926 19925 -0.0050% 80.07%
250000 250000 249427 -0.2297% 0.00%
250000 250000 250522 0.2084% 0.00%
250000 187501 187462 -0.0208% 25.00%
250000 187501 189224 0.9106% 25.00%
250000 49728 49446 -0.5703% 80.11%
250000 49728 51282 3.0303% 80.11%
500000 500000 499968 -0.0064% 0.00%
500000 500000 500013 0.0026% 0.00%
500000 374699 374989 0.0773% 25.06%
500000 374699 375248 0.1463% 25.06%
500000 100405 100533 0.1273% 79.92%
500000 100405 102744 2.2765% 79.92%
1000000 1000000 1002466 0.2460% 0.00%
1000000 1000000 1002467 0.2461% 0.00%
1000000 750045 749366 -0.0906% 25.00%
1000000 750045 749391 -0.0873% 25.00%
1000000 200081 199592 -0.2450% 79.99%
1000000 200081 201210 0.5611% 79.99%
5000000 5000000 5009290 0.1855% 0.00%
5000000 5000000 5009291 0.1855% 0.00%
5000000 3749786 3761091 0.3006% 25.00%
5000000 3749786 3761091 0.3006% 25.00%
5000000 999232 1001148 0.1914% 80.02%
5000000 999232 1001149 0.1915% 80.02%
25000000 25000000 24994384 -0.0225% 0.00%
25000000 25000000 24994384 -0.0225% 0.00%
25000000 18748935 18720936 -0.1496% 25.00%
25000000 18748935 18720937 -0.1496% 25.00%
25000000 4998717 5007377 0.1729% 80.01%
25000000 4998717 5007377 0.1729% 80.01%
100000000 100000000 99684277 -0.3167% 0.00%
100000000 100000000 99684277 -0.3167% 0.00%
100000000 74989984 75057400 0.0898% 25.01%
100000000 74989984 75057401 0.0898% 25.01%
500000000 500000000 500475904 0.0951% 0.00%
500000000 500000000 500475904 0.0951% 0.00%
500000000 375000935 374940065 -0.0162% 25.00%
500000000 375000935 374940066 -0.0162% 25.00%
Mean Error Median Error Error Variance Max Error
0.0041% 0.0000% 0.0264 -0.5703%
0.3820% 0.0079% 0.7114 3.0303%
The tool used to create them is here.
It looks like there may be a slight over-estimation bias in this new approach, but it's the larger outliers I'm more concerned about. Duplication
refers to how many duplicates there were in the dataset. Estimation
and Error (%)
refer to the performance of the implementations. The current implementation is the first row for each set, this PR's implementation is the second row for each set.
The main program to recreate the above results is:
package main
import (
"os"
axhll "github.com/axiomhq/influxdb/pkg/estimator/hll"
"github.com/influxdata/hll-check"
"github.com/influxdata/influxdb/pkg/estimator/hll"
)
func main() {
hllcheck.Seed = 123456
// Existing implementation.
h1f := func() hllcheck.HLL { return hll.MustNewPlus(16) }
// Proposed implementation in influxdb#8512
h2f := func() hllcheck.HLL { return axhll.MustNewPlus(16) }
_ = hllcheck.Run(hllcheck.ToHLLFatory(h1f), hllcheck.ToHLLFatory(h2f), os.Stdout)
}
@seiflotfy I have github.com/axiomhq/influxdb
on the loglogbeta
branch.
Let me know what you think (and please feel free to check the correctness of the tool — I swiftly knocked it up...)
In summary, I'm mainly concerned about the situations where this new approach is over-estimating cardinalities by more than 1%.
thanks @e-dard indeed loglog-beta constants are optimised for p=14, I will run the simulations to generate the constants for p=16 and update. when running both on p=14 loglog-beta proves to be pretty good and smaller range of error |
@seiflotfy I had a feeling it could be that. Cheers. |
@seiflotfy yep finding new constants did it. I would now say the accuracy is equivalent to the current implementation we have. Given this approach is conceptually simpler, I'm happy to merge this. Thanks!
(Current HLL on top result rows, this PR on bottom result rows). |
The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. The accuracy provided by the new algorithm is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++. The sparse HyperLogLog++ representation was kept in to provide better low cardinality accuracy.
Refer to https://arxiv.org/pdf/1612.02284v2.pdf for more details.
Redis has also recently switched to this implementation (see redis/redis#3677 )
Required for all non-trivial PRs