Switch to LogLog-Beta Cardinality estimation #8512

seiflotfy · 2017-06-20T12:10:13Z

The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. The accuracy provided by the new algorithm is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++. The sparse HyperLogLog++ representation was kept in to provide better low cardinality accuracy.

Refer to https://arxiv.org/pdf/1612.02284v2.pdf for more details.

Redis has also recently switched to this implementation (see redis/redis#3677 )

Required for all non-trivial PRs

Rebased/mergable
Tests pass
CHANGELOG.md updated
Sign CLA (if not already signed)

The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. Our simulations show that the accuracy provided by the new algorithm is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++. The sparse representation was kept in to provide better low cardinality accuracy. However the linear counting and range estimations are replaced.

e-dard · 2017-06-26T14:51:21Z

@seiflotfy thanks for this, sorry for the delay. I plan to look at this soon. I was on vacation last week.

seiflotfy · 2017-06-26T18:38:20Z

No worries @e-dard, I am working on a second implementation the builds on this that will reduce the space usage by 50% and allow conversion from your version of hyperloglog to the new one. https://github.com/axiomhq/hyperloglog

e-dard

In general I'm in favour of this approach as it's conceptually simpler than the current state-of-the-art.

However, based on some further testing on datasets with a wide range of cardinalities, I'm concerned that this new approach displays some significantly larger maximum errors.

Cardinalities between 100K and 500K seem to be most likely to be over-estimated. Unfortunately many of our users/customers will have datasets with cardinalities in these ranges.

This could be down to a bug in the implementation or quite possibly my testing tool(!!), so I'm not closing the PR per se yet 😄 .

The results I have are as follows:

Size			Actual Cardinality	Estimation		Error (%)		Duplication (%)
500			500			500			0.0000%			0.00%
500			500			500			0.0000%			0.00%
500			390			390			0.0000%			22.00%
500			390			390			0.0000%			22.00%
500			95			95			0.0000%			81.00%
500			95			95			0.0000%			81.00%
1000			1000			1000			0.0000%			0.00%
1000			1000			1000			0.0000%			0.00%
1000			758			758			0.0000%			24.20%
1000			758			758			0.0000%			24.20%
1000			222			222			0.0000%			77.80%
1000			222			222			0.0000%			77.80%
5000			5000			5000			0.0000%			0.00%
5000			5000			5000			0.0000%			0.00%
5000			3769			3769			0.0000%			24.62%
5000			3769			3769			0.0000%			24.62%
5000			1016			1016			0.0000%			79.68%
5000			1016			1016			0.0000%			79.68%
10000			10000			10000			0.0000%			0.00%
10000			10000			10000			0.0000%			0.00%
10000			7523			7524			0.0133%			24.77%
10000			7523			7524			0.0133%			24.77%
10000			1947			1948			0.0513%			80.53%
10000			1947			1948			0.0513%			80.53%
100000			100000			100123			0.1228%			0.00%
100000			100000			102333			2.2798%			0.00%
100000			74973			75076			0.1372%			25.03%
100000			74973			77152			2.8243%			25.03%
100000			19926			19925			-0.0050%		80.07%
100000			19926			19925			-0.0050%		80.07%
250000			250000			249427			-0.2297%		0.00%
250000			250000			250522			0.2084%			0.00%
250000			187501			187462			-0.0208%		25.00%
250000			187501			189224			0.9106%			25.00%
250000			49728			49446			-0.5703%		80.11%
250000			49728			51282			3.0303%			80.11%
500000			500000			499968			-0.0064%		0.00%
500000			500000			500013			0.0026%			0.00%
500000			374699			374989			0.0773%			25.06%
500000			374699			375248			0.1463%			25.06%
500000			100405			100533			0.1273%			79.92%
500000			100405			102744			2.2765%			79.92%
1000000			1000000			1002466			0.2460%			0.00%
1000000			1000000			1002467			0.2461%			0.00%
1000000			750045			749366			-0.0906%		25.00%
1000000			750045			749391			-0.0873%		25.00%
1000000			200081			199592			-0.2450%		79.99%
1000000			200081			201210			0.5611%			79.99%
5000000			5000000			5009290			0.1855%			0.00%
5000000			5000000			5009291			0.1855%			0.00%
5000000			3749786			3761091			0.3006%			25.00%
5000000			3749786			3761091			0.3006%			25.00%
5000000			999232			1001148			0.1914%			80.02%
5000000			999232			1001149			0.1915%			80.02%
25000000		25000000		24994384		-0.0225%		0.00%
25000000		25000000		24994384		-0.0225%		0.00%
25000000		18748935		18720936		-0.1496%		25.00%
25000000		18748935		18720937		-0.1496%		25.00%
25000000		4998717			5007377			0.1729%			80.01%
25000000		4998717			5007377			0.1729%			80.01%
100000000		100000000		99684277		-0.3167%		0.00%
100000000		100000000		99684277		-0.3167%		0.00%
100000000		74989984		75057400		0.0898%			25.01%
100000000		74989984		75057401		0.0898%			25.01%
500000000		500000000		500475904		0.0951%			0.00%
500000000		500000000		500475904		0.0951%			0.00%
500000000		375000935		374940065		-0.0162%		25.00%
500000000		375000935		374940066		-0.0162%		25.00%


Mean Error		Median Error		Error Variance		Max Error
0.0041%			0.0000%			0.0264			-0.5703%
0.3820%			0.0079%			0.7114			3.0303%

The tool used to create them is here.

It looks like there may be a slight over-estimation bias in this new approach, but it's the larger outliers I'm more concerned about. Duplication refers to how many duplicates there were in the dataset. Estimation and Error (%) refer to the performance of the implementations. The current implementation is the first row for each set, this PR's implementation is the second row for each set.

The main program to recreate the above results is:

package main

import (
	"os"

	axhll "github.com/axiomhq/influxdb/pkg/estimator/hll"
	"github.com/influxdata/hll-check"
	"github.com/influxdata/influxdb/pkg/estimator/hll"
)

func main() {
	hllcheck.Seed = 123456
	// Existing implementation.
	h1f := func() hllcheck.HLL { return hll.MustNewPlus(16) }
	// Proposed implementation in influxdb#8512
	h2f := func() hllcheck.HLL { return axhll.MustNewPlus(16) }

	_ = hllcheck.Run(hllcheck.ToHLLFatory(h1f), hllcheck.ToHLLFatory(h2f), os.Stdout)
}

@seiflotfy I have github.com/axiomhq/influxdb on the loglogbeta branch.

Let me know what you think (and please feel free to check the correctness of the tool — I swiftly knocked it up...)

In summary, I'm mainly concerned about the situations where this new approach is over-estimating cardinalities by more than 1%.

seiflotfy · 2017-06-29T14:50:08Z

thanks @e-dard indeed loglog-beta constants are optimised for p=14, I will run the simulations to generate the constants for p=16 and update. when running both on p=14 loglog-beta proves to be pretty good and smaller range of error

e-dard · 2017-06-29T14:51:46Z

@seiflotfy I had a feeling it could be that. Cheers.

e-dard · 2017-07-07T15:12:33Z

@seiflotfy yep finding new constants did it. I would now say the accuracy is equivalent to the current implementation we have. Given this approach is conceptually simpler, I'm happy to merge this. Thanks!

⇒  go run cmd/main.go
Size			Actual Cardinality	Estimation		Error (%)		Duplication (%)
500			500			500			0.0000%			0.00%
500			500			500			0.0000%			0.00%
500			390			390			0.0000%			22.00%
500			390			390			0.0000%			22.00%
500			95			95			0.0000%			81.00%
500			95			95			0.0000%			81.00%
1000			1000			1000			0.0000%			0.00%
1000			1000			1000			0.0000%			0.00%
1000			758			758			0.0000%			24.20%
1000			758			758			0.0000%			24.20%
1000			222			222			0.0000%			77.80%
1000			222			222			0.0000%			77.80%
5000			5000			5000			0.0000%			0.00%
5000			5000			5000			0.0000%			0.00%
5000			3769			3769			0.0000%			24.62%
5000			3769			3769			0.0000%			24.62%
5000			1016			1016			0.0000%			79.68%
5000			1016			1016			0.0000%			79.68%
10000			10000			10000			0.0000%			0.00%
10000			10000			10000			0.0000%			0.00%
10000			7523			7524			0.0133%			24.77%
10000			7523			7524			0.0133%			24.77%
10000			1947			1948			0.0513%			80.53%
10000			1947			1948			0.0513%			80.53%
100000			100000			100123			0.1228%			0.00%
100000			100000			99901			-0.0991%		0.00%
100000			74973			75076			0.1372%			25.03%
100000			74973			74925			-0.0641%		25.03%
100000			19926			19925			-0.0050%		80.07%
100000			19926			19925			-0.0050%		80.07%
250000			250000			249427			-0.2297%		0.00%
250000			250000			249243			-0.3037%		0.00%
250000			187501			187462			-0.0208%		25.00%
250000			187501			187293			-0.1111%		25.00%
250000			49728			49446			-0.5703%		80.11%
250000			49728			49512			-0.4363%		80.11%
500000			500000			499968			-0.0064%		0.00%
500000			500000			499948			-0.0104%		0.00%
500000			374699			374989			0.0773%			25.06%
500000			374699			374986			0.0765%			25.06%
500000			100405			100533			0.1273%			79.92%
500000			100405			100310			-0.0947%		79.92%
1000000			1000000			1002466			0.2460%			0.00%
1000000			1000000			1002467			0.2461%			0.00%
1000000			750045			749366			-0.0906%		25.00%
1000000			750045			749506			-0.0719%		25.00%
1000000			200081			199592			-0.2450%		79.99%
1000000			200081			199402			-0.3405%		79.99%
5000000			5000000			5009290			0.1855%			0.00%
5000000			5000000			5009291			0.1855%			0.00%
5000000			3749786			3761091			0.3006%			25.00%
5000000			3749786			3761091			0.3006%			25.00%
5000000			999232			1001148			0.1914%			80.02%
5000000			999232			1001149			0.1915%			80.02%
25000000		25000000		24994384		-0.0225%		0.00%
25000000		25000000		24994384		-0.0225%		0.00%
25000000		18748935		18720936		-0.1496%		25.00%
25000000		18748935		18720937		-0.1496%		25.00%
25000000		4998717			5007377			0.1729%			80.01%
25000000		4998717			5007377			0.1729%			80.01%
100000000		100000000		99684277		-0.3167%		0.00%
100000000		100000000		99684277		-0.3167%		0.00%
100000000		74989984		75057400		0.0898%			25.01%
100000000		74989984		75057401		0.0898%			25.01%
500000000		500000000		500475904		0.0951%			0.00%
500000000		500000000		500475904		0.0951%			0.00%
500000000		375000935		374940065		-0.0162%		25.00%
500000000		375000935		374940066		-0.0162%		25.00%


Mean Error		Median Error		Error Variance		Max Error
0.0041%			0.0000%			0.0264			-0.5703%
-0.0182%		0.0000%			0.0244			-0.4363%

(Current HLL on top result rows, this PR on bottom result rows).

121watts added the proposed label Jun 20, 2017

jwilder requested a review from e-dard June 20, 2017 20:14

e-dard suggested changes Jun 29, 2017

View reviewed changes

change beta constants for the hll cardinality bias estimator

4cb01c1

e-dard approved these changes Jul 7, 2017

View reviewed changes

e-dard merged commit a432386 into influxdata:master Jul 7, 2017

e-dard removed the proposed label Jul 7, 2017

e-dard mentioned this pull request Oct 19, 2017

Add EXACT CARDINALITY commands #8984

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to LogLog-Beta Cardinality estimation #8512

Switch to LogLog-Beta Cardinality estimation #8512

seiflotfy commented Jun 20, 2017 •

edited

Loading

e-dard commented Jun 26, 2017

seiflotfy commented Jun 26, 2017

e-dard left a comment •

edited

Loading

seiflotfy commented Jun 29, 2017

e-dard commented Jun 29, 2017

e-dard commented Jul 7, 2017

Switch to LogLog-Beta Cardinality estimation #8512

Switch to LogLog-Beta Cardinality estimation #8512

Conversation

seiflotfy commented Jun 20, 2017 • edited Loading

Required for all non-trivial PRs

e-dard commented Jun 26, 2017

seiflotfy commented Jun 26, 2017

e-dard left a comment • edited Loading

Choose a reason for hiding this comment

seiflotfy commented Jun 29, 2017

e-dard commented Jun 29, 2017

e-dard commented Jul 7, 2017

seiflotfy commented Jun 20, 2017 •

edited

Loading

e-dard left a comment •

edited

Loading