Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to LogLog-Beta Cardinality estimation #8512

Merged
merged 2 commits into from
Jul 7, 2017
Merged

Switch to LogLog-Beta Cardinality estimation #8512

merged 2 commits into from
Jul 7, 2017

Conversation

seiflotfy
Copy link
Contributor

@seiflotfy seiflotfy commented Jun 20, 2017

The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is more efficient and simpler to implement. The accuracy provided by the new algorithm is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++. The sparse HyperLogLog++ representation was kept in to provide better low cardinality accuracy.

Refer to https://arxiv.org/pdf/1612.02284v2.pdf for more details.

Redis has also recently switched to this implementation (see redis/redis#3677 )

Required for all non-trivial PRs
  • Rebased/mergable
  • Tests pass
  • CHANGELOG.md updated
  • Sign CLA (if not already signed)

The new algorithm uses only one formula and needs no additional bias corrections for the entire range of cardinalities,
therefore, it is more efficient and simpler to implement. Our simulations show that the accuracy provided by the new
algorithm is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++. The sparse
representation was kept in to provide better low cardinality accuracy. However the linear counting and range estimations
are replaced.
@jwilder jwilder requested a review from e-dard June 20, 2017 20:14
@e-dard
Copy link
Contributor

e-dard commented Jun 26, 2017

@seiflotfy thanks for this, sorry for the delay. I plan to look at this soon. I was on vacation last week.

@seiflotfy
Copy link
Contributor Author

No worries @e-dard, I am working on a second implementation the builds on this that will reduce the space usage by 50% and allow conversion from your version of hyperloglog to the new one. https://github.com/axiomhq/hyperloglog

Copy link
Contributor

@e-dard e-dard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm in favour of this approach as it's conceptually simpler than the current state-of-the-art.

However, based on some further testing on datasets with a wide range of cardinalities, I'm concerned that this new approach displays some significantly larger maximum errors.

Cardinalities between 100K and 500K seem to be most likely to be over-estimated. Unfortunately many of our users/customers will have datasets with cardinalities in these ranges.

This could be down to a bug in the implementation or quite possibly my testing tool(!!), so I'm not closing the PR per se yet 😄 .

The results I have are as follows:

Size			Actual Cardinality	Estimation		Error (%)		Duplication (%)
500			500			500			0.0000%			0.00%
500			500			500			0.0000%			0.00%
500			390			390			0.0000%			22.00%
500			390			390			0.0000%			22.00%
500			95			95			0.0000%			81.00%
500			95			95			0.0000%			81.00%
1000			1000			1000			0.0000%			0.00%
1000			1000			1000			0.0000%			0.00%
1000			758			758			0.0000%			24.20%
1000			758			758			0.0000%			24.20%
1000			222			222			0.0000%			77.80%
1000			222			222			0.0000%			77.80%
5000			5000			5000			0.0000%			0.00%
5000			5000			5000			0.0000%			0.00%
5000			3769			3769			0.0000%			24.62%
5000			3769			3769			0.0000%			24.62%
5000			1016			1016			0.0000%			79.68%
5000			1016			1016			0.0000%			79.68%
10000			10000			10000			0.0000%			0.00%
10000			10000			10000			0.0000%			0.00%
10000			7523			7524			0.0133%			24.77%
10000			7523			7524			0.0133%			24.77%
10000			1947			1948			0.0513%			80.53%
10000			1947			1948			0.0513%			80.53%
100000			100000			100123			0.1228%			0.00%
100000			100000			102333			2.2798%			0.00%
100000			74973			75076			0.1372%			25.03%
100000			74973			77152			2.8243%			25.03%
100000			19926			19925			-0.0050%		80.07%
100000			19926			19925			-0.0050%		80.07%
250000			250000			249427			-0.2297%		0.00%
250000			250000			250522			0.2084%			0.00%
250000			187501			187462			-0.0208%		25.00%
250000			187501			189224			0.9106%			25.00%
250000			49728			49446			-0.5703%		80.11%
250000			49728			51282			3.0303%			80.11%
500000			500000			499968			-0.0064%		0.00%
500000			500000			500013			0.0026%			0.00%
500000			374699			374989			0.0773%			25.06%
500000			374699			375248			0.1463%			25.06%
500000			100405			100533			0.1273%			79.92%
500000			100405			102744			2.2765%			79.92%
1000000			1000000			1002466			0.2460%			0.00%
1000000			1000000			1002467			0.2461%			0.00%
1000000			750045			749366			-0.0906%		25.00%
1000000			750045			749391			-0.0873%		25.00%
1000000			200081			199592			-0.2450%		79.99%
1000000			200081			201210			0.5611%			79.99%
5000000			5000000			5009290			0.1855%			0.00%
5000000			5000000			5009291			0.1855%			0.00%
5000000			3749786			3761091			0.3006%			25.00%
5000000			3749786			3761091			0.3006%			25.00%
5000000			999232			1001148			0.1914%			80.02%
5000000			999232			1001149			0.1915%			80.02%
25000000		25000000		24994384		-0.0225%		0.00%
25000000		25000000		24994384		-0.0225%		0.00%
25000000		18748935		18720936		-0.1496%		25.00%
25000000		18748935		18720937		-0.1496%		25.00%
25000000		4998717			5007377			0.1729%			80.01%
25000000		4998717			5007377			0.1729%			80.01%
100000000		100000000		99684277		-0.3167%		0.00%
100000000		100000000		99684277		-0.3167%		0.00%
100000000		74989984		75057400		0.0898%			25.01%
100000000		74989984		75057401		0.0898%			25.01%
500000000		500000000		500475904		0.0951%			0.00%
500000000		500000000		500475904		0.0951%			0.00%
500000000		375000935		374940065		-0.0162%		25.00%
500000000		375000935		374940066		-0.0162%		25.00%


Mean Error		Median Error		Error Variance		Max Error
0.0041%			0.0000%			0.0264			-0.5703%
0.3820%			0.0079%			0.7114			3.0303%

The tool used to create them is here.

It looks like there may be a slight over-estimation bias in this new approach, but it's the larger outliers I'm more concerned about. Duplication refers to how many duplicates there were in the dataset. Estimation and Error (%) refer to the performance of the implementations. The current implementation is the first row for each set, this PR's implementation is the second row for each set.

The main program to recreate the above results is:

package main

import (
	"os"

	axhll "github.com/axiomhq/influxdb/pkg/estimator/hll"
	"github.com/influxdata/hll-check"
	"github.com/influxdata/influxdb/pkg/estimator/hll"
)

func main() {
	hllcheck.Seed = 123456
	// Existing implementation.
	h1f := func() hllcheck.HLL { return hll.MustNewPlus(16) }
	// Proposed implementation in influxdb#8512
	h2f := func() hllcheck.HLL { return axhll.MustNewPlus(16) }

	_ = hllcheck.Run(hllcheck.ToHLLFatory(h1f), hllcheck.ToHLLFatory(h2f), os.Stdout)
}

@seiflotfy I have github.com/axiomhq/influxdb on the loglogbeta branch.

Let me know what you think (and please feel free to check the correctness of the tool — I swiftly knocked it up...)

In summary, I'm mainly concerned about the situations where this new approach is over-estimating cardinalities by more than 1%.

@seiflotfy
Copy link
Contributor Author

thanks @e-dard indeed loglog-beta constants are optimised for p=14, I will run the simulations to generate the constants for p=16 and update. when running both on p=14 loglog-beta proves to be pretty good and smaller range of error

@e-dard
Copy link
Contributor

e-dard commented Jun 29, 2017

@seiflotfy I had a feeling it could be that. Cheers.

@e-dard
Copy link
Contributor

e-dard commented Jul 7, 2017

@seiflotfy yep finding new constants did it. I would now say the accuracy is equivalent to the current implementation we have. Given this approach is conceptually simpler, I'm happy to merge this. Thanks!

⇒  go run cmd/main.go
Size			Actual Cardinality	Estimation		Error (%)		Duplication (%)
500			500			500			0.0000%			0.00%
500			500			500			0.0000%			0.00%
500			390			390			0.0000%			22.00%
500			390			390			0.0000%			22.00%
500			95			95			0.0000%			81.00%
500			95			95			0.0000%			81.00%
1000			1000			1000			0.0000%			0.00%
1000			1000			1000			0.0000%			0.00%
1000			758			758			0.0000%			24.20%
1000			758			758			0.0000%			24.20%
1000			222			222			0.0000%			77.80%
1000			222			222			0.0000%			77.80%
5000			5000			5000			0.0000%			0.00%
5000			5000			5000			0.0000%			0.00%
5000			3769			3769			0.0000%			24.62%
5000			3769			3769			0.0000%			24.62%
5000			1016			1016			0.0000%			79.68%
5000			1016			1016			0.0000%			79.68%
10000			10000			10000			0.0000%			0.00%
10000			10000			10000			0.0000%			0.00%
10000			7523			7524			0.0133%			24.77%
10000			7523			7524			0.0133%			24.77%
10000			1947			1948			0.0513%			80.53%
10000			1947			1948			0.0513%			80.53%
100000			100000			100123			0.1228%			0.00%
100000			100000			99901			-0.0991%		0.00%
100000			74973			75076			0.1372%			25.03%
100000			74973			74925			-0.0641%		25.03%
100000			19926			19925			-0.0050%		80.07%
100000			19926			19925			-0.0050%		80.07%
250000			250000			249427			-0.2297%		0.00%
250000			250000			249243			-0.3037%		0.00%
250000			187501			187462			-0.0208%		25.00%
250000			187501			187293			-0.1111%		25.00%
250000			49728			49446			-0.5703%		80.11%
250000			49728			49512			-0.4363%		80.11%
500000			500000			499968			-0.0064%		0.00%
500000			500000			499948			-0.0104%		0.00%
500000			374699			374989			0.0773%			25.06%
500000			374699			374986			0.0765%			25.06%
500000			100405			100533			0.1273%			79.92%
500000			100405			100310			-0.0947%		79.92%
1000000			1000000			1002466			0.2460%			0.00%
1000000			1000000			1002467			0.2461%			0.00%
1000000			750045			749366			-0.0906%		25.00%
1000000			750045			749506			-0.0719%		25.00%
1000000			200081			199592			-0.2450%		79.99%
1000000			200081			199402			-0.3405%		79.99%
5000000			5000000			5009290			0.1855%			0.00%
5000000			5000000			5009291			0.1855%			0.00%
5000000			3749786			3761091			0.3006%			25.00%
5000000			3749786			3761091			0.3006%			25.00%
5000000			999232			1001148			0.1914%			80.02%
5000000			999232			1001149			0.1915%			80.02%
25000000		25000000		24994384		-0.0225%		0.00%
25000000		25000000		24994384		-0.0225%		0.00%
25000000		18748935		18720936		-0.1496%		25.00%
25000000		18748935		18720937		-0.1496%		25.00%
25000000		4998717			5007377			0.1729%			80.01%
25000000		4998717			5007377			0.1729%			80.01%
100000000		100000000		99684277		-0.3167%		0.00%
100000000		100000000		99684277		-0.3167%		0.00%
100000000		74989984		75057400		0.0898%			25.01%
100000000		74989984		75057401		0.0898%			25.01%
500000000		500000000		500475904		0.0951%			0.00%
500000000		500000000		500475904		0.0951%			0.00%
500000000		375000935		374940065		-0.0162%		25.00%
500000000		375000935		374940066		-0.0162%		25.00%


Mean Error		Median Error		Error Variance		Max Error
0.0041%			0.0000%			0.0264			-0.5703%
-0.0182%		0.0000%			0.0244			-0.4363%

(Current HLL on top result rows, this PR on bottom result rows).

@e-dard e-dard merged commit a432386 into influxdata:master Jul 7, 2017
@e-dard e-dard removed the proposed label Jul 7, 2017
@e-dard e-dard mentioned this pull request Oct 19, 2017
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants