Besides being a compact and pretty speedy HyperLogLog implementation for cardinality counting, this modified HyperLogLog allows intersection and similarity estimation of different HyperLogLogs.
A simple implementation of HyperLogLog (LogLog-Beta to be specific):
- 16 bit registers instead of 6 bit, the new 10 bit are for b-bit signatures
- Similarity function estimates Jaccard indices (a number between 0-1) of 0.01 for set cardinalities on the order of 1e9 with accuracy around 5%
- Intersection applies the Jaccard index on the union of the sets to return the intersecting set cardinality
The work is based on "HyperMinHash: Jaccard index sketching in LogLog space - Yun William Yu, Griffin M. Weber"
sk1 := hyperminhash.New()
sk2 := hyperminhash.New()
for i := 0; i < 10000; i++ {
sk1.Add([]byte(strconv.Itoa(i)))
}
sk1.Cardinality() // 10001 (should be 10000)
for i := 3333; i < 23333; i++ {
sk2.Add([]byte(strconv.Itoa(i)))
}
sk2.Cardinality() // 19977 (should be 20000)
sk1.Similarity(sk2) // 0.284589082 (should be 0.2857326533)
sk1.Intersection(sk2) // 6623 (should be 6667)
sk1.Merge(sk2)
sk1.Cardinality() // 23271 (should be 23333)
Set1 | HLL1 | Set2 | HLL2 | S1 ∪ S2 | HLL1 ∪ HLL2 | S1 ∩ S2 | HLL1 ∩ HLL2 |
---|---|---|---|---|---|---|---|
350 | 352 | 752 | 752 | 831 | 832 | 271 (32.611312%) | 274 (32.932692%) |
746 | 748 | 591 | 590 | 834 | 835 | 503 (60.311751%) | 501 (60.000000%) |
248 | 248 | 789 | 791 | 897 | 899 | 140 (15.607581%) | 144 (16.017798%) |
9 | 9 | 818 | 818 | 824 | 825 | 3 (0.364078%) | 3 (0.363636%) |
408 | 411 | 412 | 408 | 771 | 771 | 49 (6.355383%) | 47 (6.095979%) |
Set1 | HLL1 | Set2 | HLL2 | S1 ∪ S2 | HLL1 ∪ HLL2 | S1 ∩ S2 | HLL1 ∩ HLL2 |
---|---|---|---|---|---|---|---|
2126 | 2138 | 1162 | 1158 | 3063 | 3060 | 225 (7.345739%) | 223 (7.287582%) |
7767 | 7706 | 7054 | 7064 | 8889 | 8887 | 5932 (66.734166%) | 5888 (66.254079%) |
842 | 844 | 5183 | 5135 | 5880 | 5842 | 145 (2.465986%) | 135 (2.310852%) |
6833 | 6791 | 664 | 666 | 7410 | 7345 | 87 (1.174089%) | 89 (1.211709%) |
1814 | 1820 | 6214 | 6169 | 7697 | 7639 | 331 (4.300377%) | 320 (4.189030%) |
Set1 | HLL1 | Set2 | HLL2 | S1 ∪ S2 | HLL1 ∪ HLL2 | S1 ∩ S2 | HLL1 ∩ HLL2 |
---|---|---|---|---|---|---|---|
29667 | 29540 | 88700 | 88167 | 92444 | 91667 | 25923 (28.041842%) | 25036 (27.311901%) |
79242 | 78731 | 30216 | 30137 | 83502 | 82953 | 25956 (31.084285%) | 25995 (31.337022%) |
57830 | 57223 | 79550 | 79194 | 82112 | 81595 | 55268 (67.308067%) | 54684 (67.018812%) |
64610 | 63501 | 21696 | 21729 | 75895 | 74816 | 10411 (13.717636%) | 10083 (13.477064%) |
92204 | 91453 | 96417 | 95556 | 165025 | 163370 | 23596 (14.298440%) | 24130 (14.770154%) |
Set1 | HLL1 | Set2 | HLL2 | S1 ∪ S2 | HLL1 ∪ HLL2 | S1 ∩ S2 | HLL1 ∩ HLL2 |
---|---|---|---|---|---|---|---|
150443 | 149810 | 974366 | 979514 | 1088517 | 1096991 | 36292 (3.334077%) | 37417 (3.410876%) |
156337 | 155347 | 19083 | 19070 | 167353 | 165433 | 8067 (4.820350%) | 8017 (4.846071%) |
800969 | 802044 | 51053 | 51244 | 851388 | 853396 | 634 (0.074467%) | 511 (0.059878%) |
176155 | 174707 | 520111 | 516822 | 570092 | 569289 | 126174 (22.132217%) | 123766 (21.740452%) |
485954 | 481362 | 967341 | 972651 | 1081990 | 1091296 | 371305 (34.316861%) | 376007 (34.455088%) |
Set1 | HLL1 | Set2 | HLL2 | S1 ∪ S2 | HLL1 ∪ HLL2 | S1 ∩ S2 | HLL1 ∩ HLL2 |
---|---|---|---|---|---|---|---|
7132942 | 7150720 | 122116 | 121539 | 7243153 | 7261709 | 11905 (0.164362%) | 12550 (0.172824%) |
8646240 | 8649049 | 1277784 | 1295017 | 9821480 | 9854242 | 102544 (1.044079%) | 99163 (1.006298%) |
4192390 | 4164637 | 2788913 | 2779975 | 4526476 | 4499897 | 2454827 (54.232630%) | 2454356 (54.542493%) |
9803344 | 9826412 | 1705700 | 1715798 | 10255010 | 10262719 | 1254034 (12.228501%) | 1273821 (12.412120%) |
1308849 | 1322604 | 9940327 | 9971519 | 11179030 | 11201850 | 70146 (0.627478%) | 80717 (0.720568%) |
Set1 | HLL1 | Set2 | HLL2 | S1 ∪ S2 | HLL1 ∪ HLL2 | S1 ∩ S2 | HLL1 ∩ HLL2 |
---|---|---|---|---|---|---|---|
13237748 | 13298469 | 57073758 | 57124720 | 59474437 | 59394847 | 10837069 (18.221390%) | 11143669 (18.762013%) |
90757994 | 88576114 | 5717797 | 5701796 | 95061178 | 93016636 | 1414613 (1.488108%) | 1350058 (1.451416%) |
60150663 | 60033013 | 79238333 | 77672994 | 110438475 | 108311818 | 28950521 (26.214162%) | 27666946 (25.543792%) |
30187492 | 30718889 | 37756209 | 37153655 | 67443566 | 66938074 | 500135 (0.741561%) | 447406 (0.668388%) |
53196095 | 53461989 | 48344583 | 47535284 | 93284291 | 91321031 | 8256387 (8.850780%) | 8036467 (8.800237%) |