APPROX_HEAVY_HITTERS: Values that have a percent_share of total rows above min_percent_share #11807

cploonker · 2018-10-29T17:57:19Z

Presto aggregate function: APPROX_HEAVY_HITTERS(A, min_percent_share, ε, δ) -> MAP(K, V)

A= column of the table. In other words, entire array of values.
n= total number of values(rows) in A
min_percent_share= User provided parameter. The values returned should atleast have this much share in all the values processed. In other words min_percent share of 10 means return only those heave hitters whose occurence is atleast 10% of the overall volume.
ε= error bound such that counts are overestimated by at most εn. Default value=0.01 OR 1/2k OR min_percent_share/200
δ= probability that the count is overestimated by more than the error bound εn. Default value=0.01
MAP(K, V)= Map of heavy hitter values as keys and the occurrence counts as values.
k= Variable used in the referenced paper. min_percent_share=100/k.

Every value that occurs at least n/k times is in the list. No false negatives.
Worst case(for default parameters) values that occur at least n/2k times but less than n/k times can make it to the list with probability of δ. Chances of false positives.

Example use case

Let's say there is a table with each record representing a visitor and the corresponding domain visited. This function can be useful to get the top domains by visit count and approx visit count. Can be even more valuable to find top domains by country.

Algorithm

For complete background on the algorithm refer to heavy hitters in: http://theory.stanford.edu/~tim/s17/l/l2.pdf

Data structures to hold the data

Maintain a standard Count-Min sketch during the scan of the data set and put all elements into it : https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/ConservativeAddSketch.java
b= e/ε. number of buckets(width) in count-min-sketch. e is the base of natural log=2.718
l >= ln(1/δ) number of hash functions(depth) in count-min-sketch
Counter m of the total number of processed elements so far.
Maintain heavy hitters values which occur at least m/k where m is the number of values processed so far. https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/execution/resourceGroups/IndexedPriorityQueue.java

Logic to add elements into the above data structures:

Put the element into the sketch Inc(x) followed by Count(x)
If Count(x) >= m*min_percent_share/100 then add/update the x into the heavy hitters heap. Here m=number of rows processed till now.
Whenever m grows to the point that some object x stored in the heap falls below min_percent_share (checkable in O(1) time via Find-Min), we delete x from the heap (via Extract-Min). After finishing the pass, we output all of the objects in the heap.
If the element which is dropped from the heavy hitters list re-appears and crosses min_percent_share than it will make it into the heavy hitters list again.
Heap contains at most 2k elements at all times as per the referred paper. Note k = 100/min_percent_share
The “no large errors” assumption implies an approximate: every object x in the heap has true frequency count at least n/k − εn = n/2k (other objects would be deleted from the heap by the end of the pass). If the count-min sketch makes large errors on a few objects, then these objects might erroneously appear in final output.

Error bounds

Counts are overestimated by at most εn except in a small probability of δ

Why not top K elements

Since in presto calculations are distributed at block level, if we try to maintain a list of top K elements then there is a chance that a specific element does not make it to the top K list within individual blocks but when put together should have been part of the top K list. Such elements will be missed out if we build a top_k_elements version of the function. By using min_percent_share individual blocks will always have all elements that are above min_percent_share. Moreover when merging blocks percent_share can only go down with respect to the highest percent_share in one of the block and hence no risk of missing out on an element which is above min_percent_share.

Resources:

http://theory.stanford.edu/~tim/s17/l/l2.pdf
This algorithm is used by redis as well to implement top_hitters: https://redislabs.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/
https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

rongrong · 2019-03-19T06:37:02Z

Can you provide some example use case? It's not clear to me what kind of use case would require approximation on top N, and how using approximation would help with memory footprint. Thanks!

cploonker · 2019-03-19T17:44:06Z

@rongrong thanks for your attention.

Let us say we have a huge table with each row representing the domain and the user visiting it. If my interest is to only know the top domains which are visited the most and how many times is each visited this function would make it easy to do that kind of query. Imagine if i want the same result but for each country.

About the memory, as described in the logic above since we are using count-min-sketch the memory footprint is greatly reduced. Here is a link which shows that a 40MB data can be held in 48KB of data in count-min-sketch: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Hope this answers your question and please feel free to let me know if i can answer any more questions.

jonw73330 · 2019-03-19T20:22:00Z

@rongrong - Similar to cploonker's feedback

max_by(,,N) is a reasonably common pattern.
For datasets that are very high cardinality (and usually highly skewed), being able to identify top items is important. (ex> Top ID's, words, url's.) The current implementation is excluding many workloads from using presto due to memory constraints. Having a small error rate in these datasets is an acceptable compromise. This seems consistent with the other HLL approx functions.

tompetrillo · 2019-03-20T22:20:16Z

I also think this would be useful. Suppose I have a table of errors and different products that the errors were experienced in, then it is very useful to see the top 4 or 5 most commonly occurring errors for each product. Generally speaking, this is a more advanced version of MODE which is also not available out-of-the-box. There are methods to obtain it, but the approximation dramatically improves speed. I would like this.

tompetrillo · 2019-03-20T22:22:31Z

I would comment, that I think the name should change. APPROX should happen at the end of the function. Something like MOST_FREQUENT_OCCURING_APPROX has more obvious meaning.

abhrajitmukherjee · 2019-03-20T23:58:23Z

This is a much needed feature and will be super useful for the Data Engineers of the field who heavily use presto

cploonker · 2019-03-21T17:27:42Z

I would comment, that I think the name should change. APPROX should happen at the end of the function. Something like MOST_FREQUENT_OCCURING_APPROX has more obvious meaning.

@tompetrillo, i would let the presto team decide the function name to align with their naming convention. I don't have any strong opinion about the name.

tdcmeehan · 2019-04-04T06:08:48Z

The above algorithm will work in specific cases where we know the incoming dataset and can carefully tune our epsilon and delta values. In general, however, it seems to be an undue burden to ask that a user give good values for delta and epsilon. The costs of getting it wrong are steep: a saturated CMS will falsely report heavy hitters with high probability. The function could throw once it's saturated to a point, but this increases the overhead and burden of the function. It might also push people to use larger sizes than necessary. The holy grail is an algorithm that adapts to the input, while still preserving lossless merging of intermediate aggregate states as will be necessary in Presto, without data quality compromises for unskewed distributions.

joelbecker · 2020-05-13T20:54:07Z

I just want to +1 this. This is a useful pattern that shouldn't require a subquery or additional CTE.

cploonker changed the title ~~Top N Most/Least frequent occurring values~~ Top N Most occurring values along with approx occurrence counts Mar 11, 2019

cploonker changed the title ~~Top N Most occurring values along with approx occurrence counts~~ Approx top N Most occurring values along with approx occurrence counts Mar 11, 2019

cploonker mentioned this issue Mar 19, 2019

Changes to IndexedPriorityQueue to be able to use elsewhere in the project #12493

Closed

tdcmeehan self-assigned this Mar 19, 2019

cploonker changed the title ~~Approx top N Most occurring values along with approx occurrence counts~~ Approx top K Most occurring values along with approx occurrence counts Mar 20, 2019

cploonker changed the title ~~Approx top K Most occurring values along with approx occurrence counts~~ APPROX_HEAVY_HITTERS: Elements each of which is comprising at least k-th part of total elements Mar 20, 2019

cploonker changed the title ~~APPROX_HEAVY_HITTERS: Elements each of which is comprising at least k-th part of total elements~~ APPROX_HEAVY_HITTERS: Values that occur atleast n/k times where n=total row, k=user provided parameter Mar 20, 2019

cploonker changed the title ~~APPROX_HEAVY_HITTERS: Values that occur atleast n/k times where n=total row, k=user provided parameter~~ APPROX_HEAVY_HITTERS: Values that occur atleast n/k times where n=total rows, k=user provided parameter Mar 20, 2019

cploonker changed the title ~~APPROX_HEAVY_HITTERS: Values that occur atleast n/k times where n=total rows, k=user provided parameter~~ APPROX_HEAVY_HITTERS: Values that have a percent_share of total rows above min_percent_share Mar 28, 2019

rongrong mentioned this issue Apr 11, 2019

Add support for mode(x) or approx_mode(x) in aggregate functions #12648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APPROX_HEAVY_HITTERS: Values that have a percent_share of total rows above min_percent_share #11807

APPROX_HEAVY_HITTERS: Values that have a percent_share of total rows above min_percent_share #11807

cploonker commented Oct 29, 2018 •

edited

rongrong commented Mar 19, 2019

cploonker commented Mar 19, 2019 •

edited

jonw73330 commented Mar 19, 2019

tompetrillo commented Mar 20, 2019

tompetrillo commented Mar 20, 2019

abhrajitmukherjee commented Mar 20, 2019

cploonker commented Mar 21, 2019

tdcmeehan commented Apr 4, 2019

joelbecker commented May 13, 2020

APPROX_HEAVY_HITTERS: Values that have a percent_share of total rows above min_percent_share #11807

APPROX_HEAVY_HITTERS: Values that have a percent_share of total rows above min_percent_share #11807

Comments

cploonker commented Oct 29, 2018 • edited

Presto aggregate function: APPROX_HEAVY_HITTERS(A, min_percent_share, ε, δ) -> MAP(K, V)

Example use case

Algorithm

Data structures to hold the data

Logic to add elements into the above data structures:

Error bounds

Why not top K elements

Resources:

rongrong commented Mar 19, 2019

cploonker commented Mar 19, 2019 • edited

jonw73330 commented Mar 19, 2019

tompetrillo commented Mar 20, 2019

tompetrillo commented Mar 20, 2019

abhrajitmukherjee commented Mar 20, 2019

cploonker commented Mar 21, 2019

tdcmeehan commented Apr 4, 2019

joelbecker commented May 13, 2020

cploonker commented Oct 29, 2018 •

edited

cploonker commented Mar 19, 2019 •

edited