Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APPROX_HEAVY_HITTERS: Values that have a percent_share of total rows above min_percent_share #11807

Open
cploonker opened this issue Oct 29, 2018 · 9 comments
Assignees

Comments

@cploonker
Copy link

cploonker commented Oct 29, 2018

Presto aggregate function: APPROX_HEAVY_HITTERS(A, min_percent_share, ε, δ) -> MAP(K, V)

A= column of the table. In other words, entire array of values.
n= total number of values(rows) in A
min_percent_share= User provided parameter. The values returned should atleast have this much share in all the values processed. In other words min_percent share of 10 means return only those heave hitters whose occurence is atleast 10% of the overall volume.
ε= error bound such that counts are overestimated by at most εn. Default value=0.01 OR 1/2k OR min_percent_share/200
δ= probability that the count is overestimated by more than the error bound εn. Default value=0.01
MAP(K, V)= Map of heavy hitter values as keys and the occurrence counts as values.
k= Variable used in the referenced paper. min_percent_share=100/k.

  1. Every value that occurs at least n/k times is in the list. No false negatives.
  2. Worst case(for default parameters) values that occur at least n/2k times but less than n/k times can make it to the list with probability of δ. Chances of false positives.

Example use case

Let's say there is a table with each record representing a visitor and the corresponding domain visited. This function can be useful to get the top domains by visit count and approx visit count. Can be even more valuable to find top domains by country.

Algorithm

For complete background on the algorithm refer to heavy hitters in: http://theory.stanford.edu/~tim/s17/l/l2.pdf

Data structures to hold the data

  1. Maintain a standard Count-Min sketch during the scan of the data set and put all elements into it : https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/ConservativeAddSketch.java
    b= e/ε. number of buckets(width) in count-min-sketch. e is the base of natural log=2.718
    l >= ln(1/δ) number of hash functions(depth) in count-min-sketch
  2. Counter m of the total number of processed elements so far.
  3. Maintain heavy hitters values which occur at least m/k where m is the number of values processed so far. https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/execution/resourceGroups/IndexedPriorityQueue.java

Logic to add elements into the above data structures:

  1. Put the element into the sketch Inc(x) followed by Count(x)
  2. If Count(x) >= m*min_percent_share/100 then add/update the x into the heavy hitters heap. Here m=number of rows processed till now.
  3. Whenever m grows to the point that some object x stored in the heap falls below min_percent_share (checkable in O(1) time via Find-Min), we delete x from the heap (via Extract-Min). After finishing the pass, we output all of the objects in the heap.
  4. If the element which is dropped from the heavy hitters list re-appears and crosses min_percent_share than it will make it into the heavy hitters list again.
  5. Heap contains at most 2k elements at all times as per the referred paper. Note k = 100/min_percent_share
  6. The “no large errors” assumption implies an approximate: every object x in the heap has true frequency count at least n/k − εn = n/2k (other objects would be deleted from the heap by the end of the pass). If the count-min sketch makes large errors on a few objects, then these objects might erroneously appear in final output.

Error bounds

Counts are overestimated by at most εn except in a small probability of δ

Why not top K elements

  1. Since in presto calculations are distributed at block level, if we try to maintain a list of top K elements then there is a chance that a specific element does not make it to the top K list within individual blocks but when put together should have been part of the top K list. Such elements will be missed out if we build a top_k_elements version of the function. By using min_percent_share individual blocks will always have all elements that are above min_percent_share. Moreover when merging blocks percent_share can only go down with respect to the highest percent_share in one of the block and hence no risk of missing out on an element which is above min_percent_share.

Resources:

  1. http://theory.stanford.edu/~tim/s17/l/l2.pdf
  2. This algorithm is used by redis as well to implement top_hitters: https://redislabs.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/
  3. https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
@cploonker cploonker changed the title Top N Most/Least frequent occurring values Top N Most occurring values along with approx occurrence counts Mar 11, 2019
@cploonker cploonker changed the title Top N Most occurring values along with approx occurrence counts Approx top N Most occurring values along with approx occurrence counts Mar 11, 2019
@rongrong
Copy link
Contributor

Can you provide some example use case? It's not clear to me what kind of use case would require approximation on top N, and how using approximation would help with memory footprint. Thanks!

@cploonker
Copy link
Author

cploonker commented Mar 19, 2019

@rongrong thanks for your attention.

Let us say we have a huge table with each row representing the domain and the user visiting it. If my interest is to only know the top domains which are visited the most and how many times is each visited this function would make it easy to do that kind of query. Imagine if i want the same result but for each country.

About the memory, as described in the logic above since we are using count-min-sketch the memory footprint is greatly reduced. Here is a link which shows that a 40MB data can be held in 48KB of data in count-min-sketch: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Hope this answers your question and please feel free to let me know if i can answer any more questions.

@tdcmeehan tdcmeehan self-assigned this Mar 19, 2019
@jonw73330
Copy link

@rongrong - Similar to cploonker's feedback

max_by(,,N) is a reasonably common pattern.
For datasets that are very high cardinality (and usually highly skewed), being able to identify top items is important. (ex> Top ID's, words, url's.) The current implementation is excluding many workloads from using presto due to memory constraints. Having a small error rate in these datasets is an acceptable compromise. This seems consistent with the other HLL approx functions.

@cploonker cploonker changed the title Approx top N Most occurring values along with approx occurrence counts Approx top K Most occurring values along with approx occurrence counts Mar 20, 2019
@cploonker cploonker changed the title Approx top K Most occurring values along with approx occurrence counts APPROX_HEAVY_HITTERS: Elements each of which is comprising at least k-th part of total elements Mar 20, 2019
@cploonker cploonker changed the title APPROX_HEAVY_HITTERS: Elements each of which is comprising at least k-th part of total elements APPROX_HEAVY_HITTERS: Values that occur atleast n/k times where n=total row, k=user provided parameter Mar 20, 2019
@cploonker cploonker changed the title APPROX_HEAVY_HITTERS: Values that occur atleast n/k times where n=total row, k=user provided parameter APPROX_HEAVY_HITTERS: Values that occur atleast n/k times where n=total rows, k=user provided parameter Mar 20, 2019
@tompetrillo
Copy link

I also think this would be useful. Suppose I have a table of errors and different products that the errors were experienced in, then it is very useful to see the top 4 or 5 most commonly occurring errors for each product. Generally speaking, this is a more advanced version of MODE which is also not available out-of-the-box. There are methods to obtain it, but the approximation dramatically improves speed. I would like this.

@tompetrillo
Copy link

I would comment, that I think the name should change. APPROX should happen at the end of the function. Something like MOST_FREQUENT_OCCURING_APPROX has more obvious meaning.

@abhrajitmukherjee
Copy link

This is a much needed feature and will be super useful for the Data Engineers of the field who heavily use presto

@cploonker
Copy link
Author

I would comment, that I think the name should change. APPROX should happen at the end of the function. Something like MOST_FREQUENT_OCCURING_APPROX has more obvious meaning.

@tompetrillo, i would let the presto team decide the function name to align with their naming convention. I don't have any strong opinion about the name.

@cploonker cploonker changed the title APPROX_HEAVY_HITTERS: Values that occur atleast n/k times where n=total rows, k=user provided parameter APPROX_HEAVY_HITTERS: Values that have a percent_share of total rows above min_percent_share Mar 28, 2019
@tdcmeehan
Copy link
Contributor

The above algorithm will work in specific cases where we know the incoming dataset and can carefully tune our epsilon and delta values. In general, however, it seems to be an undue burden to ask that a user give good values for delta and epsilon. The costs of getting it wrong are steep: a saturated CMS will falsely report heavy hitters with high probability. The function could throw once it's saturated to a point, but this increases the overhead and burden of the function. It might also push people to use larger sizes than necessary. The holy grail is an algorithm that adapts to the input, while still preserving lossless merging of intermediate aggregate states as will be necessary in Presto, without data quality compromises for unskewed distributions.

@joelbecker
Copy link

I just want to +1 this. This is a useful pattern that shouldn't require a subquery or additional CTE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants