Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Add reservoir sample aggregator #12812

Merged
merged 3 commits into from Mar 24, 2023

Conversation

tpoterba
Copy link
Contributor

CHANGELOG: Fixed bug where Table/MT._calculate_new_partitions returned unbalanced intervals with whole-stage code generation runtime.

CHANGELOG: Fixed bug where Table/MT._calculate_new_partitions returned unbalanced intervals with whole-stage code generation runtime.
@tpoterba
Copy link
Contributor Author

Suggestion from Patrick: https://en.wikipedia.org/wiki/Combinatorial_number_system could be used to transform the sample into a uniform single integer in [0, n choose k) for testing


cb.assign(garbage, other.garbage)
cb.assign(seenSoFar, other.seenSoFar)
cb.assign(garbage, other.garbage)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assigned garbage twice

cb.assign(j, 0)
cb.whileLoop(j < maxSize, {
val x = cb.memoize(rand.invoke[Double]("nextDouble"))
cb.ifx(x * (totalWeightLeft + totalWeightRight) <= totalWeightLeft, {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the probabilities need to change as you start pulling items out of the two sides. I think it should be

if (x * (leftSize * totalWeightLeft + rightSize * totalWeightRight) <= leftSize * totalWeightLeft)

Another possibility is to modify the left builder in place, using a weighted generalization of the seqOp:

weightSoFar = totalWeightLeft
rightWeight = totalWeightRight / rightSize
for (j in 0..right.size)
  weightSoFar += rightWeight
  if (left.size < maxSize)
    left.append(right[j])
  else
    if (randDouble() * weightSoFar < rightWeight * maxSize)
      swap right[j] into random position in left

The unweighted sampler maintains the invariant that at any time, the probability any item seen so far is in the sample (P(x in S)) is maxSize / seenSoFar. The weighted generalization makes that maxSize * weight(x) / weightSoFar, where weightSoFar is the sum of the weights of all items seen so far.

For the combOp, if we just union the two samples together, but give each item from the left the weight totalWeightLeft / leftSize, and similarly for the right, then after the weighted sampler runs, the probability any item from the left is in the result is

(leftSize / totalWeightLeft) * (maxSize * (totalWeightLeft / leftSize) / totalWeight)
=
maxSize / totalWeight

I'm pretty sure this handles all cases where one or both sides aren't full as well.

@danking danking merged commit d983d0d into hail-is:main Mar 24, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants