Skip to content

Add ApproximateCDF aggregator#5570

Merged
danking merged 21 commits intohail-is:masterfrom
patrick-schultz:quantiles-kll
Mar 22, 2019
Merged

Add ApproximateCDF aggregator#5570
danking merged 21 commits intohail-is:masterfrom
patrick-schultz:quantiles-kll

Conversation

@patrick-schultz
Copy link
Copy Markdown
Member

This only exposes the raw results to python. I still need to wrap that in some nice convenience functions like approximate_quantile, but I didn't want to let this PR get any longer.


Notes
-----
This method returns a struct collecting two array: `values` and `ranks`. The
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containing two arrays

aggregator. For example, values=[0,2,5,6,9] and ranks=[0,3,4,5,8,10]
represents the approximation [0,0,0,2,5,6,6,6,9,9], with the value
`values(i)` occupying indices `ranks(i)` (inclusive) to `ranks(i+1)`
(exclusive).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this paragraph will perplex more than it edifies.

I think I would avoid giving the approximation example. Instead I'd say something like:

Suppose we had a collection of values: 0, 0, 1, 1, 2, 3, 10, 11, 12. The
median of this collection is 2. The 25th percentile is 1. The 75th percentile
is 11. We may represent these three quantiles as:

[   1,    2,   11]
[0.25, 0.50, 0.75]

We could also list the rank of the elements instead of their quantile:

[ 1, 2, 11]
[ 2, 4,  7]

This method produces an approximation of the previous representation. The
values are always members of the input, but their ranks may slightly differ from
their true ranks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and I guess something about the last rank being the size of the dataset? Hmm. Not sure. I think the above verbiage builds a more helpful intuition but I see that it is mismatched with the actual output.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some context, I think of the approx_cdf aggregator as a low level thing that most users will never use directly. For a separate PR I've been adding approximate quantile aggregators and plotting methods, which are more user friendly endpoints. So I was writing this documentation with the superusers in mind who might want to build new use cases on top of this low level primitive.

Besides the fact that the ranks array is one longer than the values array, I think the intuition that the sketch is a list of approximate quantiles is subtly misleading. Not that it's wrong, exactly, but thinking that way tends to lead me to wrong conclusions.

Not sure if you will find this helpful, but it was clarifying for me to consider the extreme case where we collect only one sample. This implementation doesn't support that case, but if it did, the only thing it could do (and still be mathematically correct), is to collect a single uniform random sample, with weight N (the size of the data). Then the cdf would be {values=[s], ranks=[0,N]}.

The key property of the sketch is that, for any query x, we can return an unbiased estimate of the rank of x. In this case, we return 0 if x≤s and 1 if x>s. Note that this is in fact an unbiased estimate—if s is a uniform random sample, then s<x with probability equal to the quantile of x. If we flip it around and estimate the qth quantile, we will answer s for any q. In terms of the description I gave in the comment, that's because our sketch represents the stream that contains N copies of s. Your example gives the impression that in this case the sketch would try to find a sample close to the median, but that is not the case, nor are we providing any other quantile between 0 and 1 as our guess of the quantile of s.

I agree this could stand to be clarified, and I'll think more about the best thing to say. The best I can say right now is that this way of understanding the sketch is what I've had to keep falling back on when trying to figure out the right way to compute a quantile estimate, or how to draw a CDF plot.

object ApproxCDFCombiner {
def apply[@specialized(Int, Long, Float, Double) T: ClassTag : Ordering](
numLevels: Int, capacity: Int, dummy: T
)(implicit helper: ApproxCDFHelper[T]): ApproxCDFCombiner[T] = new ApproxCDFCombiner[T](
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could also add this as another : ApproxCDFHelper since you don't reference the name

59049, 177147, 531441, 1594323, 4782969, 14348907, 43046721, 129140163,
387420489, 1162261467, 3486784401L, 10460353203L, 31381059609L,
94143178827L, 282429536481L, 847288609443L, 2541865828329L, 7625597484987L,
22876792454961L, 68630377364883L, 205891132094649L)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verified.

@patrick-schultz
Copy link
Copy Markdown
Member Author

Addressed. But I'm open to push-back on the best way to document this.

Copy link
Copy Markdown
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had these comments queued, didn't get a chance to dive in today. Put it on my calendar for tomorrow.

These represent a summary of the CDF of the distribution of values. In
particular, for any value `x = values(i)` in the summary, we estimate that
there are `ranks(i)` values strictly less than `x`, and that there are
`ranks(i+1)` values less that or equal to `x`. For any value `x` (not
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less than or equal

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change variable names to indicate we're no longer speaking of the previous x

Copy link
Copy Markdown
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some more comments. I'm up to generalCompact, which looks complex and I think I'll pick up again tomorrow.


def compactBufferBackwards(
buf: Array[T], inStart: Int, inEnd: Int,
out: Array[T], outEnd: Int
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, there's an invariant here that's something like:

assert buf != out || inStart >= outEnd

right? Same for above.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be, and often are, the same buffer. For example, compactBufferBackwards(array, start, end, array, end) is a common use case. The possibility that the input and output ranges overlap is the only reason for having the separate forwards and backwards compact methods. compactBuffer (forwards) works when the input and output ranges start at the same index.

I can work out and assert the exact invariant if you'd like.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the backwards one it's actually inStart <= outEnd. That's sufficient for correctness and for the uses in the algorithm. As long as you start at or before outEnd, i is always ahead of out so it always reads an original value.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copies half of the interval [inStart, inEnd) to the interval [outStart, outEnd), where outStart = outEnd - (inEnd - inStart) / 2. If buf and out are the same array, and inStart < outEnd < inEnd, then the first thing that happens is copying buf(inEnd - 1) or buf(inEnd - 2) (depending on the coin flip), to buf(outEnd - 1), which will overwrite a value in [inStart, inEnd) that we haven't yet copied.

It's safe if either outEnd <= inStart, in which case the input and output intervals are completely disjoint, or outEnd >= inEnd, in which case we are guaranteed to only overwrite input value that have already been copied.

val newBot = bot - 1
items(newBot) = t
levels(0) = newBot

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rogue new line

}

object ApproxCDFCombiner {
def apply[@specialized(Int, Long, Float, Double) T: ClassTag : Ordering : ApproxCDFHelper](
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test that fails if specialization fails on Int, Long, Float, or Double? I suspect the overhead of boxing will be significant, so asserting the time is within some order of magnitude of a certain expected time seems reasonable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, have you inspected the byte code to be certain there's no lingering unspecialized values?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't inspected the byte code. I was doing some timing tests during development, and it was pretty obvious when I did something that caused boxing. But it was close enough (less than double the time) that a timing test might have too many false positives.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm that's unfortunate

seqOpArgs: Seq[Type])

sealed trait AggOp { }
final case class ApproxCDF() extends AggOp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we don't verify the result is "correct", we should have tests that run through the JVM code generation process for each type.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a python test that should be doing that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I missed that

* Shift lower levels up to keep items contiguous.
*/
def compactLevel(level: Int, shiftLowerLevels: Boolean = true) {
assert(level < maxNumLevels - 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean to compact a level > numLevels? (equivalently: should this be assert(level <= numLevels - 1)?)

if (shiftLowerLevels) {
// shift up values below
if (sizeBelow > 1) {
// copy [bot, a) to [bot+halfSize, b)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty clear from the code, I don't think it's necessary

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code doesn't mention the upper endpoints of the intervals, which is why I found this helpful. But happy to remove it if you think it's clear.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the argument du jour. I think it's pretty clear once I've understood the code. I am OK with a pair of assertions:

assert(levels(0) + sizeBelow == a)
assert(bot + halfSize + sizeBelow == b)

// shift up values below
if (sizeBelow > 1) {
// copy [bot, a) to [bot+halfSize, b)
System.arraycopy(items, levels(0), items, bot + halfSize, sizeBelow)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bot == levels(0)

System.arraycopy(items, levels(0), items, bot + halfSize, sizeBelow)
} else {
// only needs to be done if sizeBelow == 1, but doesn't hurt otherwise
items(b - 1) = items(a0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use array copy to copy the one element?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what if there's one element in level 0 and an even number of elements in level 1, then sizeBelow is 1, a0 == a, and we should actually be copying a - 1. Does that never happen?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use array copy to copy the one element?

If you look at the docstring for arraycopy, it does a lot of precondition checking, figuring out if the in and out ranges overlap, etc. The sizeBelow == 1 case is very common, and I suspect (but haven't checked) that avoiding arraycopy is faster in this case.

Does that never happen?

I don't think that can happen, but I also don't think this code should be relying on that. I think replacing a0 with bot fixes it. Agree?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah bot seems right

/* Compact level `level`, merging the compacted results into level `level+1`.
* Shift lower levels up to keep items contiguous.
*/
def compactLevel(level: Int, shiftLowerLevels: Boolean = true) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can unit test his method and I think we should do so.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to seed the randomness.

}
}

def merge(other: ApproxCDFCombiner[T], ubOnNumLevels: Int ): ApproxCDFCombiner[T] = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also unit test this method.

Copy link
Copy Markdown
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x

val items: Array[Int] = Array(7,2,4,1,3,8,0,5,9)
val combiner = new ApproxCDFCombiner(levels, items, 3, 0, 9, rand)
combiner.compactLevel(1, shiftLowerLevels = true)
println(items.mkString(","))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want this to be reported on error, add it as a message to the assert

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, just left in debugging cruft.

val combiner = new ApproxCDFCombiner(levels, items, 3, 0, 9, rand)
combiner.compactLevel(1, shiftLowerLevels = true)
println(items.mkString(","))
assert(items.view(1,9) sameElements Array(7,2,4,1,0,5,8,9))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought level 1 was [levels(i), levels(i+1)), i.e. [3, 6), i.e. 1,3,8 and you're merging into 0,5,9 which I'd expect to result in 7,2,4,0,1,5,8,9?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what happened:

The values, split into levels, are

7 2 4 | 1 3 8 | 0 5 9

We start compacting level 1. There are an odd number of elements, so we leave the first one where it is, and conceptually put the rest into a scratch level between levels 1 and 2:

7 2 4 | 1 | 3 8 | 0 5 9

Now we flip a coin, and it comes up 1, meaning we skip the first element, i.e. we take the odd indices, and merge half the level into the level above, giving

7 2 4 | 1 | ? | 0 5 8 9

where we have left an unused spot, so we shift up everything below:

? | 7 2 4 | 1 | 0 5 8 9

* aggregator is an array "values" of samples, in increasing order, and an array
* "ranks" of integers less than `n`, in increasing order, such that:
* - ranks.length = values.length + 1
* - ranks(0) = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why include it?

@danking danking merged commit bb41118 into hail-is:master Mar 22, 2019
@patrick-schultz patrick-schultz deleted the quantiles-kll branch January 2, 2025 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants