Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ApproximateCDF aggregator #5570

Merged
merged 21 commits into from Mar 22, 2019

Conversation

Projects
None yet
2 participants
@patrick-schultz
Copy link
Collaborator

commented Mar 8, 2019

This only exposes the raw results to python. I still need to wrap that in some nice convenience functions like approximate_quantile, but I didn't want to let this PR get any longer.

@patrick-schultz patrick-schultz force-pushed the patrick-schultz:quantiles-kll branch from b194dc6 to 3f8cc9b Mar 13, 2019

patrick-schultz added some commits Mar 13, 2019

Notes
-----
This method returns a struct collecting two array: `values` and `ranks`. The

This comment has been minimized.

Copy link
@danking

danking Mar 14, 2019

Collaborator

containing two arrays

aggregator. For example, values=[0,2,5,6,9] and ranks=[0,3,4,5,8,10]
represents the approximation [0,0,0,2,5,6,6,6,9,9], with the value
`values(i)` occupying indices `ranks(i)` (inclusive) to `ranks(i+1)`
(exclusive).

This comment has been minimized.

Copy link
@danking

danking Mar 14, 2019

Collaborator

I think this paragraph will perplex more than it edifies.

I think I would avoid giving the approximation example. Instead I'd say something like:

Suppose we had a collection of values: 0, 0, 1, 1, 2, 3, 10, 11, 12. The
median of this collection is 2. The 25th percentile is 1. The 75th percentile
is 11. We may represent these three quantiles as:

[   1,    2,   11]
[0.25, 0.50, 0.75]

We could also list the rank of the elements instead of their quantile:

[ 1, 2, 11]
[ 2, 4,  7]

This method produces an approximation of the previous representation. The
values are always members of the input, but their ranks may slightly differ from
their true ranks.

This comment has been minimized.

Copy link
@danking

danking Mar 14, 2019

Collaborator

Oh and I guess something about the last rank being the size of the dataset? Hmm. Not sure. I think the above verbiage builds a more helpful intuition but I see that it is mismatched with the actual output.

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 15, 2019

Author Collaborator

For some context, I think of the approx_cdf aggregator as a low level thing that most users will never use directly. For a separate PR I've been adding approximate quantile aggregators and plotting methods, which are more user friendly endpoints. So I was writing this documentation with the superusers in mind who might want to build new use cases on top of this low level primitive.

Besides the fact that the ranks array is one longer than the values array, I think the intuition that the sketch is a list of approximate quantiles is subtly misleading. Not that it's wrong, exactly, but thinking that way tends to lead me to wrong conclusions.

Not sure if you will find this helpful, but it was clarifying for me to consider the extreme case where we collect only one sample. This implementation doesn't support that case, but if it did, the only thing it could do (and still be mathematically correct), is to collect a single uniform random sample, with weight N (the size of the data). Then the cdf would be {values=[s], ranks=[0,N]}.

The key property of the sketch is that, for any query x, we can return an unbiased estimate of the rank of x. In this case, we return 0 if x≤s and 1 if x>s. Note that this is in fact an unbiased estimate—if s is a uniform random sample, then s<x with probability equal to the quantile of x. If we flip it around and estimate the qth quantile, we will answer s for any q. In terms of the description I gave in the comment, that's because our sketch represents the stream that contains N copies of s. Your example gives the impression that in this case the sketch would try to find a sample close to the median, but that is not the case, nor are we providing any other quantile between 0 and 1 as our guess of the quantile of s.

I agree this could stand to be clarified, and I'll think more about the best thing to say. The best I can say right now is that this way of understanding the sketch is what I've had to keep falling back on when trying to figure out the right way to compute a quantile estimate, or how to draw a CDF plot.

object ApproxCDFCombiner {
def apply[@specialized(Int, Long, Float, Double) T: ClassTag : Ordering](
numLevels: Int, capacity: Int, dummy: T
)(implicit helper: ApproxCDFHelper[T]): ApproxCDFCombiner[T] = new ApproxCDFCombiner[T](

This comment has been minimized.

Copy link
@danking

danking Mar 14, 2019

Collaborator

you could also add this as another : ApproxCDFHelper since you don't reference the name

59049, 177147, 531441, 1594323, 4782969, 14348907, 43046721, 129140163,
387420489, 1162261467, 3486784401L, 10460353203L, 31381059609L,
94143178827L, 282429536481L, 847288609443L, 2541865828329L, 7625597484987L,
22876792454961L, 68630377364883L, 205891132094649L)

This comment has been minimized.

Copy link
@danking

danking Mar 14, 2019

Collaborator

verified.

@patrick-schultz

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 15, 2019

Addressed. But I'm open to push-back on the best way to document this.

@danking
Copy link
Collaborator

left a comment

I had these comments queued, didn't get a chance to dive in today. Put it on my calendar for tomorrow.

These represent a summary of the CDF of the distribution of values. In
particular, for any value `x = values(i)` in the summary, we estimate that
there are `ranks(i)` values strictly less than `x`, and that there are
`ranks(i+1)` values less that or equal to `x`. For any value `x` (not

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

less than or equal

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

maybe change variable names to indicate we're no longer speaking of the previous x

@danking
Copy link
Collaborator

left a comment

Here are some more comments. I'm up to generalCompact, which looks complex and I think I'll pick up again tomorrow.


def compactBufferBackwards(
buf: Array[T], inStart: Int, inEnd: Int,
out: Array[T], outEnd: Int

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

hmm, there's an invariant here that's something like:

assert buf != out || inStart >= outEnd

right? Same for above.

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 19, 2019

Author Collaborator

They can be, and often are, the same buffer. For example, compactBufferBackwards(array, start, end, array, end) is a common use case. The possibility that the input and output ranges overlap is the only reason for having the separate forwards and backwards compact methods. compactBuffer (forwards) works when the input and output ranges start at the same index.

I can work out and assert the exact invariant if you'd like.

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

I think for the backwards one it's actually inStart <= outEnd. That's sufficient for correctness and for the uses in the algorithm. As long as you start at or before outEnd, i is always ahead of out so it always reads an original value.

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 22, 2019

Author Collaborator

This copies half of the interval [inStart, inEnd) to the interval [outStart, outEnd), where outStart = outEnd - (inEnd - inStart) / 2. If buf and out are the same array, and inStart < outEnd < inEnd, then the first thing that happens is copying buf(inEnd - 1) or buf(inEnd - 2) (depending on the coin flip), to buf(outEnd - 1), which will overwrite a value in [inStart, inEnd) that we haven't yet copied.

It's safe if either outEnd <= inStart, in which case the input and output intervals are completely disjoint, or outEnd >= inEnd, in which case we are guaranteed to only overwrite input value that have already been copied.

val newBot = bot - 1
items(newBot) = t
levels(0) = newBot

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

rogue new line

}

object ApproxCDFCombiner {
def apply[@specialized(Int, Long, Float, Double) T: ClassTag : Ordering : ApproxCDFHelper](

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

Can we have a test that fails if specialization fails on Int, Long, Float, or Double? I suspect the overhead of boxing will be significant, so asserting the time is within some order of magnitude of a certain expected time seems reasonable.

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

Also, have you inspected the byte code to be certain there's no lingering unspecialized values?

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 19, 2019

Author Collaborator

I haven't inspected the byte code. I was doing some timing tests during development, and it was pretty obvious when I did something that caused boxing. But it was close enough (less than double the time) that a timing test might have too many false positives.

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

mm that's unfortunate

@@ -15,6 +15,7 @@ case class AggSignature(
seqOpArgs: Seq[Type])

sealed trait AggOp { }
final case class ApproxCDF() extends AggOp

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

Even if we don't verify the result is "correct", we should have tests that run through the JVM code generation process for each type.

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 19, 2019

Author Collaborator

I wrote a python test that should be doing that.

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

Ah, sorry, I missed that

* Shift lower levels up to keep items contiguous.
*/
def compactLevel(level: Int, shiftLowerLevels: Boolean = true) {
assert(level < maxNumLevels - 1)

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

what does it mean to compact a level > numLevels? (equivalently: should this be assert(level <= numLevels - 1)?)

if (shiftLowerLevels) {
// shift up values below
if (sizeBelow > 1) {
// copy [bot, a) to [bot+halfSize, b)

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

this is pretty clear from the code, I don't think it's necessary

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 19, 2019

Author Collaborator

The code doesn't mention the upper endpoints of the intervals, which is why I found this helpful. But happy to remove it if you think it's clear.

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

This is the argument du jour. I think it's pretty clear once I've understood the code. I am OK with a pair of assertions:

assert(levels(0) + sizeBelow == a)
assert(bot + halfSize + sizeBelow == b)
// shift up values below
if (sizeBelow > 1) {
// copy [bot, a) to [bot+halfSize, b)
System.arraycopy(items, levels(0), items, bot + halfSize, sizeBelow)

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

bot == levels(0)

System.arraycopy(items, levels(0), items, bot + halfSize, sizeBelow)
} else {
// only needs to be done if sizeBelow == 1, but doesn't hurt otherwise
items(b - 1) = items(a0)

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

Why not just use array copy to copy the one element?

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

Also, what if there's one element in level 0 and an even number of elements in level 1, then sizeBelow is 1, a0 == a, and we should actually be copying a - 1. Does that never happen?

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 19, 2019

Author Collaborator

Why not just use array copy to copy the one element?

If you look at the docstring for arraycopy, it does a lot of precondition checking, figuring out if the in and out ranges overlap, etc. The sizeBelow == 1 case is very common, and I suspect (but haven't checked) that avoiding arraycopy is faster in this case.

Does that never happen?

I don't think that can happen, but I also don't think this code should be relying on that. I think replacing a0 with bot fixes it. Agree?

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

yeah bot seems right

/* Compact level `level`, merging the compacted results into level `level+1`.
* Shift lower levels up to keep items contiguous.
*/
def compactLevel(level: Int, shiftLowerLevels: Boolean = true) {

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

We can unit test his method and I think we should do so.

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

We'll need to seed the randomness.

}
}

def merge(other: ApproxCDFCombiner[T], ubOnNumLevels: Int ): ApproxCDFCombiner[T] = {

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

I think we should also unit test this method.

@danking
Copy link
Collaborator

left a comment

x

val items: Array[Int] = Array(7,2,4,1,3,8,0,5,9)
val combiner = new ApproxCDFCombiner(levels, items, 3, 0, 9, rand)
combiner.compactLevel(1, shiftLowerLevels = true)
println(items.mkString(","))

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

if you want this to be reported on error, add it as a message to the assert

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 20, 2019

Author Collaborator

Oops, just left in debugging cruft.

val combiner = new ApproxCDFCombiner(levels, items, 3, 0, 9, rand)
combiner.compactLevel(1, shiftLowerLevels = true)
println(items.mkString(","))
assert(items.view(1,9) sameElements Array(7,2,4,1,0,5,8,9))

This comment has been minimized.

Copy link
@danking

danking Mar 19, 2019

Collaborator

I thought level 1 was [levels(i), levels(i+1)), i.e. [3, 6), i.e. 1,3,8 and you're merging into 0,5,9 which I'd expect to result in 7,2,4,0,1,5,8,9?

This comment has been minimized.

Copy link
@patrick-schultz

patrick-schultz Mar 20, 2019

Author Collaborator

Here's what happened:

The values, split into levels, are

7 2 4 | 1 3 8 | 0 5 9

We start compacting level 1. There are an odd number of elements, so we leave the first one where it is, and conceptually put the rest into a scratch level between levels 1 and 2:

7 2 4 | 1 | 3 8 | 0 5 9

Now we flip a coin, and it comes up 1, meaning we skip the first element, i.e. we take the odd indices, and merge half the level into the level above, giving

7 2 4 | 1 | ? | 0 5 8 9

where we have left an unused spot, so we shift up everything below:

? | 7 2 4 | 1 | 0 5 8 9
* aggregator is an array "values" of samples, in increasing order, and an array
* "ranks" of integers less than `n`, in increasing order, such that:
* - ranks.length = values.length + 1
* - ranks(0) = 0

This comment has been minimized.

Copy link
@danking

danking Mar 22, 2019

Collaborator

why include it?

@danking danking merged commit bb41118 into hail-is:master Mar 22, 2019

1 check passed

hail-ci-0-1 successful build
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.