Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ApproximateCDF aggregator #5570

Merged
merged 21 commits into from Mar 22, 2019
Merged

Conversation

@patrick-schultz
Copy link
Collaborator

@patrick-schultz patrick-schultz commented Mar 8, 2019

This only exposes the raw results to python. I still need to wrap that in some nice convenience functions like approximate_quantile, but I didn't want to let this PR get any longer.

Notes
-----
This method returns a struct collecting two array: `values` and `ranks`. The
Copy link
Collaborator

@danking danking Mar 14, 2019

containing two arrays

aggregator. For example, values=[0,2,5,6,9] and ranks=[0,3,4,5,8,10]
represents the approximation [0,0,0,2,5,6,6,6,9,9], with the value
`values(i)` occupying indices `ranks(i)` (inclusive) to `ranks(i+1)`
(exclusive).
Copy link
Collaborator

@danking danking Mar 14, 2019

I think this paragraph will perplex more than it edifies.

I think I would avoid giving the approximation example. Instead I'd say something like:

Suppose we had a collection of values: 0, 0, 1, 1, 2, 3, 10, 11, 12. The
median of this collection is 2. The 25th percentile is 1. The 75th percentile
is 11. We may represent these three quantiles as:

[   1,    2,   11]
[0.25, 0.50, 0.75]

We could also list the rank of the elements instead of their quantile:

[ 1, 2, 11]
[ 2, 4,  7]

This method produces an approximation of the previous representation. The
values are always members of the input, but their ranks may slightly differ from
their true ranks.

Copy link
Collaborator

@danking danking Mar 14, 2019

Oh and I guess something about the last rank being the size of the dataset? Hmm. Not sure. I think the above verbiage builds a more helpful intuition but I see that it is mismatched with the actual output.

Copy link
Collaborator Author

@patrick-schultz patrick-schultz Mar 15, 2019

For some context, I think of the approx_cdf aggregator as a low level thing that most users will never use directly. For a separate PR I've been adding approximate quantile aggregators and plotting methods, which are more user friendly endpoints. So I was writing this documentation with the superusers in mind who might want to build new use cases on top of this low level primitive.

Besides the fact that the ranks array is one longer than the values array, I think the intuition that the sketch is a list of approximate quantiles is subtly misleading. Not that it's wrong, exactly, but thinking that way tends to lead me to wrong conclusions.

Not sure if you will find this helpful, but it was clarifying for me to consider the extreme case where we collect only one sample. This implementation doesn't support that case, but if it did, the only thing it could do (and still be mathematically correct), is to collect a single uniform random sample, with weight N (the size of the data). Then the cdf would be {values=[s], ranks=[0,N]}.

The key property of the sketch is that, for any query x, we can return an unbiased estimate of the rank of x. In this case, we return 0 if x≤s and 1 if x>s. Note that this is in fact an unbiased estimate—if s is a uniform random sample, then s<x with probability equal to the quantile of x. If we flip it around and estimate the qth quantile, we will answer s for any q. In terms of the description I gave in the comment, that's because our sketch represents the stream that contains N copies of s. Your example gives the impression that in this case the sketch would try to find a sample close to the median, but that is not the case, nor are we providing any other quantile between 0 and 1 as our guess of the quantile of s.

I agree this could stand to be clarified, and I'll think more about the best thing to say. The best I can say right now is that this way of understanding the sketch is what I've had to keep falling back on when trying to figure out the right way to compute a quantile estimate, or how to draw a CDF plot.

object ApproxCDFCombiner {
def apply[@specialized(Int, Long, Float, Double) T: ClassTag : Ordering](
numLevels: Int, capacity: Int, dummy: T
)(implicit helper: ApproxCDFHelper[T]): ApproxCDFCombiner[T] = new ApproxCDFCombiner[T](
Copy link
Collaborator

@danking danking Mar 14, 2019

you could also add this as another : ApproxCDFHelper since you don't reference the name

59049, 177147, 531441, 1594323, 4782969, 14348907, 43046721, 129140163,
387420489, 1162261467, 3486784401L, 10460353203L, 31381059609L,
94143178827L, 282429536481L, 847288609443L, 2541865828329L, 7625597484987L,
22876792454961L, 68630377364883L, 205891132094649L)
Copy link
Collaborator

@danking danking Mar 14, 2019

verified.

@patrick-schultz
Copy link
Collaborator Author

@patrick-schultz patrick-schultz commented Mar 15, 2019

Addressed. But I'm open to push-back on the best way to document this.

Copy link
Collaborator

@danking danking left a comment

I had these comments queued, didn't get a chance to dive in today. Put it on my calendar for tomorrow.

These represent a summary of the CDF of the distribution of values. In
particular, for any value `x = values(i)` in the summary, we estimate that
there are `ranks(i)` values strictly less than `x`, and that there are
`ranks(i+1)` values less that or equal to `x`. For any value `x` (not
Copy link
Collaborator

@danking danking Mar 19, 2019

less than or equal

Copy link
Collaborator

@danking danking Mar 19, 2019

maybe change variable names to indicate we're no longer speaking of the previous x

Copy link
Collaborator

@danking danking left a comment

Here are some more comments. I'm up to generalCompact, which looks complex and I think I'll pick up again tomorrow.


def compactBufferBackwards(
buf: Array[T], inStart: Int, inEnd: Int,
out: Array[T], outEnd: Int
Copy link
Collaborator

@danking danking Mar 19, 2019

hmm, there's an invariant here that's something like:

assert buf != out || inStart >= outEnd

right? Same for above.

Copy link
Collaborator Author

@patrick-schultz patrick-schultz Mar 19, 2019

They can be, and often are, the same buffer. For example, compactBufferBackwards(array, start, end, array, end) is a common use case. The possibility that the input and output ranges overlap is the only reason for having the separate forwards and backwards compact methods. compactBuffer (forwards) works when the input and output ranges start at the same index.

I can work out and assert the exact invariant if you'd like.

Copy link
Collaborator

@danking danking Mar 19, 2019

I think for the backwards one it's actually inStart <= outEnd. That's sufficient for correctness and for the uses in the algorithm. As long as you start at or before outEnd, i is always ahead of out so it always reads an original value.

Copy link
Collaborator Author

@patrick-schultz patrick-schultz Mar 22, 2019

This copies half of the interval [inStart, inEnd) to the interval [outStart, outEnd), where outStart = outEnd - (inEnd - inStart) / 2. If buf and out are the same array, and inStart < outEnd < inEnd, then the first thing that happens is copying buf(inEnd - 1) or buf(inEnd - 2) (depending on the coin flip), to buf(outEnd - 1), which will overwrite a value in [inStart, inEnd) that we haven't yet copied.

It's safe if either outEnd <= inStart, in which case the input and output intervals are completely disjoint, or outEnd >= inEnd, in which case we are guaranteed to only overwrite input value that have already been copied.

val newBot = bot - 1
items(newBot) = t
levels(0) = newBot

Copy link
Collaborator

@danking danking Mar 19, 2019

rogue new line

}

object ApproxCDFCombiner {
def apply[@specialized(Int, Long, Float, Double) T: ClassTag : Ordering : ApproxCDFHelper](
Copy link
Collaborator

@danking danking Mar 19, 2019

Can we have a test that fails if specialization fails on Int, Long, Float, or Double? I suspect the overhead of boxing will be significant, so asserting the time is within some order of magnitude of a certain expected time seems reasonable.

Copy link
Collaborator

@danking danking Mar 19, 2019

Also, have you inspected the byte code to be certain there's no lingering unspecialized values?

Copy link
Collaborator Author

@patrick-schultz patrick-schultz Mar 19, 2019

I haven't inspected the byte code. I was doing some timing tests during development, and it was pretty obvious when I did something that caused boxing. But it was close enough (less than double the time) that a timing test might have too many false positives.

Copy link
Collaborator

@danking danking Mar 19, 2019

mm that's unfortunate

@@ -15,6 +15,7 @@ case class AggSignature(
seqOpArgs: Seq[Type])

sealed trait AggOp { }
final case class ApproxCDF() extends AggOp
Copy link
Collaborator

@danking danking Mar 19, 2019

Even if we don't verify the result is "correct", we should have tests that run through the JVM code generation process for each type.

Copy link
Collaborator Author

@patrick-schultz patrick-schultz Mar 19, 2019

I wrote a python test that should be doing that.

Copy link
Collaborator

@danking danking Mar 19, 2019

Ah, sorry, I missed that

* Shift lower levels up to keep items contiguous.
*/
def compactLevel(level: Int, shiftLowerLevels: Boolean = true) {
assert(level < maxNumLevels - 1)
Copy link
Collaborator

@danking danking Mar 19, 2019

what does it mean to compact a level > numLevels? (equivalently: should this be assert(level <= numLevels - 1)?)

if (shiftLowerLevels) {
// shift up values below
if (sizeBelow > 1) {
// copy [bot, a) to [bot+halfSize, b)
Copy link
Collaborator

@danking danking Mar 19, 2019

this is pretty clear from the code, I don't think it's necessary

Copy link
Collaborator Author

@patrick-schultz patrick-schultz Mar 19, 2019

The code doesn't mention the upper endpoints of the intervals, which is why I found this helpful. But happy to remove it if you think it's clear.

Copy link
Collaborator

@danking danking Mar 19, 2019

This is the argument du jour. I think it's pretty clear once I've understood the code. I am OK with a pair of assertions:

assert(levels(0) + sizeBelow == a)
assert(bot + halfSize + sizeBelow == b)

// shift up values below
if (sizeBelow > 1) {
// copy [bot, a) to [bot+halfSize, b)
System.arraycopy(items, levels(0), items, bot + halfSize, sizeBelow)
Copy link
Collaborator

@danking danking Mar 19, 2019

bot == levels(0)

System.arraycopy(items, levels(0), items, bot + halfSize, sizeBelow)
} else {
// only needs to be done if sizeBelow == 1, but doesn't hurt otherwise
items(b - 1) = items(a0)
Copy link
Collaborator

@danking danking Mar 19, 2019

Why not just use array copy to copy the one element?

Copy link
Collaborator

@danking danking Mar 19, 2019

Also, what if there's one element in level 0 and an even number of elements in level 1, then sizeBelow is 1, a0 == a, and we should actually be copying a - 1. Does that never happen?

Copy link
Collaborator Author

@patrick-schultz patrick-schultz Mar 19, 2019

Why not just use array copy to copy the one element?

If you look at the docstring for arraycopy, it does a lot of precondition checking, figuring out if the in and out ranges overlap, etc. The sizeBelow == 1 case is very common, and I suspect (but haven't checked) that avoiding arraycopy is faster in this case.

Does that never happen?

I don't think that can happen, but I also don't think this code should be relying on that. I think replacing a0 with bot fixes it. Agree?

Copy link
Collaborator

@danking danking Mar 19, 2019

yeah bot seems right

/* Compact level `level`, merging the compacted results into level `level+1`.
* Shift lower levels up to keep items contiguous.
*/
def compactLevel(level: Int, shiftLowerLevels: Boolean = true) {
Copy link
Collaborator

@danking danking Mar 19, 2019

We can unit test his method and I think we should do so.

Copy link
Collaborator

@danking danking Mar 19, 2019

We'll need to seed the randomness.

}
}

def merge(other: ApproxCDFCombiner[T], ubOnNumLevels: Int ): ApproxCDFCombiner[T] = {
Copy link
Collaborator

@danking danking Mar 19, 2019

I think we should also unit test this method.

Copy link
Collaborator

@danking danking left a comment

x

val items: Array[Int] = Array(7,2,4,1,3,8,0,5,9)
val combiner = new ApproxCDFCombiner(levels, items, 3, 0, 9, rand)
combiner.compactLevel(1, shiftLowerLevels = true)
println(items.mkString(","))
Copy link
Collaborator

@danking danking Mar 19, 2019

if you want this to be reported on error, add it as a message to the assert

Copy link
Collaborator Author

@patrick-schultz patrick-schultz Mar 20, 2019

Oops, just left in debugging cruft.

val combiner = new ApproxCDFCombiner(levels, items, 3, 0, 9, rand)
combiner.compactLevel(1, shiftLowerLevels = true)
println(items.mkString(","))
assert(items.view(1,9) sameElements Array(7,2,4,1,0,5,8,9))
Copy link
Collaborator

@danking danking Mar 19, 2019

I thought level 1 was [levels(i), levels(i+1)), i.e. [3, 6), i.e. 1,3,8 and you're merging into 0,5,9 which I'd expect to result in 7,2,4,0,1,5,8,9?

Copy link
Collaborator Author

@patrick-schultz patrick-schultz Mar 20, 2019

Here's what happened:

The values, split into levels, are

7 2 4 | 1 3 8 | 0 5 9

We start compacting level 1. There are an odd number of elements, so we leave the first one where it is, and conceptually put the rest into a scratch level between levels 1 and 2:

7 2 4 | 1 | 3 8 | 0 5 9

Now we flip a coin, and it comes up 1, meaning we skip the first element, i.e. we take the odd indices, and merge half the level into the level above, giving

7 2 4 | 1 | ? | 0 5 8 9

where we have left an unused spot, so we shift up everything below:

? | 7 2 4 | 1 | 0 5 8 9

* aggregator is an array "values" of samples, in increasing order, and an array
* "ranks" of integers less than `n`, in increasing order, such that:
* - ranks.length = values.length + 1
* - ranks(0) = 0
Copy link
Collaborator

@danking danking Mar 22, 2019

why include it?

@danking danking merged commit bb41118 into hail-is:master Mar 22, 2019
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants