Add ApproximateCDF aggregator#5570
Conversation
b194dc6 to
3f8cc9b
Compare
|
|
||
| Notes | ||
| ----- | ||
| This method returns a struct collecting two array: `values` and `ranks`. The |
| aggregator. For example, values=[0,2,5,6,9] and ranks=[0,3,4,5,8,10] | ||
| represents the approximation [0,0,0,2,5,6,6,6,9,9], with the value | ||
| `values(i)` occupying indices `ranks(i)` (inclusive) to `ranks(i+1)` | ||
| (exclusive). |
There was a problem hiding this comment.
I think this paragraph will perplex more than it edifies.
I think I would avoid giving the approximation example. Instead I'd say something like:
Suppose we had a collection of values: 0, 0, 1, 1, 2, 3, 10, 11, 12. The
median of this collection is 2. The 25th percentile is 1. The 75th percentile
is 11. We may represent these three quantiles as:
[ 1, 2, 11]
[0.25, 0.50, 0.75]
We could also list the rank of the elements instead of their quantile:
[ 1, 2, 11]
[ 2, 4, 7]
This method produces an approximation of the previous representation. The
values are always members of the input, but their ranks may slightly differ from
their true ranks.
There was a problem hiding this comment.
Oh and I guess something about the last rank being the size of the dataset? Hmm. Not sure. I think the above verbiage builds a more helpful intuition but I see that it is mismatched with the actual output.
There was a problem hiding this comment.
For some context, I think of the approx_cdf aggregator as a low level thing that most users will never use directly. For a separate PR I've been adding approximate quantile aggregators and plotting methods, which are more user friendly endpoints. So I was writing this documentation with the superusers in mind who might want to build new use cases on top of this low level primitive.
Besides the fact that the ranks array is one longer than the values array, I think the intuition that the sketch is a list of approximate quantiles is subtly misleading. Not that it's wrong, exactly, but thinking that way tends to lead me to wrong conclusions.
Not sure if you will find this helpful, but it was clarifying for me to consider the extreme case where we collect only one sample. This implementation doesn't support that case, but if it did, the only thing it could do (and still be mathematically correct), is to collect a single uniform random sample, with weight N (the size of the data). Then the cdf would be {values=[s], ranks=[0,N]}.
The key property of the sketch is that, for any query x, we can return an unbiased estimate of the rank of x. In this case, we return 0 if x≤s and 1 if x>s. Note that this is in fact an unbiased estimate—if s is a uniform random sample, then s<x with probability equal to the quantile of x. If we flip it around and estimate the qth quantile, we will answer s for any q. In terms of the description I gave in the comment, that's because our sketch represents the stream that contains N copies of s. Your example gives the impression that in this case the sketch would try to find a sample close to the median, but that is not the case, nor are we providing any other quantile between 0 and 1 as our guess of the quantile of s.
I agree this could stand to be clarified, and I'll think more about the best thing to say. The best I can say right now is that this way of understanding the sketch is what I've had to keep falling back on when trying to figure out the right way to compute a quantile estimate, or how to draw a CDF plot.
| object ApproxCDFCombiner { | ||
| def apply[@specialized(Int, Long, Float, Double) T: ClassTag : Ordering]( | ||
| numLevels: Int, capacity: Int, dummy: T | ||
| )(implicit helper: ApproxCDFHelper[T]): ApproxCDFCombiner[T] = new ApproxCDFCombiner[T]( |
There was a problem hiding this comment.
you could also add this as another : ApproxCDFHelper since you don't reference the name
| 59049, 177147, 531441, 1594323, 4782969, 14348907, 43046721, 129140163, | ||
| 387420489, 1162261467, 3486784401L, 10460353203L, 31381059609L, | ||
| 94143178827L, 282429536481L, 847288609443L, 2541865828329L, 7625597484987L, | ||
| 22876792454961L, 68630377364883L, 205891132094649L) |
|
Addressed. But I'm open to push-back on the best way to document this. |
danking
left a comment
There was a problem hiding this comment.
I had these comments queued, didn't get a chance to dive in today. Put it on my calendar for tomorrow.
| These represent a summary of the CDF of the distribution of values. In | ||
| particular, for any value `x = values(i)` in the summary, we estimate that | ||
| there are `ranks(i)` values strictly less than `x`, and that there are | ||
| `ranks(i+1)` values less that or equal to `x`. For any value `x` (not |
There was a problem hiding this comment.
maybe change variable names to indicate we're no longer speaking of the previous x
danking
left a comment
There was a problem hiding this comment.
Here are some more comments. I'm up to generalCompact, which looks complex and I think I'll pick up again tomorrow.
|
|
||
| def compactBufferBackwards( | ||
| buf: Array[T], inStart: Int, inEnd: Int, | ||
| out: Array[T], outEnd: Int |
There was a problem hiding this comment.
hmm, there's an invariant here that's something like:
assert buf != out || inStart >= outEnd
right? Same for above.
There was a problem hiding this comment.
They can be, and often are, the same buffer. For example, compactBufferBackwards(array, start, end, array, end) is a common use case. The possibility that the input and output ranges overlap is the only reason for having the separate forwards and backwards compact methods. compactBuffer (forwards) works when the input and output ranges start at the same index.
I can work out and assert the exact invariant if you'd like.
There was a problem hiding this comment.
I think for the backwards one it's actually inStart <= outEnd. That's sufficient for correctness and for the uses in the algorithm. As long as you start at or before outEnd, i is always ahead of out so it always reads an original value.
There was a problem hiding this comment.
This copies half of the interval [inStart, inEnd) to the interval [outStart, outEnd), where outStart = outEnd - (inEnd - inStart) / 2. If buf and out are the same array, and inStart < outEnd < inEnd, then the first thing that happens is copying buf(inEnd - 1) or buf(inEnd - 2) (depending on the coin flip), to buf(outEnd - 1), which will overwrite a value in [inStart, inEnd) that we haven't yet copied.
It's safe if either outEnd <= inStart, in which case the input and output intervals are completely disjoint, or outEnd >= inEnd, in which case we are guaranteed to only overwrite input value that have already been copied.
| val newBot = bot - 1 | ||
| items(newBot) = t | ||
| levels(0) = newBot | ||
|
|
| } | ||
|
|
||
| object ApproxCDFCombiner { | ||
| def apply[@specialized(Int, Long, Float, Double) T: ClassTag : Ordering : ApproxCDFHelper]( |
There was a problem hiding this comment.
Can we have a test that fails if specialization fails on Int, Long, Float, or Double? I suspect the overhead of boxing will be significant, so asserting the time is within some order of magnitude of a certain expected time seems reasonable.
There was a problem hiding this comment.
Also, have you inspected the byte code to be certain there's no lingering unspecialized values?
There was a problem hiding this comment.
I haven't inspected the byte code. I was doing some timing tests during development, and it was pretty obvious when I did something that caused boxing. But it was close enough (less than double the time) that a timing test might have too many false positives.
| seqOpArgs: Seq[Type]) | ||
|
|
||
| sealed trait AggOp { } | ||
| final case class ApproxCDF() extends AggOp |
There was a problem hiding this comment.
Even if we don't verify the result is "correct", we should have tests that run through the JVM code generation process for each type.
There was a problem hiding this comment.
I wrote a python test that should be doing that.
| * Shift lower levels up to keep items contiguous. | ||
| */ | ||
| def compactLevel(level: Int, shiftLowerLevels: Boolean = true) { | ||
| assert(level < maxNumLevels - 1) |
There was a problem hiding this comment.
what does it mean to compact a level > numLevels? (equivalently: should this be assert(level <= numLevels - 1)?)
| if (shiftLowerLevels) { | ||
| // shift up values below | ||
| if (sizeBelow > 1) { | ||
| // copy [bot, a) to [bot+halfSize, b) |
There was a problem hiding this comment.
this is pretty clear from the code, I don't think it's necessary
There was a problem hiding this comment.
The code doesn't mention the upper endpoints of the intervals, which is why I found this helpful. But happy to remove it if you think it's clear.
There was a problem hiding this comment.
This is the argument du jour. I think it's pretty clear once I've understood the code. I am OK with a pair of assertions:
assert(levels(0) + sizeBelow == a)
assert(bot + halfSize + sizeBelow == b)| // shift up values below | ||
| if (sizeBelow > 1) { | ||
| // copy [bot, a) to [bot+halfSize, b) | ||
| System.arraycopy(items, levels(0), items, bot + halfSize, sizeBelow) |
| System.arraycopy(items, levels(0), items, bot + halfSize, sizeBelow) | ||
| } else { | ||
| // only needs to be done if sizeBelow == 1, but doesn't hurt otherwise | ||
| items(b - 1) = items(a0) |
There was a problem hiding this comment.
Why not just use array copy to copy the one element?
There was a problem hiding this comment.
Also, what if there's one element in level 0 and an even number of elements in level 1, then sizeBelow is 1, a0 == a, and we should actually be copying a - 1. Does that never happen?
There was a problem hiding this comment.
Why not just use array copy to copy the one element?
If you look at the docstring for arraycopy, it does a lot of precondition checking, figuring out if the in and out ranges overlap, etc. The sizeBelow == 1 case is very common, and I suspect (but haven't checked) that avoiding arraycopy is faster in this case.
Does that never happen?
I don't think that can happen, but I also don't think this code should be relying on that. I think replacing a0 with bot fixes it. Agree?
| /* Compact level `level`, merging the compacted results into level `level+1`. | ||
| * Shift lower levels up to keep items contiguous. | ||
| */ | ||
| def compactLevel(level: Int, shiftLowerLevels: Boolean = true) { |
There was a problem hiding this comment.
We can unit test his method and I think we should do so.
There was a problem hiding this comment.
We'll need to seed the randomness.
| } | ||
| } | ||
|
|
||
| def merge(other: ApproxCDFCombiner[T], ubOnNumLevels: Int ): ApproxCDFCombiner[T] = { |
There was a problem hiding this comment.
I think we should also unit test this method.
| val items: Array[Int] = Array(7,2,4,1,3,8,0,5,9) | ||
| val combiner = new ApproxCDFCombiner(levels, items, 3, 0, 9, rand) | ||
| combiner.compactLevel(1, shiftLowerLevels = true) | ||
| println(items.mkString(",")) |
There was a problem hiding this comment.
if you want this to be reported on error, add it as a message to the assert
There was a problem hiding this comment.
Oops, just left in debugging cruft.
| val combiner = new ApproxCDFCombiner(levels, items, 3, 0, 9, rand) | ||
| combiner.compactLevel(1, shiftLowerLevels = true) | ||
| println(items.mkString(",")) | ||
| assert(items.view(1,9) sameElements Array(7,2,4,1,0,5,8,9)) |
There was a problem hiding this comment.
I thought level 1 was [levels(i), levels(i+1)), i.e. [3, 6), i.e. 1,3,8 and you're merging into 0,5,9 which I'd expect to result in 7,2,4,0,1,5,8,9?
There was a problem hiding this comment.
Here's what happened:
The values, split into levels, are
7 2 4 | 1 3 8 | 0 5 9
We start compacting level 1. There are an odd number of elements, so we leave the first one where it is, and conceptually put the rest into a scratch level between levels 1 and 2:
7 2 4 | 1 | 3 8 | 0 5 9
Now we flip a coin, and it comes up 1, meaning we skip the first element, i.e. we take the odd indices, and merge half the level into the level above, giving
7 2 4 | 1 | ? | 0 5 8 9
where we have left an unused spot, so we shift up everything below:
? | 7 2 4 | 1 | 0 5 8 9
| * aggregator is an array "values" of samples, in increasing order, and an array | ||
| * "ranks" of integers less than `n`, in increasing order, such that: | ||
| * - ranks.length = values.length + 1 | ||
| * - ranks(0) = 0 |
This only exposes the raw results to python. I still need to wrap that in some nice convenience functions like approximate_quantile, but I didn't want to let this PR get any longer.