-
Notifications
You must be signed in to change notification settings - Fork 185
faster bin transform #1225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faster bin transform #1225
Conversation
5e2cd64
to
3c876fc
Compare
Looks like I’ve introduced a regression or two (probably the first and last bin). Getting close though! |
There’s still some polishing we could do to maybeBin—for example, we could re-implement quantization binning for numeric data instead of using bisection, and maybe we could change the default reducer for data to be a no-op when we detect that there’s no other channel defined in options to reference it. And we should review the changes to maybeBin closely, since it’s really easy to introduce errors in the edge cases. (Fortunately we have a lot of tests!) But, pretty excited about this! The 1M test now renders in ~250 ms, down from ~15 s, a 60× improvement. 🚀 |
Ah, this breaks one-dimensional cumulative binning (e.g., |
@@ -74,7 +83,7 @@ function binn( | |||
gx, // optionally group on x (exclusive with bx and gy) | |||
gy, // optionally group on y (exclusive with by and gx) | |||
{ | |||
data: reduceData = reduceIdentity, | |||
data: reduceData = reduceIdentity, // TODO avoid materializing when unused? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a single one of the tests seems to be using the return value of reduceIdentity.reduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it’s rare, but changing it would break backwards compatibility. You can do it like this:
Plot.rectY(data, {...Plot.binX(), title: D => D.length})
You could detect these by looking at the passed-in options, or maybe this.channels, and seeing if any of the channel definitions there do not correspond to channels produced by the bin transform. If any such channel is found, then reduceData needs to default to reduceIdentity instead of “reduceNone” (i.e., produce undefined).
1d2243f
to
8634456
Compare
|
||
export async function bin1m() { | ||
return Plot.plot({ | ||
marks: [Plot.rectY(dates, Plot.binX({y: "count", data: "first"}))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this, it feels like a hack. In the future we might want a "null" data reducer to convey the meaning (with no added performance benefit, since it would call a null reducer {reduce:()=>{}}
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should add the null
and/or "none"
reducer in the future.
function binfilter([{x0, x1}, set]) { | ||
return [x0, x1, set.size ? (I) => I.filter(set.has, set) : binempty]; | ||
// non-cumulative distribution | ||
function bin1(E, T, V) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a bin quantizer that can be used in lieu of bin1 if we use the "number" ticks. However, due to floating-point rounding, we need to undershoot then correct course… this is really not your most beautiful piece of code—and I fear it might create more problems.
function binq(E, T, V) {
if (T.length < 2) return bin1(E, T, V); // degenerate case
const a = T[0];
const b = (1 + 1e-12) / (T[1] - T[0]);
return (I) => {
const B = E.map(() => []);
for (const i of I) {
let j = Math.floor(b * (V[i] - a));
if (T[j] > V[i]) j++;
B[j]?.push(i);
}
return B;
};
}
In my tests it's about 30% faster, so maybe worth a shot (later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we’ll want to use the value returned by tickIncrement directly here (like we do in d3.bin) rather than “rediscovering” it as T[1] - T[0]. But yes I suggest we defer this optimization to later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only faster but also quite easier to read than previously (binfilter).
This also goes around a subtle issue we had with d3-bin (evidenced by the modification of the shorthandBinRectY test). Although the default bin domain is "extent", when we passed d3.extent as the domain it was not recognized as being the default in (https://github.com/d3/d3-array/blob/191aa03f0519593e938f5a0cae545617866103e2/src/bin.js#L37), and nice was not applied. Which is why we had only 3 bins instead of the now correct 6. (wrong analysis, too subtle ;-) )
I investigated this; it was recognizing the domain function correctly as extent. Instead, the change in behavior is because we’ve changed the logic (in a way that I think is still valid). Under the old logic, this happened:
Whereas under the new logic:
In other words, under the new logic we compute the tick increment before we extend the domain. That is possible because we are computing the extended (niced) ticks directly, rather than first nicing the domain and then recomputing the tick increment. |
* bin 1m test * faster binning * fix first and last bin * fix first and last bin, again * fix last bin, again * bypass slow data reducer * data reducer is required * fix single-value bin * fix 1d cumulative
Fixes #454. The main ideas are:
I haven‘t implemented two-dimensional binning yet, but it should be possible without impacting the performance for one-dimensional binning. I also haven’t implemented cumulative binning but I don’t anticipate any major challenges doing so.These have now been implemented.Notes here: https://observablehq.com/d/0fb511ca14875b15