Skip to content

faster bin transform #1225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 17, 2023
Merged

faster bin transform #1225

merged 9 commits into from
Jan 17, 2023

Conversation

mbostock
Copy link
Member

@mbostock mbostock commented Jan 15, 2023

Fixes #454. The main ideas are:

  • Instead of binning everything, then separately grouping (and faceting), and lastly intersecting the bins with the groups (i.e., binfilter), instead bin each group separately after grouping. I chose to “eject” from d3.bin for flexibility.
  • Coerce dates to numbers (and use a typed array). This makes a dramatic improvement to the speed of bisection.

I haven‘t implemented two-dimensional binning yet, but it should be possible without impacting the performance for one-dimensional binning. I also haven’t implemented cumulative binning but I don’t anticipate any major challenges doing so. These have now been implemented.

Notes here: https://observablehq.com/d/0fb511ca14875b15

@mbostock
Copy link
Member Author

Looks like I’ve introduced a regression or two (probably the first and last bin). Getting close though!

@mbostock mbostock requested a review from Fil January 16, 2023 05:01
@mbostock mbostock marked this pull request as ready for review January 16, 2023 05:02
@mbostock
Copy link
Member Author

There’s still some polishing we could do to maybeBin—for example, we could re-implement quantization binning for numeric data instead of using bisection, and maybe we could change the default reducer for data to be a no-op when we detect that there’s no other channel defined in options to reference it. And we should review the changes to maybeBin closely, since it’s really easy to introduce errors in the edge cases. (Fortunately we have a lot of tests!)

But, pretty excited about this! The 1M test now renders in ~250 ms, down from ~15 s, a 60× improvement. 🚀

@mbostock
Copy link
Member Author

mbostock commented Jan 16, 2023

Ah, this breaks one-dimensional cumulative binning (e.g., x: {value: "carat", cumulative: true}). I’ll need to fix that, and we should have a test since it isn’t currently tested. Update: fixed! 👍

@@ -74,7 +83,7 @@ function binn(
gx, // optionally group on x (exclusive with bx and gy)
gy, // optionally group on y (exclusive with by and gx)
{
data: reduceData = reduceIdentity,
data: reduceData = reduceIdentity, // TODO avoid materializing when unused?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a single one of the tests seems to be using the return value of reduceIdentity.reduce.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it’s rare, but changing it would break backwards compatibility. You can do it like this:

Plot.rectY(data, {...Plot.binX(), title: D => D.length})

You could detect these by looking at the passed-in options, or maybe this.channels, and seeing if any of the channel definitions there do not correspond to channels produced by the bin transform. If any such channel is found, then reduceData needs to default to reduceIdentity instead of “reduceNone” (i.e., produce undefined).

@mbostock mbostock force-pushed the mbostock/faster-bin branch from 1d2243f to 8634456 Compare January 17, 2023 17:56

export async function bin1m() {
return Plot.plot({
marks: [Plot.rectY(dates, Plot.binX({y: "count", data: "first"}))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, it feels like a hack. In the future we might want a "null" data reducer to convey the meaning (with no added performance benefit, since it would call a null reducer {reduce:()=>{}})?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should add the null and/or "none" reducer in the future.

function binfilter([{x0, x1}, set]) {
return [x0, x1, set.size ? (I) => I.filter(set.has, set) : binempty];
// non-cumulative distribution
function bin1(E, T, V) {
Copy link
Contributor

@Fil Fil Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a bin quantizer that can be used in lieu of bin1 if we use the "number" ticks. However, due to floating-point rounding, we need to undershoot then correct course… this is really not your most beautiful piece of code—and I fear it might create more problems.

function binq(E, T, V) {
  if (T.length < 2) return bin1(E, T, V); // degenerate case
  const a = T[0];
  const b = (1 + 1e-12) / (T[1] - T[0]);
  return (I) => {
    const B = E.map(() => []);
    for (const i of I) {
      let j = Math.floor(b * (V[i] - a));
      if (T[j] > V[i]) j++;
      B[j]?.push(i);
    }
    return B;
  };
}

In my tests it's about 30% faster, so maybe worth a shot (later).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we’ll want to use the value returned by tickIncrement directly here (like we do in d3.bin) rather than “rediscovering” it as T[1] - T[0]. But yes I suggest we defer this optimization to later.

Copy link
Contributor

@Fil Fil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only faster but also quite easier to read than previously (binfilter).

This also goes around a subtle issue we had with d3-bin (evidenced by the modification of the shorthandBinRectY test). Although the default bin domain is "extent", when we passed d3.extent as the domain it was not recognized as being the default in (https://github.com/d3/d3-array/blob/191aa03f0519593e938f5a0cae545617866103e2/src/bin.js#L37), and nice was not applied. Which is why we had only 3 bins instead of the now correct 6. (wrong analysis, too subtle ;-) )

@mbostock
Copy link
Member Author

mbostock commented Jan 17, 2023

Although the default bin domain is "extent", when we passed d3.extent as the domain it was not recognized as being the default

I investigated this; it was recognizing the domain function correctly as extent. Instead, the change in behavior is because we’ve changed the logic (in a way that I think is still valid).

Under the old logic, this happened:

  1. The default thresholdAuto (capped thresholdScott) suggests 4 bins.
  2. The nice domain of [154.83, 179.37] is extended to [150, 180] (d3.nice(154.83, 179.37, 4)).
  3. The subsequent ticks are [150, 160, 170, 180] (d3.ticks(150, 180, 4)).

Whereas under the new logic:

  1. The default thresholdAuto (capped thresholdScott) suggests 4 bins.
  2. The tick increment is computed as 5 (d3.tickIncrement(154.83, 179.37, 4)).
  3. The subsequent extended ticks are [150, 155, 160, 165, 170, 175, 180].

In other words, under the new logic we compute the tick increment before we extend the domain. That is possible because we are computing the extended (niced) ticks directly, rather than first nicing the domain and then recomputing the tick increment.

@mbostock mbostock merged commit b773d87 into main Jan 17, 2023
@mbostock mbostock deleted the mbostock/faster-bin branch January 17, 2023 22:25
@Fil Fil mentioned this pull request Jan 18, 2023
chaichontat pushed a commit to chaichontat/plot that referenced this pull request Jan 14, 2024
* bin 1m test

* faster binning

* fix first and last bin

* fix first and last bin, again

* fix last bin, again

* bypass slow data reducer

* data reducer is required

* fix single-value bin

* fix 1d cumulative
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Could binning millions of values be faster?
2 participants