Use columnar data #451

Fil · 2021-07-09T21:54:54Z

We detect a columnar table by its column accessor. In that case, we accept it as "already arrayified", and use the column accessor to construct any channel defined by a string, rather than the field accessor. Furthermore, we tweak the range so that it uses the indices method (if present), rather than iterate.

With this, I think we're using arquero in an optimal way, while still allowing to define channels with function accessors running on the table iterator.

closes #449

Build and example at https://observablehq.com/d/eab191ed35920c7c

The “proof” that it's working is given by inspecting:

Plot.valueof(table, "age") === table.column("age") // true

Fil · 2021-07-09T22:56:59Z

OK this doesn't fully work, in particular we can't use these columns for the title option. The reason is that Plot's code for the title option checks if the title channel L value for each index i is nonempty, by calling nonempty(L[i]). But arquero's columns do not have a getter for L[i], only L.get(i), so L[i] is always undefined.

There might be ways to fix this from Plot, but I think this is a case where we would prefer to patch arquero and add a getter to the column prototype?

Fil · 2021-07-10T00:11:02Z

ok it was not that much of a complication:
return "get" in data ? new Proxy(data, {get: (_, i) => i in _ ? _[i] : _.get(i)}) : data;

src/mark.js

mbostock

Thanks for this, Fil. It’s a good effort, but it raises subtle concerns; I’m not yet comfortable landing this. Unfortunately I don’t have an actionable recommendation yet on how we should proceed, but I want to share my thoughts so far.

My main concern is that Arquero support is currently a leaky abstraction. It could have a downstream ripple effect because Arquero data goes through different code paths. And this will be a particular problem for custom marks and transforms.

For example, arrayify(data) no longer converts the Arquero Table instance into an array; arrayify can now return something that is not an array. Plot transforms are passed data, which means that they will now potentially receive an Arquero table instance, too, which could cause them to break. For example the stack transform expects there to be a data.length property, but Arquero only implements table.size.

plot/src/transforms/stack.js

Line 66 in 20d118c

const n = data.length;

This problem is masked somewhat in this PR because valueof and range also support data being an Arquero Table, but we can’t control arbitrary code in transforms, so this makes Arquero support a leaky abstraction.

In the motivating issue #449, note that I wanted the data to be table.indices(), not the Arquero table itself. This means that the data can be a valid array. However, if the data is an index rather than the Arquero table, it’ll make it harder to define channels as functions since those functions will now be passed an integer index, not a row object.

I also wonder: if the data is an Arquero table, then could channels be defined as table expressions to avoid instantiating row objects (i.e., for derived columns rather than for pre-existing named columns)?

I’m going to think about this some more.

Fil · 2021-07-11T08:13:12Z

At least it helps clarify a few things.

Maybe what we need from arquero (or could build as a wrapper around the arquero table format, with Proxy) is to expose a more "array-like" API: accessing data with [i], having a length, and possibly a column API that returns an object with the same "array-like" characteristics.
The n = data.length line in the stack transform should probably be Y.length (independently of this particular use case).
The removal of field() seems a good change, also independently of this use case. **DONE IN Don’t promote string channel values to functions. #453 **

Fil · 2021-07-13T13:55:51Z

I've added details in the demo https://observablehq.com/d/eab191ed35920c7c to what needs to be "proxied" for an arquero table to have an API that is closer to an array. Still pretty much inconclusive at this point.

mbostock · 2021-07-13T14:37:17Z

I believe that Proxy is fairly slow, so I wonder if this is still faster than the existing (iterator over row objects) approach. And also whether Arquero is a performance bottleneck in any case, compared to rendering. I was also considering using table.array instead of table.column, but then we’re making a copy of the column which would be nice to avoid if possible.

Anyway, speaking of performance, one optimization I know we want to make is binning of many values (a few millions, say.) This currently uses bisection in D3 but I suspect we could do something faster (quantization) when the thresholds are uniformly-spaced. I’ll file an issue for that.

Fil · 2023-03-23T09:29:51Z

superseded by #1324

Fil requested a review from mbostock July 9, 2021 21:54

Fil marked this pull request as ready for review July 9, 2021 22:14

This comment has been minimized.

Sign in to view

mbostock reviewed Jul 10, 2021

View reviewed changes

src/mark.js Outdated Show resolved Hide resolved

mbostock requested changes Jul 10, 2021

View reviewed changes

mbostock mentioned this pull request Jul 12, 2021

Don’t promote string channel values to functions. #453

Merged

Fil marked this pull request as draft July 13, 2021 10:16

rebase

afc9e38

Fil force-pushed the fil/arquero-optimize branch from fe5ae25 to afc9e38 Compare July 13, 2021 13:29

mbostock mentioned this pull request Jul 13, 2021

Could binning millions of values be faster? #454

Closed

mbostock added the question Further information is needed label Aug 11, 2021

mbostock force-pushed the main branch from 0cc0de9 to 322718f Compare December 8, 2021 15:56

Fil closed this Mar 23, 2023

Fil deleted the fil/arquero-optimize branch March 23, 2023 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use columnar data #451

Use columnar data #451

Fil commented Jul 9, 2021 •

edited

Loading

Fil commented Jul 9, 2021

This comment has been minimized.

Fil commented Jul 10, 2021

mbostock left a comment

Fil commented Jul 11, 2021 •

edited

Loading

Fil commented Jul 13, 2021

mbostock commented Jul 13, 2021 •

edited

Loading

Fil commented Mar 23, 2023

Use columnar data #451

Use columnar data #451

Conversation

Fil commented Jul 9, 2021 • edited Loading

Fil commented Jul 9, 2021

This comment has been minimized.

Fil commented Jul 10, 2021

mbostock left a comment

Choose a reason for hiding this comment

Fil commented Jul 11, 2021 • edited Loading

Fil commented Jul 13, 2021

mbostock commented Jul 13, 2021 • edited Loading

Fil commented Mar 23, 2023

Fil commented Jul 9, 2021 •

edited

Loading

Fil commented Jul 11, 2021 •

edited

Loading

mbostock commented Jul 13, 2021 •

edited

Loading