-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use columnar data #451
Use columnar data #451
Conversation
OK this doesn't fully work, in particular we can't use these columns for the title option. The reason is that Plot's code for the title option checks if the title channel L value for each index i is nonempty, by calling There might be ways to fix this from Plot, but I think this is a case where we would prefer to patch arquero and add a getter to the column prototype? |
This comment has been minimized.
This comment has been minimized.
ok it was not that much of a complication: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, Fil. It’s a good effort, but it raises subtle concerns; I’m not yet comfortable landing this. Unfortunately I don’t have an actionable recommendation yet on how we should proceed, but I want to share my thoughts so far.
My main concern is that Arquero support is currently a leaky abstraction. It could have a downstream ripple effect because Arquero data goes through different code paths. And this will be a particular problem for custom marks and transforms.
For example, arrayify(data) no longer converts the Arquero Table instance into an array; arrayify can now return something that is not an array. Plot transforms are passed data, which means that they will now potentially receive an Arquero table instance, too, which could cause them to break. For example the stack transform expects there to be a data.length property, but Arquero only implements table.size.
Line 66 in 20d118c
const n = data.length; |
This problem is masked somewhat in this PR because valueof
and range
also support data being an Arquero Table, but we can’t control arbitrary code in transforms, so this makes Arquero support a leaky abstraction.
In the motivating issue #449, note that I wanted the data to be table.indices(), not the Arquero table itself. This means that the data can be a valid array. However, if the data is an index rather than the Arquero table, it’ll make it harder to define channels as functions since those functions will now be passed an integer index, not a row object.
I also wonder: if the data is an Arquero table, then could channels be defined as table expressions to avoid instantiating row objects (i.e., for derived columns rather than for pre-existing named columns)?
I’m going to think about this some more.
At least it helps clarify a few things.
|
fe5ae25
to
afc9e38
Compare
I've added details in the demo https://observablehq.com/d/eab191ed35920c7c to what needs to be "proxied" for an arquero table to have an API that is closer to an array. Still pretty much inconclusive at this point. |
I believe that Proxy is fairly slow, so I wonder if this is still faster than the existing (iterator over row objects) approach. And also whether Arquero is a performance bottleneck in any case, compared to rendering. I was also considering using table.array instead of table.column, but then we’re making a copy of the column which would be nice to avoid if possible. Anyway, speaking of performance, one optimization I know we want to make is binning of many values (a few millions, say.) This currently uses bisection in D3 but I suspect we could do something faster (quantization) when the thresholds are uniformly-spaced. I’ll file an issue for that. |
superseded by #1324 |
We detect a columnar table by its column accessor. In that case, we accept it as "already arrayified", and use the column accessor to construct any channel defined by a string, rather than the field accessor. Furthermore, we tweak the range so that it uses the indices method (if present), rather than iterate.
With this, I think we're using arquero in an optimal way, while still allowing to define channels with function accessors running on the table iterator.
closes #449
Build and example at https://observablehq.com/d/eab191ed35920c7c
The “proof” that it's working is given by inspecting: