Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use columnar data #451

Closed
wants to merge 1 commit into from
Closed

Use columnar data #451

wants to merge 1 commit into from

Conversation

Fil
Copy link
Contributor

@Fil Fil commented Jul 9, 2021

We detect a columnar table by its column accessor. In that case, we accept it as "already arrayified", and use the column accessor to construct any channel defined by a string, rather than the field accessor. Furthermore, we tweak the range so that it uses the indices method (if present), rather than iterate.

With this, I think we're using arquero in an optimal way, while still allowing to define channels with function accessors running on the table iterator.

closes #449

Build and example at https://observablehq.com/d/eab191ed35920c7c

The “proof” that it's working is given by inspecting:

Plot.valueof(table, "age") === table.column("age") // true

@Fil Fil requested a review from mbostock July 9, 2021 21:54
@Fil Fil marked this pull request as ready for review July 9, 2021 22:14
@Fil
Copy link
Contributor Author

Fil commented Jul 9, 2021

OK this doesn't fully work, in particular we can't use these columns for the title option. The reason is that Plot's code for the title option checks if the title channel L value for each index i is nonempty, by calling nonempty(L[i]). But arquero's columns do not have a getter for L[i], only L.get(i), so L[i] is always undefined.

There might be ways to fix this from Plot, but I think this is a case where we would prefer to patch arquero and add a getter to the column prototype?

@Fil

This comment has been minimized.

@Fil
Copy link
Contributor Author

Fil commented Jul 10, 2021

ok it was not that much of a complication:
return "get" in data ? new Proxy(data, {get: (_, i) => i in _ ? _[i] : _.get(i)}) : data;

src/mark.js Outdated Show resolved Hide resolved
Copy link
Member

@mbostock mbostock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, Fil. It’s a good effort, but it raises subtle concerns; I’m not yet comfortable landing this. Unfortunately I don’t have an actionable recommendation yet on how we should proceed, but I want to share my thoughts so far.

My main concern is that Arquero support is currently a leaky abstraction. It could have a downstream ripple effect because Arquero data goes through different code paths. And this will be a particular problem for custom marks and transforms.

For example, arrayify(data) no longer converts the Arquero Table instance into an array; arrayify can now return something that is not an array. Plot transforms are passed data, which means that they will now potentially receive an Arquero table instance, too, which could cause them to break. For example the stack transform expects there to be a data.length property, but Arquero only implements table.size.

const n = data.length;

This problem is masked somewhat in this PR because valueof and range also support data being an Arquero Table, but we can’t control arbitrary code in transforms, so this makes Arquero support a leaky abstraction.

In the motivating issue #449, note that I wanted the data to be table.indices(), not the Arquero table itself. This means that the data can be a valid array. However, if the data is an index rather than the Arquero table, it’ll make it harder to define channels as functions since those functions will now be passed an integer index, not a row object.

I also wonder: if the data is an Arquero table, then could channels be defined as table expressions to avoid instantiating row objects (i.e., for derived columns rather than for pre-existing named columns)?

I’m going to think about this some more.

@Fil
Copy link
Contributor Author

Fil commented Jul 11, 2021

At least it helps clarify a few things.

  • Maybe what we need from arquero (or could build as a wrapper around the arquero table format, with Proxy) is to expose a more "array-like" API: accessing data with [i], having a length, and possibly a column API that returns an object with the same "array-like" characteristics.
  • The n = data.length line in the stack transform should probably be Y.length (independently of this particular use case).
  • The removal of field() seems a good change, also independently of this use case. **DONE IN Don’t promote string channel values to functions. #453 **

@Fil
Copy link
Contributor Author

Fil commented Jul 13, 2021

I've added details in the demo https://observablehq.com/d/eab191ed35920c7c to what needs to be "proxied" for an arquero table to have an API that is closer to an array. Still pretty much inconclusive at this point.

@mbostock
Copy link
Member

mbostock commented Jul 13, 2021

I believe that Proxy is fairly slow, so I wonder if this is still faster than the existing (iterator over row objects) approach. And also whether Arquero is a performance bottleneck in any case, compared to rendering. I was also considering using table.array instead of table.column, but then we’re making a copy of the column which would be nice to avoid if possible.

Anyway, speaking of performance, one optimization I know we want to make is binning of many values (a few millions, say.) This currently uses bisection in D3 but I suspect we could do something faster (quantization) when the thresholds are uniformly-spaced. I’ll file an issue for that.

@Fil
Copy link
Contributor Author

Fil commented Mar 23, 2023

superseded by #1324

@Fil Fil closed this Mar 23, 2023
@Fil Fil deleted the fil/arquero-optimize branch March 23, 2023 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arquero shorthand
2 participants