Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Apache Arrow tables in database clients #320

Open
domoritz opened this issue Nov 11, 2022 · 6 comments
Open

Support Apache Arrow tables in database clients #320

domoritz opened this issue Nov 11, 2022 · 6 comments

Comments

@domoritz
Copy link
Contributor

yield batch.value.toArray();
introduces a copy that may not be needed. As soon as Arrow is supported as an output format, it would be good to remove this call.

@mbostock mbostock changed the title Remove toArray call in DuckDB Support Apache Arrow tables in database clients Nov 11, 2022
@mbostock
Copy link
Member

Retitled this issue to describe the more generic problem: we want to support Apache Arrow tables as a tabular data representation throughout database clients, SQL cells, and data table cells.

@domoritz
Copy link
Contributor Author

apache/arrow#34939 adds an indexed access proxy for Arrow but the performance isn't great compared to properly adopting Arrow. It would be great to have Arrow support throughout the different clients and cells.

@domoritz
Copy link
Contributor Author

Now that Arrow is used in a lot more places, I think it may be a good time to revisit this issue. The extra copies are introducing extra overhead in many places and I think it would be super awesome if we could just pass Arrow columns directly into Plot (observablehq/plot#191) without it making extra copies.

@mbostock
Copy link
Member

FWIW, Framework’s DuckDBClient (as of 1.3) returns Apache Arrow tables without materializing array-of-objects. So there’s that.

@domoritz
Copy link
Contributor Author

Oh nice. I guess you can't just remove the toArray call here for backwards compatibility?

How good is Arrow/columnar data support in Plot these days?

@mbostock
Copy link
Member

That’s correct, it wouldn’t be backwards-compatible so I don’t think we are likely to change the behavior in Observable notebooks any time soon. (But eventually we’ll have a way to version control the Observable standard library, and port improvements from Observable Framework back to notebooks.)

Plot uses columnar data internally, so I would rate support as excellent, but we don’t yet have the shorthand syntax so it’s cumbersome to avoid materializing the array-of-objects — you have to pass the column vectors in yourself for each channel. observablehq/plot#191 covers making the syntax more convenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants