You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
scikit-learn and other libraries (e.g. VegaFusion) are working to adding add for the standard Python dataframe interchange protocol in order to be able to fit consume data from any dataframe library without having to first convert to pandas, see e.g.:
Is there any reason not to support the dataframe protocol on every table expression?
It's not clear based on the linked issue or this one how implementing __dataframe__ only for in-memory tables would be useful, since you wouldn't be able to rely on the protocol being implemented for an arbitrary expression, resulting in all sorts of annoying checks everywhere in downstream code.
That said, is there any expectation about the performance characteristics of the dataframe interchange protocol?
Would it be a problem for our implementation of the protocol to trigger arbitrary computation? A user would have to trigger compute regardless to get a DataFrame, pyarrow Table etc out.
There are also a bunch of things in the protocol that simply do not make sense for ibis, like buffers and memory layout.
I'd suggest that we implement __dataframe__ like this to start:
Indeed you are right. I assumed restricting to in memory tables would avoid accidental memory exhaustion but this is not guaranteed for all expressions (some expressions might yield bigger results than the source data).
A generic implementation based on pyarrow or pandas is probably the way to go, depending on which library has the most complete coverage of the spec. Note that the spec is not yet finalized and all existing implementations are still lacking as far as I know.
ogrisel
changed the title
feat: __dataframe__ protocol support for memory backed tables
feat: __dataframe__ protocol support
Jun 2, 2023
Is your feature request related to a problem?
scikit-learn and other libraries (e.g. VegaFusion) are working to adding add for the standard Python dataframe interchange protocol in order to be able to fit consume data from any dataframe library without having to first convert to pandas, see e.g.:
Describe the solution you'd like
Expose the standard
__dataframe__
method on any memory backedTable
(e.g. those created byibis.memtable
).This feature could start with the DuckDB backend and be latter generalized to other backends with in-memory data (e.g. pandas, polars, ...).
Note: the dataframe protocol was mentioned in this discussion:
But I figured it would be worth opening a dedicated feature request specifically for this need.
What version of ibis are you running?
5.1.0 (master branch)
What backend(s) are you using, if any?
DuckDB
Code of Conduct
The text was updated successfully, but these errors were encountered: