feat: dataframe protocol support #6343

ogrisel · 2023-05-31T08:58:32Z

Is your feature request related to a problem?

scikit-learn and other libraries (e.g. VegaFusion) are working to adding add for the standard Python dataframe interchange protocol in order to be able to fit consume data from any dataframe library without having to first convert to pandas, see e.g.:

Support other dataframes like polars and pyarrow not just pandas scikit-learn/scikit-learn#25896

Describe the solution you'd like

Expose the standard __dataframe__ method on any memory backed Table (e.g. those created by ibis.memtable).

This feature could start with the DuckDB backend and be latter generalized to other backends with in-memory data (e.g. pandas, polars, ...).

Note: the dataframe protocol was mentioned in this discussion:

User experience feedback using Ibis as a general dataframe library #4542

But I figured it would be worth opening a dedicated feature request specifically for this need.

What version of ibis are you running?

5.1.0 (master branch)

What backend(s) are you using, if any?

DuckDB

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cpcloud · 2023-06-02T11:49:14Z

@ogrisel Thanks for the issue!

Is there any reason not to support the dataframe protocol on every table expression?

It's not clear based on the linked issue or this one how implementing __dataframe__ only for in-memory tables would be useful, since you wouldn't be able to rely on the protocol being implemented for an arbitrary expression, resulting in all sorts of annoying checks everywhere in downstream code.

That said, is there any expectation about the performance characteristics of the dataframe interchange protocol?

Would it be a problem for our implementation of the protocol to trigger arbitrary computation? A user would have to trigger compute regardless to get a DataFrame, pyarrow Table etc out.

There are also a bunch of things in the protocol that simply do not make sense for ibis, like buffers and memory layout.

I'd suggest that we implement __dataframe__ like this to start:

class Table(Expr):
	...  # existing APIs

	def __dataframe__(self):
		return self.to_pyarrow().__dataframe__()

and I guess hope that no one is calling expr.__dataframe__() in a tight loop!

ogrisel · 2023-06-02T12:10:08Z

Indeed you are right. I assumed restricting to in memory tables would avoid accidental memory exhaustion but this is not guaranteed for all expressions (some expressions might yield bigger results than the source data).

A generic implementation based on pyarrow or pandas is probably the way to go, depending on which library has the most complete coverage of the spec. Note that the spec is not yet finalized and all existing implementations are still lacking as far as I know.

ogrisel added the feature Features or general enhancements label May 31, 2023

cpcloud added the ecosystem External projects or activities label Jun 2, 2023

ogrisel changed the title ~~feat: __dataframe__ protocol support for memory backed tables~~ feat: __dataframe__ protocol support Jun 2, 2023

cpcloud mentioned this issue Jun 16, 2023

feat(api): add __dataframe__ implementation #6464

Merged

gforsyth closed this as completed in #6464 Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dataframe protocol support #6343

feat: dataframe protocol support #6343

ogrisel commented May 31, 2023 •

edited

cpcloud commented Jun 2, 2023

ogrisel commented Jun 2, 2023

feat: __dataframe__ protocol support #6343

feat: __dataframe__ protocol support #6343

Comments

ogrisel commented May 31, 2023 • edited

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

cpcloud commented Jun 2, 2023

ogrisel commented Jun 2, 2023

feat: dataframe protocol support #6343

feat: dataframe protocol support #6343

ogrisel commented May 31, 2023 •

edited