Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: __dataframe__ protocol support #6343

Closed
1 task done
ogrisel opened this issue May 31, 2023 · 2 comments · Fixed by #6464
Closed
1 task done

feat: __dataframe__ protocol support #6343

ogrisel opened this issue May 31, 2023 · 2 comments · Fixed by #6464
Labels
ecosystem External projects or activities feature Features or general enhancements

Comments

@ogrisel
Copy link
Contributor

ogrisel commented May 31, 2023

Is your feature request related to a problem?

scikit-learn and other libraries (e.g. VegaFusion) are working to adding add for the standard Python dataframe interchange protocol in order to be able to fit consume data from any dataframe library without having to first convert to pandas, see e.g.:

Describe the solution you'd like

Expose the standard __dataframe__ method on any memory backed Table (e.g. those created by ibis.memtable).

This feature could start with the DuckDB backend and be latter generalized to other backends with in-memory data (e.g. pandas, polars, ...).

Note: the dataframe protocol was mentioned in this discussion:

But I figured it would be worth opening a dedicated feature request specifically for this need.

What version of ibis are you running?

5.1.0 (master branch)

What backend(s) are you using, if any?

DuckDB

Code of Conduct

  • I agree to follow this project's Code of Conduct
@ogrisel ogrisel added the feature Features or general enhancements label May 31, 2023
@cpcloud
Copy link
Member

cpcloud commented Jun 2, 2023

@ogrisel Thanks for the issue!

Is there any reason not to support the dataframe protocol on every table expression?

It's not clear based on the linked issue or this one how implementing __dataframe__ only for in-memory tables would be useful, since you wouldn't be able to rely on the protocol being implemented for an arbitrary expression, resulting in all sorts of annoying checks everywhere in downstream code.

That said, is there any expectation about the performance characteristics of the dataframe interchange protocol?

Would it be a problem for our implementation of the protocol to trigger arbitrary computation? A user would have to trigger compute regardless to get a DataFrame, pyarrow Table etc out.

There are also a bunch of things in the protocol that simply do not make sense for ibis, like buffers and memory layout.

I'd suggest that we implement __dataframe__ like this to start:

class Table(Expr):
	...  # existing APIs

	def __dataframe__(self):
		return self.to_pyarrow().__dataframe__()

and I guess hope that no one is calling expr.__dataframe__() in a tight loop!

@cpcloud cpcloud added the ecosystem External projects or activities label Jun 2, 2023
@ogrisel
Copy link
Contributor Author

ogrisel commented Jun 2, 2023

Indeed you are right. I assumed restricting to in memory tables would avoid accidental memory exhaustion but this is not guaranteed for all expressions (some expressions might yield bigger results than the source data).

A generic implementation based on pyarrow or pandas is probably the way to go, depending on which library has the most complete coverage of the spec. Note that the spec is not yet finalized and all existing implementations are still lacking as far as I know.

@ogrisel ogrisel changed the title feat: __dataframe__ protocol support for memory backed tables feat: __dataframe__ protocol support Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem External projects or activities feature Features or general enhancements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants