feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export #17676

kylebarron · 2024-07-16T19:41:59Z

Progress towards #12530.

I added one minimal test for the Series export and it appears to work:

a = pl.Series("a", [1, 2, 3, None])
pyarrow_chunked = pa.chunked_array(a)
assert pyarrow_chunked.combine_chunks() == pa.array([1, 2, 3, None])

I added a test for DataFrame stream export and it works as well. You can pass pa.table(polars.DataFrame) and it'll just work.

kylebarron · 2024-07-16T19:42:41Z

crates/polars-arrow/src/ffi/stream.rs

@@ -19,6 +19,8 @@ impl Drop for ArrowArrayStream {
    }
 }

+unsafe impl Send for ArrowArrayStream {}


Arrow-rs also implements this: https://github.com/apache/arrow-rs/blob/6d4e2f2ceaf423031b0bc72f54c547dd77a0ddbb/arrow-array/src/ffi_stream.rs#L100

py-polars/src/interop/arrow/to_py.rs

eitsupi · 2024-07-16T22:37:44Z

I'm hitting some lifetime issues with the DataFrame export, but I figured I'd create the PR and we can discuss.

Have you seen #14208?

codecov · 2024-07-17T17:36:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.50%. Comparing base (66f0026) to head (d40f696).
Report is 13 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #17676      +/-   ##
==========================================
+ Coverage   80.47%   80.50%   +0.03%     
==========================================
  Files        1503     1503              
  Lines      197115   197100      -15     
  Branches     2794     2804      +10     
==========================================
+ Hits       158628   158684      +56     
+ Misses      37973    37896      -77     
- Partials      514      520       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kylebarron · 2024-07-17T18:19:53Z

I'm hitting some lifetime issues with the DataFrame export, but I figured I'd create the PR and we can discuss.

Have you seen #14208?

I ended up vendoring that code as part of this PR.

Just checking, when you call DataFrame.clone() is that a full memory copy of the input or are arrays reference counted somewhere?

I added a test for DataFrame export as well, so this should be good to review.

eitsupi · 2024-07-18T11:36:56Z

I ended up vendoring that code as part of this PR.

I am wondering if that should be added to polars-core or somewhere else instead of py-polars. (i.e. the code must be copied downstream each time like py-polars or r-polars unless it is included in the polars crate)
Of course this can be done later with follow up PRs.

Just checking, when you call DataFrame.clone() is that a full memory copy of the input or are arrays reference counted somewhere?

I'm not familiar with the polars internals, but I'm pretty sure that DataFrame.clone() isn't actually copying data (Python Polars does clone everywhere, but that's not slowing it down, is it?)

pitrou · 2024-07-18T13:20:42Z

py-polars/tests/unit/interop/test_interop.py

+    df = pl.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
+    out = pa.table(PyCapsuleStreamHolder(df.__arrow_c_stream__(None)))
+    assert df.shape == out.shape
+    assert df.schema.names() == out.schema.names


You could drop df just now and make sure that the recreated df2 below still gets the expected contents (instead of crashing or whatever else).

I updated the test to not hold a bare capsule, but rather call the underlying object's __arrow_c_stream__ method. I'm not sure what you're suggesting this test, since I need to check below that df and df2 are equal. Are you suggesting after that I should drop df again? That isn't possible when this utility class doesn't hold bare capsules

pitrou · 2024-07-18T13:21:24Z

py-polars/tests/unit/series/test_series.py

+
+    a = pl.Series("a", [1, 2, 3, None])
+    out = pa.chunked_array(PyCapsuleSeriesHolder(a.__arrow_c_stream__(None)))
+    out_arr = out.combine_chunks()


Same idea here (drop a before doing things with out)

kylebarron · 2024-07-23T02:14:42Z

py-polars/tests/unit/utils/pycapsule_utils.py

+from typing import Any
+
+
+class PyCapsuleStreamHolder:


This is put in a helper file because it's used by tests both in this PR and in https://github.com/pola-rs/polars/pull/17693/files. Let me know if there's a better place to put this test helper.

ritchie46

Thank you Kyle, I've left some comments.

ritchie46 · 2024-07-23T11:38:10Z

py-polars/src/interop/arrow/to_py.rs

+    series: &'py Series,
+    py: Python<'py>,
+) -> PyResult<Bound<'py, PyCapsule>> {
+    let field = series.field().to_arrow(CompatLevel::oldest());


I do think this should be newest, otherwise we trigger a copy whereas the consumer should decide if they want to cast to a datatype they can support.

Why requested_schema is not used? I think it instead of CompatLevel should decides what schema should be used (e.g. LargeString or Utf8View). In the future, imo it can replace CompatLevel.

Why requested_schema is not used?

Does the protocol allow for this?

Why requested_schema is not used?

Does the protocol allow for this?

https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#schema-requests

Right, then I agree request_schema should be respected and if none given we can default to newest.

There's been discussion about this in apache/arrow#39689. To be able to pass in a requested_schema argument, the consumer needs to know the schema of the producer's existing Arrow data. Only then can it know whether it needs to ask the producer to cast to a different type.

I believe I summarized the consensus in apache/arrow#39689 (comment), but while waiting for confirmation, I think it would be best for us to leave requested_schema and schema negotiation to a follow up PR, if that's ok.

py-polars/src/interop/arrow/to_py.rs

py-polars/tests/unit/interop/test_interop.py

ruihe774 · 2024-07-23T13:07:48Z

FWIW, I'm curious about whether it's possible to implement Series/DataFrame importing from PyCapsule. And if it is possible, can we migrate current FFI interfaces (Series._import_arrow_from_c and Array._export_to_c) to PyCapsule?

ritchie46 · 2024-07-23T13:25:34Z

What would be the benefit of that? (I am on the camp, if it aint broke, don't fix it. ;) )

ruihe774 · 2024-07-23T13:29:52Z

What would be the benefit of that? (I am on the camp, if it aint broke, don't fix it. ;) )

We can drop the dependency on pyarrow's Array._export_to_c when exporting DataFrame/Series.
pyo3-polars can work with any python objects that support Arrow PyCapsule interface, not limited to polars dataframes. Imagine that users can directly pass pandas dataframes to pyo3 extensions.

kylebarron · 2024-07-23T14:58:11Z

FWIW, I'm curious about whether it's possible to implement Series/DataFrame importing from PyCapsule

@ruihe774 have you seen #17693?

ritchie46

Alright. Thanks a lot @kylebarron. Once all is in, can you follow up with an update on the user guide. We have a section on Arrow C interop, which should expose the capsule method as well.

kylebarron · 2024-07-25T15:49:20Z

Once all is in, can you follow up with an update on the user guide. We have a section on Arrow C interop, which should expose the capsule method as well.

Can you point me to where this is? Do you mean this paragraph? https://docs.pola.rs/user-guide/ecosystem/#apache-arrow

ritchie46 · 2024-07-25T16:43:26Z

It isn't released yet, but it is this page: https://github.com/pola-rs/polars/blob/main/docs/user-guide/misc/arrow.md

kylebarron · 2024-07-25T18:07:02Z

I see. I see those APIs from #17696 were just added, but I'd personally argue to deprecate them. The PyCapsule Interface should be a strict improvement over those APIs:

No need for the caller to know anything about polars and to know Polars' semi-private APIs.
No need for the caller to specifically rechunk a polars Series or iterate over a Python list of chunks. The Arrow C Stream will have the same number of chunks as the Series has.
No memory leaks: with _export_arrow_to_c if the caller doesn't import the exported pointers, memory leaks. With PyCapsules, Drop is called when the capsule goes out of Python scope if it hasn't been imported, so memory can't leak.
Works on both a Series and a DataFrame

Regardless, I'll make a docs PR to add to that page

kylebarron added 4 commits July 16, 2024 14:21

Export Series via pycapsule interface

435b972

Try to add __arrow_c_stream__ on DataFrame

11e45fc

Add series test

a084a45

Fix Series test

d9c8a80

kylebarron requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners July 16, 2024 19:42

github-actions bot added the title needs formatting label Jul 16, 2024

kylebarron commented Jul 16, 2024

View reviewed changes

py-polars/src/interop/arrow/to_py.rs Outdated Show resolved Hide resolved

kylebarron mentioned this pull request Jul 16, 2024

Arrow PyCapsule Interface support #12530

Open

3 tasks

kylebarron added 2 commits July 17, 2024 13:00

Port from R

eb9b960

Add df interop test

59c2048

kylebarron added 2 commits July 17, 2024 13:39

type hints

8226931

lint

ab015a1

pitrou reviewed Jul 18, 2024

View reviewed changes

This was referenced Jul 18, 2024

Support ingesting objects that support the Arrow PyCapsule API vega/vegafusion#498

Open

Support for Arrow PyCapsule Interface Eventual-Inc/Daft#2504

Open

kylebarron changed the title ~~Implement Arrow PyCapsule Interface for Series/DataFrame export~~ feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export Jul 22, 2024

github-actions bot added enhancement New feature or an improvement of an existing feature and removed title needs formatting labels Jul 22, 2024

github-actions bot added the python Related to Python Polars label Jul 22, 2024

kylebarron added 2 commits July 22, 2024 22:06

Merge branch 'main' into kyle/export-pycapsule-interface

ecf5334

Don't hold bare capsules

b309a57

kylebarron commented Jul 23, 2024

View reviewed changes

ritchie46 requested changes Jul 23, 2024

View reviewed changes

This was referenced Jul 23, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

[Python] Arrow PyCapsule Protocol: standard way to get the schema of a "data" (array of stream) object? apache/arrow#39689

Open

kylebarron added 4 commits July 23, 2024 13:37

Address most comments

5f23484

compat level newest

cc04d92

Remove clone

c2e2144

longer strings in test

d40f696

kylebarron requested a review from ritchie46 July 23, 2024 18:52

ritchie46 approved these changes Jul 25, 2024

View reviewed changes

ritchie46 merged commit 9978d88 into pola-rs:main Jul 25, 2024
27 checks passed

kylebarron deleted the kyle/export-pycapsule-interface branch July 25, 2024 15:45

kylebarron mentioned this pull request Jul 26, 2024

feat: Support for Arrow PyCapsule interface manzt/quak#23

Merged

WillAyd mentioned this pull request Jul 29, 2024

Arrow PyCapsule TypeError: __arrow_c_stream__() missing 1 required positional argument: 'requested_schema' #17921

Closed

2 tasks

This was referenced Jul 29, 2024

[python-package] Adding support for polars for input data microsoft/LightGBM#6204

Open

docs(python): Documentation for Arrow PyCapsule interface integration #17935

Merged

kylebarron mentioned this pull request Aug 16, 2024

Read Arrow C Stream from Arrow PyCapsule Interface vega/vegafusion#501

Closed

kylebarron mentioned this pull request Sep 4, 2024

Support Arrow PyCapsule Interface vega/altair#3568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export #17676

feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export #17676

kylebarron commented Jul 16, 2024 •

edited

Loading

kylebarron Jul 16, 2024

eitsupi commented Jul 16, 2024

codecov bot commented Jul 17, 2024 •

edited

Loading

kylebarron commented Jul 17, 2024

eitsupi commented Jul 18, 2024 •

edited

Loading

pitrou Jul 18, 2024

kylebarron Jul 23, 2024

pitrou Jul 18, 2024

kylebarron Jul 23, 2024

ritchie46 left a comment

ritchie46 Jul 23, 2024

ruihe774 Jul 23, 2024 •

edited

Loading

ritchie46 Jul 23, 2024

ruihe774 Jul 23, 2024

ritchie46 Jul 23, 2024

kylebarron Jul 23, 2024

ruihe774 commented Jul 23, 2024

ritchie46 commented Jul 23, 2024

ruihe774 commented Jul 23, 2024

kylebarron commented Jul 23, 2024

ritchie46 left a comment

kylebarron commented Jul 25, 2024

ritchie46 commented Jul 25, 2024

kylebarron commented Jul 25, 2024 •

edited

Loading

feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export #17676

feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export #17676

Conversation

kylebarron commented Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

eitsupi commented Jul 16, 2024

codecov bot commented Jul 17, 2024 • edited Loading

Codecov Report

kylebarron commented Jul 17, 2024

eitsupi commented Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritchie46 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruihe774 Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruihe774 commented Jul 23, 2024

ritchie46 commented Jul 23, 2024

ruihe774 commented Jul 23, 2024

kylebarron commented Jul 23, 2024

ritchie46 left a comment

Choose a reason for hiding this comment

kylebarron commented Jul 25, 2024

ritchie46 commented Jul 25, 2024

kylebarron commented Jul 25, 2024 • edited Loading

kylebarron commented Jul 16, 2024 •

edited

Loading

codecov bot commented Jul 17, 2024 •

edited

Loading

eitsupi commented Jul 18, 2024 •

edited

Loading

ruihe774 Jul 23, 2024 •

edited

Loading

kylebarron commented Jul 25, 2024 •

edited

Loading