-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export #17676
feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export #17676
Conversation
@@ -19,6 +19,8 @@ impl Drop for ArrowArrayStream { | |||
} | |||
} | |||
|
|||
unsafe impl Send for ArrowArrayStream {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow-rs also implements this: https://github.com/apache/arrow-rs/blob/6d4e2f2ceaf423031b0bc72f54c547dd77a0ddbb/arrow-array/src/ffi_stream.rs#L100
Have you seen #14208? |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #17676 +/- ##
==========================================
+ Coverage 80.47% 80.50% +0.03%
==========================================
Files 1503 1503
Lines 197115 197100 -15
Branches 2794 2804 +10
==========================================
+ Hits 158628 158684 +56
+ Misses 37973 37896 -77
- Partials 514 520 +6 ☔ View full report in Codecov by Sentry. |
I ended up vendoring that code as part of this PR. Just checking, when you call I added a test for DataFrame export as well, so this should be good to review. |
I am wondering if that should be added to polars-core or somewhere else instead of py-polars. (i.e. the code must be copied downstream each time like py-polars or r-polars unless it is included in the polars crate)
I'm not familiar with the polars internals, but I'm pretty sure that |
df = pl.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}) | ||
out = pa.table(PyCapsuleStreamHolder(df.__arrow_c_stream__(None))) | ||
assert df.shape == out.shape | ||
assert df.schema.names() == out.schema.names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could drop df
just now and make sure that the recreated df2
below still gets the expected contents (instead of crashing or whatever else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the test to not hold a bare capsule, but rather call the underlying object's __arrow_c_stream__
method. I'm not sure what you're suggesting this test, since I need to check below that df
and df2
are equal. Are you suggesting after that I should drop df
again? That isn't possible when this utility class doesn't hold bare capsules
|
||
a = pl.Series("a", [1, 2, 3, None]) | ||
out = pa.chunked_array(PyCapsuleSeriesHolder(a.__arrow_c_stream__(None))) | ||
out_arr = out.combine_chunks() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same idea here (drop a
before doing things with out
)
from typing import Any | ||
|
||
|
||
class PyCapsuleStreamHolder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is put in a helper file because it's used by tests both in this PR and in https://github.com/pola-rs/polars/pull/17693/files. Let me know if there's a better place to put this test helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Kyle, I've left some comments.
py-polars/src/interop/arrow/to_py.rs
Outdated
series: &'py Series, | ||
py: Python<'py>, | ||
) -> PyResult<Bound<'py, PyCapsule>> { | ||
let field = series.field().to_arrow(CompatLevel::oldest()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think this should be newest
, otherwise we trigger a copy whereas the consumer should decide if they want to cast to a datatype they can support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why requested_schema
is not used? I think it instead of CompatLevel
should decides what schema should be used (e.g. LargeString
or Utf8View
). In the future, imo it can replace CompatLevel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why requested_schema is not used?
Does the protocol allow for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why requested_schema is not used?
Does the protocol allow for this?
https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#schema-requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, then I agree request_schema
should be respected and if none given we can default to newest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's been discussion about this in apache/arrow#39689. To be able to pass in a requested_schema
argument, the consumer needs to know the schema of the producer's existing Arrow data. Only then can it know whether it needs to ask the producer to cast to a different type.
I believe I summarized the consensus in apache/arrow#39689 (comment), but while waiting for confirmation, I think it would be best for us to leave requested_schema
and schema negotiation to a follow up PR, if that's ok.
FWIW, I'm curious about whether it's possible to implement Series/DataFrame importing from PyCapsule. And if it is possible, can we migrate current FFI interfaces ( |
What would be the benefit of that? (I am on the camp, if it aint broke, don't fix it. ;) ) |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. Thanks a lot @kylebarron. Once all is in, can you follow up with an update on the user guide. We have a section on Arrow C interop, which should expose the capsule method as well.
Can you point me to where this is? Do you mean this paragraph? https://docs.pola.rs/user-guide/ecosystem/#apache-arrow |
It isn't released yet, but it is this page: https://github.com/pola-rs/polars/blob/main/docs/user-guide/misc/arrow.md |
I see. I see those APIs from #17696 were just added, but I'd personally argue to deprecate them. The PyCapsule Interface should be a strict improvement over those APIs:
Regardless, I'll make a docs PR to add to that page |
Progress towards #12530.
I added one minimal test for the Series export and it appears to work:
I added a test for DataFrame stream export and it works as well. You can pass
pa.table(polars.DataFrame)
and it'll just work.