-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): DataFrame interchange protocol implementation #5662
Conversation
This is a great initiative and really awesome if we can make this work. <3
Does it return a
Should we see a I can do the rust side of these things and expose them to python. So just put me to work here. :) |
c75b178
to
b3d17cd
Compare
Yes, basically we'd want an iterable over the chunks of a Series / DataFrame, where each chunk is a separate Series / DataFrame. I tried to explain this better in #5670.
I'm not sure here; I think it's an issue with the protocol itself, not any shortcomings of Polars. I'll probably make an issue in their repo to try and clarify. |
@ritchie46
If we use bit/bytemasks, we'd need a way to access these. |
We use bitmasks for everything. We can access them by calling |
c729ede
to
d3f377e
Compare
@ritchie46 There's just some stuff left for the 'buffer' class:
Is there already functionality available for these? |
0388e69
to
d0bb23f
Compare
daec322
to
c208069
Compare
c208069
to
bad7c36
Compare
For your interest, we just added support for the protocol in pyarrow as well (apache/arrow#14804) It could actually also be an option to use pyarrow internally in polars for this, since the polars <-> pyarrow memory conversion should be cheap. I know pyarrow is not a required dependency of polars, so it's certainly not ideal, but still wanted to mention it as something that could be considered (it avoids having to maintain two implementations that basically implement the protocol to/from arrow-compatible memory. It could be an interesting idea to factor it out into an independent small package without dependency on pyarrow, if that's a blocker) |
Thanks for the input! I'll definitely check that out, might indeed be a better method going forward, since I expect the protocol to change quite heavily in the future. I don't think the pyarrow dependency is a big issue here. |
Has there been any thinking about implementing the DataFrame interchange protocol for LazyFrames? It would be really cool to abstract away the difference between DataFrame and LazyFrame, and just have Polars do |
I agree. Let's at least use it to fill in the blanks. |
I don't think a query plan/promise/computation graph should be treaded as a table. |
Officially, we can't rely on Otherwise, I think it's worth finishing what I started, since I'm almost there. |
bad7c36
to
3124142
Compare
Closing in favor of #6581 |
Resolves #2249
Basically I copied Pandas' homework and am adjusting it for the Polars project. Note that some parts of the protocol are still unclear and Pandas has not fully implemented everything yet.
I expect to finish this over the next 3 weeks or so. There seem to be some parts that require some of the Rust API that is currently NOT available in Python to be exposed to the Python side, so we should do those in separate PRs.
My knowledge of chunks and buffers is limited, so I'll definitely need some help to finish this!
Progress
DataFrame
__dataframe__
metadata
num_columns
num_rows
num_chunks
column_names
get_column
get_column_by_name
get_columns
select_columns
select_columns_by_name
get_chunks
Column
size
offset
dtype
describe_categorical
- needs ImplementSeries.cat.categories
andSeries.cat.ordered
#5671describe_null
null_count
metadata
num_chunks
get_chunks
get_buffers
Buffer
bufsize
ptr
- needs Access memory address of start of chunk #6320__dlpack__
__dlpack_device__
from_dataframe
-> Out of scope for this PRBlockers:
Series.cat.categories
andSeries.cat.ordered
#5671