Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): DataFrame interchange protocol implementation #5662

Closed
wants to merge 10 commits into from

Conversation

stinodego
Copy link
Member

@stinodego stinodego commented Nov 29, 2022

Resolves #2249

Basically I copied Pandas' homework and am adjusting it for the Polars project. Note that some parts of the protocol are still unclear and Pandas has not fully implemented everything yet.

I expect to finish this over the next 3 weeks or so. There seem to be some parts that require some of the Rust API that is currently NOT available in Python to be exposed to the Python side, so we should do those in separate PRs.

My knowledge of chunks and buffers is limited, so I'll definitely need some help to finish this!

Progress

DataFrame

  • __dataframe__
  • metadata
  • num_columns
  • num_rows
  • num_chunks
  • column_names
  • get_column
  • get_column_by_name
  • get_columns
  • select_columns
  • select_columns_by_name
  • get_chunks

Column

Buffer

from_dataframe -> Out of scope for this PR

Blockers:

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Nov 29, 2022
@ritchie46
Copy link
Member

This is a great initiative and really awesome if we can make this work. <3

get_chunks
There doesn't seem to be any functionality on the Python side for iterating over chunks / getting chunk information. Does this exist on the Rust side?

Does it return a List<Series>/List<DF>? A Series can contain of several chunks, so we could transpose the several chunks in a single Series to a List<Series>. The is something we already do with DataFrame when we call flatten_df on the rust side.

The protocol definition is unclear here. It asks for the size of the 'current' chunk, but there does not seem to be a way to specify what the current chunk is.

Should we see a Series as a chunk?

I can do the rust side of these things and expose them to python. So just put me to work here. :)

@stinodego
Copy link
Member Author

get_chunks

Does it return a List<Series>/List<DF>? A Series can contain of several chunks, so we could transpose the several chunks in a single Series to a List<Series>. The is something we already do with DataFrame when we call flatten_df on the rust side.

Yes, basically we'd want an iterable over the chunks of a Series / DataFrame, where each chunk is a separate Series / DataFrame. I tried to explain this better in #5670. flatten_df indeed looks similar!

The protocol definition is unclear here. It asks for the size of the 'current' chunk, but there does not seem to be a way to specify what the current chunk is.

Should we see a Series as a chunk?

I'm not sure here; I think it's an issue with the protocol itself, not any shortcomings of Polars. I'll probably make an issue in their repo to try and clarify.

@stinodego
Copy link
Member Author

stinodego commented Dec 1, 2022

@ritchie46
For Column.describe_null, I need to know how null values are handled for various dtypes. Do we use a bitmask for all types? The options are:

    NON_NULLABLE
        Non-nullable column.
    USE_NAN
        Use explicit float NaN value.
    USE_SENTINEL 
        Sentinel value besides NaN.
    USE_BITMASK 
        The bit is set/unset representing a null on a certain position.
    USE_BYTEMASK
        The byte is set/unset representing a null on a certain position.

If we use bit/bytemasks, we'd need a way to access these.

@ritchie46
Copy link
Member

We use bitmasks for everything. We can access them by calling is_not_null on a Series or an Expression.

@stinodego
Copy link
Member Author

stinodego commented Dec 2, 2022

@ritchie46 There's just some stuff left for the 'buffer' class:

  • Is it possible to get a pointer (integer) to the memory address where the underlying data for a Series starts?
  • We also need to get the size of the data, which is easy enough (number of elements * bits per element). Only string data is more complicated. Do we have something for this already, maybe? (Series.estimated_size is not useful as we're not interested in validity and other metadata). -> Found other solution
  • Also, for string data, we'd need to be able to find the offsets of each element. -> Found other solution

Is there already functionality available for these?

@jorisvandenbossche
Copy link

For your interest, we just added support for the protocol in pyarrow as well (apache/arrow#14804)


It could actually also be an option to use pyarrow internally in polars for this, since the polars <-> pyarrow memory conversion should be cheap. I know pyarrow is not a required dependency of polars, so it's certainly not ideal, but still wanted to mention it as something that could be considered (it avoids having to maintain two implementations that basically implement the protocol to/from arrow-compatible memory. It could be an interesting idea to factor it out into an independent small package without dependency on pyarrow, if that's a blocker)

@stinodego
Copy link
Member Author

Thanks for the input! I'll definitely check that out, might indeed be a better method going forward, since I expect the protocol to change quite heavily in the future.

I don't think the pyarrow dependency is a big issue here.

@ianmcook
Copy link

ianmcook commented Jan 26, 2023

Has there been any thinking about implementing the DataFrame interchange protocol for LazyFrames? It would be really cool to abstract away the difference between DataFrame and LazyFrame, and just have Polars do collect() when the LazyFrame needs to be materialized.

@ritchie46
Copy link
Member

Thanks for the input! I'll definitely check that out, might indeed be a better method going forward, since I expect the protocol to change quite heavily in the future.

I don't think the pyarrow dependency is a big issue here.

I agree. Let's at least use it to fill in the blanks.

@ritchie46
Copy link
Member

Has there been any thinking about implementing the DataFrame interchange protocol for LazyFrames? It would be really cool to abstract away the difference between DataFrame and LazyFrame, and just have Polars do collect() when the LazyFrame needs to be materialized.

I don't think a query plan/promise/computation graph should be treaded as a table.

@stinodego
Copy link
Member Author

Officially, we can't rely on pyarrow because Categoricals are not zero copy. Is there a way we can make that conversion zero copy?

Otherwise, I think it's worth finishing what I started, since I'm almost there.

@stinodego
Copy link
Member Author

Closing in favor of #6581

@stinodego stinodego closed this Jan 30, 2023
@stinodego stinodego deleted the dataframe-protocol branch February 22, 2023 18:37
@stinodego stinodego added the A-interchange Area: Python dataframe interchange protocol label Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interchange Area: Python dataframe interchange protocol enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Array API
4 participants