feat(python): DataFrame interchange protocol implementation #5662

stinodego · 2022-11-29T00:49:53Z

Resolves #2249

Basically I copied Pandas' homework and am adjusting it for the Polars project. Note that some parts of the protocol are still unclear and Pandas has not fully implemented everything yet.

I expect to finish this over the next 3 weeks or so. There seem to be some parts that require some of the Rust API that is currently NOT available in Python to be exposed to the Python side, so we should do those in separate PRs.

My knowledge of chunks and buffers is limited, so I'll definitely need some help to finish this!

Progress

`DataFrame`

`Column`

`Buffer`

bufsize
ptr - needs Access memory address of start of chunk #6320
__dlpack__
__dlpack_device__

`from_dataframe` -> Out of scope for this PR

Blockers:

ritchie46 · 2022-11-29T08:10:06Z

This is a great initiative and really awesome if we can make this work. <3

get_chunks
There doesn't seem to be any functionality on the Python side for iterating over chunks / getting chunk information. Does this exist on the Rust side?

Does it return a List<Series>/List<DF>? A Series can contain of several chunks, so we could transpose the several chunks in a single Series to a List<Series>. The is something we already do with DataFrame when we call flatten_df on the rust side.

The protocol definition is unclear here. It asks for the size of the 'current' chunk, but there does not seem to be a way to specify what the current chunk is.

Should we see a Series as a chunk?

I can do the rust side of these things and expose them to python. So just put me to work here. :)

stinodego · 2022-11-30T00:35:11Z

get_chunks

Does it return a List<Series>/List<DF>? A Series can contain of several chunks, so we could transpose the several chunks in a single Series to a List<Series>. The is something we already do with DataFrame when we call flatten_df on the rust side.

Yes, basically we'd want an iterable over the chunks of a Series / DataFrame, where each chunk is a separate Series / DataFrame. I tried to explain this better in #5670. flatten_df indeed looks similar!

The protocol definition is unclear here. It asks for the size of the 'current' chunk, but there does not seem to be a way to specify what the current chunk is.

Should we see a Series as a chunk?

I'm not sure here; I think it's an issue with the protocol itself, not any shortcomings of Polars. I'll probably make an issue in their repo to try and clarify.

stinodego · 2022-12-01T22:43:01Z

@ritchie46
For Column.describe_null, I need to know how null values are handled for various dtypes. Do we use a bitmask for all types? The options are:

    NON_NULLABLE
        Non-nullable column.
    USE_NAN
        Use explicit float NaN value.
    USE_SENTINEL 
        Sentinel value besides NaN.
    USE_BITMASK 
        The bit is set/unset representing a null on a certain position.
    USE_BYTEMASK
        The byte is set/unset representing a null on a certain position.

If we use bit/bytemasks, we'd need a way to access these.

ritchie46 · 2022-12-02T08:50:13Z

We use bitmasks for everything. We can access them by calling is_not_null on a Series or an Expression.

stinodego · 2022-12-02T14:24:37Z

@ritchie46 There's just some stuff left for the 'buffer' class:

Is it possible to get a pointer (integer) to the memory address where the underlying data for a Series starts?
We also need to get the size of the data, which is easy enough (number of elements * bits per element). Only string data is more complicated. Do we have something for this already, maybe? (Series.estimated_size is not useful as we're not interested in validity and other metadata). -> Found other solution
~~Also, for string data, we'd need to be able to find the offsets of each element.~~ -> Found other solution

Is there already functionality available for these?

jorisvandenbossche · 2023-01-25T15:01:12Z

For your interest, we just added support for the protocol in pyarrow as well (apache/arrow#14804)

It could actually also be an option to use pyarrow internally in polars for this, since the polars <-> pyarrow memory conversion should be cheap. I know pyarrow is not a required dependency of polars, so it's certainly not ideal, but still wanted to mention it as something that could be considered (it avoids having to maintain two implementations that basically implement the protocol to/from arrow-compatible memory. It could be an interesting idea to factor it out into an independent small package without dependency on pyarrow, if that's a blocker)

stinodego · 2023-01-25T15:04:50Z

Thanks for the input! I'll definitely check that out, might indeed be a better method going forward, since I expect the protocol to change quite heavily in the future.

I don't think the pyarrow dependency is a big issue here.

ianmcook · 2023-01-26T15:13:12Z

Has there been any thinking about implementing the DataFrame interchange protocol for LazyFrames? It would be really cool to abstract away the difference between DataFrame and LazyFrame, and just have Polars do collect() when the LazyFrame needs to be materialized.

ritchie46 · 2023-01-26T15:39:05Z

Thanks for the input! I'll definitely check that out, might indeed be a better method going forward, since I expect the protocol to change quite heavily in the future.

I don't think the pyarrow dependency is a big issue here.

I agree. Let's at least use it to fill in the blanks.

ritchie46 · 2023-01-26T15:39:48Z

Has there been any thinking about implementing the DataFrame interchange protocol for LazyFrames? It would be really cool to abstract away the difference between DataFrame and LazyFrame, and just have Polars do collect() when the LazyFrame needs to be materialized.

I don't think a query plan/promise/computation graph should be treaded as a table.

stinodego · 2023-01-27T21:05:34Z

Officially, we can't rely on pyarrow because Categoricals are not zero copy. Is there a way we can make that conversion zero copy?

Otherwise, I think it's worth finishing what I started, since I'm almost there.

stinodego · 2023-01-30T19:44:40Z

Closing in favor of #6581

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Nov 29, 2022

This was referenced Nov 29, 2022

Implement Series.get_chunks #5670

Closed

Implement Series.cat.categories and Series.cat.ordered #5671

Closed

stinodego force-pushed the dataframe-protocol branch from c75b178 to b3d17cd Compare November 29, 2022 23:47

stinodego force-pushed the dataframe-protocol branch from c729ede to d3f377e Compare December 2, 2022 12:33

stinodego force-pushed the dataframe-protocol branch from 0388e69 to d0bb23f Compare December 2, 2022 18:37

stinodego force-pushed the dataframe-protocol branch from daec322 to c208069 Compare January 12, 2023 20:13

stinodego mentioned this pull request Jan 19, 2023

Access memory address of start of chunk #6320

Closed

stinodego self-assigned this Jan 20, 2023

stinodego force-pushed the dataframe-protocol branch from c208069 to bad7c36 Compare January 25, 2023 14:42

stinodego added 9 commits January 27, 2023 22:01

WIP

faaa567

Fix dataframe get_chunks

73f2681

Finish size/offset

614a7c4

WIP from_dataframe

8d4290c

WIP

2cd9ac4

WIP

b5b9c28

Add utils test

91f6bfa

WIP

432af87

Add buffer unittests

c4d2448

WIP

3124142

stinodego force-pushed the dataframe-protocol branch from bad7c36 to 3124142 Compare January 27, 2023 22:33

stinodego mentioned this pull request Jan 30, 2023

feat(python): Implement DataFrame Interchange Protocol through pyarrow #6581

Merged

stinodego closed this Jan 30, 2023

stinodego deleted the dataframe-protocol branch February 22, 2023 18:37

stinodego added the A-interchange Area: Python dataframe interchange protocol label Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): DataFrame interchange protocol implementation #5662

feat(python): DataFrame interchange protocol implementation #5662

stinodego commented Nov 29, 2022 •

edited

Loading

ritchie46 commented Nov 29, 2022

stinodego commented Nov 30, 2022

stinodego commented Dec 1, 2022 •

edited

Loading

ritchie46 commented Dec 2, 2022

stinodego commented Dec 2, 2022 •

edited

Loading

jorisvandenbossche commented Jan 25, 2023

stinodego commented Jan 25, 2023

ianmcook commented Jan 26, 2023 •

edited

Loading

ritchie46 commented Jan 26, 2023

ritchie46 commented Jan 26, 2023

stinodego commented Jan 27, 2023

stinodego commented Jan 30, 2023

feat(python): DataFrame interchange protocol implementation #5662

feat(python): DataFrame interchange protocol implementation #5662

Conversation

stinodego commented Nov 29, 2022 • edited Loading

Progress

DataFrame

Column

Buffer

from_dataframe -> Out of scope for this PR

ritchie46 commented Nov 29, 2022

stinodego commented Nov 30, 2022

stinodego commented Dec 1, 2022 • edited Loading

ritchie46 commented Dec 2, 2022

stinodego commented Dec 2, 2022 • edited Loading

jorisvandenbossche commented Jan 25, 2023

stinodego commented Jan 25, 2023

ianmcook commented Jan 26, 2023 • edited Loading

ritchie46 commented Jan 26, 2023

ritchie46 commented Jan 26, 2023

stinodego commented Jan 27, 2023

stinodego commented Jan 30, 2023

stinodego commented Nov 29, 2022 •

edited

Loading

`DataFrame`

`Column`

`Buffer`

`from_dataframe` -> Out of scope for this PR

stinodego commented Dec 1, 2022 •

edited

Loading

stinodego commented Dec 2, 2022 •

edited

Loading

ianmcook commented Jan 26, 2023 •

edited

Loading