Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement DataFrame interchange protocol #46141

Merged
merged 49 commits into from Apr 27, 2022
Merged

Conversation

vnlitvinov
Copy link
Contributor

@vnlitvinov vnlitvinov commented Feb 24, 2022

Do note that this PR is currently work-in-progress, mostly to facilitate the discussion on how the implementation should be going.

It also vendors the exchange spec and exchange tests, which aren't yet merged at the consortium, so I'll keep updating the vendored copies as the discussion goes there.

More tests are also to be added, as well as the implementations of some cases (a lot of non-central cases are NotImplemented now, as I've built this upon the prototype.

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@pep8speaks
Copy link

pep8speaks commented Feb 24, 2022

Hello @vnlitvinov! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-04-24 10:03:07 UTC

@vnlitvinov
Copy link
Contributor Author

cc @jreback for preliminary feedback

pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Feb 27, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didnt look in detail but some top-level organizational comments

pandas/api/exchange/dataframe_protocol.py Outdated Show resolved Hide resolved
pandas/api/exchange/implementation.py Outdated Show resolved Hide resolved
pandas/api/exchange/implementation.py Outdated Show resolved Hide resolved
pandas/tests/api/conftest.py Outdated Show resolved Hide resolved
pandas/tests/api/test_protocol.py Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Mar 30, 2022
@vnlitvinov
Copy link
Contributor Author

@jreback @jbrockmendel I've responded to your comments, but I suggest to refrain from re-reading the PR just yet - I'm in the middle of improving it yet further, and I'll make a comment when it's again ready for reviewing.

Thanks again for your feedback!

@vnlitvinov
Copy link
Contributor Author

Okay, I think logic-wise it's ready to be reviewed.

I still need to make CI happy about code style etc., but I don't expect a lot of changes for that.

@vnlitvinov
Copy link
Contributor Author

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Please remove the stale label

@vnlitvinov
Copy link
Contributor Author

vnlitvinov commented Mar 31, 2022

This PR finally passes the code checks (and new functionality passes newly added tests which cover at least the basic usage of the API) on my end, so I'm marking this PR as "ready for review".

@vnlitvinov vnlitvinov changed the title [WIP] DataFrame exchange protocol ENH: Implement DataFrame exchange protocol Mar 31, 2022
@vnlitvinov vnlitvinov marked this pull request as ready for review March 31, 2022 17:35
@vnlitvinov vnlitvinov force-pushed the df-xchg branch 4 times, most recently from 98bfab4 to a681598 Compare March 31, 2022 21:12
@vnlitvinov
Copy link
Contributor Author

CI failures look like flaky tests, not related to my changes.

So I consider this PR ready for reviewing, ping @jbrockmendel @jreback

pandas/tests/exchange/test_impl.py Show resolved Hide resolved
c_arrow_dtype_f_str,
"=",
)
elif is_string_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a dataframe's column has object dtype, is_string_dtype returns True and the flow goes into this branch. Since the spec doesn't have a requirement to support object dtype, should we raise TypeError exception when calling df.__dataframe()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good question, I think I'll stub it with a NotImplementedError for now - I think it fits better than TypeError as there is no error on user side, but a missing spec entry...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have performed a little more research and I'm no longer sure how can I properly check for the thing being str vs being something more complex except checking all entries in a column via isinstance() which feels wrong... adding a TODO instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we check isinstance(df.dtype, object) in df.__dataframe() and if it is True, then throw NotImplementedError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because strings are usually stored as objects, don't they?.. this would effectively block strings altogether.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be put in df.dataframe() to get an error before playing around with the dataframe implementing the protocol?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vasily's answer:

I cannot find this comment somewhere where I can write an answer, so I'm going to type it as a general comment.

I think this check should be delayed as much as possible because it's potentially scanning all the items in the column, which is a heavy operation while a user might just be needing some small amount of information (or might be wanting to get some particular column but not this string/object one).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if the user play around with a df for a long time, which has a column with object dtype, not touching df.dtype, and only after a while gets the error. I think that is a controversial question. I would like to hear other opinions on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol is mostly for exchanging the dataframe between certain libraries, not for some user to play around with.

I'm imagining the use case like "someone wants to plot some graphs for a few columns of a dataframe backed by library X, so they request matplotlib to show a graph; matplotlib then imports the dataframe using the protocol and shows the requested columns, it doesn't care about other columns or anything else". In this case it would be harmful to the end user of the scenario to check if any column could be represented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas/core/exchange/column.py Outdated Show resolved Hide resolved
pandas/core/exchange/column.py Show resolved Hide resolved
pandas/core/exchange/column.py Show resolved Hide resolved
pandas/core/exchange/column.py Show resolved Hide resolved
pandas/core/exchange/buffer.py Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good a few small comments

pandas/core/exchange/buffer.py Outdated Show resolved Hide resolved
pandas/core/exchange/column.py Show resolved Hide resolved
pandas/core/exchange/dataframe_protocol.py Show resolved Hide resolved
@jorisvandenbossche jorisvandenbossche changed the title ENH: Implement DataFrame exchange protocol ENH: Implement DataFrame interchange protocol Apr 14, 2022
pandas/core/exchange/dataframe.py Outdated Show resolved Hide resolved
pandas/core/exchange/column.py Outdated Show resolved Hide resolved
pandas/core/exchange/column.py Outdated Show resolved Hide resolved
pandas/core/exchange/column.py Outdated Show resolved Hide resolved
pandas/core/exchange/column.py Outdated Show resolved Hide resolved
Signed-off-by: Vasily Litvinov <vasilij.n.litvinov@intel.com>
Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Copy link

@dchigarev dchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks fine for me overall, left couple of minor comments

pandas/core/exchange/from_dataframe.py Outdated Show resolved Hide resolved
pandas/core/exchange/from_dataframe.py Outdated Show resolved Hide resolved
pandas/tests/exchange/test_impl.py Outdated Show resolved Hide resolved
@YarShev
Copy link
Contributor

YarShev commented Apr 19, 2022

@vnlitvinov, I see no answers to my questions above. Please take a look at them.

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
@vnlitvinov
Copy link
Contributor Author

@YarShev I hope I've answered all of them now, I'm sorry I've somehow missed that you've added more responses to initial review.

@jorisvandenbossche should I rename subpackage pandas.core.exchange to pandas.core.interchange to align with new PR title?..

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
pandas/tests/exchange/test_utils.py Show resolved Hide resolved
pandas/tests/exchange/test_spec_conformance.py Outdated Show resolved Hide resolved
pandas/tests/exchange/test_spec_conformance.py Outdated Show resolved Hide resolved
pandas/core/exchange/from_dataframe.py Show resolved Hide resolved
pandas/core/exchange/dataframe.py Show resolved Hide resolved
c_arrow_dtype_f_str,
"=",
)
elif is_string_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be put in df.dataframe() to get an error before playing around with the dataframe implementing the protocol?

@vnlitvinov
Copy link
Contributor Author

@YarShev

Should this check be put in df.dataframe() to get an error before playing around with the dataframe implementing the protocol?

I cannot find this comment somewhere where I can write an answer, so I'm going to type it as a general comment.

I think this check should be delayed as much as possible because it's potentially scanning all the items in the column, which is a heavy operation while a user might just be needing some small amount of information (or might be wanting to get some particular column but not this string/object one).

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
@YarShev
Copy link
Contributor

YarShev commented Apr 24, 2022

@YarShev

Should this check be put in df.dataframe() to get an error before playing around with the dataframe implementing the protocol?

I cannot find this comment somewhere where I can write an answer, so I'm going to type it as a general comment.

I think this check should be delayed as much as possible because it's potentially scanning all the items in the column, which is a heavy operation while a user might just be needing some small amount of information (or might be wanting to get some particular column but not this string/object one).

This is about handling string and object dtype. Let's continue the discussion there (link.)

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
@jreback jreback added this to the 1.5 milestone Apr 26, 2022
@jreback
Copy link
Contributor

jreback commented Apr 26, 2022

should there be tests that the protocol is round-trippable? e.g.

tm.assert_frame_equal(df, pd.api.exchange.from_dataframe(df.__dataframe__()))

for some/most of possible dfs? (e.g. empty, various types), if they have a non-range index they should raise? what about non-string columns names?

can certainly do this in another PR as well.

@vnlitvinov
Copy link
Contributor Author

There already are a few:

@pytest.mark.parametrize("data", [("ordered", True), ("unordered", False)])
def test_categorical_dtype(data):
df = pd.DataFrame({"A": (test_data_categorical[data[0]])})
col = df.__dataframe__().get_column_by_name("A")
assert col.dtype[0] == DtypeKind.CATEGORICAL
assert col.null_count == 0
assert col.describe_null == (ColumnNullType.USE_SENTINEL, -1)
assert col.num_chunks() == 1
assert col.describe_categorical == {
"is_ordered": data[1],
"is_dictionary": True,
"mapping": {0: "a", 1: "d", 2: "e", 3: "s", 4: "t"},
}
tm.assert_frame_equal(df, from_dataframe(df.__dataframe__()))

and

@pytest.mark.parametrize(
"data", [int_data, uint_data, float_data, bool_data, datetime_data]
)
def test_dataframe(data):
df = pd.DataFrame(data)
df2 = df.__dataframe__()
assert df2.num_columns() == NCOLS
assert df2.num_rows() == NROWS
assert list(df2.column_names()) == list(data.keys())
indices = (0, 2)
names = tuple(list(data.keys())[idx] for idx in indices)
tm.assert_frame_equal(
from_dataframe(df2.select_columns(indices)),
from_dataframe(df2.select_columns_by_name(names)),
)

Maybe I should extend the second one and take a subset of pandas DataFrame using same indices and compare it with the one obtained via protocol...

@vnlitvinov
Copy link
Contributor Author

if they have a non-range index they should raise? what about non-string columns names?

can certainly do this in another PR as well.

I would rather make it in a separate PR, as this one is already big...

@jreback
Copy link
Contributor

jreback commented Apr 27, 2022

if they have a non-range index they should raise? what about non-string columns names?
can certainly do this in another PR as well.

I would rather make it in a separate PR, as this one is already big...

no for sure, pls create a todo issue (and PRs)!

thanks for all of this @vnlitvinov and @YarShev for all the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants