Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add DataFrame.iter_columns #12653

Merged
merged 3 commits into from
Nov 27, 2023

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented Nov 23, 2023

closes #12630

Demo:

In [1]: df = pl.DataFrame({'a': [1, 2]})

In [2]: [i for i in df]
<ipython-input-2-30769f5b8b68>:1: DeprecationWarning: `__iter__` is deprecated. This function will raise in a future version of Polars.
Instead, use:
- `DataFrame.iter_columns` to iterate over columns
- `DataFrame.columns` to iterate over column names
- `DataFrame.iter_rows` to iterate over rows

  [i for i in df]
Out[2]: 
[shape: (2,)
 Series: 'a' [i64]
 [
        1
        2
 ]]

EDIT

demo above it outdated, this PR now just adds iter_columns without deprecating anything

@github-actions github-actions bot added deprecation Add a deprecation warning to outdated functionality python Related to Python Polars labels Nov 23, 2023
@MarcoGorelli MarcoGorelli force-pushed the iter_columns branch 2 times, most recently from 66d9a6c to f07236c Compare November 23, 2023 19:45
@MarcoGorelli MarcoGorelli marked this pull request as ready for review November 23, 2023 19:46
@stinodego
Copy link
Member

I'm in favor of adding the explicit iter_columns, but I don't see the need to deprecate __iter__.

@MarcoGorelli MarcoGorelli marked this pull request as draft November 24, 2023 09:41
@MarcoGorelli MarcoGorelli changed the title depr(python): deprecate DataFrame.__iter__() in favour of DataFrame.iter_columns feat(python): add DataFrame.iter_columns Nov 24, 2023
@github-actions github-actions bot added the enhancement New feature or an improvement of an existing feature label Nov 24, 2023
@MarcoGorelli MarcoGorelli removed the deprecation Add a deprecation warning to outdated functionality label Nov 24, 2023
@MarcoGorelli MarcoGorelli marked this pull request as ready for review November 24, 2023 09:48
@ritchie46
Copy link
Member

I'm in favor of adding the explicit iter_columns, but I don't see the need to deprecate __iter__.

I think it is worth the deprecation as iteration over a 2D data structure is ambiguous.

@stinodego
Copy link
Member

stinodego commented Nov 24, 2023

I think it is worth the deprecation as iteration over a 2D data structure is ambiguous.

In that sense it would match what we do with our implementation of __bool__. I guess that's consisent in a way.

Overall, I think removal will make DataFrame function less as a Python user would expect.

For example, I expect collection-type objects to implement __iter__. Iterating a dict is ambiguous but Python chooses to let you iterate the keys by default. You can still iterate keys or values explicitly by using keys() or values(). Sensible defaults are good to have.

You could argue also __len__ should be deprecated because it's ambiguous. In the end, we're trying to create something usable, and these shortcuts are useful when wrangling data in a notebook. Unfortunately that can be at odds with enforcing an explicit, readable API...

Anyway, that's my two cents and I'll leave it up to you to decide. The PR in its current form can be merged regardless 👍

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Nov 24, 2023

I'd also strongly suggest not deprecating __iter__ (and I also like adding the explicit iter_columns ;)

Being able to directly iterate a collection-like structure is definitely expected for Python objects. Given that DataFrames are column-native, it feels reasonable/expected to iterate over them as Series. It's also not really ambiguous - if we returned Python-native lists here (instead of Series) then there might be some (minor) potential for confusion, but that's not the case.

@ritchie46
Copy link
Member

Ok, let's be conservative for now. Can we revert the deprecation and think about it a little bit more?

@MarcoGorelli
Copy link
Collaborator Author

yup - there's no deprecation in this PR at the moment, it just adds iter_columns (which I think people are on board with)

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Nov 25, 2023

Actually... I'd make one addition/change; iter_rows returns py-native data, and I suggest we do the same for iter_columns, while also having an iter_series (which __iter__ could then call, or vice-versa). Alternatively (probably cleaner) add an "as_series" param, as we have for as_dict, but default to False (and eventually we follow-up with a breaking change to switch the as_dict default to False as well, which I think @stinodego already has planned ;)

Then, in either case, we will be consistent in our default behaviour for both row/col iteration types (returning py-native data when iterating from python), and Series iteration becomes explicit instead of implicit. Should cover all the bases pretty cleanly.

@ritchie46
Copy link
Member

I am not very fond of materializing to python on iter_columns. We have zero copy iter_columns and we have a column datastructure, the Series.

We don't have a row datastructure and we cannot do that zero-copy anyways, so it makes sense to me that our row datastructure differs. Materializing columns to python is also much more a performance footgun than rows in many cases as most dataframes are longer than wide.

@stinodego
Copy link
Member

I agree with Ritchie on this one - I don't think the discussion of Python native objects vs Polars objects applies here.

I'm going to go ahead and merge this. Nice to have an explicit way of iterating columns!

@stinodego stinodego merged commit 8aebb03 into pola-rs:main Nov 27, 2023
16 checks passed
@stinodego stinodego changed the title feat(python): add DataFrame.iter_columns feat(python): Add DataFrame.iter_columns Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame is iterable, but breaks expectations (just like pandas)
4 participants