Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): auto-determine index/columns/values columns in pivot if one is left out, deprecate passing arguments positionally #12125

Closed
wants to merge 10 commits into from

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented Oct 31, 2023

THIS IS A BREAKING CHANGE

Resolves #12087

Since index and columns are both required, there's no reason why values can't be optional. To do so requires moving values to the 3rd argument, which is a breaking change unless we wanted to make all arguments keyword args, or have the user supply None manually. It makes sense to me that values should be optional, as I think the most common usage is to use all remaining columns.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Oct 31, 2023
@stinodego stinodego changed the title feat(python): make values optional parameter to pivot feat(python)!: make values optional parameter to pivot Oct 31, 2023
@github-actions github-actions bot added the breaking Change that breaks backwards compatibility label Oct 31, 2023
@MarcoGorelli
Copy link
Collaborator

Thanks for working on this

I think this is going to break too much code, especially without a deprecation period. Not totally sure about what to suggest instead though, other than keyword-only args. Maybe they could all be optional, and if you specify 2, the third is implied?

@mcrumiller
Copy link
Contributor Author

@MarcoGorelli no problem. I've moved the values parameter back to the front and made it required, but if the user passes None, all available columns are used.

@MarcoGorelli
Copy link
Collaborator

Not sure about values=None having to be passed explicitly

Why not

keyword-only args. Maybe they could all be optional, and if you specify 2, the third is implied?

@mcrumiller
Copy link
Contributor Author

@MarcoGorelli yeah that's perfectly reasonable, just implemented.

@mcrumiller mcrumiller changed the title feat(python)!: make values optional parameter to pivot feat(python)!: auto-determine index/columns/values columns in pivot if one is left out Nov 12, 2023
@MarcoGorelli
Copy link
Collaborator

you'll need test for if less than 2 are specified, and deprecate_nonkeyword_arguments

@mcrumiller
Copy link
Contributor Author

Hi @MarcoGorelli,

Sorry for not being thorough on this, been busy. New commit adds deprecation warning and also an error test for when < 2/3 args are supplied.

@mcrumiller
Copy link
Contributor Author

Hi @MarcoGorelli thanks for taking a close look at this and catching my stupid mistakes.

@alexander-beedie I made a very minor change to _expand_selectors where Nones are simply ignored instead of being added as elements to the list output. This doesn't cause any problems in any tests but I wanted your input on this since that's your domain.

Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm on-board with the suggestion - it's not actually breaking, as there would be a deprecation period, and the arguments would become keyword-only (so, nobody's code would silently produce something different)

If I read df.pivot("a", "b", "c"), I have no idea what to expect (esp. as the order of arguments differs from pandas.DataFrame.pivot!), so I welcome keyword-only args

And determining the third argument from the 2 specified ones seems like a good ergonomic improvement

So, I'll approve, but this is pending

@MarcoGorelli MarcoGorelli changed the title feat(python)!: auto-determine index/columns/values columns in pivot if one is left out feat(python): auto-determine index/columns/values columns in pivot if one is left out, make arguments keyword-only Dec 4, 2023
@MarcoGorelli MarcoGorelli changed the title feat(python): auto-determine index/columns/values columns in pivot if one is left out, make arguments keyword-only feat(python): auto-determine index/columns/values columns in pivot if one is left out, deprecate passing arguments positionally Dec 4, 2023
@mcrumiller
Copy link
Contributor Author

@stinodego this one has been sitting around for a while. Are you still on-board with this one and, if so, do you have any suggested changes?

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Jan 25, 2024

Looks like there's not much appetite for this one, and on the pandas side I've seen complaints that keyword-only arguments here are cumbersome (pandas-dev/pandas#51359 (comment))

I'd suggest closing for now then, it can always be reconsidered in the future

@stinodego
Copy link
Member

I admit I haven't really looked at this yet. First I'd have to figure out what pivot does exactly 😄 but I can come back to this soon.

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jan 25, 2024

First I'd have to figure out what pivot does exactly 😄

Pivot is a fancier unstack operation with aggregation, if that helps. I think the documentation could be improved with better column names to align directly with the keywords to show what's happening. Here is my improved example that I think lends clarity :

import polars as pl

df = pl.DataFrame({
    # The unique values of this column determine the output "index" column
    # Think of this as the "group_by" column
    "index": ["one", "one", "two", "two", "three", "three",],

    # Each unique value becomes a new output column, so our output columns
    # will be "x", "y", and "z"
    "columns1": ["x", "x", "x", "y", "y", "y"],
    "columns2": ["z", "z", "z", "z", "z", "z"],

    # The values in this column must be aggregated somehow
    "values": [1, 2, 3, 4, 5, 6],
})

# df:
# shape: (6, 4)
# ┌───────┬──────────┬──────────┬────────┐
# │ index ┆ columns1 ┆ columns2 ┆ values │
# │ ---   ┆ ---      ┆ ---      ┆ ---    │
# │ str   ┆ str      ┆ str      ┆ i64    │
# ╞═══════╪══════════╪══════════╪════════╡
# │ one   ┆ x        ┆ z        ┆ 1      │
# │ one   ┆ x        ┆ z        ┆ 2      │
# │ two   ┆ x        ┆ z        ┆ 3      │
# │ two   ┆ y        ┆ z        ┆ 4      │
# │ three ┆ y        ┆ z        ┆ 5      │
# │ three ┆ y        ┆ z        ┆ 6      │
# └───────┴──────────┴──────────┴────────┘

out = df.pivot(
    values="values",
    columns=["columns1", "columns2"],
    index=["index"],
    aggregate_function="sum",
)

# out:
# shape: (3, 4)
# ┌───────┬──────┬──────┬─────┐
# │ index ┆ x    ┆ y    ┆ z   │
# │ ---   ┆ ---  ┆ ---  ┆ --- │
# │ str   ┆ i64  ┆ i64  ┆ i64 │
# ╞═══════╪══════╪══════╪═════╡
# │ one   ┆ 3*   ┆ null ┆ 3   │   * aggregation where index = "one" and columns1 = "x"
# │ two   ┆ 3    ┆ 4*   ┆ 7   │   * aggregation where index = "two" and columns1 = "y"
# │ three ┆ null ┆ 11   ┆ 11* │   * aggregation where index = "three" and columns2 = "z"
# └───────┴──────┴──────┴─────┘

FYI this just led me to find a bug, whereby pivot can great a dataframe with duplicate column names (#13994).

@mcrumiller mcrumiller closed this Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Change that breaks backwards compatibility enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make values in pl.pivot() optional
3 participants