Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust, python)!: Formalize list aggregation difference between groupbys, selection and window functions #6487

Merged
merged 5 commits into from Jan 27, 2023

Conversation

ritchie46
Copy link
Member

@ritchie46 ritchie46 commented Jan 27, 2023

This is quite a serious breaking change, but one that has to be made. This was a serious inconsistency in the query engine regarding expressions.

Context

Expressions should adhere to the following rules regarding their context.

  1. In the selection context expressions work on columns.
  2. In the aggregation/groupby context expressions work on the groups.

Within this you should be able to reason what an expression does. E.g. an expression used in select should explain to you how it should behave in agg as well.

This was not the case.

A col().list() turned a i64 column into a list<i64>.

Based on this, I would assume that an aggregation with col().list() turns a i64 in a list<list<i64>>. The outer list being the groups, the inner list being the list aggregation passed by the user. This was not the case.

But now it is:

Selection

df = pl.DataFrame({
    "group": [1, 2, 2, 3], 
    "value": ["a", "b", "c", "d"]
})

df.select(
    pl.col("value").list()
)
shape: (1, 1)
┌─────────────────────┐
│ value               │
│ ---                 │
│ list[str]           │
╞═════════════════════╡
│ ["a", "b", ... "d"] │
└─────────────────────┘

Groupby

(
    df.groupby("group")
    .agg([
        pl.col("value").alias("values in groups"),
        pl.col("value").list().alias("values in groups + list")
    ])
)
shape: (3, 3)
┌───────┬──────────────────┬─────────────────────────┐
│ group ┆ values in groups ┆ values in groups + list │
│ ---   ┆ ---              ┆ ---                     │
│ i64   ┆ list[str]        ┆ list[list[str]]         │
╞═══════╪══════════════════╪═════════════════════════╡
│ 2     ┆ ["b", "c"]       ┆ [["b", "c"]]            │
│ 1     ┆ ["a"]            ┆ [["a"]]                 │
│ 3     ┆ ["d"]            ┆ [["d"]]                 │
└───────┴──────────────────┴─────────────────────────┘

Window functions

For window functions we must have slightly different mental model. If we don't reduce an aggregation in a window function, e.g. bring down to one element via mean, sum, first etc, we map the groups back to their position in the DataFrame.

For instance a col().sort().over() on a i64 returns the same i64 dtype.
So window function, though very similar to groupby operations, are always a single level less nested than a groupby result.

A col().list().over() will return a list<i64> where the list is the list aggregation and the groups are processed by putting that aggregation on the approriate location in the DataFrame.

This wraps it up. This likely will break some examples in the polars-book so we will have to look at that as well.

Quite breaking, but for the better. :)

Migration

This is easy. Remove any .list() call in a groupby().agg([...]) context.

@github-actions github-actions bot added breaking Change that breaks backwards compatibility fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jan 27, 2023
@ritchie46 ritchie46 changed the title fix(rust, python)!: Formalize list aggregation difference between groupbys and window functions fix(rust, python)!: Formalize list aggregation difference between groupbys, selection and window functions Jan 27, 2023
@stinodego
Copy link
Member

stinodego commented Jan 27, 2023

In the example you give, shouldn't the groupby result in the values of group 2 each part of an individual list?

shape: (3, 3)
┌───────┬──────────────────┬─────────────────────────┐
│ group ┆ values in groups ┆ values in groups + list │
│ ---   ┆ ---              ┆ ---                     │
│ i64   ┆ list[str]        ┆ list[list[str]]         │
╞═══════╪══════════════════╪═════════════════════════╡
│ 2     ┆ ["b", "c"]       ┆ [["b"], ["c"]]          │
│ 1     ┆ ["a"]            ┆ [["a"]]                 │
│ 3     ┆ ["d"]            ┆ [["d"]]                 │
└───────┴──────────────────┴─────────────────────────┘

Also, interesting change and I agree that it is needed.

@ritchie46
Copy link
Member Author

ritchie46 commented Jan 27, 2023

In the example you give, shouldn't the groupby result in the values of group 2 each part of an individual list?

shape: (3, 3)
┌───────┬──────────────────┬─────────────────────────┐
│ group ┆ values in groups ┆ values in groups + list │
│ ---   ┆ ---              ┆ ---                     │
│ i64   ┆ list[str]        ┆ list[list[str]]         │
╞═══════╪══════════════════╪═════════════════════════╡
│ 2     ┆ ["b", "c"]       ┆ [["b"], ["c"]]          │
│ 1     ┆ ["a"]            ┆ [["a"]]                 │
│ 3     ┆ ["d"]            ┆ [["d"]]                 │
└───────┴──────────────────┴─────────────────────────┘

Also, interesting change and I agree that it is needed.

Nope, that's not what list does in select. You wrap the entire group in a list.

@stinodego
Copy link
Member

stinodego commented Jan 27, 2023

Nope, that's not what list does in select. You wrap the entire group in a list.

Right, makes sense now that I think about it. Thanks!

@ritchie46 ritchie46 merged commit 47752f9 into master Jan 27, 2023
@ritchie46 ritchie46 deleted the list_agg branch January 27, 2023 13:29
@stinodego stinodego added the highlight Highlight this PR in the changelog label Jan 28, 2023
@ritchie46 ritchie46 mentioned this pull request Feb 3, 2023
philss added a commit to philss/explorer that referenced this pull request Mar 9, 2023
We may need some changes from this version to tackle these issues:
- elixir-explorer#498
- elixir-explorer#499

See: pola-rs/polars#6487
And also the release: https://github.com/pola-rs/polars/releases/tag/rs-0.27.0

With this update we also had to change the
`Explorer.DataFrame.describe/2` function to adhere the changes from
Polars. I changed the backend to simplify, since we shouldn't have a
lazy version.
philss added a commit to elixir-explorer/explorer that referenced this pull request Mar 10, 2023
We may need some changes from this version to tackle these issues:
- #498
- #499

See: pola-rs/polars#6487
And also the release: https://github.com/pola-rs/polars/releases/tag/rs-0.27.0

With this update we also had to change the
`Explorer.DataFrame.describe/2` function to adhere the changes from
Polars. I changed the backend to simplify, since we shouldn't have a
lazy version.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Change that breaks backwards compatibility fix Bug fix highlight Highlight this PR in the changelog python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants