Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to sort an array of structs by struct field #16110

Open
david-waterworth opened this issue May 7, 2024 · 3 comments
Open

Ability to sort an array of structs by struct field #16110

david-waterworth opened this issue May 7, 2024 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@david-waterworth
Copy link

david-waterworth commented May 7, 2024

Description

There doesn't appear to be a direct way to sort an array of structs by field, i.e.

Given:

df = pl.DataFrame([{"id": 1, "data": [{"key": "A", "value": 2}, {"key": "B", "value": 1}]}])

Sort struct by value

df = pl.DataFrame([{"id": 1, "data": [{"key": "B", "value": 1}, {"key": "A", "value": 2}]}])

There are work-arounds which revolve around either the side-effect that sorting a struct appears to sort by fields in order so by (re)constructing the struct as {"value": 1, "key": "B"} sorts by value, or by exploding the struct, sorting and then re-grouping.

Ideally pl.Expr.list.sort would accept a sort key expression similar to how the python sorted function behaves.

If this is accepted I'd be inclined to implement for all sort methods?

@david-waterworth david-waterworth added the enhancement New feature or an improvement of an existing feature label May 7, 2024
@cmdlineluser
Copy link
Contributor

Polars has a separate Expr.sort_by()

So I'm guessing you'd be asking for a list equivalent of that?

df.with_columns(
   pl.col("data").list.sort_by(
       pl.col("data").list.eval(pl.element().struct["value"])
   )
)
# AttributeError: 'ExprListNameSpace' object has no attribute 'sort_by'

@nameexhaustion
Copy link
Collaborator

For now, you can use .list.eval(pl.element().sort_by(pl.element().struct.field("value")))

>>> df.with_columns(
...     sorted=pl.col("data").list.eval(
...         pl.element().sort_by(pl.element().struct.field("value"))
...     )
... )
shape: (1, 3)
┌─────┬────────────────────┬────────────────────┐
│ iddatasorted             │
│ ---------                │
│ i64list[struct[2]]    ┆ list[struct[2]]    │
╞═════╪════════════════════╪════════════════════╡
│ 1   ┆ [{"A",2}, {"B",1}] ┆ [{"B",1}, {"A",2}] │
└─────┴────────────────────┴────────────────────┘

@david-waterworth
Copy link
Author

david-waterworth commented May 8, 2024

Yeah either works, @cmdlineluser I guess list.sort_by is easier to find (i.e it's explicit), but the workaround posted by @nameexhaustion example works for me as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants