Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pl$arg_sort_by() #929

Merged
merged 10 commits into from
Mar 23, 2024
Merged

Implement pl$arg_sort_by() #929

merged 10 commits into from
Mar 23, 2024

Conversation

etiennebacher
Copy link
Collaborator

Related to #204. To finish after #923 is merged.

@etiennebacher
Copy link
Collaborator Author

I'm a bit stuck here, the following crashes the session and I don't know why:

df = pl$DataFrame(
  a = c(0, 1, 1, 0),
  b = c(3, 2, 3, 2)
)

df$with_columns(
  arg_sort_ab = pl$arg_sort_by(c("a", "b"), descending = TRUE)
)

@eitsupi
Copy link
Collaborator

eitsupi commented Mar 23, 2024

I'm a bit stuck here, the following crashes the session and I don't know why:

This is reproduced in Python. The error occurs at the following location:
https://github.com/pola-rs/polars/blob/911abc32fa1865d3e36f834fed585af9f9f0080d/crates/polars-plan/src/dsl/functions/index.rs#L10

>>> import polars as pl
>>> df = pl.DataFrame(
...     {
...         "a": [0, 1, 1, 0],
...         "b": [3, 2, 3, 2],
...     }
... )
>>> df.select(pl.arg_sort_by(pl.col("a")))
shape: (4, 1)
┌─────┐
│ a   │
│ --- │
│ u32 │
╞═════╡
│ 0   │
│ 3   │
│ 1   │
│ 2   │
└─────┘
>>> df.select(pl.arg_sort_by(pl.col("a", "b")))
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-plan/src/dsl/functions/index.rs:10:36:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("this expression may produce multiple output names"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rstudio/.local/lib/python3.10/site-packages/polars/functions/lazy.py", line 1601, in arg_sort_by
    return wrap_expr(plr.arg_sort_by(exprs, descending))
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("this expression may produce multiple output names"))
>>> df.select(pl.arg_sort_by(pl.col("*")))
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-plan/src/dsl/functions/index.rs:10:36:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("cannot determine output column without a context for this expression"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rstudio/.local/lib/python3.10/site-packages/polars/functions/lazy.py", line 1601, in arg_sort_by
    return wrap_expr(plr.arg_sort_by(exprs, descending))
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("cannot determine output column without a context for this expression"))
>>> df.select(pl.arg_sort_by(pl.col("^h")))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/rstudio/.local/lib/python3.10/site-packages/polars/dataframe/frame.py", line 8129, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
  File "/home/rstudio/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1937, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ColumnNotFoundError: ^h

@eitsupi
Copy link
Collaborator

eitsupi commented Mar 23, 2024

Looking at the operation of this, only the first column has special meaning (reuse as the result column name).
Therefore, it is not possible to pass multiple column names as the first argument.

There are functions with similar behavior, such as all_horizontal, which does not reuse column names because it names the resulting column as "all".

@etiennebacher
Copy link
Collaborator Author

etiennebacher commented Mar 23, 2024

Therefore, it is not possible to pass multiple column names as the first argument.

I think it is, they have an example like that in py-polars:

import polars as pl

df = pl.DataFrame(
    {
        "a": [0, 1, 1, 0],
        "b": [3, 2, 3, 2],
    }
)

df.select(pl.arg_sort_by(["a", "b"]))

shape: (4, 1)
┌─────┐
│ a   │
│ --- │
│ u32 │
╞═════╡
│ 3   │
│ 0   │
│ 1   │
│ 2   │
└─────┘

@eitsupi
Copy link
Collaborator

eitsupi commented Mar 23, 2024

I pushed a change to allow vectors of length > 1 as the first argument, but in the end I don't think this is a fundamental solution because passing something like pl$col("a", "b") as the first argument still causes a panic in this case.

@etiennebacher
Copy link
Collaborator Author

Thanks for your help

in the end I don't think this is a fundamental solution because passing something like pl$col("a", "b") as the first argument still causes a panic in this case.

I think that also depends on your PR upstream. I'm fine with letting this one as a draft until it's sorted out upstream, or merging this now, as you prefer.

@eitsupi
Copy link
Collaborator

eitsupi commented Mar 23, 2024

I would prefer to merge this PR rather than leave it open.
Could you update tests and complete this PR?

@etiennebacher etiennebacher marked this pull request as ready for review March 23, 2024 10:24
Copy link
Collaborator

@eitsupi eitsupi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@etiennebacher etiennebacher merged commit 71a715c into main Mar 23, 2024
24 of 25 checks passed
@etiennebacher etiennebacher deleted the arg_sort_by branch March 23, 2024 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants