Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename rle() struct fields to len and value #15230

Closed
cmdlineluser opened this issue Mar 22, 2024 · 3 comments · Fixed by #15249
Closed

Rename rle() struct fields to len and value #15230

cmdlineluser opened this issue Mar 22, 2024 · 3 comments · Fixed by #15249
Assignees
Labels
A-api Area: changes to the public API accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Milestone

Comments

@cmdlineluser
Copy link
Contributor

Description

Remove the plural from Series.rle() and Expr.rle() field names.

(Similar to what was done for value_counts: #11462)

Current:

>>> pl.Series([1, 1, 2, 3]).rle().struct.unnest()
shape: (3, 2)
┌─────────┬────────┐
│ lengthsvalues │
│ ------    │
│ i32i64    │
╞═════════╪════════╡
│ 21      │
│ 12      │
│ 13      │
└─────────┴────────┘

Desired:

>>> pl.Series([1, 1, 2, 3]).rle().struct.unnest()
shape: (3, 2)
┌─────────┬────────┐
│ lenvalue  │
│ ------    │
│ i32i64    │
╞═════════╪════════╡
│ 21      │
│ 12      │
│ 13      │
└─────────┴────────┘

(Choosing len to match up with list.len())

@cmdlineluser cmdlineluser added the enhancement New feature or an improvement of an existing feature label Mar 22, 2024
@cmdlineluser
Copy link
Contributor Author

Side note: Would it make sense for rle() to also return the row index? {index, value, len}

The particular use-case being wanting the original row index after performing a .filter()

df = pl.DataFrame({"foo": ["a", "a", "a", "b", "c", "c"]})

df.select(pl.col("foo").rle())
# shape: (3, 1)
# ┌───────────┐
# │ foo       │
# │ ---       │
# │ struct[2] │
# ╞═══════════╡
# │ {3,"a"}   │
# │ {1,"b"}   │
# │ {2,"c"}   │
# └───────────┘

We can calculate it from the length, but it's a little awkward:

(df.select(pl.col("foo").rle())
   .with_columns(
      index = pl.col("foo").struct["lengths"].cum_sum().shift().fill_null(0)
   )
   #.filter(...)
)
# shape: (3, 2)
# ┌───────────┬───────┐
# │ foo       ┆ index │
# │ ---       ┆ ---   │
# │ struct[2] ┆ i32   │
# ╞═══════════╪═══════╡
# │ {3,"a"}   ┆ 0     │
# │ {1,"b"}   ┆ 3     │
# │ {2,"c"}   ┆ 4     │
# └───────────┴───────┘

@stinodego
Copy link
Member

stinodego commented Mar 22, 2024

Agreed on the rename.

I don't think the index should be part of the RLE method by default. It is not an essential part of the RLE definition. Though possibly an include_index parameter would make sense - but please open a separate issue for that.

@stinodego
Copy link
Member

stinodego commented Mar 23, 2024

@cmdlineluser I am tempted to also change the field order of the struct to value/len. That way it matches value_counts. What do you think?

EDIT: Nevermind, it's probably not a good idea as the standard RLE places len before value, e.g. 12W1B12W3B24W1B14W.

@stinodego stinodego added the A-api Area: changes to the public API label Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
2 participants