Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add new index/range based selector cs.by_index, allow multiple indices for nth #16217

Merged
merged 6 commits into from
May 15, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented May 14, 2024

Enables some interesting new column selection capabilities using indices (including negative indices) and range objects, as well as their union, intersection, etc (via selector combinatorics).

  • Adds a new by_index selector that works with indices and ranges.
  • Allows nth to take multiple indices.

Examples

import polars.selectors as cs
import polars as pl

df = pl.DataFrame({
    "key": ["abc"],
    **{f"c{i:02}": [0.5 * i] for i in range(100)},
})
# shape: (1, 101)
# ┌─────┬─────┬─────┬─────┬───┬──────┬──────┬──────┬──────┐
# │ key ┆ c00 ┆ c01 ┆ c02 ┆ … ┆ c96  ┆ c97  ┆ c98  ┆ c99  │
# │ --- ┆ --- ┆ --- ┆ --- ┆   ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
# │ str ┆ f64 ┆ f64 ┆ f64 ┆   ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
# ╞═════╪═════╪═════╪═════╪═══╪══════╪══════╪══════╪══════╡
# │ abc ┆ 0.0 ┆ 0.5 ┆ 1.0 ┆ … ┆ 48.0 ┆ 48.5 ┆ 49.0 ┆ 49.5 │
# └─────┴─────┴─────┴─────┴───┴──────┴──────┴──────┴──────┘

Select "key" col and the three first/last numeric columns:
(can freely mix/match indexes and range objects)

df.select(pl.nth([0, 1, 2, 3, -3, -2, -1]))
df.select(cs.by_index(0, 1, 2, 3, -3, -2, -1))
df.select(cs.by_index(range(0,4), range(-3,0)))
# shape: (1, 7)
# ┌─────┬─────┬─────┬─────┬──────┬──────┬──────┐
# │ key ┆ c00 ┆ c01 ┆ c02 ┆ c97  ┆ c98  ┆ c99  │
# │ --- ┆ --- ┆ --- ┆ --- ┆ ---  ┆ ---  ┆ ---  │
# │ str ┆ f64 ┆ f64 ┆ f64 ┆ f64  ┆ f64  ┆ f64  │
# ╞═════╪═════╪═════╪═════╪══════╪══════╪══════╡
# │ abc ┆ 0.0 ┆ 0.5 ┆ 1.0 ┆ 48.5 ┆ 49.0 ┆ 49.5 │
# └─────┴─────┴─────┴─────┴──────┴──────┴──────┘

Select every 10th column:

df.select(cs.by_index(range(0,101,10)))
# shape: (1, 11)
# ┌─────┬─────┬─────┬──────┬───┬──────┬──────┬──────┬──────┐
# │ key ┆ c09 ┆ c19 ┆ c29  ┆ … ┆ c69  ┆ c79  ┆ c89  ┆ c99  │
# │ --- ┆ --- ┆ --- ┆ ---  ┆   ┆ ---  ┆ ---  ┆ ---  ┆ ---  │
# │ str ┆ f64 ┆ f64 ┆ f64  ┆   ┆ f64  ┆ f64  ┆ f64  ┆ f64  │
# ╞═════╪═════╪═════╪══════╪═══╪══════╪══════╪══════╪══════╡
# │ abc ┆ 4.5 ┆ 9.5 ┆ 14.5 ┆ … ┆ 34.5 ┆ 39.5 ┆ 44.5 ┆ 49.5 │
# └─────┴─────┴─────┴──────┴───┴──────┴──────┴──────┴──────┘

Select every 25th numeric column in reverse order:

df.select(cs.by_index(range(101,0,-25)))
# shape: (1, 4)
# ┌──────┬──────┬──────┬─────┐
# │ c75  ┆ c50  ┆ c25  ┆ c00 │
# │ ---  ┆ ---  ┆ ---  ┆ --- │
# │ f64  ┆ f64  ┆ f64  ┆ f64 │
# ╞══════╪══════╪══════╪═════╡
# │ 37.5 ┆ 25.0 ┆ 12.5 ┆ 0.0 │
# └──────┴──────┴──────┴─────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 14, 2024
Copy link

codspeed-hq bot commented May 14, 2024

CodSpeed Performance Report

Merging #16217 will not alter performance

Comparing alexander-beedie:index-column-expr (8fb7aa4) with main (253fd24)

Summary

✅ 34 untouched benchmarks

@alexander-beedie alexander-beedie force-pushed the index-column-expr branch 2 times, most recently from 95238fe to 9bcc84a Compare May 14, 2024 12:17
@alexander-beedie alexander-beedie changed the title feat: Add index position support to col, and a new index/range based selector cs.by_index feat: Add ordinal position support to col, and a new index/range based selector cs.by_index May 14, 2024
Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 91.30435% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 80.83%. Comparing base (993cc99) to head (8fb7aa4).
Report is 10 commits behind head on main.

Files Patch % Lines
...tes/polars-plan/src/logical_plan/expr_expansion.rs 91.07% 5 Missing ⚠️
crates/polars-plan/src/dsl/expr.rs 0.00% 1 Missing ⚠️
crates/polars-plan/src/dsl/meta.rs 0.00% 1 Missing ⚠️
crates/polars-plan/src/logical_plan/format.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16217      +/-   ##
==========================================
- Coverage   80.99%   80.83%   -0.16%     
==========================================
  Files        1393     1394       +1     
  Lines      179445   180095     +650     
  Branches     2907     2913       +6     
==========================================
+ Hits       145335   145574     +239     
- Misses      33604    34018     +414     
+ Partials      506      503       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stinodego
Copy link
Member

We have the nth function for this, right? I'd prefer not to overload the col function if there is no need for it.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented May 14, 2024

We have the nth function for this, right? I'd prefer not to overload the col function if there is no need for it.

It's not the same -nth can only take a single index, which leads to a profusion of expression objects if you want to select more than a handful:

df.select(pl.col(*range(10)))
df.select(
    pl.nth(0), 
    pl.nth(1), 
    pl.nth(2), 
    pl.nth(3), 
    pl.nth(4), 
    pl.nth(5), 
    pl.nth(6), 
    pl.nth(7), 
    pl.nth(8), 
    pl.nth(9),
)

Could probably move the multi-index support into nth, though given that col already selects by one or more names/dtypes I don't see adding index as a stretch (there's no ambiguity, and it provides some consistency - you don't have to look in different places to select columns, you just use col with a suitable argument for each selection type).

Still, nth wouldn't be an unreasonable home for multi-index support (though arguably it's less discoverable), so... Strong opinion? Want me to switch it over?

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool one! A few comments.

crates/polars-plan/src/dsl/expr.rs Outdated Show resolved Hide resolved
crates/polars-plan/src/logical_plan/expr_expansion.rs Outdated Show resolved Hide resolved
@alexander-beedie alexander-beedie added the A-selectors Area: column selectors label May 14, 2024
@alexander-beedie alexander-beedie force-pushed the index-column-expr branch 3 times, most recently from e8e51a6 to 5b869e5 Compare May 14, 2024 14:27
@stinodego
Copy link
Member

stinodego commented May 14, 2024

Could probably move the multi-index support into nth, though given that col already selects by one or more names/dtypes I don't see adding index as a stretch (there's no ambiguity, and it provides some consistency - you don't have to look in different places to select columns, you just use col with a suitable argument for each selection type).

Still, nth wouldn't be an unreasonable home for multi-index support (though arguably it's less discoverable), so... Strong opinion? Want me to switch it over?

I'm already not super happy with col taking data type inputs - I'd prefer if that were a different function. I don't think nth is a very descriptive name, but at least it has a dedicated purpose. Extending it with multi-column support makes a lot of sense.

We can add a See Also to col linking to nth if it doesn't do so already to help with discovery.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented May 14, 2024

We can add a See Also to col linking to nth if it doesn't do so already to help with discovery.

Ok, will switch it over to nth in the next few hours✌️

(Update: @stinodego - done).

@alexander-beedie alexander-beedie force-pushed the index-column-expr branch 2 times, most recently from 0dabbb7 to 9c635ab Compare May 14, 2024 18:53
@alexander-beedie alexander-beedie changed the title feat: Add ordinal position support to col, and a new index/range based selector cs.by_index feat: Adds a new index/range based selector cs.by_index, and allows multiple indices for nth May 14, 2024
@alexander-beedie alexander-beedie changed the title feat: Adds a new index/range based selector cs.by_index, and allows multiple indices for nth feat: Adds a new index/range based selector cs.by_index, allows multiple indices for nth May 15, 2024
@alexander-beedie alexander-beedie changed the title feat: Adds a new index/range based selector cs.by_index, allows multiple indices for nth feat: Adds new index/range based selector cs.by_index, allows multiple indices for nth May 15, 2024
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice functionality to have! I left a few comments for the Python side.

py-polars/src/functions/lazy.rs Outdated Show resolved Hide resolved
py-polars/polars/selectors.py Show resolved Hide resolved
py-polars/polars/selectors.py Outdated Show resolved Hide resolved
py-polars/polars/selectors.py Outdated Show resolved Hide resolved
@@ -646,9 +646,9 @@ def last(*columns: str) -> Expr:
return F.col(*columns).last()


def nth(n: int, *columns: str) -> Expr:
def nth(n: int | Sequence[int], *columns: str) -> Expr:
Copy link
Member

@stinodego stinodego May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pity we have this *columns API, because passing nth(1,2) would feel really good. I think we should deprecate this API (in another PR) and make this a keyword argument, so we can pass nth(1,2, from=['x', 'y', 'z'])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same thing 👌 Can follow-up with a separate improvement / deprecation; I think there are a few other functions with a similar set of params, so we should check and do all at once, if so.

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, good to go! :shipit:

@alexander-beedie alexander-beedie merged commit a34ca2c into pola-rs:main May 15, 2024
27 checks passed
@alexander-beedie alexander-beedie deleted the index-column-expr branch May 15, 2024 13:59
@alexander-beedie alexander-beedie changed the title feat: Adds new index/range based selector cs.by_index, allows multiple indices for nth feat: Add new index/range based selector cs.by_index, allow multiple indices for nth May 15, 2024
MarcoGorelli pushed a commit to MarcoGorelli/polars that referenced this pull request May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-selectors Area: column selectors enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants