Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stable Expr.top_k #16596

Open
MarcoGorelli opened this issue May 30, 2024 · 0 comments
Open

Add stable Expr.top_k #16596

MarcoGorelli opened this issue May 30, 2024 · 0 comments
Labels
A-ops Area: operations accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@MarcoGorelli
Copy link
Collaborator

Description

In #10054, there was a request for a way to answer the query:

For each group 'd', find the rows corresponding to the top k values from column 'b'

One possible API could have been: df.top_k(k=k, by='b', group_by='d'), or, as the OP suggested, df.group_by('d').top_k(k=k, by='b').

The response was that

df.group_by('d').agg(pl.all().top_k(k=1, by='b'))

is enough, and that's what's currently suggested in the top_k docs

However, as there's no ordering guarantees, then if there's ties in the by column, then the risk is that this solution produces a result with rows which never appeared in the original dataframe: #10054 (comment)

This was discussed in #15238, and the suggestion is now to introduce a stable Expr.top_k. This would solve the original issue

@MarcoGorelli MarcoGorelli added the enhancement New feature or an improvement of an existing feature label May 30, 2024
@stinodego stinodego added accepted Ready for implementation A-ops Area: operations labels May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ops Area: operations accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

2 participants