Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a separate class RollingGroupBy #694

Merged
merged 9 commits into from
Jan 13, 2024
Merged

Conversation

etiennebacher
Copy link
Collaborator

Split the GroupBy and RollingGroupBy implementations. This will make it easier to handle the variants of GroupBy (like DynamicGroupBy in #691).

Note that <LazyFrame>$rolling() directly returns a LazyGroupBy object from the Rust method. Should we manually change the class to LazyRollingGroupBy? I don't really see the need for this since the "internals are opaque", but maybe you have an idea about that @eitsupi ?

To be merged before #691

Copy link
Collaborator

@eitsupi eitsupi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, could you add as_polars_df.RPolarsRollingGroupBy?

R/as_polars.R Outdated Show resolved Hide resolved
@eitsupi
Copy link
Collaborator

eitsupi commented Jan 13, 2024

I was thinking that it would be better to make the contents of GroupBy completely LazyGroupBy and call $collect() at the end, but is there any reason why that would not work?

Co-authored-by: eitsupi <50911393+eitsupi@users.noreply.github.com>
@etiennebacher
Copy link
Collaborator Author

I think the way I did it here matches quite well the python implementation: https://github.com/pola-rs/polars/blob/837ff46d2fb7435f3110e15b025af985bb5b6c8a/py-polars/polars/dataframe/group_by.py#L35

Since LazyGroupBy doesn't show the internals when we print it, we still need to have a GroupBy class for the eager implementation, where we can store the info about groups and other arguments. I'm not sure I understand what you have in mind.

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 13, 2024

Since LazyGroupBy doesn't show the internals when we print it, we still need to have a GroupBy class for the eager implementation, where we can store the info about groups and other arguments.

I don't think it is a big deal because basically the GroupBy class object is not intended to be used alone.

@etiennebacher
Copy link
Collaborator Author

We need to store the info (e.g groups) for when we print a GroupBy object

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 13, 2024

We need to store the info (e.g groups) for when we print a GroupBy object

Frankly, I don't see the need for this - could there be a situation where a GroupBy object could be printed as is without aggregation?
This is an "Utility class" and it is quite difficult to envision how it could be used to exist by itself for an extended period of time.
The fact that it has far fewer methods than DataFrame or LazyFrame is also indicative of this.

@etiennebacher
Copy link
Collaborator Author

etiennebacher commented Jan 13, 2024

Same argument for $ungroup() then: is there a situation where the user groups the data but doesn't use $agg() or another method right after? Since $agg() returns a Data/LazyFrame, we could remove the $ungroup() method. But since it's already there, I don't really see the point

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 13, 2024

Same argument for $ungroup() then: is there a situation where the user groups the data but doesn't use $agg() or another method right after? Since $agg() returns a Data/LazyFrame, we could remove the $ungroup() method. But since it's already there, I don't really see the point

In fact ungroup does not exist in Rust and Python. I too think this can be removed.
I just did it because I could implement it.

My basic idea is that there is no need for there to be a GroupBy object itself (for the same reason that Hadley regrets implementing dplyr's group_by as a stand-alone function).
But in fact, the GroupBy object has been present in Polars (probably because it mimics Spark's API), so it is inevitable that it is present here.
However, it should not be recommended to be used as printed, and it is not worth complicating the implementation to support it.

@etiennebacher
Copy link
Collaborator Author

Agreed on this. I thought maybe I was missing something when $ungroup() was implemented. I'll remove it, the GroupBy print method, and simplify the rest of the PR

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 13, 2024

I think ungroup is basically unnecessary, but since the operation is irreversible when GroupBy is created, there is a slight possibility that it could be used like undo if the user accidentally presses Enter in an interactive session.
On the other hand, I think that print has almost no problem even if group is not displayed in the interactive session because the code that created the GroupBy exists in the history.

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 13, 2024

@etiennebacher Sorry, I was completely mistaken. I read the source code you linked to and Python has RollingGroupBy and DynamicGroupBy separately.
I don't know this and RollingGroupBy is not even in the API reference....
Since these classes actually have different methods than GroupBy, it would make perfect sense to implement them as separate classes. My apologies.

@etiennebacher
Copy link
Collaborator Author

Oh I thought the discussion above was simply about the GroupBy - LazyGroupBy interaction (letting aside the Rolling and Dynamic variants). I should have clarified a bit more in the initial post. Thanks ;)

@etiennebacher etiennebacher merged commit 3545ee2 into main Jan 13, 2024
17 checks passed
@etiennebacher etiennebacher deleted the separate-groupby-variants branch January 13, 2024 15:35
@eitsupi
Copy link
Collaborator

eitsupi commented Jan 13, 2024

Yes, I had assumed that only the GroupBy class existed in Python and was concerned that a completely unique class would be introduced here. But in fact this is a copy of Python, so the implementation makes sense. (I'm not a big fan of the way Polars does aggregate calculations, which is reinforced now that I know that LazyFrame and DataFrame have different classes returned by rolling, but that's just my preference).

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 13, 2024

While reading the Rust source code, I noticed that aggregate calculation with GroupBy has been deprecated for years (pola-rs/polars#4946).
Perhaps at some point, given the current complexity, we may as well remove GroupBy and use only LazyGroupBy.
https://docs.rs/polars/0.36.2/polars/prelude/struct.GroupBy.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants