Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add $ungroup() for GroupBy and LazyGroupBy #522

Merged
merged 7 commits into from
Nov 20, 2023

Conversation

eitsupi
Copy link
Collaborator

@eitsupi eitsupi commented Nov 18, 2023

  • Allow GroupBy and LazyGroupBy to be ungrouped
  • Fix as_polars_df.GroupBy

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 18, 2023

If Rust Polars were to add features like pola-rs/polars#12562, this workaround would no longer be necessary.

@sorhawell
Copy link
Collaborator

Looks good to me from my phone. I will review tonight.

@sorhawell
Copy link
Collaborator

sorhawell commented Nov 19, 2023

We need to call clone on ExtPtr's if modifying the attributes(), otherwise we break the immutable behavior of the public polars api.

I will add the line.

library(polars)

# immutability except for environment(s), this is what we are used to in R.
x = list("foo") # x is immutable
tracemem(x)
#> [1] "<0x7fb895761c90>"
class(x) = "bar" # if writing to x, a clone will defacto be made
#> tracemem[0x7fb895761c90 -> 0x7fb895826f28]: eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
y = x # y is a shallow copy of x
class(x) = "foobar" # if write to x, a new clone is made and y is untouched
#> tracemem[0x7fb895826f28 -> 0x7fb8958cc940]: eval eval eval_with_user_handlers withVisible withCallingHandlers doTryCatch tryCatchOne tryCatchList tryCatch try handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers process_file <Anonymous> <Anonymous> <Anonymous> <Anonymous> do.call saveRDS withCallingHandlers doTryCatch tryCatchOne tryCatchList doTryCatch tryCatchOne tryCatchList tryCatch
class(x)
#> [1] "foobar"
class(y)
#> [1] "bar"

# but this is not the case for ExtPtr. 
gbx = pl$DataFrame(mtcars)$group_by("cyl")
tracemem(gbx) # not allowed
#> Error in tracemem(gbx): 'tracemem' is not useful for weak reference or external pointer objects
polars:::mem_address(gbx)
#> [1] "0x48c7c243780"
gby = gbx # some user makes a copy and assumes gbx is immutable


# user will call some ungroup method, all seems fine
class(gbx) = "DataFrame"
attr(gbx, "private") = NULL
attributes(gbx)
#> $class
#> [1] "DataFrame"

# and for the surprise behold!
attributes(gby) # attributes to an SEXP ExtPtr are mutable!! dfy was also ungrouped
#> $class
#> [1] "DataFrame"
# I was shocked about this, but came around to believe it is for the better.

# R cannot automatically clone the object ExtPtr is pointing to.
polars:::mem_address(gby)
#> [1] "0x48c7c243780"

# A ExtPtr is one SEXP where zero or more R variables can point to. If zero variables point to one
#ExtPtr then the next gc() will release the ExtPtr.

# The first time and only time the ExtPtr is released it will call desctructor/drop method across
# the foreign interface.

# My take on this is ExtPtr are singletons and are mutable by nature and attributes are associated
# with the SEXP not with the variable. We can disguise this fact andmake ExtPtr appear immutable.
# This is in the DNA of polars api, so it is not much effort.

# any method in the api called "mut" or "in_place" breaks this rule immutability


# ok one more time
gbx = pl$DataFrame(mtcars)$group_by("cyl")
polars:::mem_address(gbx)
#> [1] "0x48c7c242b20"
gby = gbx # some user makes a copy and assumes gbx is immutable


# clone before modifying attributes
gbx = .pr$DataFrame$clone_see_me_macro(gbx)
# this is a new ExtPtr pointing to another DataFrame on rust side
# but in the end it is all pointing to the same Arrow arrays, so cloning is still cheap.
polars:::mem_address(gbx) 
#> [1] "0x48c7c242ba0"
class(gbx) = "DataFrame"
attr(gbx, "private") = NULL
attributes(gbx)
#> $class
#> [1] "DataFrame"

# gby was not affacted
polars:::mem_address(gby) 
#> [1] "0x48c7c242b20"
attributes(gby)
#> $class
#> [1] "GroupBy"
#> 
#> $private
#> $private$groupby_input
#> $private$groupby_input[[1]]
#> [1] "cyl"
#> 
#> 
#> $private$maintain_order
#> [1] FALSE

Created on 2023-11-19 with reprex v2.0.2

@sorhawell

This comment was marked as outdated.

Copy link
Collaborator

@sorhawell sorhawell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eitsupi Me and chatGPT struggled with permissions to push directly to this PR originating in your fork. I have a branch here with suggested changes 67fc0ed

only changes to GroupBy_ungroup() method are mandotory

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 19, 2023

You may want to do git push git@github.com:username/projectName.git
See cli/cli#2189

@sorhawell
Copy link
Collaborator

You may want to do git push git@github.com:username/projectName.git See cli/cli#2189

git push git@github.com:eitsupi/r-polars.git
Enumerating objects: 37, done.
Counting objects: 100% (37/37), done.
Delta compression using up to 8 threads
Compressing objects: 100% (19/19), done.
Writing objects: 100% (19/19), 1.95 KiB | 46.00 KiB/s, done.
Total 19 (delta 17), reused 0 (delta 0)
remote: Resolving deltas: 100% (17/17), completed with 17 local objects.
To github.com:eitsupi/r-polars.git
 ! [remote rejected] ungroup2 -> ungroup2 (permission denied)
error: failed to push some refs to 'git@github.com:eitsupi/r-polars.git'

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 19, 2023

I think branch name is not correct.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 19, 2023

Pls read the error.
There in no ungroup2 branch on the remote.

@sorhawell
Copy link
Collaborator

@eitsupi I was honestly reading the error msgs :/ I didn't know the meaning of -> operator in git error messages.

But thank you for spelling that out

git push git@github.com:eitsupi/r-polars.git ungroup2:lazyframe-ungroup

@eitsupi
Copy link
Collaborator Author

eitsupi commented Nov 20, 2023

Thanks for the update!

@eitsupi eitsupi merged commit a4d3772 into pola-rs:main Nov 20, 2023
23 checks passed
@eitsupi eitsupi deleted the lazyframe-ungroup branch November 20, 2023 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants