Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 methods for DataFrame and LazyFrame: simplest #107

Merged
merged 13 commits into from
Apr 16, 2023

Conversation

vincentarelbundock
Copy link
Collaborator

@vincentarelbundock vincentarelbundock commented Apr 12, 2023

Issue #104

Notes:

  • S3 methods have @noRd tags because I did not want to pollute the manual. But for some reason skipping nrow and ncol led check() to issue a warning, so I included minimal docs for just those two.

TODO in a different PR:

  • brackets – lots of work!
  • is.null (new method needed)
  • na.omit (new method needed)
  • colnames (new method needed)

This PR implements a few of the main base R S3 methods for DataFrame and LazyFrame objects. For example:

library(rpolars)
d = pl$DataFrame(mtcars)
dl = pl$DataFrame(mtcars)$lazy()

dim(d)
# [1] 32 11

head(dl, 3)
# [1] "polars LazyFrame naive plan: (run ldf$describe_optimized_plan() to see the optimized plan)"
#   SLICE[offset: 0, len: 3]
#     DF ["mpg", "cyl", "disp", "hp"]; PROJECT */11 COLUMNS; SELECTION: "None"

names(d)
#  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
# [11] "carb"

nrow(dl)
# [1] 32

ncol(dl)
# [1] 11

as.matrix(d) |> str()
#  num [1:32, 1:11] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#  - attr(*, "dimnames")=List of 2
#   ..$ : NULL
#   ..$ : chr [1:11] "mpg" "cyl" "disp" "hp" ...

@vincentarelbundock vincentarelbundock marked this pull request as draft April 12, 2023 15:58
@sorhawell
Copy link
Collaborator

implement attributes and methods instead of LazyFrame()$collect()

I think e.g. nrow cannot be inferred without collect, however it could be implemented something like where only the count is returned. If polars will optimize away if possible.

x$count()$collect()

@vincentarelbundock vincentarelbundock changed the title [WIP] S3 methods for DataFrame and LazyFrame: simplest S3 methods for DataFrame and LazyFrame: simplest Apr 15, 2023
@vincentarelbundock vincentarelbundock marked this pull request as ready for review April 15, 2023 11:56
@vincentarelbundock
Copy link
Collaborator Author

I have completed all I planned to achieve in this PR, so feel free to take a look when you find some time. As always, I'm happy to make any changes you'd like.

The cool thing about my two recent PRs is that it allowed me to considerably simplify the Get Started vignette (@grantmcdermott may want to take a look). Here's a rendered version for your convenience:

https://arelbundock.com/polars.html

Here are the main things I changed in the vignette:

  • Added a couple examples of standard base R functions like head(df).
  • The first Polars methods users now encounter are the ultra-simple and "clean" ones that we just merged, like dat$tail() and dat$mean().
  • Introduce methods chaining using those same ultra-simple methods: dat$tail(10)$mean(). I used this example to replace the old Hello World. It was cute and funny, but overly complicated. I must admit that the first time I saw that I had a pretty bad reaction to the ugly syntax. In editing, I thought it was best for the first examples to be simple and actually useful.
  • Push the discussion of data types to the end. This is fundamental and useful, but "Get Started" is also a "sales pitch" and I wanted to give users a payoff as fast as possible.

@sorhawell
Copy link
Collaborator

sorhawell commented Apr 15, 2023

For what it’s worth, if we didn’t care about overwriting the columns, the previous query could have been written more concisely as:

"For what it’s worth, the previous query could have been written more concisely as"

dat$with_columns(
  pl$col(c("mpg", "hp"))$sum()$over("cyl")$prefix("sum_")
)

#BTW also possible but not fully stabilized, so do not recommend in tutorial

# has currenyly a thread-safety user warning. But could be removed now.
dat$with_columns(
  pl$col(c("mpg", "hp"))$sum()$over("cyl")$map_alias(\(x) paste0("sum_",x))
)

# has not stabilized becaues does not work well with wild-cards like $all()
pl$set_polars_options(named_exprs = TRUE) # must be activted like this first
dat$with_columns(
  sum_mpg = pl$col("mpg")$sum()$over("cyl"),
  sum_hp  = pl$col("hp")$sum()$over("cyl")
)


#could be fun to try allow this also :)  as an opt-in feature. All column names converted to `pl$col(x)` and in scope of the method. Not sure it would work well for lazy API though.

dat$with_columns(
  sum_mpg = mpg$sum()$over(cyl),
  sum_hp  = hp$sum()$over(cyl)
)

@sorhawell
Copy link
Collaborator

Half-way through the revised tutorial . It has really been a pleasure to read. I agree with your changes. I will add a few syntax edits.

@grantmcdermott
Copy link
Collaborator

Half-way through the revised tutorial . It has really been a pleasure to read. I agree with your changes. I will add a few syntax edits.

Agree, lots to like here. Will leave a few comments for you to think about.

Copy link
Collaborator

@grantmcdermott grantmcdermott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots to like here @vincentarelbundock.

One thing I haven't mentioned as a specific comment---but I think is good for consistency---is sticking to the hard 80 character wrapping. I know that you're a fellow vim user and this is the hook I always use ;-)

vignettes/polars.Rmd Outdated Show resolved Hide resolved
vignettes/polars.Rmd Outdated Show resolved Hide resolved
vignettes/polars.Rmd Outdated Show resolved Hide resolved
vignettes/polars.Rmd Outdated Show resolved Hide resolved
and/or regular R vectors.
## Methods and pipelines

Although some simple R functions work out of the box on **polars** objects, the full power of Polars is realized via _methods_. Polars methods are accessed using the `$` syntax. For example, to convert Polars `Series` and `DataFrames` back to standard R objects, we use the `$to_r_vector()` and `$as_data_frame()` methods:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while since I tried this, but is as_data_frame still a tibble-exported (or maybe arrow-exported) function? Just worth thinking ahead to whether users might run into a NAMESPACE clash.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no clash between exported functions and polars methods, though, right? And this PR adds a proper as.data.frame() s3 for polars objects, which just calls df$as_data_frame() under the hood.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool cool. Just wanted to check.

vignettes/polars.Rmd Show resolved Hide resolved
vignettes/polars.Rmd Show resolved Hide resolved
@vincentarelbundock
Copy link
Collaborator Author

FYI, I believe I have addressed all the points raised in review so far. Feel free to make more requests if needed.

@sorhawell sorhawell merged commit e9d98de into pola-rs:main Apr 16, 2023
@sorhawell
Copy link
Collaborator

Many thanks @vincentarelbundock :)

@vincentarelbundock vincentarelbundock deleted the s3methods branch April 23, 2023 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants