Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Complete rewrite based on savvy and rlang #1152

Open
eitsupi opened this issue Jun 27, 2024 · 9 comments
Open

[Proposal] Complete rewrite based on savvy and rlang #1152

eitsupi opened this issue Jun 27, 2024 · 9 comments
Labels
enhancement New feature or request
Milestone

Comments

@eitsupi
Copy link
Collaborator

eitsupi commented Jun 27, 2024

Continued from #942 and #1126

(The following text is copied from https://github.com/eitsupi/neo-r-polars's README)


Motivation

I have been developing r-polars for over a year, and I felt that a significant rewrite was necessary.
r-polars is a clone of py-polars,
but the package structure is currently quite different.
Therefore, it was difficult to keep up with frequent updates.

I thought that now, around the release of Python Polars 1.0.0, is a good time for a complete rewrite, so I decided to try it.

There are several reasons to rewrite r-polars on both the Rust and R sides.

Rust side

  1. Appropriate file division. Due to the limitations of extendr,
    it is not possible to place multiple impl blocks.
    (extendr/extendr#538)
  2. Error handling.
    There is a lot of custom code to use the Result type with extendr, which is quite different from other packages based on extendr.
    (extendr/extendr#650)
  3. Simplify type conversion.
    The code is difficult to follow because it uses a macro called robj_to for type conversion (at least in rust-analyzer).

About 1 and 2, I expect that switching from extendr to savvy
will improve the situation.

For 3, in py-polars and nodejs-polars, a thin Wrap struct wraps other types and processes them with standard From traits etc.,
which I think makes the code cleaner.

R side

  1. The structure of classes.
    In py-polars, the strategy is that classes defined on the Rust side (e.g., PyDataFrame) are wrapped by classes defined on the Python side (e.g., DataFrame).
    In r-polars, a complex strategy is adopted to update classes created by Rust side/extendr (e.g., RPolarsDataFrame) with a lot of custom code.
    (This is also related to the fact that extendr makes associated functions of Rust structs members of R classes. savvy does not mix associated functions and methods.)
  2. S3 methods first.
    This is also related to the Rust side, in the current r-polars, generic functions like as_polars_series were added later,
    so there are several places where type conversion from R to Polars is done on the Rust side, making it difficult to understand where the type conversion is done.
    If type conversion from R to Polars is done with two generic functions, as_polars_series and as_polars_expr, the code will be much simpler and customization from the R side will be possible.
  3. Error handling.
    Currently, r-polars has its own Result type on the R side, and error handling is done through it.
    The backtrace generated that is quite easy to understand, but it is not necessarily easy to use when using polars internally in other packages, such as testthat::expect_error().
  4. Based on rlang.
    Currently, r-polars has no R package dependencies. This is great,
    but that includes a degraded copy of list2()
    instead of the convenient functions in the rlang package.
    rlang is a lightweight R package, and I feel that it is more beneficial to depend on the convenient functions of rlang than to stick to no dependencies.

1 and 3 are also related to the fact that it is built with extendr, and it seems that switching to savvy is appropriate here as well.
If we abandon the current Result type on the R side, it is natural to use rlang for error handling, so from that perspective, it is reasonable to depend on rlang in 4.

@eitsupi eitsupi added the enhancement New feature or request label Jun 27, 2024
@eitsupi eitsupi pinned this issue Jun 27, 2024
@eitsupi eitsupi added this to the Rewrite milestone Jun 27, 2024
@etiennebacher
Copy link
Collaborator

Thanks for all the work, I'll take a look this weekend after #1147

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 28, 2024

Significant changes that may affect users are noted in the NEWS file.
https://github.com/eitsupi/neo-r-polars/blob/60502b13e9c21ef38c733abc24477671e0e13a2a/NEWS.md

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 30, 2024

I pushed the neo-r-polars' main branch to this repository so that we can start the migration work.
f7316d3

For now, the new next branch lacks developer tools and other settings, and switching to this branch in the local repository of r-polars should detect a large number of diffs due to differences in .gitignore, so I recommend that do not switch to the branch.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jul 4, 2024

Documentation for translation from Python Polars has been pushed.
I hope it comes across as fairly simple.
https://github.com/pola-rs/r-polars/blob/63790cd069cd736e650a5aabf7900f197490ece9/DEVELOPMENT.md

@etiennebacher
Copy link
Collaborator

Thanks, that is very helpful. I noticed that the class names have changed, e.g from RPolarsExpr to polars_expr. Is there any reason for this? I think the current class names are already clear and unique enough to avoid collisions with classes from other packages.

Similarly, the current Struct is called RPolarsExpr but in the rewrite it is PlRExpr. I like the fact that those names are identical on the R and Rust sides, is it possible to keep them like this?

@etiennebacher
Copy link
Collaborator

Also, related to the post on Mastodon asking for help, I think it would be helpful to have a checklist (either at the top of this issue or in a new one) with more details on the progress of the rewrite. It doesn't have to be super detailed but simply have an idea of how far we are, where should a new contributor focus, etc. For instance something like this:

  • DataFrame
    • translation
    • tests
  • LazyFrame
    • translation
    • tests
  • Expr
    • str
      • translation
      • tests
    • arr
    • ...

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jul 5, 2024

Thanks for taking a look at this. I will create the task list at another epic issue soon as I can, as it is definitely needed.

I noticed that the class names have changed, e.g from RPolarsExpr to polars_expr. Is there any reason for this? I think the current class names are already clear and unique enough to avoid collisions with classes from other packages.

There are two reasons.

  1. To avoid conflicts between the current polars package and the neopolars package. This is also why the names of the structs on the Rust side is something like PlRExpr. This justification disappears when there is no longer a need for coexistence (i.e., when the neopolars package is renamed to polars).
  2. Because the class name of CamelCase is not considered to be very common in R.
    In other words, the current class names such as RPolarsExpr are a compromise due to the restrictions of the extendr that Rust's struct and R's classes cannot be named separately, and the need to avoid class name conflicts with other packages. But if class names can be given only on the R side I feel that the snake_case is more appropriate than the CamelCase.
    Of course, this is debatable.

Similarly, the current Struct is called RPolarsExpr but in the rewrite it is PlRExpr. I like the fact that those names are identical on the R and Rust sides, is it possible to keep them like this?

No, these are actually different things. That is, polars_expr and PlRExpr already exist on the R side, and PlRExpr is an internal class that is not normally touched by the user.
This is the same as Python Polars having two classes: the Expr class and the PyExpr class (from the Rust struct).

In Python:

>>> import polars as pl
>>> pl.lit(1).__class__.__name__
'Expr'
>>> pl.lit(1)._pyexpr.__class__.__name__
'PyExpr'

In R:

> pl$lit(1) |> class()
[1] "polars_expr"   "polars_object"

> pl$lit(1)$`_rexpr` |> class()
[1] "PlRExpr"

Unlike the current RPolarsExpr, PlRExpr is an internal class, so I have tentatively named it PlRExpr because we think it is safe even if it does not contain the word "polars" (although it is an internal class, it is not without risk of collision, so I have not made it RExpr).

@etiennebacher
Copy link
Collaborator

I see, thanks for the explanations.

There are two reasons.

  1. To avoid conflicts between the current polars package and the neopolars package. This is also why the names of the structs on the Rust side is something like PlRExpr. This justification disappears when there is no longer a need for coexistence (i.e., when the neopolars package is renamed to polars).
  2. Because the class name of CamelCase is not considered to be very common in R.
    In other words, the current class names such as RPolarsExpr are a compromise due to the restrictions of the extendr that Rust's struct and R's classes cannot be named separately, and the need to avoid class name conflicts with other packages. But if class names can be given only on the R side I feel that the snake_case is more appropriate than the CamelCase.
    Of course, this is debatable.

I don't think the style of class names is important. To me, the two main requirements is that they must be clear and unique, and I think RPolarsExpr and variants match those requirements. We already break a lot of code at each release, I'd rather not break the class names in addition to that.

Bottom line, I'd rather keep the current RPolars* names. Open to discuss too ofc.


Sidenote: I didn't find conventions for class names (in general, I feel R code lacks conventions anyway). I also didn't find packages with CamelCase class names (can't say I searched a lot though) but some prominent packages do not use snake_case:

class(Matrix::Matrix(0, 3, 2))
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

Packages using S4 (like those on BioConductor) also usually have camel case.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jul 7, 2024

I didn't find conventions for class names (in general, I feel R code lacks conventions anyway).

I think the most commonly used style guides are the Tidyverse style guide and Google's Style guide (of course Google publishes Style guides for many languages and I believe they are very popular).

https://style.tidyverse.org/syntax.html#object-names
https://google.github.io/styleguide/Rguide.html

Both of these encourage the use of snake_case for class names.

@etiennebacher etiennebacher changed the title [Proposal] Completely rewritten based on savvy and rlang [Proposal] Complete rewrite based on savvy and rlang Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants