Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidy + optimize pl$DataFrame, pl$Series #385

Merged
merged 8 commits into from
Oct 2, 2023
Merged

Conversation

sorhawell
Copy link
Collaborator

@sorhawell sorhawell commented Sep 12, 2023

summary of refactoring:

  • remove internal is_DataFrame_data_input() guard from pl$DataFrame. Reason: not needed, error msg further deep in conversions are better. The guard could also prevent some valid conversion some R class which is not a list or a vector.

  • remove pl$DataFrame dead args parallel and via_select. These are not documented and does not do anything, and should have been removed with a previous PR. I would argue this should not count as breaking change to remove some exotic dead args.

  • add result_minimal, as_polars_series uses an error to signal internally no impl was found for a given R class or it does not work. result_minimal is light weight version of result which skips all the error upgrading/conversion to signal this.

  • raise any other error from as_polars_series which was not due to missing implementation.

  • pl$Series drop not needed guard if (inherits(x, "Series")) {... internal conversion will do this also anyways. Remove convert_to_fewer_types() preprocessing which is now covered by as_polars_series.POSIXlt.

This is a follow up to PR #369 that brought optional support for vctrs or any other classed R object via the s3 method as_polars_seri.

When running with debug I see the implementation calls methods left and right and take 1ms! per column. Really only an issue for very wide tables. However, I propose a cleanup of the implementation which also is a bit faster .25ms per small column or so. I have also removed some old code not needed anymore in pl$DataFrame

 > pl$options$debug_polars(TRUE)
 > df = do.call(cbind,lapply(1:100, \(i_) mtcars))
 > dim(df)
 [1]   32 1100
 > system.time(pl$DataFrame(df))
[TIME? ms]
pl$DataFrame() -> [0.797ms]
pl$lit() -> [4.627ms]
  RPolarsErr$plain() -> [0.823ms]
   .pr$RPolarsErr$new() -> [0.159ms]
RPolarsErr$plain() -> [0.7918ms]
.
. {~1100 repeats...}
.
   .pr$RPolarsErr$new() -> [0.149ms]
RPolarsErr$plain() -> [1.21ms]
   .pr$RPolarsErr$new() -> [0.262ms]
RPolarsErr$plain() -> [0.2382ms]
pl$select() -> [0.7279ms]
   .pr$DataFrame$select() ->    user  system elapsed 
  1.205   0.040   1.546 
 
 #with out debug print
> pl$options() -> > pl$options$debug_polars(FALSE)
[1.562ms]
pl$options() -> [0.3068ms]
pl$set_polars_options() -> 
> system.time(pl$DataFrame(df))
   user  system elapsed 
  0.808   0.013   0.825   

and after

> system.time(pl$DataFrame(df))
[9284ms]
pl$DataFrame() -> [9.998ms]
pl$lit() -> [122.3ms]
pl$select() -> [0.5779ms]
   .pr$DataFrame$select() ->    user  system elapsed 
  0.148   0.005   0.149 

Close #336

@sorhawell sorhawell changed the title tidy + optimize pl$DataFrame, pl$Series WIP: tidy + optimize pl$DataFrame, pl$Series Sep 12, 2023
Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, always nice to see the internals become less complicated 😄

R/dataframe__frame.R Outdated Show resolved Hide resolved
@etiennebacher
Copy link
Collaborator

etiennebacher commented Sep 13, 2023

While you're refactoring pl$DataFrame(), do you want to address #336 as well or leave it for a future PR?

@sorhawell
Copy link
Collaborator Author

sorhawell commented Sep 14, 2023

it could be some syntactic sugar for $select(pl$col("col1")$cast(new_dtype_1), ... ,... ) that would be ok simple.

If to support direct conversion from any R type to any polars type. That would be a big rework. Since casting is faster than R-polars conversion, it is probably not worth it.

@etiennebacher
Copy link
Collaborator

@sorhawell I took a stab at implementing the arg schema in pl$DataFrame(), I hope it's fine to do it here. If you agree with these changes, I can implement it in LazyFrame as well

@sorhawell sorhawell changed the title WIP: tidy + optimize pl$DataFrame, pl$Series tidy + optimize pl$DataFrame, pl$Series Sep 30, 2023
@sorhawell sorhawell marked this pull request as ready for review September 30, 2023 08:42
Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I forgot LazyFrame() was just a simple wrapper. I added a few tests and just waiting for CI to pass, thanks @sorhawell!

@etiennebacher etiennebacher merged commit 68ca441 into main Oct 2, 2023
11 checks passed
@etiennebacher etiennebacher deleted the result_minimal branch October 2, 2023 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to specify dtypes when I create a DataFrame from scratch ?
2 participants