tidy + optimize pl$DataFrame, pl$Series #385

sorhawell · 2023-09-12T21:12:14Z

summary of refactoring:

remove internal is_DataFrame_data_input() guard from pl$DataFrame. Reason: not needed, error msg further deep in conversions are better. The guard could also prevent some valid conversion some R class which is not a list or a vector.
remove pl$DataFrame dead args parallel and via_select. These are not documented and does not do anything, and should have been removed with a previous PR. I would argue this should not count as breaking change to remove some exotic dead args.
add result_minimal, as_polars_series uses an error to signal internally no impl was found for a given R class or it does not work. result_minimal is light weight version of result which skips all the error upgrading/conversion to signal this.
raise any other error from as_polars_series which was not due to missing implementation.
pl$Series drop not needed guard if (inherits(x, "Series")) {... internal conversion will do this also anyways. Remove convert_to_fewer_types() preprocessing which is now covered by as_polars_series.POSIXlt.

This is a follow up to PR #369 that brought optional support for vctrs or any other classed R object via the s3 method as_polars_seri.

When running with debug I see the implementation calls methods left and right and take 1ms! per column. Really only an issue for very wide tables. However, I propose a cleanup of the implementation which also is a bit faster .25ms per small column or so. I have also removed some old code not needed anymore in pl$DataFrame

 > pl$options$debug_polars(TRUE)
 > df = do.call(cbind,lapply(1:100, \(i_) mtcars))
 > dim(df)
 [1]   32 1100
 > system.time(pl$DataFrame(df))
[TIME? ms]
pl$DataFrame() -> [0.797ms]
pl$lit() -> [4.627ms]
  RPolarsErr$plain() -> [0.823ms]
   .pr$RPolarsErr$new() -> [0.159ms]
RPolarsErr$plain() -> [0.7918ms]
.
. {~1100 repeats...}
.
   .pr$RPolarsErr$new() -> [0.149ms]
RPolarsErr$plain() -> [1.21ms]
   .pr$RPolarsErr$new() -> [0.262ms]
RPolarsErr$plain() -> [0.2382ms]
pl$select() -> [0.7279ms]
   .pr$DataFrame$select() ->    user  system elapsed 
  1.205   0.040   1.546 
 
 #with out debug print
> pl$options() -> > pl$options$debug_polars(FALSE)
[1.562ms]
pl$options() -> [0.3068ms]
pl$set_polars_options() -> 
> system.time(pl$DataFrame(df))
   user  system elapsed 
  0.808   0.013   0.825

and after

> system.time(pl$DataFrame(df))
[9284ms]
pl$DataFrame() -> [9.998ms]
pl$lit() -> [122.3ms]
pl$select() -> [0.5779ms]
   .pr$DataFrame$select() ->    user  system elapsed 
  0.148   0.005   0.149

Close #336

etiennebacher

Cool, always nice to see the internals become less complicated 😄

R/dataframe__frame.R

etiennebacher · 2023-09-13T20:41:44Z

While you're refactoring pl$DataFrame(), do you want to address #336 as well or leave it for a future PR?

sorhawell · 2023-09-14T13:39:08Z

it could be some syntactic sugar for $select(pl$col("col1")$cast(new_dtype_1), ... ,... ) that would be ok simple.

If to support direct conversion from any R type to any polars type. That would be a big rework. Since casting is faster than R-polars conversion, it is probably not worth it.

etiennebacher · 2023-09-29T17:51:59Z

@sorhawell I took a stab at implementing the arg schema in pl$DataFrame(), I hope it's fine to do it here. If you agree with these changes, I can implement it in LazyFrame as well

…into result_minimal

etiennebacher

Actually, I forgot LazyFrame() was just a simple wrapper. I added a few tests and just waiting for CI to pass, thanks @sorhawell!

tidy + optimize pl$DataFrame, pl$Series

a68382f

sorhawell changed the title ~~tidy + optimize pl$DataFrame, pl$Series~~ WIP: tidy + optimize pl$DataFrame, pl$Series Sep 12, 2023

etiennebacher reviewed Sep 13, 2023

View reviewed changes

R/dataframe__frame.R Outdated Show resolved Hide resolved

etiennebacher added 2 commits September 29, 2023 19:49

implement a schema arg

b47aadd

Merge branch 'main' into result_minimal

5357074

Merge branch 'main' into result_minimal

bfe95c2

sorhawell changed the title ~~WIP: tidy + optimize pl$DataFrame, pl$Series~~ tidy + optimize pl$DataFrame, pl$Series Sep 30, 2023

sorhawell marked this pull request as ready for review September 30, 2023 08:42

etiennebacher added 3 commits October 2, 2023 13:40

add custom schema example for lazyframe

4f8b9ed

Merge branch 'result_minimal' of https://github.com/pola-rs/r-polars …

bdbec75

…into result_minimal

add test for lazyframe

38f2bff

etiennebacher approved these changes Oct 2, 2023

View reviewed changes

bump news

6a7c452

etiennebacher merged commit 68ca441 into main Oct 2, 2023

etiennebacher deleted the result_minimal branch October 2, 2023 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tidy + optimize pl$DataFrame, pl$Series #385

tidy + optimize pl$DataFrame, pl$Series #385

sorhawell commented Sep 12, 2023 •

edited by etiennebacher

Loading

etiennebacher left a comment

etiennebacher commented Sep 13, 2023 •

edited

Loading

sorhawell commented Sep 14, 2023 •

edited

Loading

etiennebacher commented Sep 29, 2023

etiennebacher left a comment

tidy + optimize pl$DataFrame, pl$Series #385

tidy + optimize pl$DataFrame, pl$Series #385

Conversation

sorhawell commented Sep 12, 2023 • edited by etiennebacher Loading

etiennebacher left a comment

Choose a reason for hiding this comment

etiennebacher commented Sep 13, 2023 • edited Loading

sorhawell commented Sep 14, 2023 • edited Loading

etiennebacher commented Sep 29, 2023

etiennebacher left a comment

Choose a reason for hiding this comment

sorhawell commented Sep 12, 2023 •

edited by etiennebacher

Loading

etiennebacher commented Sep 13, 2023 •

edited

Loading

sorhawell commented Sep 14, 2023 •

edited

Loading