Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor lit, col, DataFrame, Series #369

Merged
merged 9 commits into from
Aug 31, 2023
Merged

Conversation

sorhawell
Copy link
Collaborator

@sorhawell sorhawell commented Aug 30, 2023

#368 showed some short comming in the conversions in polars

I took it as an occasion to tidy up

tidying

  • rewrite new DataFrame to be derived from pl$select() but where strings are literals.
  • refactor pl$col and pl$lit to rust side and thereby remove alot of calls from rust side to R

new features

  • add public method as_polars_series, such that any user or package extending polars can define how some Robj is converted into a Series
  • add explode = FALSE to pl$date_range. In newer py-polars method is named pl$date_ranges and "explode" is implicitly TRUE. adding explode arg to pl$date_range allow users to use it create a list of ranges.

@sorhawell sorhawell marked this pull request as draft August 30, 2023 12:04
Merge remote-tracking branch 'origin/main' into tidy_up_lit_col_dataframe
# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
@sorhawell sorhawell changed the title WIP: refactor lit, col, DataFrame refactor lit, col, DataFrame, Series Aug 30, 2023
@sorhawell
Copy link
Collaborator Author

news section is not complete

@sorhawell sorhawell marked this pull request as ready for review August 30, 2023 22:25
Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring, I just have a comment about the new arg explode in date_range().

Otherwise, LGTM (I just had to fix tiny errors in tidypolars but I can confirm there's no breaking change there).

The as_polars_series() also works in tidypolars, this looks nice

R/functions__eager.R Show resolved Hide resolved
src/rust/src/rlib.rs Show resolved Hide resolved
@etiennebacher
Copy link
Collaborator

etiennebacher commented Aug 31, 2023

It might also be important to note that vctrs_rcrd is the "default" class for all objects created with vctrs:

library(dplyr, warn.conflicts = FALSE)
library(ivs)
library(clock)
library(polars)

# a data.frame with {ivs} class + vctrs_rcrd
df <- tribble(
  ~x, ~start, ~end,
  810L,   "2020-12-15",    "2020-12-16"
) |> 
  mutate(foo = iv(start, end)) 

df
#> # A tibble: 1 × 4
#>       x start      end                             foo
#>   <int> <chr>      <chr>                     <iv<chr>>
#> 1   810 2020-12-15 2020-12-16 [2020-12-15, 2020-12-16)
class(df$foo)
#> [1] "ivs_iv"     "vctrs_rcrd" "vctrs_vctr"

# a data.frame with {clock} class + vctrs_rcrd
df2 <- tibble(foo = year_month_day(2019, 1, 30, 9) |> as_naive_time())
df2
#> # A tibble: 1 × 1
#>   foo          
#>   <naive<hour>>
#> 1 2019-01-30T09
class(df2$foo)
#> [1] "clock_naive_time" "clock_time_point" "clock_rcrd"       "vctrs_rcrd"      
#> [5] "vctrs_vctr"

as_polars_series.vctrs_rcrd = function(x, ...) {
  pl$DataFrame(unclass(x))$to_struct()
}

pl$DataFrame(df)
#> shape: (1, 4)
#> ┌─────┬────────────┬────────────┬─────────────────────────────┐
#> │ x   ┆ start      ┆ end        ┆ foo                         │
#> │ --- ┆ ---        ┆ ---        ┆ ---                         │
#> │ i32 ┆ str        ┆ str        ┆ struct[2]                   │
#> ╞═════╪════════════╪════════════╪═════════════════════════════╡
#> │ 810 ┆ 2020-12-15 ┆ 2020-12-16 ┆ {"2020-12-15","2020-12-16"} │
#> └─────┴────────────┴────────────┴─────────────────────────────┘
pl$DataFrame(df2)
#> shape: (1, 1)
#> ┌─────────────────────┐
#> │ foo                 │
#> │ ---                 │
#> │ struct[2]           │
#> ╞═════════════════════╡
#> │ {2.1475e9,430233.0} │
#> └─────────────────────┘

When this is merged, I'd like to experiment a bit in tidypolars and potentially write a vignette about this.

@sorhawell
Copy link
Collaborator Author

sorhawell commented Aug 31, 2023

It appears all vctrs_rcrd class instances will be list of equal sized vectors with unique names. (docs)

That is about the same requirement for arrow/polars struct so there is a fair chance it would work for any vctrs_rcrd.

It seem if polars defines these methods e.g. the user (external package?) can still override them.

#given polars internally keeps this definit in series__trait.R
#   as_polars_series.vctrs_rcrd = function(x, ...) {
#   pl$DataFrame(unclass(x))$to_struct()
# }

library(dplyr, warn.conflicts = FALSE)
library(ivs)
library(polars)
library(tidypolars)
#> Warning: package 'tidypolars' was built under R version 4.3.1
#> Registered S3 method overwritten by 'tidypolars':
#>   method          from  
#>   print.DataFrame polars
t_date <- as.Date("2020-05-05")

test_df <- tibble(id = 1:5, 
                   grp = c("a", "a", "b", "b", "b"),
                   start = rep(t_date+1:5),
                   end = rep(t_date+11:7))

# adding an iv-variable to the dataframe
test_df_iv <- test_df |> 
    mutate(range = ivs::iv(start, end))


pl$DataFrame(test_df_iv)
#> shape: (5, 5)
#> ┌─────┬─────┬────────────┬────────────┬─────────────────────────┐
#> │ id  ┆ grp ┆ start      ┆ end        ┆ range                   │
#> │ --- ┆ --- ┆ ---        ┆ ---        ┆ ---                     │
#> │ i32 ┆ str ┆ date       ┆ date       ┆ struct[2]               │
#> ╞═════╪═════╪════════════╪════════════╪═════════════════════════╡
#> │ 1   ┆ a   ┆ 2020-05-06 ┆ 2020-05-16 ┆ {2020-05-06,2020-05-16} │
#> │ 2   ┆ a   ┆ 2020-05-07 ┆ 2020-05-15 ┆ {2020-05-07,2020-05-15} │
#> │ 3   ┆ b   ┆ 2020-05-08 ┆ 2020-05-14 ┆ {2020-05-08,2020-05-14} │
#> │ 4   ┆ b   ┆ 2020-05-09 ┆ 2020-05-13 ┆ {2020-05-09,2020-05-13} │
#> │ 5   ┆ b   ┆ 2020-05-10 ┆ 2020-05-12 ┆ {2020-05-10,2020-05-12} │
#> └─────┴─────┴────────────┴────────────┴─────────────────────────┘


#user override polars definition, could a external package do that as well?
as_polars_series.vctrs_rcrd = function(x, ...) {
  pl$DataFrame(unclass(x))$to_series(0L) #get first column only
}

pl$DataFrame(test_df_iv)
#> shape: (5, 5)
#> ┌─────┬─────┬────────────┬────────────┬────────────┐
#> │ id  ┆ grp ┆ start      ┆ end        ┆ range      │
#> │ --- ┆ --- ┆ ---        ┆ ---        ┆ ---        │
#> │ i32 ┆ str ┆ date       ┆ date       ┆ date       │
#> ╞═════╪═════╪════════════╪════════════╪════════════╡
#> │ 1   ┆ a   ┆ 2020-05-06 ┆ 2020-05-16 ┆ 2020-05-06 │
#> │ 2   ┆ a   ┆ 2020-05-07 ┆ 2020-05-15 ┆ 2020-05-07 │
#> │ 3   ┆ b   ┆ 2020-05-08 ┆ 2020-05-14 ┆ 2020-05-08 │
#> │ 4   ┆ b   ┆ 2020-05-09 ┆ 2020-05-13 ┆ 2020-05-09 │
#> │ 5   ┆ b   ┆ 2020-05-10 ┆ 2020-05-12 ┆ 2020-05-10 │
#> └─────┴─────┴────────────┴────────────┴────────────┘

Created on 2023-08-31 with reprex v2.0.2

@etiennebacher etiennebacher merged commit 21524b0 into main Aug 31, 2023
11 checks passed
@etiennebacher etiennebacher deleted the tidy_up_lit_col_dataframe branch August 31, 2023 19:10
Comment on lines +39 to +42
#' @export
as_polars_series.POSIXlt = function(x, ...) {
as.POSIXct(x)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not implemented correctly.

Comment on lines +46 to +48
as_polars_series.vctrs_rcrd = function(x, ...) {
pl$DataFrame(unclass(x))$to_struct()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants