Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl$DataFrame() fails when columns of type "ivs_iv" is present in dataframe #368

Closed
cathblatter opened this issue Aug 28, 2023 · 7 comments
Closed

Comments

@cathblatter
Copy link

Hi - came across an issue with daterange-columns from the {ivs} package in my use of {tidypolars}, as per the author's suggestion I'm posting here.

In brief: pl$DataFrame() fails when an interval-column (daterange) using the {ivs}-package was added to the dataframe prior to converting, with the following error:

library(dplyr, warn.conflicts = FALSE)
library(ivs)
library(polars)

t_date <- as.Date("2020-05-05")

test_df <- tibble(id = 1:5, 
                   grp = c("a", "a", "b", "b", "b"),
                   start = rep(t_date+1:5),
                   end = rep(t_date+11:7))

# adding an iv-variable to the dataframe
test_df_iv <- test_df |> 
    mutate(range = ivs::iv(start, end))

pl$DataFrame(test_df_iv)
#> Error: in set_column_from_robj: ShapeMismatch(ErrString("unable to add a column of length 2 to a dataframe of height 5")) 

I'm aware its a bit of a special column type and I am just curious if you plan to support these type of variables at some point? I find it extremely convenient to use r-polars with large routine data for the time saved (and it often contains date range data).

let me know if you want me to provide more info - in the meantime 🙌 thanks for the work!

@eitsupi
Copy link
Collaborator

eitsupi commented Aug 29, 2023

It seems ivs is based on the vctrs package, and the arrow package already support vctrs package's class. (Regardless of whether it is the intended type on Arrow Type)

> pak::pak("ivs")
✔ Updated metadata database: 2.90 MB in 6 files.Updating metadata database ... doneWill install 1 package.Will download 1 package with unknown size.
+ ivs   0.2.0 [dl]
ℹ Getting 1 pkg with unknown sizeGot ivs 0.2.0 (x86_64-pc-linux-gnu-ubuntu-22.04) (412.73 kB)     
✔ Downloaded 1 package (412.73 kB)in 3.5sInstalled ivs 0.2.0  (62ms)                               
✔ 1 pkg + 5 deps: kept 5, added 1, dld 1 (412.73 kB) [18.4s]                                  

> library(dplyr)

Attaching package:dplyrThe following objects are masked frompackage:stats:

    filter, lag

The following objects are masked frompackage:base:

    intersect, setdiff, setequal, union


> library(ivs)

> t_date <- as.Date("2020-05-05")
 
    test_df <- tibble(id = 1:5, 
                       grp = c("a", "a", "b", "b", "b"),
                       start = rep(t_date+1:5),
                       end = rep(t_date+11:7))
 
    # adding an iv-variable to the dataframe
    test_df_iv <- test_df |> 
        mutate(range = ivs::iv(start, end))

> test_df_iv$range
<iv<date>[5]>
[1] [2020-05-06, 2020-05-16) [2020-05-07, 2020-05-15) [2020-05-08, 2020-05-14) [2020-05-09, 2020-05-13)
[5] [2020-05-10, 2020-05-12)

> test_df_iv$range |> class()
[1] "ivs_iv"     "vctrs_rcrd" "vctrs_vctr"

> test_df_iv |> arrow::as_arrow_table()
Table
5 rows x 5 columns
$id <int32>
$grp <string>
$start <date32[day]>
$end <date32[day]>
$range <<iv<date>[0]>>

> test_df_iv |> arrow::as_arrow_table() |> _$range
ChunkedArray
<<iv<date>[0]>>
[
  -- is_valid: all not null
  -- child 0 type: date32[day]
    [
      2020-05-06,
      2020-05-07,
      2020-05-08,
      2020-05-09,
      2020-05-10
    ]
  -- child 1 type: date32[day]
    [
      2020-05-16,
      2020-05-15,
      2020-05-14,
      2020-05-13,
      2020-05-12
    ]
]

But when I try to convert this to polars I get an error. Perhaps the arrow2 crate does not support this type.

In other words, it's an upstream issue.

> test_df_iv |> arrow::as_arrow_table() |> polars::pl$from_arrow()
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Utf8Error', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow2-0.17.4/src/ffi/schema.rs:501:39
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'explicit panic', src/rdataframe/mod.rs:82:1
Error: Execution halted with the following contexts
   0: In R: in pl$from_arrow:
   0: During function call [polars::pl$from_arrow(arrow::as_arrow_table(test_df_iv))]
   1: user function panicked: from_arrow_record_batches

When I write this data to Parquet and try to read it, DuckDB can read it successfully but Python Polars fails to read it.

In [1]: import polars as pl

In [2]: pl.read_parquet("test.parquet")
---------------------------------------------------------------------------
ArrowErrorException                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 pl.read_parquet("test.parquet")

File ~/.local/lib/python3.10/site-packages/polars/io/parquet/functions.py:132, in read_parquet(source, columns, n_rows, use_pyarrow, memory_map, storage_options, parallel, row_count_name, row_count_offset, low_memory, pyarrow_options, use_statistics, rechunk)
    121     import pyarrow.parquet
    123     return from_arrow(  # type: ignore[return-value]
    124         pa.parquet.read_table(
    125             source_prep,
   (...)
    129         )
    130     )
--> 132 return pl.DataFrame._read_parquet(
    133     source_prep,
    134     columns=columns,
    135     n_rows=n_rows,
    136     parallel=parallel,
    137     row_count_name=row_count_name,
    138     row_count_offset=row_count_offset,
    139     low_memory=low_memory,
    140     use_statistics=use_statistics,
    141     rechunk=rechunk,
    142 )

File ~/.local/lib/python3.10/site-packages/polars/dataframe/frame.py:852, in DataFrame._read_parquet(cls, source, columns, n_rows, parallel, row_count_name, row_count_offset, low_memory, use_statistics, rechunk)
    850 projection, columns = handle_projection_columns(columns)
    851 self = cls.__new__(cls)
--> 852 self._df = PyDataFrame.read_parquet(
    853     source,
    854     columns,
    855     projection,
    856     n_rows,
    857     parallel,
    858     _prepare_row_count_args(row_count_name, row_count_offset),
    859     low_memory=low_memory,
    860     use_statistics=use_statistics,
    861     rechunk=rechunk,
    862 )
    863 return self

ArrowErrorException: OutOfSpec("In <KeyValue@d8>::value(): Invalid utf-8: invalid utf-8 sequence of 1 bytes from index 83")

@eitsupi
Copy link
Collaborator

eitsupi commented Aug 29, 2023

Even if this bug is resolved, I think it is necessary to implement dedicated processing to convert vectors built on vctrs such as the clock package and ivs package to the intended Arrow type.

@sorhawell
Copy link
Collaborator

I have fix for this upcomming, but got interrupted. Will update later.

@sorhawell
Copy link
Collaborator

Currently "polars" will just ignore the vctrs annotations and traits and convert as what it is: a list of two vectors. However that will give a length missmatch 2 by 5. Even though it is possible to import some ivs_iv classed vector, all the methods from the package would not know what to do with polars Series and DataFrame(s). You might want to swap to the polars pl$date_range e.g.

Polars should support vctrs-vectors I think. On the occassion of this issue I have refactored the polars import of Robj's and I have also added dependency injection method as_polars_series.YourClass such that any classed Robj can be supported by polars OR tidypolars OR the final user.

code example below is from PR #369 . Examples before as_polars_series should work in polars 0.7.0 also.

library(dplyr, warn.conflicts = FALSE)
library(ivs)
library(polars)
library(tidypolars)
#> Warning: package 'tidypolars' was built under R version 4.3.1
#> Registered S3 method overwritten by 'tidypolars':
#>   method          from  
#>   print.DataFrame polars
t_date <- as.Date("2020-05-05")

test_df <- tibble(id = 1:5, 
                   grp = c("a", "a", "b", "b", "b"),
                   start = rep(t_date+1:5),
                   end = rep(t_date+11:7))

# adding an iv-variable to the dataframe
test_df_iv <- test_df |> 
    mutate(range = ivs::iv(start, end))

class(test_df_iv$range)
#> [1] "ivs_iv"     "vctrs_rcrd" "vctrs_vctr"
unclass(test_df_iv$range)
#> $start
#> [1] "2020-05-06" "2020-05-07" "2020-05-08" "2020-05-09" "2020-05-10"
#> 
#> $end
#> [1] "2020-05-16" "2020-05-15" "2020-05-14" "2020-05-13" "2020-05-12"


# importing as plain Dates by remove vctrs attribute
test_df_plain = test_df_iv
test_df_plain[,c("range_1","range_2")] = unclass(test_df_iv$range)
test_df_plain$range = NULL
pl$DataFrame(test_df_plain)
#> shape: (5, 6)
#> ┌─────┬─────┬────────────┬────────────┬────────────┬────────────┐
#> │ id  ┆ grp ┆ start      ┆ end        ┆ range_1    ┆ range_2    │
#> │ --- ┆ --- ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
#> │ i32 ┆ str ┆ date       ┆ date       ┆ date       ┆ date       │
#> ╞═════╪═════╪════════════╪════════════╪════════════╪════════════╡
#> │ 1   ┆ a   ┆ 2020-05-06 ┆ 2020-05-16 ┆ 2020-05-06 ┆ 2020-05-16 │
#> │ 2   ┆ a   ┆ 2020-05-07 ┆ 2020-05-15 ┆ 2020-05-07 ┆ 2020-05-15 │
#> │ 3   ┆ b   ┆ 2020-05-08 ┆ 2020-05-14 ┆ 2020-05-08 ┆ 2020-05-14 │
#> │ 4   ┆ b   ┆ 2020-05-09 ┆ 2020-05-13 ┆ 2020-05-09 ┆ 2020-05-13 │
#> │ 5   ┆ b   ┆ 2020-05-10 ┆ 2020-05-12 ┆ 2020-05-10 ┆ 2020-05-12 │
#> └─────┴─────┴────────────┴────────────┴────────────┴────────────┘

# or make a series Struct, (a struct is pretty close a DataFrame in a Series)
pl$select(unclass(test_df_iv$range))$to_struct()$alias("range_struct")
#> polars Series: shape: (5,)
#> Series: 'range_struct' [struct[2]]
#> [
#>  {2020-05-06,2020-05-16}
#>  {2020-05-07,2020-05-15}
#>  {2020-05-08,2020-05-14}
#>  {2020-05-09,2020-05-13}
#>  {2020-05-10,2020-05-12}
#> ]

# use polars date_range instead of ivs
test_df_plain$range_1 = NULL
test_df_plain$range_2 = NULL
pl$DataFrame(test_df_plain)$with_columns(
  #as Date
  pl$date_range(
      pl$col("start"), 
      pl$col("end"),
      interval = "1d",
      explode = FALSE
    )$alias("range_as_date_ranges"),
  
  #or some Datetime
  pl$date_range(
      pl$col("start"), 
      pl$col("end"),
      interval = "1d42m5s",
      explode = FALSE
    )$alias("range_as_datetime_ranges")
)
#> shape: (5, 6)
#> ┌─────┬─────┬────────────┬────────────┬────────────────────────────┬──────────────────────────┐
#> │ id  ┆ grp ┆ start      ┆ end        ┆ range_as_date_ranges       ┆ range_as_datetime_ranges │
#> │ --- ┆ --- ┆ ---        ┆ ---        ┆ ---                        ┆ ---                      │
#> │ i32 ┆ str ┆ date       ┆ date       ┆ list[date]                 ┆ list[datetime[μs]]       │
#> ╞═════╪═════╪════════════╪════════════╪════════════════════════════╪══════════════════════════╡
#> │ 1   ┆ a   ┆ 2020-05-06 ┆ 2020-05-16 ┆ [2020-05-06, 2020-05-07, … ┆ [2020-05-06 00:00:00,    │
#> │     ┆     ┆            ┆            ┆ 2020-…                     ┆ 2020-05-07…              │
#> │ 2   ┆ a   ┆ 2020-05-07 ┆ 2020-05-15 ┆ [2020-05-07, 2020-05-08, … ┆ [2020-05-07 00:00:00,    │
#> │     ┆     ┆            ┆            ┆ 2020-…                     ┆ 2020-05-08…              │
#> │ 3   ┆ b   ┆ 2020-05-08 ┆ 2020-05-14 ┆ [2020-05-08, 2020-05-09, … ┆ [2020-05-08 00:00:00,    │
#> │     ┆     ┆            ┆            ┆ 2020-…                     ┆ 2020-05-09…              │
#> │ 4   ┆ b   ┆ 2020-05-09 ┆ 2020-05-13 ┆ [2020-05-09, 2020-05-10, … ┆ [2020-05-09 00:00:00,    │
#> │     ┆     ┆            ┆            ┆ 2020-…                     ┆ 2020-05-10…              │
#> │ 5   ┆ b   ┆ 2020-05-10 ┆ 2020-05-12 ┆ [2020-05-10, 2020-05-11,   ┆ [2020-05-10 00:00:00,    │
#> │     ┆     ┆            ┆            ┆ 2020-05…                   ┆ 2020-05-11…              │
#> └─────┴─────┴────────────┴────────────┴────────────────────────────┴──────────────────────────┘


# But ....
# from a package extending polars or some user perspective it could be ugly to handcode all this
# I have added a method to polars::as_polars_series (likely released with polars  0.8.0) where
# users or package maintainers can use to modify/extend how Robj are converted into Series

# e..g define a generic conversion for any "vctrs_rcrd"
as_polars_series.vctrs_rcrd = function(x, ...) {
  pl$DataFrame(unclass(x))$to_struct()
}

# now it just works
pl$lit(test_df_iv$range)
#> polars Expr: Series
pl$DataFrame(test_df_iv)
#> shape: (5, 5)
#> ┌─────┬─────┬────────────┬────────────┬─────────────────────────┐
#> │ id  ┆ grp ┆ start      ┆ end        ┆ range                   │
#> │ --- ┆ --- ┆ ---        ┆ ---        ┆ ---                     │
#> │ i32 ┆ str ┆ date       ┆ date       ┆ struct[2]               │
#> ╞═════╪═════╪════════════╪════════════╪═════════════════════════╡
#> │ 1   ┆ a   ┆ 2020-05-06 ┆ 2020-05-16 ┆ {2020-05-06,2020-05-16} │
#> │ 2   ┆ a   ┆ 2020-05-07 ┆ 2020-05-15 ┆ {2020-05-07,2020-05-15} │
#> │ 3   ┆ b   ┆ 2020-05-08 ┆ 2020-05-14 ┆ {2020-05-08,2020-05-14} │
#> │ 4   ┆ b   ┆ 2020-05-09 ┆ 2020-05-13 ┆ {2020-05-09,2020-05-13} │
#> │ 5   ┆ b   ┆ 2020-05-10 ┆ 2020-05-12 ┆ {2020-05-10,2020-05-12} │
#> └─────┴─────┴────────────┴────────────┴─────────────────────────┘
x = test_df_iv$range


# or define a more specialized conversion for ivs_vs, where we use specificly "start" and "end"
as_polars_series.ivs_iv = function(x, ...) {
  pl$DataFrame(unclass(x))$select(
    pl$date_range(
      pl$col("start"), 
      pl$col("end"),
      interval = "1d",
      explode = FALSE
    )$alias("ivs_iv")
  )$to_series()
}

pl$lit(test_df_iv$range)
#> polars Expr: Series[ivs_iv]
pl$DataFrame(test_df_iv)
#> shape: (5, 5)
#> ┌─────┬─────┬────────────┬────────────┬───────────────────────────────────┐
#> │ id  ┆ grp ┆ start      ┆ end        ┆ range                             │
#> │ --- ┆ --- ┆ ---        ┆ ---        ┆ ---                               │
#> │ i32 ┆ str ┆ date       ┆ date       ┆ list[date]                        │
#> ╞═════╪═════╪════════════╪════════════╪═══════════════════════════════════╡
#> │ 1   ┆ a   ┆ 2020-05-06 ┆ 2020-05-16 ┆ [2020-05-06, 2020-05-07, … 2020-… │
#> │ 2   ┆ a   ┆ 2020-05-07 ┆ 2020-05-15 ┆ [2020-05-07, 2020-05-08, … 2020-… │
#> │ 3   ┆ b   ┆ 2020-05-08 ┆ 2020-05-14 ┆ [2020-05-08, 2020-05-09, … 2020-… │
#> │ 4   ┆ b   ┆ 2020-05-09 ┆ 2020-05-13 ┆ [2020-05-09, 2020-05-10, … 2020-… │
#> │ 5   ┆ b   ┆ 2020-05-10 ┆ 2020-05-12 ┆ [2020-05-10, 2020-05-11, 2020-05… │
#> └─────┴─────┴────────────┴────────────┴───────────────────────────────────┘


#final gotcha, select and with_column unpack a single list as input
# as it expects it is an arg list
pl$select(test_df_iv$range) # naively converts them to dates
#> shape: (5, 2)
#> ┌────────────┬────────────┐
#> │ start      ┆ end        │
#> │ ---        ┆ ---        │
#> │ date       ┆ date       │
#> ╞════════════╪════════════╡
#> │ 2020-05-06 ┆ 2020-05-16 │
#> │ 2020-05-07 ┆ 2020-05-15 │
#> │ 2020-05-08 ┆ 2020-05-14 │
#> │ 2020-05-09 ┆ 2020-05-13 │
#> │ 2020-05-10 ┆ 2020-05-12 │
#> └────────────┴────────────┘


#to avoid this any first and only list arg must be wrapped in a list
pl$select(list(test_df_iv$range)) # naively converts them to dates
#> shape: (5, 1)
#> ┌───────────────────────────────────┐
#> │ ivs_iv                            │
#> │ ---                               │
#> │ list[date]                        │
#> ╞═══════════════════════════════════╡
#> │ [2020-05-06, 2020-05-07, … 2020-… │
#> │ [2020-05-07, 2020-05-08, … 2020-… │
#> │ [2020-05-08, 2020-05-09, … 2020-… │
#> │ [2020-05-09, 2020-05-10, … 2020-… │
#> │ [2020-05-10, 2020-05-11, 2020-05… │
#> └───────────────────────────────────┘

Created on 2023-08-31 with reprex v2.0.2

@sorhawell
Copy link
Collaborator

Hi @cathblatter from released polars 0.8.0 with PR #369 above example has been enabled. Should we close this issue?

@cathblatter
Copy link
Author

Confirm this works like a charm - thank you very much @sorhawell & also @etiennebacher for your speedy adjustments🥳

@eitsupi
Copy link
Collaborator

eitsupi commented Dec 4, 2023

But when I try to convert this to polars I get an error. Perhaps the arrow2 crate does not support this type.

I think the real problem here is that Polars doesn't support "Extension type" of Arrow. (pola-rs/polars#9112)
I will open an new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants