feat: export/import DataFrame as raw vector #1072

eitsupi · 2024-05-03T10:35:46Z

Functions originally implemented on the Rust side for use in inter-process communication are refactored and made available on the R side.
This is useful when implementing asynchronous processing in R (use with the mirai package for example).

I am considering what function should be used on the R side.

Python Polars uses write_ipc and read_ipc to switch between writing to the session and writing to the file depending on the first argument, but I feel that it is better to have the same function for reading and a separate function for writing, as in the R arrow package.

shikokuchuo · 2024-05-03T10:59:26Z

In general I prefer separate functions as you've implemented. Otherwise a separate 'file' argument if writing a file, then just need a check for missing() rather than doing costly inferral.

This kind of thing makes sense for generics, but when the arguments are different types, I find it cleaner not to overload.

eitsupi · 2024-05-03T11:15:18Z

@shikokuchuo Thank you for your comment.

I agree with you, but I think there are cases here where it is better to focus on consistency with other language APIs in Polars and with other packages in R.

In Python Polars, polars.read_ipc() and others change their behavior depending on the argument type.
Popular read functions in R, such as data.table::fread(), readr::read_csv(), and arrow::read_ipc(), have the ability to interpret vectors directly as files in addition to file paths.

In other words, in both Polars and R, it seems that it is acceptable to have different behavior for read functions depending on the argument type (although these functions are not generic functions, of course).

So I am thinking here that the function to convert a raw vector of Arrow IPC to a DataFrame should be used by rewriting pl$read_ipc().

shikokuchuo · 2024-05-03T11:19:27Z

That's fine. My comment is just 'in general' - if there are more important considerations such as API consistency as you point out, don't let me stop you!

eitsupi · 2024-05-03T14:06:34Z

It seems working fine with mirai 1.0.0 (released few hours ago!).

This example is from https://shikokuchuo.net/mirai/articles/databases.html#database-hosting---using-arrow-database-connectivity

library(mirai)

daemons(1)
#> [1] 1

everywhere({
  library(DBI)
  con <<- dbConnect(adbi::adbi("adbcsqlite"), uri = ":memory:")
})

serialization(
  refhook = list(\(x) polars::as_polars_df(x)$to_raw_ipc(future = TRUE),
                polars::pl$read_ipc),
  class = "nanoarrow_array_stream"
)

m <- mirai(dbWriteTableArrow(con, "iris", iris))
call_mirai(m)$data
#> [1] TRUE

m <- mirai(dbReadTableArrow(con, "iris"))
call_mirai(m)$data
#> shape: (150, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
#> │ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
#> │ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
#> │ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
#> │ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
#> │ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ …            ┆ …           ┆ …            ┆ …           ┆ …         │
#> │ 6.7          ┆ 3.0         ┆ 5.2          ┆ 2.3         ┆ virginica │
#> │ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
#> │ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
#> │ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
#> │ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
#> └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

^{Created on 2024-05-03 with reprex v2.1.0}

shikokuchuo · 2024-05-03T14:50:19Z

Great! I can confirm that the above works.

For me this also works:

serialization(
  refhook = list(\(x) polars::as_polars_df(x)$to_raw_ipc(),
                polars::pl$read_ipc),
  class = "nanoarrow_array_stream"
)

I didn't find the documentation for what that future = TRUE argument means.

Apart from this, is there scope to make it even more ergonomic to have a direct counterpart to the read_ipc() method so there doesn't need to be an anonymous function?

I think once nanoarrow implements its own serialization features, it should also behave like this i.e. not require function(x) ...

eitsupi · 2024-05-03T15:02:05Z

I didn't find the documentation for what that future = TRUE argument means.

Polars introduced the StringView type earlier than other Arrow implementations and uses it as the default string type, which can cause problems when passing Arrow IPC to other Arrow implementations.
For example, nanoarrow does not yet implement the StringView type yet (apache/arrow-nanoarrow#367).

Also, the arrow package does not support converting this to R as of version 15, so errors occur when loading it.

df <- polars::pl$DataFrame(string = "foo")

df$to_raw_ipc(future = FALSE) |>
  arrow::read_ipc_file()
#>   string
#> 1    foo

df$to_raw_ipc(future = TRUE) |>
  arrow::read_ipc_file()
#> Error: cannot handle Array of type <utf8_view>

^{Created on 2024-05-03 with reprex v2.1.0}

This is not a problem for exchanging data between Polars and should result in a slight performance increase due to the lack of extra conversions.

Apart from this, is there scope to make it even more ergonomic to have a direct counterpart to the read_ipc() method so there doesn't need to be an anonymous function?

I too think something like that would be worth adding, but there is no consensus yet.
Here is a recent discussion.
etiennebacher/tidypolars#111

shikokuchuo · 2024-05-03T15:23:23Z

Thanks for exposing these anyway - I think they will be useful for your users if they want to work with parallel / distributed computing. I will add something to the mirai docs early next week.

Or once this is merged / released actually :)

eitsupi · 2024-05-04T03:14:23Z

I will merge this for now. If function names need to be changed, I believe they can be changed later.

feat: export/import DataFrame as raw vector [skip ci]

4efeda7

eitsupi added 4 commits May 3, 2024 12:13

feat: more options [skip ci]

4f50219

refactor: simplify [skip ci]

214bc58

feat: update R side

d281a31

docs(news): add bullets

662a476

eitsupi marked this pull request as ready for review May 3, 2024 13:30

eitsupi requested a review from etiennebacher May 3, 2024 13:30

eitsupi merged commit 32a97c6 into main May 4, 2024
33 of 35 checks passed

eitsupi deleted the to_raw branch May 4, 2024 03:14

eitsupi mentioned this pull request Jun 14, 2024

Support raw yutannihilation/savvy#30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: export/import DataFrame as raw vector #1072

feat: export/import DataFrame as raw vector #1072

eitsupi commented May 3, 2024 •

edited

Loading

shikokuchuo commented May 3, 2024 •

edited

Loading

eitsupi commented May 3, 2024

shikokuchuo commented May 3, 2024

eitsupi commented May 3, 2024 •

edited

Loading

shikokuchuo commented May 3, 2024

eitsupi commented May 3, 2024 •

edited

Loading

shikokuchuo commented May 3, 2024 •

edited

Loading

eitsupi commented May 4, 2024

feat: export/import DataFrame as raw vector #1072

feat: export/import DataFrame as raw vector #1072

Conversation

eitsupi commented May 3, 2024 • edited Loading

shikokuchuo commented May 3, 2024 • edited Loading

eitsupi commented May 3, 2024

shikokuchuo commented May 3, 2024

eitsupi commented May 3, 2024 • edited Loading

shikokuchuo commented May 3, 2024

eitsupi commented May 3, 2024 • edited Loading

shikokuchuo commented May 3, 2024 • edited Loading

eitsupi commented May 4, 2024

eitsupi commented May 3, 2024 •

edited

Loading

shikokuchuo commented May 3, 2024 •

edited

Loading

eitsupi commented May 3, 2024 •

edited

Loading

eitsupi commented May 3, 2024 •

edited

Loading

shikokuchuo commented May 3, 2024 •

edited

Loading