Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: export/import DataFrame as raw vector #1072

Merged
merged 5 commits into from
May 4, 2024
Merged

feat: export/import DataFrame as raw vector #1072

merged 5 commits into from
May 4, 2024

Conversation

eitsupi
Copy link
Collaborator

@eitsupi eitsupi commented May 3, 2024

Functions originally implemented on the Rust side for use in inter-process communication are refactored and made available on the R side.
This is useful when implementing asynchronous processing in R (use with the mirai package for example).

I am considering what function should be used on the R side.

Python Polars uses write_ipc and read_ipc to switch between writing to the session and writing to the file depending on the first argument, but I feel that it is better to have the same function for reading and a separate function for writing, as in the R arrow package.

@shikokuchuo
Copy link

shikokuchuo commented May 3, 2024

In general I prefer separate functions as you've implemented. Otherwise a separate 'file' argument if writing a file, then just need a check for missing() rather than doing costly inferral.

This kind of thing makes sense for generics, but when the arguments are different types, I find it cleaner not to overload.

@eitsupi
Copy link
Collaborator Author

eitsupi commented May 3, 2024

@shikokuchuo Thank you for your comment.

I agree with you, but I think there are cases here where it is better to focus on consistency with other language APIs in Polars and with other packages in R.

  1. In Python Polars, polars.read_ipc() and others change their behavior depending on the argument type.
  2. Popular read functions in R, such as data.table::fread(), readr::read_csv(), and arrow::read_ipc(), have the ability to interpret vectors directly as files in addition to file paths.

In other words, in both Polars and R, it seems that it is acceptable to have different behavior for read functions depending on the argument type (although these functions are not generic functions, of course).

So I am thinking here that the function to convert a raw vector of Arrow IPC to a DataFrame should be used by rewriting pl$read_ipc().

@shikokuchuo
Copy link

That's fine. My comment is just 'in general' - if there are more important considerations such as API consistency as you point out, don't let me stop you!

@eitsupi eitsupi marked this pull request as ready for review May 3, 2024 13:30
@eitsupi eitsupi requested a review from etiennebacher May 3, 2024 13:30
@eitsupi
Copy link
Collaborator Author

eitsupi commented May 3, 2024

It seems working fine with mirai 1.0.0 (released few hours ago!).

This example is from https://shikokuchuo.net/mirai/articles/databases.html#database-hosting---using-arrow-database-connectivity

library(mirai)

daemons(1)
#> [1] 1

everywhere({
  library(DBI)
  con <<- dbConnect(adbi::adbi("adbcsqlite"), uri = ":memory:")
})

serialization(
  refhook = list(\(x) polars::as_polars_df(x)$to_raw_ipc(future = TRUE),
                polars::pl$read_ipc),
  class = "nanoarrow_array_stream"
)

m <- mirai(dbWriteTableArrow(con, "iris", iris))
call_mirai(m)$data
#> [1] TRUE

m <- mirai(dbReadTableArrow(con, "iris"))
call_mirai(m)$data
#> shape: (150, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
#> │ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
#> │ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
#> │ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
#> │ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
#> │ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa    │
#> │ …            ┆ …           ┆ …            ┆ …           ┆ …         │
#> │ 6.7          ┆ 3.0         ┆ 5.2          ┆ 2.3         ┆ virginica │
#> │ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
#> │ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
#> │ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
#> │ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
#> └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

Created on 2024-05-03 with reprex v2.1.0

@shikokuchuo
Copy link

Great! I can confirm that the above works.

For me this also works:

serialization(
  refhook = list(\(x) polars::as_polars_df(x)$to_raw_ipc(),
                polars::pl$read_ipc),
  class = "nanoarrow_array_stream"
)

I didn't find the documentation for what that future = TRUE argument means.

Apart from this, is there scope to make it even more ergonomic to have a direct counterpart to the read_ipc() method so there doesn't need to be an anonymous function?

I think once nanoarrow implements its own serialization features, it should also behave like this i.e. not require function(x) ...

@eitsupi
Copy link
Collaborator Author

eitsupi commented May 3, 2024

I didn't find the documentation for what that future = TRUE argument means.

Polars introduced the StringView type earlier than other Arrow implementations and uses it as the default string type, which can cause problems when passing Arrow IPC to other Arrow implementations.
For example, nanoarrow does not yet implement the StringView type yet (apache/arrow-nanoarrow#367).

Also, the arrow package does not support converting this to R as of version 15, so errors occur when loading it.

df <- polars::pl$DataFrame(string = "foo")

df$to_raw_ipc(future = FALSE) |>
  arrow::read_ipc_file()
#>   string
#> 1    foo

df$to_raw_ipc(future = TRUE) |>
  arrow::read_ipc_file()
#> Error: cannot handle Array of type <utf8_view>

Created on 2024-05-03 with reprex v2.1.0

This is not a problem for exchanging data between Polars and should result in a slight performance increase due to the lack of extra conversions.

Apart from this, is there scope to make it even more ergonomic to have a direct counterpart to the read_ipc() method so there doesn't need to be an anonymous function?

I too think something like that would be worth adding, but there is no consensus yet.
Here is a recent discussion.
etiennebacher/tidypolars#111

@shikokuchuo
Copy link

shikokuchuo commented May 3, 2024

Thanks for exposing these anyway - I think they will be useful for your users if they want to work with parallel / distributed computing. I will add something to the mirai docs early next week.

Or once this is merged / released actually :)

@eitsupi
Copy link
Collaborator Author

eitsupi commented May 4, 2024

I will merge this for now. If function names need to be changed, I believe they can be changed later.

@eitsupi eitsupi merged commit 32a97c6 into main May 4, 2024
33 of 35 checks passed
@eitsupi eitsupi deleted the to_raw branch May 4, 2024 03:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants