feat: import_stream internal method for Series to support Arrow C stream interface #1078

eitsupi · 2024-05-05T15:42:05Z

Close #732
Organize functions to import Arrow C Stream, ~~and use that inside of as_polars_series() and as_polars_df() for nanoarrow_array_stream etc.~~
(Edit: This change has been undone due to observed performance degradation)
With the addition of this function, several functions that use the Arrow C Stream interface that existed (but were not used) are no longer needed and will be removed.

I have tried quite a few things, but it seems that the external pointer used when importing streams cannot be from another package and must be provided here.

I am convinced that this function is very dangerous and should not be exported, and that the conversion should basically be done via nanoarrow_array_stream.
In other words, the import_stream method will not be added to the DataFrame; use as_polars_df(<nanoarrow_array_stream>).

~~Running the same benchmarks as in #896, we can confirm that performance has not decreased.~~
(Edit: Running other benchmarks confirms the performance degradation. See comments like #1078 (comment).)

library(adbcdrivermanager)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#>     timestamp

library(tibble)
library(polars)

polars_info()
#> Polars R package version : 0.16.3.9000
#> Rust Polars crate version: 0.39.2
#>
#> Thread pool size: 16
#>
#> Features:
#> default                    TRUE
#> full_features              TRUE
#> disable_limit_max_threads  TRUE
#> nightly                    TRUE
#> sql                        TRUE
#> rpolars_debug_print       FALSE
#>
#> Code completion: deactivated

db <- adbc_database_init(adbcsqlite::adbcsqlite())
con <- adbc_connection_init(db)

flights <- nycflights13::flights
flights$time_hour <- NULL
flights |>
  write_adbc(con, "flights")

query <- "SELECT * from flights"

bench::mark(
  polars_df_1 = {
    con |>
      read_adbc(query) |>
      as_polars_df()
  },
  arrow_table = {
    con |>
      read_adbc(query) |>
      as_arrow_table()
  },
  tibble = {
    con |>
      read_adbc(query) |>
      as_tibble()
  },
  polars_df_2 = {
    con |>
      read_adbc(query) |>
      as_polars_df()
  },
  check = FALSE,
  min_iterations = 5
)
#> # A tibble: 4 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars_df_1    1.23s    1.26s     0.792    4.89MB    0
#> 2 arrow_table    1.21s    1.26s     0.790    1.62MB    0
#> 3 tibble         3.62s    3.87s     0.259   46.52MB    0.388
#> 4 polars_df_2    1.32s    1.41s     0.701   13.16KB    0

^{Created on 2024-05-05 with reprex v2.1.0}

…ow::RecordBatchReader>)` and remove unused internal functions

eitsupi · 2024-05-06T03:47:35Z

The rechunk option of as_polars_df(<arrow::Table>) no longer works as a result of rewriting the contents of as_polars_df() in c6b4ee7.
I ran the benchmark (based on #64 (comment)) with the latest release and c6b4ee7.

In both cases, the case of using nanoarrow_array_stream as the intermediate format, which performs the conversion via the Arrow C Stream interface only, is the fastest, so perhaps it may be worth eliminating the rechunk option and rewriting the code completely to use only the Arrow C Stream.

The latest release

Details

library(polars)
library(nanoarrow)
library(arrow, warn.conflicts = FALSE)

polars_info()
#> Polars R package version : 0.16.3
#> Rust Polars crate version: 0.39.2
#>
#> Thread pool size: 16
#>
#> Features:
#> default                    TRUE
#> full_features              TRUE
#> disable_limit_max_threads  TRUE
#> nightly                    TRUE
#> sql                        TRUE
#> rpolars_debug_print       FALSE
#>
#> Code completion: deactivated
arrow_info()
#> Arrow package version: 15.0.1
#>
#> Capabilities:
#>
#> acero      TRUE
#> dataset    TRUE
#> substrait FALSE
#> parquet    TRUE
#> json       TRUE
#> s3         TRUE
#> gcs        TRUE
#> utf8proc   TRUE
#> re2        TRUE
#> snappy     TRUE
#> gzip       TRUE
#> brotli     TRUE
#> zstd       TRUE
#> lz4        TRUE
#> lz4_frame  TRUE
#> lzo       FALSE
#> bz2        TRUE
#> jemalloc   TRUE
#> mimalloc   TRUE
#>
#> Memory:
#>
#> Allocator jemalloc
#> Current    0 bytes
#> Max        0 bytes
#>
#> Runtime:
#>
#> SIMD Level          avx2
#> Detected SIMD Level avx2
#>
#> Build:
#>
#> C++ Library Version  15.0.1
#> C++ Compiler            GNU
#> C++ Compiler Version 11.4.0

big_at =  do.call(rbind, lapply(1:5, \(x) as_arrow_table(nycflights13::flights)))

bench::mark(
  rechunk = as_polars_df(big_at, rechunk = TRUE),
  not_rechunk = as_polars_df(big_at, rechunk = FALSE),
  via_nanoarrow = as_polars_df(as_nanoarrow_array_stream(big_at)),
  via_r = as_polars_df(as.data.frame(big_at)),
  check = FALSE,
  min_iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rechunk        178.9ms 193.55ms     5.12     6.55MB    2.05
#> 2 not_rechunk   173.72ms 223.31ms     4.67   229.05KB    2.80
#> 3 via_nanoarrow  159.8ms 164.03ms     5.81     1.75MB    0
#> 4 via_r            1.37s    1.45s     0.671  167.41MB    0.671

^{Created on 2024-05-06 with reprex v2.1.0}

c6b4ee7

Details

library(polars)
library(nanoarrow)
library(arrow, warn.conflicts = FALSE)

polars_info()
#> Polars R package version : 0.16.3.9000
#> Rust Polars crate version: 0.39.2
#>
#> Thread pool size: 16
#>
#> Features:
#> default                    TRUE
#> full_features              TRUE
#> disable_limit_max_threads  TRUE
#> nightly                    TRUE
#> sql                        TRUE
#> rpolars_debug_print       FALSE
#>
#> Code completion: deactivated
arrow_info()
#> Arrow package version: 15.0.1
#>
#> Capabilities:
#>
#> acero      TRUE
#> dataset    TRUE
#> substrait FALSE
#> parquet    TRUE
#> json       TRUE
#> s3         TRUE
#> gcs        TRUE
#> utf8proc   TRUE
#> re2        TRUE
#> snappy     TRUE
#> gzip       TRUE
#> brotli     TRUE
#> zstd       TRUE
#> lz4        TRUE
#> lz4_frame  TRUE
#> lzo       FALSE
#> bz2        TRUE
#> jemalloc   TRUE
#> mimalloc   TRUE
#>
#> Memory:
#>
#> Allocator jemalloc
#> Current    0 bytes
#> Max        0 bytes
#>
#> Runtime:
#>
#> SIMD Level          avx2
#> Detected SIMD Level avx2
#>
#> Build:
#>
#> C++ Library Version  15.0.1
#> C++ Compiler            GNU
#> C++ Compiler Version 11.4.0

big_at =  do.call(rbind, lapply(1:5, \(x) as_arrow_table(nycflights13::flights)))

bench::mark(
  rechunk = as_polars_df(big_at, rechunk = TRUE),
  not_rechunk = as_polars_df(big_at, rechunk = FALSE),
  via_nanoarrow = as_polars_df(as_nanoarrow_array_stream(big_at)),
  via_r = as_polars_df(as.data.frame(big_at)),
  check = FALSE,
  min_iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rechunk       244.69ms 291.57ms     3.43     9.69MB    0.686
#> 2 not_rechunk   251.56ms 275.39ms     3.42    47.88KB    0.684
#> 3 via_nanoarrow 184.96ms 203.89ms     4.65   126.65KB    0
#> 4 via_r            1.55s    1.77s     0.575  167.48MB    0.920

^{Created on 2024-05-06 with reprex v2.1.0}

eitsupi · 2024-05-06T07:32:39Z

Benchmarking showed no difference in speed between the existing conversion copied from Python Polars and the conversion via the new Arrow C Stream interface, so I changed the conversion from arrow::Table to using C Stream interface and removed functions that were no longer needed in 350689c.

eitsupi · 2024-05-06T07:56:20Z

I don't know why, but it seems that rechunking is automatically done when a Struct type Stream is converted to a Series?

library(nanoarrow)
library(polars)

s_int <- basic_array_stream(list(as_nanoarrow_array(1:5), as_nanoarrow_array(6:10))) |>
  as_polars_series()

s_struct <- basic_array_stream(list(as_nanoarrow_array(mtcars[1:5, ]), as_nanoarrow_array(mtcars[6:10, ]))) |>
  as_polars_series()

s_int$n_chunks()
#> [1] 2
s_struct$n_chunks()
#> [1] 1

^{Created on 2024-05-06 with reprex v2.1.0}

eitsupi · 2024-05-06T08:03:06Z

Using the internal functions of the release version, rechunking has not been done, so something could be wrong.......

r$> library(nanoarrow)

r$> s_struct <- basic_array_stream(list(as_nanoarrow_array(mtcars[1:5, ]), as_nanoarrow_array(mtcars[6:10, ])))

r$> p <- polars:::new_arrow_stream()

r$> nanoarrow_pointer_export(s_struct, p)

r$> polars:::arrow_stream_to_series(p) -> out

r$> out
$ok
polars Series: shape: (10,)
Series: 'df' [struct[11]]
[
        {21.0,6.0,160.0,110.0,3.9,2.62,16.46,0.0,1.0,4.0,4.0}
        {21.0,6.0,160.0,110.0,3.9,2.875,17.02,0.0,1.0,4.0,4.0}
        {22.8,4.0,108.0,93.0,3.85,2.32,18.61,1.0,1.0,4.0,1.0}
        {21.4,6.0,258.0,110.0,3.08,3.215,19.44,1.0,0.0,3.0,1.0}
        {18.7,8.0,360.0,175.0,3.15,3.44,17.02,0.0,0.0,3.0,2.0}
        {18.1,6.0,225.0,105.0,2.76,3.46,20.22,1.0,0.0,3.0,1.0}
        {14.3,8.0,360.0,245.0,3.21,3.57,15.84,0.0,0.0,3.0,4.0}
        {24.4,4.0,146.7,62.0,3.69,3.19,20.0,1.0,0.0,4.0,2.0}
        {22.8,4.0,140.8,95.0,3.92,3.15,22.9,1.0,0.0,4.0,2.0}
        {19.2,6.0,167.6,123.0,3.92,3.44,18.3,1.0,0.0,4.0,4.0}
]

$err
NULL

attr(,"class")
[1] "extendr_result"

r$> out$ok$n_chunks()
[1] 2

eitsupi · 2024-05-06T08:12:17Z

Using the internal functions of the release version, rechunking has not been done, so something could be wrong.......

I think this is because the conversion was done on a chunk-by-chunk basis to Series, but now it is done in chunks, which is not a problem considering that Python Polars also do the conversion in chunks.
https://github.com/pola-rs/polars/blob/ae66acd30bcd453f767639463d89f1a10e70f061/py-polars/src/series/c_interface.rs#L27-L29

etiennebacher

I didn't follow your recent changes very closely, but based on your benchmarks above it seems all conversions have slowed down compared to the release version. Is this expected?

tests/testthat/test-as_polars.R

eitsupi · 2024-05-06T12:21:17Z

Thanks for your review.

your benchmarks above it seems all conversions have slowed down compared to the release version. Is this expected?

I don't know the cause either, but it may be related to the fact that the development version of the benchmark is running in a development container and the release version of the benchmark is using a different, newer container. In any case, I am running the benchmark on a machine that is simultaneously running other software such as a browser, etc. and the results seem to vary quite a bit, so I suspect it is an effect of variability.

…face" This reverts commit 350689c.

…nside the function

eitsupi · 2024-05-06T15:02:45Z

Hmmm, that is really strange the performance of the conversion from nanoarrow_array_stream has degraded.

I added a new feature in e04762c to switch the conversion method from nanoarrow_array_stream, but both are displayed at the same speed.
Could it be a performance measurement issue with the bench package?

Details

library(polars)
library(nanoarrow)
library(arrow, warn.conflicts = FALSE)

polars_info()
#> Polars R package version : 0.16.3.9000
#> Rust Polars crate version: 0.39.2
#> 
#> Thread pool size: 16 
#> 
#> Features:                               
#> default                    TRUE
#> full_features              TRUE
#> disable_limit_max_threads  TRUE
#> nightly                    TRUE
#> sql                        TRUE
#> rpolars_debug_print       FALSE
#> 
#> Code completion: deactivated
arrow_info()
#> Arrow package version: 15.0.1
#> 
#> Capabilities:
#>                
#> acero      TRUE
#> dataset    TRUE
#> substrait FALSE
#> parquet    TRUE
#> json       TRUE
#> s3         TRUE
#> gcs        TRUE
#> utf8proc   TRUE
#> re2        TRUE
#> snappy     TRUE
#> gzip       TRUE
#> brotli     TRUE
#> zstd       TRUE
#> lz4        TRUE
#> lz4_frame  TRUE
#> lzo       FALSE
#> bz2        TRUE
#> jemalloc   TRUE
#> mimalloc   TRUE
#> 
#> Memory:
#>                   
#> Allocator jemalloc
#> Current    0 bytes
#> Max        0 bytes
#> 
#> Runtime:
#>                         
#> SIMD Level          avx2
#> Detected SIMD Level avx2
#> 
#> Build:
#>                            
#> C++ Library Version  15.0.1
#> C++ Compiler            GNU
#> C++ Compiler Version 11.4.0

big_at =  do.call(rbind, lapply(1:5, \(x) as_arrow_table(nycflights13::flights)))

bench::mark(
  not_rechunk = as_polars_df(big_at, rechunk = FALSE),
  rechunk = as_polars_df(big_at, rechunk = TRUE),
  via_nanoarrow_c_stream = as_polars_df(as_nanoarrow_array_stream(big_at, experimental = TRUE)),
  via_nanoarrow = as_polars_df(as_nanoarrow_array_stream(big_at)),
  via_r = as_polars_df(as.data.frame(big_at)),
  check = FALSE,
  min_iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 5 × 6
#>   expression                  min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 not_rechunk             336.7ms 366.83ms     2.61     8.62MB    0.521
#> 2 rechunk                 238.9ms 278.98ms     3.64     1.12MB    0    
#> 3 via_nanoarrow_c_stream  122.8ms 130.09ms     7.57   159.72KB    1.51 
#> 4 via_nanoarrow           110.3ms 114.75ms     8.06    14.14KB    0    
#> 5 via_r                      1.5s    1.69s     0.589  167.48MB    0.707

Re ArrowTabular, I think it's because they no longer parallelize for the conversion from ArrowTabular.
https://github.com/pola-rs/polars/blob/575e91759f6a6a720cc0e1e50b97afac436a0240/py-polars/src/arrow_interop/to_rust.rs#L73-L87

I think we need to add this parallelization process to the C stream interface.

eitsupi · 2024-05-07T14:12:06Z

In the latest commit e3b9567, an additional experimental = TRUE option allows switching between the traditional implementation and conversion via the Arrow C stream interface.

Details

library(polars)
library(nanoarrow)
library(arrow, warn.conflicts = FALSE)

polars_info()
#> Polars R package version : 0.16.3.9000
#> Rust Polars crate version: 0.39.2
#>
#> Thread pool size: 16
#>
#> Features:
#> default                    TRUE
#> full_features              TRUE
#> disable_limit_max_threads  TRUE
#> nightly                    TRUE
#> sql                        TRUE
#> rpolars_debug_print       FALSE
#>
#> Code completion: deactivated
arrow_info()
#> Arrow package version: 15.0.1
#>
#> Capabilities:
#>
#> acero      TRUE
#> dataset    TRUE
#> substrait FALSE
#> parquet    TRUE
#> json       TRUE
#> s3         TRUE
#> gcs        TRUE
#> utf8proc   TRUE
#> re2        TRUE
#> snappy     TRUE
#> gzip       TRUE
#> brotli     TRUE
#> zstd       TRUE
#> lz4        TRUE
#> lz4_frame  TRUE
#> lzo       FALSE
#> bz2        TRUE
#> jemalloc   TRUE
#> mimalloc   TRUE
#>
#> Memory:
#>
#> Allocator jemalloc
#> Current    0 bytes
#> Max        0 bytes
#>
#> Runtime:
#>
#> SIMD Level          avx2
#> Detected SIMD Level avx2
#>
#> Build:
#>
#> C++ Library Version  15.0.1
#> C++ Compiler            GNU
#> C++ Compiler Version 11.4.0

big_at =  do.call(rbind, lapply(1:5, \(x) as_arrow_table(nycflights13::flights)))

bench::mark(
  not_rechunk = as_polars_df(big_at, rechunk = FALSE),
  not_rechunk_cs = as_polars_df(big_at, rechunk = FALSE, experimental = TRUE),
  rechunk = as_polars_df(big_at, rechunk = TRUE),
  rechunk_cs = as_polars_df(big_at, rechunk = TRUE, experimental = TRUE),
  via_nanoarrow = as_polars_df(as_nanoarrow_array_stream(big_at)),
  via_nanoarrow_cs = as_polars_df(as_nanoarrow_array_stream(big_at), experimental = TRUE),
  via_r = as_polars_df(as.data.frame(big_at)),
  check = FALSE,
  min_iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 7 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 not_rechunk      170.49ms 184.87ms     5.03     6.62MB    2.01
#> 2 not_rechunk_cs   227.17ms 294.13ms     3.52     3.44MB    0.705
#> 3 rechunk          170.82ms 192.14ms     5.12     1.28MB    2.05
#> 4 rechunk_cs       277.19ms 326.41ms     2.85    52.02KB    0.571
#> 5 via_nanoarrow    120.08ms 151.54ms     6.56   159.72KB    0
#> 6 via_nanoarrow_cs 164.08ms 350.81ms     3.11    14.14KB    0
#> 7 via_r               1.58s    1.79s     0.491  167.48MB    0.834

^{Created on 2024-05-07 with reprex v2.1.0}

In comparison, the conventional implementation is clearly faster.

However, the conversion via nanoarrow is much faster.......

@etiennebacher Could you please check the performance?

I think it is possible that the slow conversion via the C stream interface could be improved upstream; there is a large demand in Python to implement the C stream interface to remove the pyarrow dependency (pola-rs/polars#12530), which could be improved in the future.
So we can keep both the traditional implementation and the conversion via the C stream interface for now, and remove the traditional implementation in the future when there is no performance difference.

src/rust/src/arrow_interop/mod.rs

etiennebacher · 2024-05-07T21:10:47Z

Here's what I get:

 library(polars)
 library(nanoarrow)
 library(arrow, warn.conflicts = FALSE)

 big_at =  do.call(rbind, lapply(1:5, \(x) as_arrow_table(nycflights13::flights)))
 
 bench::mark(
   not_rechunk = as_polars_df(big_at, rechunk = FALSE),
   not_rechunk_cs = as_polars_df(big_at, rechunk = FALSE, experimental = TRUE),
   rechunk = as_polars_df(big_at, rechunk = TRUE),
   rechunk_cs = as_polars_df(big_at, rechunk = TRUE, experimental = TRUE),
   via_nanoarrow = as_polars_df(as_nanoarrow_array_stream(big_at)),
   via_nanoarrow_cs = as_polars_df(as_nanoarrow_array_stream(big_at), experimental = TRUE),
   via_r = as_polars_df(as.data.frame(big_at)),
   check = FALSE,
   min_iterations = 50
 )
#> # A tibble: 7 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 not_rechunk        81.1ms 183.87ms     5.70      6.5MB   4.86  
#> 2 not_rechunk_cs   268.72ms 396.62ms     2.55     3.46MB   0.559 
#> 3 rechunk          184.21ms 202.18ms     4.68     1.28MB   4.32  
#> 4 rechunk_cs       241.51ms 408.14ms     2.54    52.02KB   0.636 
#> 5 via_nanoarrow    159.91ms  202.8ms     4.01   159.72KB   0.0819
#> 6 via_nanoarrow_cs 183.72ms 202.39ms     3.16    14.14KB   0.0644
#> 7 via_r               1.51s    1.57s     0.643  167.48MB  16.1

For some reason, via_r takes 2x longer as before, so if I apply the same to the other results I get results in the same ballpark as with the the last released version. Here are the results for the last released version (those are the same as in my post above, I just put them again for easier comparison):

#> # A tibble: 4 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rechunk         84.8ms  116.3ms      7.87    6.43MB   7.26  
#> 2 not_rechunk     73.5ms   96.4ms     10.0   229.05KB   9.25  
#> 3 via_nanoarrow   66.6ms    131ms      3.43    1.75MB   0.0699
#> 4 via_r          566.3ms  707.9ms      1.31  167.41MB  24.9

eitsupi · 2024-05-07T22:20:49Z

@etiennebacher Thanks for checking!
Do you think it is ok to merge since the results are as expected?

…>) works

etiennebacher

Yes I think it's fine, thanks!

NEWS.md

etiennebacher · 2024-05-08T07:58:36Z

R/construction.R

@@ -9,9 +9,11 @@
 #' If schema names or types do not match `x`, the columns will be renamed/recast.
 #' If `NULL` (default), convert columns as is.
 #' @param schema_overrides named list of DataTypes. Cast some columns to the DataType.
+#' @param experimental If `TRUE`, use the Arrow C stream interface.


Can you inherit that param from as_polars_series()?

Good point.
This is a internal function and doesn't have Rd file, so roxygen comment inheritance does not make sense.

tests/testthat/test-arrow-c-interface.R

eitsupi · 2024-05-08T13:12:03Z

Another interesting result. If only zero-copy types (here float64) are included and there are many chunks, it is much faster via the C stream interface, as expected.

library(polars)
library(nanoarrow)
library(arrow, warn.conflicts = FALSE)

big_at =  do.call(rbind, lapply(1:100, \(x) as_arrow_table(mtcars)))

bench::mark(
  not_rechunk = as_polars_df(big_at, rechunk = FALSE),
  not_rechunk_cs = as_polars_df(big_at, rechunk = FALSE, experimental = TRUE),
  rechunk = as_polars_df(big_at, rechunk = TRUE),
  rechunk_cs = as_polars_df(big_at, rechunk = TRUE, experimental = TRUE),
  via_nanoarrow = as_polars_df(as_nanoarrow_array_stream(big_at)),
  via_nanoarrow_cs = as_polars_df(as_nanoarrow_array_stream(big_at), experimental = TRUE),
  via_r = as_polars_df(as.data.frame(big_at)),
  check = FALSE,
  min_iterations = 10
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 7 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 not_rechunk      961.75ms    1.04s     0.930   10.28MB     3.81
#> 2 not_rechunk_cs    35.19ms  40.13ms    23.3      3.42MB     3.88
#> 3 rechunk             1.05s    1.12s     0.898    3.25MB     3.59
#> 4 rechunk_cs        41.08ms  46.23ms    20.2     33.41KB     3.68
#> 5 via_nanoarrow     80.15ms 100.23ms     7.83   199.01KB     1.57
#> 6 via_nanoarrow_cs   6.61ms   9.77ms    89.0       4.3KB     1.98
#> 7 via_r             14.42ms  18.84ms    50.4    614.66KB     1.94

^{Created on 2024-05-08 with reprex v2.1.0}

eitsupi added 2 commits May 5, 2024 15:23

refactor: import_stream method for Series

98a81ec

test: add test

16ec40f

eitsupi changed the title ~~refactor: import_stream method for Series~~ refactor: import_stream method for Series, and use that inside of as_polars_series() and as_polars_df() May 5, 2024

eitsupi changed the title ~~refactor: import_stream method for Series, and use that inside of as_polars_series() and as_polars_df()~~ refactor: import_stream method for Series, and use that inside of as_polars_series() and as_polars_df() for nanoarrow_array_stream May 5, 2024

eitsupi added 3 commits May 6, 2024 01:08

test: more tests for nanoarrow

9f095c1

refactor: handle name in rust side

2b976c1

refactor!: use the Arrow C Stream interface inside `as_polars_df(<arr…

c6b4ee7

…ow::RecordBatchReader>)` and remove unused internal functions

eitsupi changed the title ~~refactor: import_stream method for Series, and use that inside of as_polars_series() and as_polars_df() for nanoarrow_array_stream~~ refactor!: import_stream method for Series, and use that inside of as_polars_series() and as_polars_df() May 6, 2024

eitsupi mentioned this pull request May 6, 2024

feat: Support for Binaryview and StringView types apache/arrow-nanoarrow#367

Closed

refactor: simplify type check for struct nanoarrow_array_stream

8de1e21

eitsupi changed the title ~~refactor!: import_stream method for Series, and use that inside of as_polars_series() and as_polars_df()~~ refactor: import_stream method for Series, and use that inside of as_polars_series() and as_polars_df() May 6, 2024

feat: as_polars_df for arrow::RecordBatchReader

bbba938

eitsupi changed the title ~~refactor: import_stream method for Series, and use that inside of as_polars_series() and as_polars_df()~~ refactor!: import_stream method for Series, and use that inside of as_polars_series() and as_polars_df() May 6, 2024

refactor!: rewrite as_polars_df to use Arrow C Stream interface

350689c

eitsupi requested a review from etiennebacher May 6, 2024 07:33

eitsupi added 2 commits May 6, 2024 08:13

docs(news): add bluets

e4fdaad

Merge remote-tracking branch 'upstream/main' into import_stream

f7d917a

eitsupi marked this pull request as ready for review May 6, 2024 08:14

eitsupi requested a review from sorhawell May 6, 2024 08:18

test: ensure auto rechunk

a70fbf6

eitsupi mentioned this pull request May 6, 2024

Building a rust binding R package depends on this r-polars #808

Open

etiennebacher reviewed May 6, 2024

View reviewed changes

tests/testthat/test-as_polars.R Outdated Show resolved Hide resolved

eitsupi mentioned this pull request May 6, 2024

Prevent unnamed non-Field elements in pl$Struct() #1053

Merged

test: add comment about the test case

a543e95

eitsupi changed the title ~~refactor!: import_stream method for Series, and use that inside of as_polars_series() and as_polars_df()~~ refactor!: import_stream method for Series May 6, 2024

eitsupi changed the title ~~refactor!: import_stream method for Series~~ refactor: import_stream method for Series May 6, 2024

eitsupi added 2 commits May 6, 2024 14:19

Revert "refactor!: rewrite as_polars_df to use Arrow C Stream inter…

fa5e600

…face" This reverts commit 350689c.

feat: add the experimental argument to use the import_stream method i…

e04762c

…nside the function

eitsupi added 5 commits May 7, 2024 13:18

feat: re add $from_arrow_record_batches

fe06f29

feat: add experimental option to use C stream interface

2fde92a

test: re add tests for as_polars_df(<ArrowTabular>)

303ea0b

test: update snapshot

89fc55c

chore: auto formatting

e3b9567

docs(news): update NEWS about C stream interface

5a97690

eitsupi mentioned this pull request May 7, 2024

Arrow PyCapsule Interface support pola-rs/polars#12530

Open

3 tasks

eitsupi marked this pull request as ready for review May 7, 2024 14:36

pitrou reviewed May 7, 2024

View reviewed changes

src/rust/src/arrow_interop/mod.rs Outdated Show resolved Hide resolved

refactor: simplify

2b5e8a6

fix: experimental argument for as_polars_df(<nanoarrow_array_stream…

038afb6

…>) works

etiennebacher approved these changes May 8, 2024

View reviewed changes

docs(news): more notes about the Arrow C stream interface

b2277fb

eitsupi changed the title ~~refactor: import_stream method for Series~~ feat: import_stream internal method for Series to support Arrow C stream interface May 8, 2024

eitsupi added 3 commits May 8, 2024 13:31

test: skip judges should be done inside .cases

db284af

refactor: simplify

ccb51e7

test: add more tests for the experimental argument

b6cc521

eitsupi merged commit b42ee0a into main May 8, 2024
31 of 33 checks passed

eitsupi deleted the import_stream branch May 8, 2024 14:28

eitsupi mentioned this pull request May 26, 2024

Implement the Arrow C stream interface for Series #1081

Closed

eitsupi mentioned this pull request Jul 18, 2024

feat(python): Support PyCapsule Interface in DataFrame & Series constructors pola-rs/polars#17693

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: import_stream internal method for Series to support Arrow C stream interface #1078

feat: import_stream internal method for Series to support Arrow C stream interface #1078

eitsupi commented May 5, 2024 •

edited

Loading

eitsupi commented May 6, 2024 •

edited

Loading

eitsupi commented May 6, 2024

eitsupi commented May 6, 2024

eitsupi commented May 6, 2024

eitsupi commented May 6, 2024

etiennebacher left a comment

eitsupi commented May 6, 2024

eitsupi commented May 6, 2024 •

edited

Loading

eitsupi commented May 7, 2024 •

edited

Loading

etiennebacher commented May 7, 2024 •

edited

Loading

eitsupi commented May 7, 2024

etiennebacher left a comment

etiennebacher May 8, 2024

eitsupi May 8, 2024

eitsupi commented May 8, 2024

feat: import_stream internal method for Series to support Arrow C stream interface #1078

feat: import_stream internal method for Series to support Arrow C stream interface #1078

Conversation

eitsupi commented May 5, 2024 • edited Loading

eitsupi commented May 6, 2024 • edited Loading

eitsupi commented May 6, 2024

eitsupi commented May 6, 2024

eitsupi commented May 6, 2024

eitsupi commented May 6, 2024

etiennebacher left a comment

Choose a reason for hiding this comment

eitsupi commented May 6, 2024

eitsupi commented May 6, 2024 • edited Loading

eitsupi commented May 7, 2024 • edited Loading

etiennebacher commented May 7, 2024 • edited Loading

eitsupi commented May 7, 2024

etiennebacher left a comment

Choose a reason for hiding this comment

etiennebacher May 8, 2024

Choose a reason for hiding this comment

eitsupi May 8, 2024

Choose a reason for hiding this comment

eitsupi commented May 8, 2024

eitsupi commented May 5, 2024 •

edited

Loading

eitsupi commented May 6, 2024 •

edited

Loading

eitsupi commented May 6, 2024 •

edited

Loading

eitsupi commented May 7, 2024 •

edited

Loading

etiennebacher commented May 7, 2024 •

edited

Loading