feat: support `as_polars_*` for nanoarrow package objects #730

eitsupi · 2024-01-22T14:56:09Z

Part of #497

eitsupi · 2024-01-22T16:21:28Z

@paleolimbot Could you take a look at this?
as_polars_series.nanoarrow_array seems to work in a simple example, but is there anything else I should test?

paleolimbot

Very cool!

No need to do so right now, but if I remember correctly, a Series comes in chunks and you could in theory convert a nanoarrow_array_stream() into one. The nanoarrow that's about to be released has a method for as_nanoarrow_array() for a ChunkedArray for exactly that purpose.

src/rust/src/arrow_interop/to_rust.rs

…function

eitsupi · 2024-01-23T14:39:16Z

@paleolimbot Thanks for the review!
Currently the function for converting chunked arrays to Series seems to be written entirely for the arrow package, so it looks like it needs to be extended for nanoarrow_array_stream.

r-polars/R/construction.R

Lines 143 to 189 in 1a6521c

    
           #' Internal function of `as_polars_series()` for `arrow::Array` and `arrow::ChunkedArray` class objects. 
        
           #' 
        
           #' This is a copy of Python Polars' `arrow_to_pyseries` function. 
        
           #' @noRd 
        
           #' @return A result inclueds RPolarsSeries 
        
           arrow_to_rseries_result = function(name, values, rechunk = TRUE) { 
        
             ## must rechunk 
        
             array = coerce_arrow(values) 
        
             # special handling of empty categorical arrays 
        
             if ( 
        
               (length(array) == 0L) && 
        
                 is_arrow_dictonary(array) && 
        
                 array$type$value_type %in_list% list(arrow::utf8(), arrow::large_utf8()) 
        
             ) { 
        
               res = Ok(pl$lit(c())$cast(pl$Categorical)$to_series()) 
        
             } else if (is.null(array$num_chunks)) { 
        
               res = .pr$Series$from_arrow(name, array) 
        
             } else { 
        
               if (array$num_chunks > 1) { 
        
                 if (is_arrow_dictonary(array)) { 
        
                   res = .pr$Series$from_arrow(name, arrow::as_arrow_array(array)) 
        
                 } else { 
        
                   chunks = array$chunks 
        
                   res = .pr$Series$from_arrow(name, chunks[[1]]) 
        
                   for (chunk in chunks[-1L]) { 
        
                     res = and_then(res, \(s) { 
        
                       .pr$Series$append_mut(s, unwrap(.pr$Series$from_arrow(name, chunk))) |> map(\(x) s) 
        
                     }) 
        
                   } 
        
                   res 
        
                 } 
        
               } else if (array$num_chunks == 0L) { 
        
                 res = .pr$Series$from_arrow(name, arrow::Array$create(NULL)$cast(array$type)) 
        
               } else { 
        
                 res = .pr$Series$from_arrow(name, array$chunk(0L)) 
        
               } 
        
             } 
        
             if (rechunk) { 
        
               res = res |> map(\(s) { 
        
                 wrap_e(s)$rechunk()$to_series() 
        
               }) 
        
             } 
        
             res 
        
           }

By the way, is there any way to get the nanoarrow type as a string representation?
Something like the following:

> arrow::int16()$ToString()
[1] "int16"

I imagine that such functionality in the following areas would simplify the process when a package does not exist.

r-polars/R/construction.R

Lines 130 to 133 in 1a6521c

    
           non_ideal_idx_types = list( 
        
             arrow::int8(), arrow::uint8(), arrow::int16(), 
        
             arrow::uint16(), arrow::int32() 
        
           )

paleolimbot · 2024-01-23T18:07:19Z

R/construction.R

-      arrow::uint16(), arrow::int32()
-    )
-    if (arr$type$index_type %in_list% non_ideal_idx_types) {
+    non_ideal_idx_types = c("int8", "uint8", "int16", "uint16", "int32")


Should int64 and uint64 be in here, too?

This is a copy of https://github.com/pola-rs/polars/blob/25537c2d3f83422790fde50c6c7971e906f238e4/py-polars/polars/utils/_construction.py#L1790-L1806, and it says:

small integer keys can often not be combined, so let's already cast to the uint32 used by polars

So, it seems intended.

I definitely may be misunderstanding the code, but in Arrow you could theoretically have an int64 or uint64 index also (i.e., if you're headed from arrow -> Polars, you might want to cast those too)

I have created an issue for this #752

eitsupi · 2024-01-29T14:29:12Z

Hmmm, I did a little research and it seems that Polars does not have an API for creating Series from array streams.

paleolimbot · 2024-01-29T15:13:20Z

Yes, it's not well-supported at the moment anywhere (I have a PR in to Arrow C++ to add support for ChunkedArray). If there is a way to create a Series from a vector of series, you could use nanoarrow::collect_array_stream() (which will give you a list() of nanoarrow_array).

eitsupi · 2024-01-29T15:19:57Z

Thanks for your response.
Yeah, copying your as_chunked_array.nanoarrow_array_stream seems like a good idea for now.

paleolimbot · 2024-01-29T22:31:57Z

R/as_polars.R

+  list_of_arrays = nanoarrow::collect_array_stream(x, validate = FALSE)
+
+  if (length(list_of_arrays) < 1L) {
+    out = pl$Series(NULL, name = name)


A zero-size list of arrays should probably keep the type? You can get the type before collect_array_stream() with x$get_schema().

…seg fault)

eitsupi · 2024-01-30T11:08:28Z

R/extendr-wrappers.R

@@ -1252,7 +1252,9 @@ RPolarsSeries$to_frame <- function() .Call(wrap__RPolarsSeries__to_frame, self)

 RPolarsSeries$set_sorted_mut <- function(descending) invisible(.Call(wrap__RPolarsSeries__set_sorted_mut, self, descending))

-RPolarsSeries$from_arrow <- function(name, array) .Call(wrap__RPolarsSeries__from_arrow, name, array)
+RPolarsSeries$from_arrow_array_stream_str <- function(name, robj_str) .Call(wrap__RPolarsSeries__from_arrow_array_stream_str, name, robj_str)


I was going to use this as the inside of as_polars_series, but now I can't because it causes a segmentation fault.
See #732 (comment).

…s.nanoarrow_array_stream

eitsupi · 2024-01-30T15:06:53Z

R/as_polars.R

+as_polars_df.nanoarrow_array_stream = function(x, ...) {
+  on.exit(x$release())
+
+  if (!inherits(nanoarrow::infer_nanoarrow_ptype(x$get_schema()), "data.frame")) {
+    stop("Can't convert non-struct array stream to RPolarsDataFrame")
+  }
+
+  list_of_struct_arrays = nanoarrow::collect_array_stream(x, validate = FALSE)
+  if (length(list_of_struct_arrays)) {
+    data_cols = list()
+
+    struct_array = list_of_struct_arrays[[1L]]
+    list_of_arrays = struct_array$children
+    col_names = names(list_of_arrays)
+
+    for (i in seq_along(list_of_arrays)) {
+      data_cols[[col_names[i]]] = as_polars_series.nanoarrow_array(list_of_arrays[[i]])
+    }
+
+    for (struct_array in list_of_struct_arrays[-1L]) {
+      list_of_arrays = struct_array$children
+      col_names = names(list_of_arrays)
+      for (i in seq_along(list_of_arrays)) {
+        .pr$Series$append_mut(data_cols[[col_names[i]]], as_polars_series.nanoarrow_array(list_of_arrays[[i]])) |>
+          unwrap("in as_polars_df(<nanoarrow_array_stream>):")
+      }
+    }
+
+    out = do.call(pl$DataFrame, data_cols)
+  } else {
+    out = pl$DataFrame() # TODO: support creating 0-row DataFrame
+  }
+
+  out
+}


I think we should add as_polars_df.nanoarrow_array and use that inside of as_polars_df.nanoarrow_array_stream. (concat all DataFrames)

May require the vstack method.
https://stackoverflow.com/questions/71654966/how-can-i-append-or-concatenate-two-dataframes-in-python-polars

I am not sure that you want to force a concatenation until absolutely necessary!

I am not sure that you want to force a concatenation until absolutely necessary!

Sorry, I didn't understand what you meant.

I am currently concatenating each column into a DataFrame after each column is concatenated, but just wondering if the order could be changed to concatenate the data frames after each chunk is concatenated into a DataFrame.

I see...I think that's fine! I don't know the Polars details well enough. It would be optimal (but not required) if (for example) when importing a ChunkedArray that composed of 1000 chunks, the Polars representation would be a Series that also had 1000 chunks (I think that's a thing).

Yes, I will modify the function to control such behavior with the rechunk option like this.

r-polars/R/construction.R

Line 14 in eb4fbcc

arrow_to_rdf = function(at, schema = NULL, schema_overrides = NULL, rechunk = TRUE) {

eitsupi · 2024-01-30T15:52:00Z

I feel that no further work is needed on Rust's side, so I will merge this in as soon as possible and address the comments with follow-up PRs.
(NEWS file has not been updated yet, but will be updated once the method is fully implemented in a follow-up PRs)

etiennebacher

I'm not familiar with nanoarrow so can't see if this is correct or not, but it is tested so I suppose it's fine.

eitsupi · 2024-01-30T23:32:57Z

At some stage I would like to write an article on exchanging data between packages via Arrow.
Python Polars has ADBC integration (via pyarrow), but we can do it here via nanoarrow. The R arrow package is not required.

eitsupi added 2 commits January 22, 2024 14:54

refactor: prep support nanoarrow

c93eac7

feat: as_polars_series.nanoarrow_array

03b7936

paleolimbot reviewed Jan 23, 2024

View reviewed changes

src/rust/src/arrow_interop/to_rust.rs Outdated Show resolved Hide resolved

eitsupi added 2 commits January 23, 2024 03:10

refactor: should use the $export_to_c method instead of the internal …

ff44776

…function

Merge branch 'main' into from-nanoarrow

6518cc7

eitsupi added 2 commits January 23, 2024 14:55

refactor: naming and method chaining

5290803

refactor: use string representation of arrow types

571793e

paleolimbot reviewed Jan 23, 2024

View reviewed changes

This was referenced Jan 24, 2024

[R] String notation of nanoarrow types apache/arrow-nanoarrow#370

Open

There are multiple functions that do the same thing (for arrow) #732

Closed

eitsupi added 10 commits January 25, 2024 13:07

refactor: fix typo and add notes

28e77a4

test: minor refactoring and add tests

58a99f2

fix: the rechunk argument should be used here

3408c57

chore: add more util functions

8c2ebb5

refactor: remove unused functions

be1a64b

Merge remote-tracking branch 'upstream/main' into from-nanoarrow

a0cd13f

chore: bump lib version

3d2dc4e

Merge remote-tracking branch 'upstream/main' into from-nanoarrow

81b1c60

Merge remote-tracking branch 'upstream/main' into from-nanoarrow

18accba

refacto: rename and simplify the function

5f75923

eitsupi added 2 commits January 29, 2024 15:42

feat: add as_polars_series.nanoarrow_array_stream

28870d9

chore: formatting

b2058aa

paleolimbot reviewed Jan 29, 2024

View reviewed changes

eitsupi added 2 commits January 30, 2024 10:05

refactor: rename the method to construct a Series from R arrow array

ca1d8bc

feat: add RPolarsSeries$from_arrow_array_stream_str (seems causing …

d89539f

…seg fault)

eitsupi commented Jan 30, 2024

View reviewed changes

eitsupi added 4 commits January 30, 2024 12:05

chore: add todo comments

fd3af18

feat: as_polars_df for nanoarrow_array_stream and fix as_polars_serie…

6fea3e0

…s.nanoarrow_array_stream

test: add test cases for nanoarrow array stream

f22d031

Merge remote-tracking branch 'upstream/main' into from-nanoarrow

be23147

eitsupi commented Jan 30, 2024

View reviewed changes

eitsupi marked this pull request as ready for review January 30, 2024 15:52

eitsupi requested a review from etiennebacher January 30, 2024 15:52

etiennebacher approved these changes Jan 30, 2024

View reviewed changes

eitsupi mentioned this pull request Jan 30, 2024

non_ideal_idx_types may include int64 and uint64 #752

Open

eitsupi merged commit 7cecb9c into main Jan 30, 2024
31 checks passed

eitsupi deleted the from-nanoarrow branch January 30, 2024 23:21

eitsupi mentioned this pull request Jan 31, 2024

Rewrite as_polars_df.nanoarrow_array #755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support `as_polars_*` for nanoarrow package objects #730

feat: support `as_polars_*` for nanoarrow package objects #730

eitsupi commented Jan 22, 2024 •

edited

Loading

eitsupi commented Jan 22, 2024

paleolimbot left a comment

eitsupi commented Jan 23, 2024 •

edited

Loading

paleolimbot Jan 23, 2024

eitsupi Jan 24, 2024

paleolimbot Jan 26, 2024

eitsupi Jan 30, 2024

eitsupi commented Jan 29, 2024

paleolimbot commented Jan 29, 2024

eitsupi commented Jan 29, 2024

paleolimbot Jan 29, 2024

eitsupi Jan 30, 2024

eitsupi Jan 30, 2024

paleolimbot Jan 30, 2024

eitsupi Jan 30, 2024

paleolimbot Jan 31, 2024

eitsupi Jan 31, 2024 •

edited

Loading

eitsupi commented Jan 30, 2024

etiennebacher left a comment

eitsupi commented Jan 30, 2024

feat: support as_polars_* for nanoarrow package objects #730

feat: support as_polars_* for nanoarrow package objects #730

Conversation

eitsupi commented Jan 22, 2024 • edited Loading

eitsupi commented Jan 22, 2024

paleolimbot left a comment

Choose a reason for hiding this comment

eitsupi commented Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eitsupi commented Jan 29, 2024

paleolimbot commented Jan 29, 2024

eitsupi commented Jan 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eitsupi Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

eitsupi commented Jan 30, 2024

etiennebacher left a comment

Choose a reason for hiding this comment

eitsupi commented Jan 30, 2024

feat: support `as_polars_*` for nanoarrow package objects #730

feat: support `as_polars_*` for nanoarrow package objects #730

eitsupi commented Jan 22, 2024 •

edited

Loading

eitsupi commented Jan 23, 2024 •

edited

Loading

eitsupi Jan 31, 2024 •

edited

Loading