-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support as_polars_*
for nanoarrow package objects
#730
Conversation
@paleolimbot Could you take a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
No need to do so right now, but if I remember correctly, a Series
comes in chunks and you could in theory convert a nanoarrow_array_stream()
into one. The nanoarrow that's about to be released has a method for as_nanoarrow_array()
for a ChunkedArray
for exactly that purpose.
@paleolimbot Thanks for the review! Lines 143 to 189 in 1a6521c
By the way, is there any way to get the nanoarrow type as a string representation? > arrow::int16()$ToString()
[1] "int16" I imagine that such functionality in the following areas would simplify the process when a package does not exist. Lines 130 to 133 in 1a6521c
|
arrow::uint16(), arrow::int32() | ||
) | ||
if (arr$type$index_type %in_list% non_ideal_idx_types) { | ||
non_ideal_idx_types = c("int8", "uint8", "int16", "uint16", "int32") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should int64 and uint64 be in here, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a copy of https://github.com/pola-rs/polars/blob/25537c2d3f83422790fde50c6c7971e906f238e4/py-polars/polars/utils/_construction.py#L1790-L1806, and it says:
small integer keys can often not be combined, so let's already cast to the uint32 used by polars
So, it seems intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely may be misunderstanding the code, but in Arrow you could theoretically have an int64 or uint64 index also (i.e., if you're headed from arrow -> Polars, you might want to cast those too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created an issue for this #752
Hmmm, I did a little research and it seems that Polars does not have an API for creating Series from array streams. |
Yes, it's not well-supported at the moment anywhere (I have a PR in to Arrow C++ to add support for ChunkedArray). If there is a way to create a Series from a vector of series, you could use |
Thanks for your response. |
list_of_arrays = nanoarrow::collect_array_stream(x, validate = FALSE) | ||
|
||
if (length(list_of_arrays) < 1L) { | ||
out = pl$Series(NULL, name = name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A zero-size list of arrays should probably keep the type? You can get the type before collect_array_stream()
with x$get_schema()
.
@@ -1252,7 +1252,9 @@ RPolarsSeries$to_frame <- function() .Call(wrap__RPolarsSeries__to_frame, self) | |||
|
|||
RPolarsSeries$set_sorted_mut <- function(descending) invisible(.Call(wrap__RPolarsSeries__set_sorted_mut, self, descending)) | |||
|
|||
RPolarsSeries$from_arrow <- function(name, array) .Call(wrap__RPolarsSeries__from_arrow, name, array) | |||
RPolarsSeries$from_arrow_array_stream_str <- function(name, robj_str) .Call(wrap__RPolarsSeries__from_arrow_array_stream_str, name, robj_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to use this as the inside of as_polars_series
, but now I can't because it causes a segmentation fault.
See #732 (comment).
as_polars_df.nanoarrow_array_stream = function(x, ...) { | ||
on.exit(x$release()) | ||
|
||
if (!inherits(nanoarrow::infer_nanoarrow_ptype(x$get_schema()), "data.frame")) { | ||
stop("Can't convert non-struct array stream to RPolarsDataFrame") | ||
} | ||
|
||
list_of_struct_arrays = nanoarrow::collect_array_stream(x, validate = FALSE) | ||
if (length(list_of_struct_arrays)) { | ||
data_cols = list() | ||
|
||
struct_array = list_of_struct_arrays[[1L]] | ||
list_of_arrays = struct_array$children | ||
col_names = names(list_of_arrays) | ||
|
||
for (i in seq_along(list_of_arrays)) { | ||
data_cols[[col_names[i]]] = as_polars_series.nanoarrow_array(list_of_arrays[[i]]) | ||
} | ||
|
||
for (struct_array in list_of_struct_arrays[-1L]) { | ||
list_of_arrays = struct_array$children | ||
col_names = names(list_of_arrays) | ||
for (i in seq_along(list_of_arrays)) { | ||
.pr$Series$append_mut(data_cols[[col_names[i]]], as_polars_series.nanoarrow_array(list_of_arrays[[i]])) |> | ||
unwrap("in as_polars_df(<nanoarrow_array_stream>):") | ||
} | ||
} | ||
|
||
out = do.call(pl$DataFrame, data_cols) | ||
} else { | ||
out = pl$DataFrame() # TODO: support creating 0-row DataFrame | ||
} | ||
|
||
out | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add as_polars_df.nanoarrow_array
and use that inside of as_polars_df.nanoarrow_array_stream
. (concat all DataFrames)
May require the vstack
method.
https://stackoverflow.com/questions/71654966/how-can-i-append-or-concatenate-two-dataframes-in-python-polars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that you want to force a concatenation until absolutely necessary!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that you want to force a concatenation until absolutely necessary!
Sorry, I didn't understand what you meant.
I am currently concatenating each column into a DataFrame after each column is concatenated, but just wondering if the order could be changed to concatenate the data frames after each chunk is concatenated into a DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see...I think that's fine! I don't know the Polars details well enough. It would be optimal (but not required) if (for example) when importing a ChunkedArray that composed of 1000 chunks, the Polars representation would be a Series that also had 1000 chunks (I think that's a thing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will modify the function to control such behavior with the rechunk
option like this.
Line 14 in eb4fbcc
arrow_to_rdf = function(at, schema = NULL, schema_overrides = NULL, rechunk = TRUE) { |
I feel that no further work is needed on Rust's side, so I will merge this in as soon as possible and address the comments with follow-up PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with nanoarrow
so can't see if this is correct or not, but it is tested so I suppose it's fine.
At some stage I would like to write an article on exchanging data between packages via Arrow. |
Part of #497