Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support as_polars_* for nanoarrow package objects #730

Merged
merged 24 commits into from
Jan 30, 2024
Merged

Conversation

eitsupi
Copy link
Collaborator

@eitsupi eitsupi commented Jan 22, 2024

Part of #497

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jan 22, 2024

@paleolimbot Could you take a look at this?
as_polars_series.nanoarrow_array seems to work in a simple example, but is there anything else I should test?

Copy link
Contributor

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

No need to do so right now, but if I remember correctly, a Series comes in chunks and you could in theory convert a nanoarrow_array_stream() into one. The nanoarrow that's about to be released has a method for as_nanoarrow_array() for a ChunkedArray for exactly that purpose.

src/rust/src/arrow_interop/to_rust.rs Outdated Show resolved Hide resolved
@eitsupi
Copy link
Collaborator Author

eitsupi commented Jan 23, 2024

@paleolimbot Thanks for the review!
Currently the function for converting chunked arrays to Series seems to be written entirely for the arrow package, so it looks like it needs to be extended for nanoarrow_array_stream.

r-polars/R/construction.R

Lines 143 to 189 in 1a6521c

#' Internal function of `as_polars_series()` for `arrow::Array` and `arrow::ChunkedArray` class objects.
#'
#' This is a copy of Python Polars' `arrow_to_pyseries` function.
#' @noRd
#' @return A result inclueds RPolarsSeries
arrow_to_rseries_result = function(name, values, rechunk = TRUE) {
## must rechunk
array = coerce_arrow(values)
# special handling of empty categorical arrays
if (
(length(array) == 0L) &&
is_arrow_dictonary(array) &&
array$type$value_type %in_list% list(arrow::utf8(), arrow::large_utf8())
) {
res = Ok(pl$lit(c())$cast(pl$Categorical)$to_series())
} else if (is.null(array$num_chunks)) {
res = .pr$Series$from_arrow(name, array)
} else {
if (array$num_chunks > 1) {
if (is_arrow_dictonary(array)) {
res = .pr$Series$from_arrow(name, arrow::as_arrow_array(array))
} else {
chunks = array$chunks
res = .pr$Series$from_arrow(name, chunks[[1]])
for (chunk in chunks[-1L]) {
res = and_then(res, \(s) {
.pr$Series$append_mut(s, unwrap(.pr$Series$from_arrow(name, chunk))) |> map(\(x) s)
})
}
res
}
} else if (array$num_chunks == 0L) {
res = .pr$Series$from_arrow(name, arrow::Array$create(NULL)$cast(array$type))
} else {
res = .pr$Series$from_arrow(name, array$chunk(0L))
}
}
if (rechunk) {
res = res |> map(\(s) {
wrap_e(s)$rechunk()$to_series()
})
}
res
}

By the way, is there any way to get the nanoarrow type as a string representation?
Something like the following:

> arrow::int16()$ToString()
[1] "int16"

I imagine that such functionality in the following areas would simplify the process when a package does not exist.

r-polars/R/construction.R

Lines 130 to 133 in 1a6521c

non_ideal_idx_types = list(
arrow::int8(), arrow::uint8(), arrow::int16(),
arrow::uint16(), arrow::int32()
)

arrow::uint16(), arrow::int32()
)
if (arr$type$index_type %in_list% non_ideal_idx_types) {
non_ideal_idx_types = c("int8", "uint8", "int16", "uint16", "int32")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should int64 and uint64 be in here, too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy of https://github.com/pola-rs/polars/blob/25537c2d3f83422790fde50c6c7971e906f238e4/py-polars/polars/utils/_construction.py#L1790-L1806, and it says:

small integer keys can often not be combined, so let's already cast to the uint32 used by polars

So, it seems intended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely may be misunderstanding the code, but in Arrow you could theoretically have an int64 or uint64 index also (i.e., if you're headed from arrow -> Polars, you might want to cast those too)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created an issue for this #752

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jan 29, 2024

Hmmm, I did a little research and it seems that Polars does not have an API for creating Series from array streams.

@paleolimbot
Copy link
Contributor

Yes, it's not well-supported at the moment anywhere (I have a PR in to Arrow C++ to add support for ChunkedArray). If there is a way to create a Series from a vector of series, you could use nanoarrow::collect_array_stream() (which will give you a list() of nanoarrow_array).

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jan 29, 2024

Thanks for your response.
Yeah, copying your as_chunked_array.nanoarrow_array_stream seems like a good idea for now.

list_of_arrays = nanoarrow::collect_array_stream(x, validate = FALSE)

if (length(list_of_arrays) < 1L) {
out = pl$Series(NULL, name = name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A zero-size list of arrays should probably keep the type? You can get the type before collect_array_stream() with x$get_schema().

@@ -1252,7 +1252,9 @@ RPolarsSeries$to_frame <- function() .Call(wrap__RPolarsSeries__to_frame, self)

RPolarsSeries$set_sorted_mut <- function(descending) invisible(.Call(wrap__RPolarsSeries__set_sorted_mut, self, descending))

RPolarsSeries$from_arrow <- function(name, array) .Call(wrap__RPolarsSeries__from_arrow, name, array)
RPolarsSeries$from_arrow_array_stream_str <- function(name, robj_str) .Call(wrap__RPolarsSeries__from_arrow_array_stream_str, name, robj_str)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to use this as the inside of as_polars_series, but now I can't because it causes a segmentation fault.
See #732 (comment).

Comment on lines +211 to +245
as_polars_df.nanoarrow_array_stream = function(x, ...) {
on.exit(x$release())

if (!inherits(nanoarrow::infer_nanoarrow_ptype(x$get_schema()), "data.frame")) {
stop("Can't convert non-struct array stream to RPolarsDataFrame")
}

list_of_struct_arrays = nanoarrow::collect_array_stream(x, validate = FALSE)
if (length(list_of_struct_arrays)) {
data_cols = list()

struct_array = list_of_struct_arrays[[1L]]
list_of_arrays = struct_array$children
col_names = names(list_of_arrays)

for (i in seq_along(list_of_arrays)) {
data_cols[[col_names[i]]] = as_polars_series.nanoarrow_array(list_of_arrays[[i]])
}

for (struct_array in list_of_struct_arrays[-1L]) {
list_of_arrays = struct_array$children
col_names = names(list_of_arrays)
for (i in seq_along(list_of_arrays)) {
.pr$Series$append_mut(data_cols[[col_names[i]]], as_polars_series.nanoarrow_array(list_of_arrays[[i]])) |>
unwrap("in as_polars_df(<nanoarrow_array_stream>):")
}
}

out = do.call(pl$DataFrame, data_cols)
} else {
out = pl$DataFrame() # TODO: support creating 0-row DataFrame
}

out
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add as_polars_df.nanoarrow_array and use that inside of as_polars_df.nanoarrow_array_stream. (concat all DataFrames)

May require the vstack method.
https://stackoverflow.com/questions/71654966/how-can-i-append-or-concatenate-two-dataframes-in-python-polars

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that you want to force a concatenation until absolutely necessary!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that you want to force a concatenation until absolutely necessary!

Sorry, I didn't understand what you meant.

I am currently concatenating each column into a DataFrame after each column is concatenated, but just wondering if the order could be changed to concatenate the data frames after each chunk is concatenated into a DataFrame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see...I think that's fine! I don't know the Polars details well enough. It would be optimal (but not required) if (for example) when importing a ChunkedArray that composed of 1000 chunks, the Polars representation would be a Series that also had 1000 chunks (I think that's a thing).

Copy link
Collaborator Author

@eitsupi eitsupi Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will modify the function to control such behavior with the rechunk option like this.

arrow_to_rdf = function(at, schema = NULL, schema_overrides = NULL, rechunk = TRUE) {

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jan 30, 2024

I feel that no further work is needed on Rust's side, so I will merge this in as soon as possible and address the comments with follow-up PRs.
(NEWS file has not been updated yet, but will be updated once the method is fully implemented in a follow-up PRs)

@eitsupi eitsupi marked this pull request as ready for review January 30, 2024 15:52
Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with nanoarrow so can't see if this is correct or not, but it is tested so I suppose it's fine.

@eitsupi eitsupi merged commit 7cecb9c into main Jan 30, 2024
31 checks passed
@eitsupi eitsupi deleted the from-nanoarrow branch January 30, 2024 23:21
@eitsupi
Copy link
Collaborator Author

eitsupi commented Jan 30, 2024

At some stage I would like to write an article on exchanging data between packages via Arrow.
Python Polars has ADBC integration (via pyarrow), but we can do it here via nanoarrow. The R arrow package is not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants