feat: some improves of `as_polars_df` #896

eitsupi · 2024-03-05T12:10:59Z

Simplify as_polars_df(<nanoarrow_array_stream>) (Fix as_polars_df() for nanoarrow_array_stream seems slow #893)
Add as_polars_df(<nanoarrow_array>) (Close Rewrite as_polars_df.nanoarrow_array #755)
Simplify as_polars_df(<data.frame>) and fix some test and docs.

The performance of the conversion from nanoarrow_array_stream is as follows, slower than as_arrow_table but faster than as_tibble, possibly due to the String type conversion. (And I am surprised that as_tibble is so fast)

library(adbcdrivermanager)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#>     timestamp
library(tibble)
library(polars)

polars_info()
#> Polars R package version : 0.15.0.9000
#> Rust Polars crate version: 0.38.1
#>
#> Thread pool size: 16
#>
#> Features:
#> default                    TRUE
#> full_features              TRUE
#> disable_limit_max_threads  TRUE
#> nightly                    TRUE
#> sql                        TRUE
#> rpolars_debug_print       FALSE
#>
#> Code completion: deactivated

db <- adbc_database_init(adbcsqlite::adbcsqlite())
con <- adbc_connection_init(db)

flights <- nycflights13::flights
flights$time_hour <- NULL
flights |>
  write_adbc(con, "flights")

query <- "SELECT * from flights"

bench::mark(
  polars_df_1 = {
    con |>
      read_adbc(query) |>
      as_polars_df()
  },
  arrow_table = {
    con |>
      read_adbc(query) |>
      as_arrow_table()
  },
  tibble = {
    con |>
      read_adbc(query) |>
      as_tibble()
  },
  polars_df_2 = {
    con |>
      read_adbc(query) |>
      as_polars_df()
  },
  check = FALSE,
  min_iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 polars_df_1    1.62s    1.78s    0.116     4.94MB   0.162
#> 2 arrow_table    1.25s    1.39s    0.722     1.62MB   0
#> 3 tibble          3.3s    5.39s    0.0938    46.5MB   0.0751
#> 4 polars_df_2    1.58s     3.4s    0.223   149.91KB   0.134

^{Created on 2024-03-05 with reprex v2.0.2}

…s_df(<nanoarrow_array>)

etiennebacher

Thanks, do you want to bump NEWS to add a note on improved performance?

etiennebacher · 2024-03-05T12:35:56Z

R/as_polars.R

+#' Should match the number of columns in `x` and correspond to each column in `x` by position.
+#' If a column in `x` does not match the name or type at the same position, it will be renamed/recast.


Wow I didn't notice that it would automatically rename / recast but that seems like a really bad behavior:

library(polars) df = data.frame(a = 1:3, b = 4:6) as_polars_df(df, schema = list(b = pl$String, y = pl$Int32)) #> shape: (3, 2) #> ┌─────┬─────┐ #> │ b ┆ y │ #> │ --- ┆ --- │ #> │ str ┆ i32 │ #> ╞═════╪═════╡ #> │ 1 ┆ 4 │ #> │ 2 ┆ 5 │ #> │ 3 ┆ 6 │ #> └─────┴─────┘

This doesn't work in py-polars:

test = pl.DataFrame( { "a": [1, 2, 3], "b": [4, 5, 6], }, schema={"b": pl.String, "y": pl.Int32}, ) ValueError: the given column-schema names do not match the data dictionary pl.DataFrame( { "a": [1, 2, 3], "b": [4, 5, 6], }, schema_overrides={"b": pl.String, "y": pl.Int32}, ) shape: (3, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪══════╡ │ 1 ┆ null │ │ 2 ┆ null │ │ 3 ┆ null │ └─────┴──────┘

This wasn't introduced in this PR so I don't think it's a blocker, but it should definitely be fixed

The schema argument is from polars.from_arrow of Python, which I believe works the same way polars.from_arrow in Python given that it copies the complete logic from that (I also think this is bad behavior and could be removed at some point).

Like these:

>>> import polars as pl >>> import pyarrow as pa >>> data = pa.table({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> pl.from_arrow(data, schema=[("b", pl.String), ("y", pl.Int32)]) shape: (3, 2) ┌─────┬─────┐ │ b ┆ y │ │ --- ┆ --- │ │ str ┆ i32 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 2 ┆ 5 │ │ 3 ┆ 6 │ └─────┴─────┘ >>> pl.from_arrow(data, schema={"b": pl.String, "y": pl.Int32}) shape: (3, 2) ┌─────┬─────┐ │ b ┆ y │ │ --- ┆ --- │ │ str ┆ i32 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 2 ┆ 5 │ │ 3 ┆ 6 │ └─────┴─────┘

Hum, so I guess they need to harmonize the behavior upstream between pl.DataFrame and pl.from_arrow then?

It is quite surprising that this behavior is performed, especially when specified with the dict type.

In general, order in dict is not guaranteed, so processing based on order is not a good idea.

Hum, so I guess they need to harmonize the behavior upstream between pl.DataFrame and pl.from_arrow then?

Not sure about that.
Perhaps the schema is intended for Series rather than DataFrame (e.g., pyarrow.Array or something has no name), since it is guaranteed that the column names already exist in the arrow::Table for as_polars_df, so we can simply remove the schema argument from here and use only schema_override.

In my opinion, there is no user demand to change column names in as_polars_df().
So simply removing the schema argument and leaving only schema_override would be sufficient.

In terms of type change, the schema argument is more difficult to use than schema_override in that all columns must be specified.

etiennebacher

Thanks, do you want to bump NEWS to add a note on improved performance?

eitsupi added 2 commits March 5, 2024 11:19

feat: rewrite as_polars_df(<nanoarrow_array_stream>) and add as_polar…

f31f09d

…s_df(<nanoarrow_array>)

refactor: simplify as_polars_df(<data.frame>) and update docs

097af22

eitsupi requested a review from etiennebacher March 5, 2024 12:24

etiennebacher requested changes Mar 5, 2024

View reviewed changes

etiennebacher approved these changes Mar 5, 2024

View reviewed changes

docs(news): update about as_polars_df nanoarrow [skip ci]

4fb4b3e

eitsupi marked this pull request as ready for review March 5, 2024 13:28

eitsupi merged commit f56c92a into main Mar 5, 2024

eitsupi deleted the fix-from-nanoarrow branch March 5, 2024 13:32

eitsupi mentioned this pull request Mar 5, 2024

The schema argument of as_polars_df() is needed? #897

Open

eitsupi mentioned this pull request May 5, 2024

feat: import_stream internal method for Series to support Arrow C stream interface #1078

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: some improves of `as_polars_df` #896

feat: some improves of `as_polars_df` #896

eitsupi commented Mar 5, 2024 •

edited

Loading

etiennebacher left a comment

etiennebacher Mar 5, 2024

eitsupi Mar 5, 2024 •

edited

Loading

eitsupi Mar 5, 2024 •

edited

Loading

etiennebacher Mar 5, 2024

eitsupi Mar 5, 2024

eitsupi Mar 5, 2024

eitsupi Mar 5, 2024

etiennebacher left a comment

		#' Should match the number of columns in `x` and correspond to each column in `x` by position.
		#' If a column in `x` does not match the name or type at the same position, it will be renamed/recast.

feat: some improves of as_polars_df #896

feat: some improves of as_polars_df #896

Conversation

eitsupi commented Mar 5, 2024 • edited Loading

etiennebacher left a comment

Choose a reason for hiding this comment

etiennebacher Mar 5, 2024

Choose a reason for hiding this comment

eitsupi Mar 5, 2024 • edited Loading

Choose a reason for hiding this comment

eitsupi Mar 5, 2024 • edited Loading

Choose a reason for hiding this comment

etiennebacher Mar 5, 2024

Choose a reason for hiding this comment

eitsupi Mar 5, 2024

Choose a reason for hiding this comment

eitsupi Mar 5, 2024

Choose a reason for hiding this comment

eitsupi Mar 5, 2024

Choose a reason for hiding this comment

etiennebacher left a comment

Choose a reason for hiding this comment

feat: some improves of `as_polars_df` #896

feat: some improves of `as_polars_df` #896

eitsupi commented Mar 5, 2024 •

edited

Loading

eitsupi Mar 5, 2024 •

edited

Loading

eitsupi Mar 5, 2024 •

edited

Loading