-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: some improves of as_polars_df
#896
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, do you want to bump NEWS to add a note on improved performance?
#' Should match the number of columns in `x` and correspond to each column in `x` by position. | ||
#' If a column in `x` does not match the name or type at the same position, it will be renamed/recast. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow I didn't notice that it would automatically rename / recast but that seems like a really bad behavior:
library(polars)
df = data.frame(a = 1:3, b = 4:6)
as_polars_df(df, schema = list(b = pl$String, y = pl$Int32))
#> shape: (3, 2)
#> ┌─────┬─────┐
#> │ b ┆ y │
#> │ --- ┆ --- │
#> │ str ┆ i32 │
#> ╞═════╪═════╡
#> │ 1 ┆ 4 │
#> │ 2 ┆ 5 │
#> │ 3 ┆ 6 │
#> └─────┴─────┘
This doesn't work in py-polars:
test = pl.DataFrame(
{
"a": [1, 2, 3],
"b": [4, 5, 6],
},
schema={"b": pl.String, "y": pl.Int32},
)
ValueError: the given column-schema names do not match the data dictionary
pl.DataFrame(
{
"a": [1, 2, 3],
"b": [4, 5, 6],
},
schema_overrides={"b": pl.String, "y": pl.Int32},
)
shape: (3, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ null │
│ 2 ┆ null │
│ 3 ┆ null │
└─────┴──────┘
This wasn't introduced in this PR so I don't think it's a blocker, but it should definitely be fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema
argument is from polars.from_arrow
of Python, which I believe works the same way polars.from_arrow
in Python given that it copies the complete logic from that (I also think this is bad behavior and could be removed at some point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like these:
>>> import polars as pl
>>> import pyarrow as pa
>>> data = pa.table({"a": [1, 2, 3], "b": [4, 5, 6]})
>>> pl.from_arrow(data, schema=[("b", pl.String), ("y", pl.Int32)])
shape: (3, 2)
┌─────┬─────┐
│ b ┆ y │
│ --- ┆ --- │
│ str ┆ i32 │
╞═════╪═════╡
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 3 ┆ 6 │
└─────┴─────┘
>>> pl.from_arrow(data, schema={"b": pl.String, "y": pl.Int32})
shape: (3, 2)
┌─────┬─────┐
│ b ┆ y │
│ --- ┆ --- │
│ str ┆ i32 │
╞═════╪═════╡
│ 1 ┆ 4 │
│ 2 ┆ 5 │
│ 3 ┆ 6 │
└─────┴─────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, so I guess they need to harmonize the behavior upstream between pl.DataFrame
and pl.from_arrow
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is quite surprising that this behavior is performed, especially when specified with the dict type.
In general, order in dict is not guaranteed, so processing based on order is not a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, so I guess they need to harmonize the behavior upstream between
pl.DataFrame
andpl.from_arrow
then?
Not sure about that.
Perhaps the schema
is intended for Series rather than DataFrame (e.g., pyarrow.Array
or something has no name), since it is guaranteed that the column names already exist in the arrow::Table
for as_polars_df
, so we can simply remove the schema
argument from here and use only schema_override
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, there is no user demand to change column names in as_polars_df()
.
So simply removing the schema
argument and leaving only schema_override
would be sufficient.
In terms of type change, the schema
argument is more difficult to use than schema_override
in that all columns must be specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, do you want to bump NEWS to add a note on improved performance?
as_polars_df(<nanoarrow_array_stream>)
(Fixas_polars_df()
fornanoarrow_array_stream
seems slow #893)as_polars_df(<nanoarrow_array>)
(Close Rewriteas_polars_df.nanoarrow_array
#755)as_polars_df(<data.frame>)
and fix some test and docs.The performance of the conversion from
nanoarrow_array_stream
is as follows, slower thanas_arrow_table
but faster thanas_tibble
, possibly due to the String type conversion. (And I am surprised thatas_tibble
is so fast)Created on 2024-03-05 with reprex v2.0.2