Large performance drop for string equality between 0.20.5 and 0.20.6 #14589
Labels
A-dtype-string
Area: string data type
bug
Something isn't working
P-medium
Priority: medium
python
Related to Python Polars
Checks
Reproducible example
Log output
parquet file must be read, statistics not sufficient for predicate. [repeated dozens of times] shape: (0, 19) ┌──────┬───────┬─────┬──────────┬───┬──────────┬──────┬────────┬────────────────────────────────┐ │ year ┆ month ┆ day ┆ dep_time ┆ … ┆ distance ┆ hour ┆ minute ┆ time_hour │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ i32 ┆ i32 ┆ i32 ┆ i32 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ datetime[μs, America/New_York] │ ╞══════╪═══════╪═════╪══════════╪═══╪══════════╪══════╪════════╪════════════════════════════════╡ └──────┴───────┴─────┴──────────┴───┴──────────┴──────┴────────┴────────────────────────────────┘
Issue description
Performance of string equality (and possibly other operations) dropped significantly. Here's an example parquet file of 1.7MB but duplicated in the script to increase its size to ~700MB and ~25M rows.
airflights.zip
Performance:
0.20.5
: 2.78246450424194340.20.6
(new string/binary): 28.0144560337066650.20.9
: 26.180001497268677Edit: 0.20.10 has same timing as 0.20.9
Expected behavior
Same or better performance in 0.20.9 as in 0.20.5
Installed versions
The text was updated successfully, but these errors were encountered: