Large performance drop for string equality between 0.20.5 and 0.20.6 #14589

etiennebacher · 2024-02-19T12:43:18Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os
os.environ["POLARS_VERBOSE"] = "1"

import polars as pl
import time

pl.show_versions()

### Create large parquet file
test_pl = pl.read_parquet("airflights.parquet")
for i in range(8):
    test_pl = test_pl.vstack(test_pl)

test_pl = test_pl.sample(fraction=0.999)
test_pl.write_parquet("airflights2.parquet")

### Scan and filter this large parquet file (should return 0 rows)
test_lazy = pl.scan_parquet("airflights2.parquet")

start = time.time()
test_lazy.filter(pl.col("carrier") == "N802UA").collect()
print(time.time() - start)

Log output

parquet file must be read, statistics not sufficient for predicate. [repeated dozens of times]

shape: (0, 19)
┌──────┬───────┬─────┬──────────┬───┬──────────┬──────┬────────┬────────────────────────────────┐
│ year ┆ month ┆ day ┆ dep_time ┆ … ┆ distance ┆ hour ┆ minute ┆ time_hour                      │
│ ---  ┆ ---   ┆ --- ┆ ---      ┆   ┆ ---      ┆ ---  ┆ ---    ┆ ---                            │
│ i32  ┆ i32   ┆ i32 ┆ i32      ┆   ┆ f64      ┆ f64  ┆ f64    ┆ datetime[μs, America/New_York] │
╞══════╪═══════╪═════╪══════════╪═══╪══════════╪══════╪════════╪════════════════════════════════╡
└──────┴───────┴─────┴──────────┴───┴──────────┴──────┴────────┴────────────────────────────────┘

Issue description

Performance of string equality (and possibly other operations) dropped significantly. Here's an example parquet file of 1.7MB but duplicated in the script to increase its size to ~700MB and ~25M rows.
airflights.zip

Performance:

0.20.5: 2.7824645042419434
0.20.6 (new string/binary): 28.014456033706665
0.20.9: 26.180001497268677

Edit: 0.20.10 has same timing as 0.20.9

Expected behavior

Same or better performance in 0.20.9 as in 0.20.5

Installed versions

--------Version info---------
Polars:               0.20.9 [Used also 0.20.5 and 0.20.6 for benchmarking]
Index type:           UInt32
Platform:             Windows-10-10.0.19044-SP0
Python:               3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.6.0
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.7.1
numpy:                1.24.3
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              12.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

stinodego · 2024-02-19T13:55:33Z

I believe the issue is not the equality operation but the filter operation. I think @orlp is working on this.

etiennebacher · 2024-02-19T14:49:03Z

I don't know, running test_lazy.with_columns(pl.col("carrier") == "N802UA").collect() instead of test_lazy.filter(pl.col("carrier") == "N802UA").collect() also takes ~30sec

ritchie46 · 2024-02-26T23:06:13Z

You have't isolated the reading form the performance comparison. I suspect the filter and comparisons are red herrings and it is the reading of string data.

This should be fixed by: #14705

When making examples (on performance) it is best to isolate the operation you are benchmarking.

ritchie46 · 2024-02-26T23:46:28Z

Confirmed that #14705 fixes the pathological regression.

etiennebacher added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 19, 2024

stinodego added A-dtype-string Area: string data type P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Feb 19, 2024

ritchie46 closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large performance drop for string equality between 0.20.5 and 0.20.6 #14589

Large performance drop for string equality between 0.20.5 and 0.20.6 #14589

etiennebacher commented Feb 19, 2024 •

edited

Loading

stinodego commented Feb 19, 2024

etiennebacher commented Feb 19, 2024

ritchie46 commented Feb 26, 2024

ritchie46 commented Feb 26, 2024

Large performance drop for string equality between 0.20.5 and 0.20.6 #14589

Large performance drop for string equality between 0.20.5 and 0.20.6 #14589

Comments

etiennebacher commented Feb 19, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

stinodego commented Feb 19, 2024

etiennebacher commented Feb 19, 2024

ritchie46 commented Feb 26, 2024

ritchie46 commented Feb 26, 2024

etiennebacher commented Feb 19, 2024 •

edited

Loading