Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large performance drop for string equality between 0.20.5 and 0.20.6 #14589

Closed
2 tasks done
etiennebacher opened this issue Feb 19, 2024 · 4 comments
Closed
2 tasks done
Labels
A-dtype-string Area: string data type bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@etiennebacher
Copy link

etiennebacher commented Feb 19, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os
os.environ["POLARS_VERBOSE"] = "1"

import polars as pl
import time

pl.show_versions()

### Create large parquet file
test_pl = pl.read_parquet("airflights.parquet")
for i in range(8):
    test_pl = test_pl.vstack(test_pl)

test_pl = test_pl.sample(fraction=0.999)
test_pl.write_parquet("airflights2.parquet")

### Scan and filter this large parquet file (should return 0 rows)
test_lazy = pl.scan_parquet("airflights2.parquet")

start = time.time()
test_lazy.filter(pl.col("carrier") == "N802UA").collect()
print(time.time() - start)

Log output

parquet file must be read, statistics not sufficient for predicate. [repeated dozens of times]

shape: (0, 19)
┌──────┬───────┬─────┬──────────┬───┬──────────┬──────┬────────┬────────────────────────────────┐
│ year ┆ month ┆ day ┆ dep_time ┆ … ┆ distance ┆ hour ┆ minute ┆ time_hour                      │
│ ---  ┆ ---   ┆ --- ┆ ---      ┆   ┆ ---      ┆ ---  ┆ ---    ┆ ---                            │
│ i32  ┆ i32   ┆ i32 ┆ i32      ┆   ┆ f64      ┆ f64  ┆ f64    ┆ datetime[μs, America/New_York] │
╞══════╪═══════╪═════╪══════════╪═══╪══════════╪══════╪════════╪════════════════════════════════╡
└──────┴───────┴─────┴──────────┴───┴──────────┴──────┴────────┴────────────────────────────────┘

Issue description

Performance of string equality (and possibly other operations) dropped significantly. Here's an example parquet file of 1.7MB but duplicated in the script to increase its size to ~700MB and ~25M rows.
airflights.zip

Performance:

  • 0.20.5: 2.7824645042419434
  • 0.20.6 (new string/binary): 28.014456033706665
  • 0.20.9: 26.180001497268677

Edit: 0.20.10 has same timing as 0.20.9

Expected behavior

Same or better performance in 0.20.9 as in 0.20.5

Installed versions

--------Version info---------
Polars:               0.20.9 [Used also 0.20.5 and 0.20.6 for benchmarking]
Index type:           UInt32
Platform:             Windows-10-10.0.19044-SP0
Python:               3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.6.0
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.7.1
numpy:                1.24.3
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              12.0.1
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@etiennebacher etiennebacher added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 19, 2024
@stinodego stinodego added A-dtype-string Area: string data type P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Feb 19, 2024
@stinodego
Copy link
Member

I believe the issue is not the equality operation but the filter operation. I think @orlp is working on this.

@etiennebacher
Copy link
Author

I don't know, running test_lazy.with_columns(pl.col("carrier") == "N802UA").collect() instead of test_lazy.filter(pl.col("carrier") == "N802UA").collect() also takes ~30sec

@ritchie46
Copy link
Member

You have't isolated the reading form the performance comparison. I suspect the filter and comparisons are red herrings and it is the reading of string data.

This should be fixed by: #14705

When making examples (on performance) it is best to isolate the operation you are benchmarking.

@ritchie46
Copy link
Member

Confirmed that #14705 fixes the pathological regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-string Area: string data type bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

3 participants