performance issue with tpch q7 after dropping columns and using sink_parquet #16694
Open
2 tasks done
Labels
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Checks
Reproducible example
this is slightly involved but you should be able to copy/paste below after
pip install 'ibis-framework[duckdb]'in addition to having Polars installed. I am on the latest release of Polars (0.20.31). this breaks down at sf=20, works fine on sf=10function to generate the data:
run for
sf=10andsf=20:now we can read the data:
you'll notice Polars by default (and Ibis on the DuckDB/Polars backends) creates the hive-partitioned
sfandnas columns in the data. this was throwing some things off, so in my longerget_polars_tablesfunction I dropped those columns:you can then read in the tables:
perhaps a separate bug but I'll move forward -- at this point the dataframes still have the
nandsfcolumns, even though they should have been dropped. this does not seem to be an issue in the eager APInow we define q7:
and run it, calling
sink_parqueton the result:at
sf=10it works fine, but atsf=20it hangs for a very long time. it also uses 100% CPU while doing thisas I'm writing this it actually did finish -- while
.collect().write_parquet()takes 1.5s atsf=20, thesink_parquetcall takes ~9s atsf=10and ~60s atsf=20I was originally testing this at
sf=50andsf=100so assumed it was hanging forever, particularly compared to the previous numbers I was seeing before I added those drop column calls. I'll still submit thisNOTE: the log output below was too long (
parquet file must be read...) for GitHub so I deleted a bunch of it, that seems like it'd be the issue though (reading the parquet file(s) a ton of times?)Log output
Issue description
two potential issues:
noticed columns aren't dropped for LazyFrames when they should be (and are for regular DataFrames)
potential performance issue involving dropping columns +
sink_parquetcolumns are dropped
no performance issue w/ the above (I can work around this w/
.collect().write_parquetit seems)Expected behavior
.collect().write_parquetit seems)Installed versions
The text was updated successfully, but these errors were encountered: