fix(python): Avoid loading all columns in `read_parquet` when `columns` parameter is specified #15229

itamarst · 2024-03-22T12:32:15Z

Without optimizations, all columns would be loaded regardless of the columns argument, leading to higher CPU and memory usage.

Without optimizations, all columns would be loaded regardless of the columns argument, leading to higher CPU and memory usage.

itamarst · 2024-03-22T12:36:13Z

If you think a test to catch future regressions would be helpful, should be able to do so with pytest-memray.

stinodego

Thanks for noticing this one. That's a painful miss 😓

I'd be interested in a way to catch these types of bugs if we can do so without overly bloating the CI.

codecov · 2024-03-22T13:06:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.26%. Comparing base (474ac34) to head (bd8ef13).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15229      +/-   ##
==========================================
- Coverage   81.26%   81.26%   -0.01%     
==========================================
  Files        1355     1355              
  Lines      175676   175676              
  Branches     2518     2518              
==========================================
- Hits       142761   142755       -6     
- Misses      32434    32439       +5     
- Partials      481      482       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

itamarst · 2024-03-22T13:09:21Z

Given the impact of this (and the minor nature of the semantic change), I'd suggest you merge this without tests and I'll see about the testing in a separate issue. The tests would presumably want to cover a larger set of code anyway.

fix(python): Re-enable optimizations when using read_parquet().

bd8ef13

Without optimizations, all columns would be loaded regardless of the columns argument, leading to higher CPU and memory usage.

github-actions bot added fix Bug fix python Related to Python Polars labels Mar 22, 2024

itamarst changed the title ~~fix(python): Re-enable optimizations when using read_parquet().~~ fix(python): Don't load all columns in read_parquet() if the user asked for only a subset Mar 22, 2024

itamarst marked this pull request as ready for review March 22, 2024 12:42

itamarst requested review from ritchie46, stinodego, c-peters, alexander-beedie and MarcoGorelli as code owners March 22, 2024 12:42

stinodego changed the title ~~fix(python): Don't load all columns in read_parquet() if the user asked for only a subset~~ fix(python): Avoid loading all columns in read_parquet when columns parameter is specified Mar 22, 2024

stinodego approved these changes Mar 22, 2024

View reviewed changes

stinodego merged commit 6503abc into pola-rs:main Mar 22, 2024
15 checks passed

itamarst deleted the 15098-read-parquet-inefficient branch March 22, 2024 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(python): Avoid loading all columns in `read_parquet` when `columns` parameter is specified #15229

fix(python): Avoid loading all columns in `read_parquet` when `columns` parameter is specified #15229

itamarst commented Mar 22, 2024

itamarst commented Mar 22, 2024

stinodego left a comment

codecov bot commented Mar 22, 2024

itamarst commented Mar 22, 2024

fix(python): Avoid loading all columns in read_parquet when columns parameter is specified #15229

fix(python): Avoid loading all columns in read_parquet when columns parameter is specified #15229

Conversation

itamarst commented Mar 22, 2024

itamarst commented Mar 22, 2024

stinodego left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 22, 2024

Codecov Report

itamarst commented Mar 22, 2024

fix(python): Avoid loading all columns in `read_parquet` when `columns` parameter is specified #15229

fix(python): Avoid loading all columns in `read_parquet` when `columns` parameter is specified #15229