Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Native CSV file list reading #16180

Merged
merged 5 commits into from
May 14, 2024
Merged

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented May 13, 2024

This replaces the existing codepath that dispatches multi-CSV file reads to the Union executor.

Copy link

codspeed-hq bot commented May 14, 2024

CodSpeed Performance Report

Merging #16180 will degrade performances by 22.39%

Comparing nameexhaustion:csv-file-list (dfc3bdb) with main (993cc99)

Summary

❌ 1 regressions
✅ 34 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main nameexhaustion:csv-file-list Change
test_groupby_h2oai_q5 7.7 ms 9.9 ms -22.39%

@nameexhaustion nameexhaustion changed the title [wip] Native CSV file list reading feat: Native CSV file list reading May 14, 2024
@nameexhaustion nameexhaustion marked this pull request as ready for review May 14, 2024 07:07
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 14, 2024
Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 93.39623% with 14 lines in your changes are missing coverage. Please review.

Project coverage is 80.93%. Comparing base (81cc802) to head (dfc3bdb).
Report is 5 commits behind head on main.

Files Patch % Lines
...olars-lazy/src/physical_plan/executors/scan/csv.rs 92.37% 9 Missing ⚠️
crates/polars-pipe/src/executors/sources/csv.rs 92.53% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16180      +/-   ##
==========================================
- Coverage   81.03%   80.93%   -0.10%     
==========================================
  Files        1392     1393       +1     
  Lines      178939   179559     +620     
  Branches     2907     2907              
==========================================
+ Hits       144997   145323     +326     
- Misses      33436    33730     +294     
  Partials      506      506              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion
Copy link
Collaborator Author

failing test is python import timing - I don't think this PR has anything to do with it

crates/polars-lazy/src/physical_plan/executors/scan/csv.rs Outdated Show resolved Hide resolved
crates/polars-lazy/src/physical_plan/executors/scan/csv.rs Outdated Show resolved Hide resolved
}
}

concat_df(out.iter())?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be sure here the the DataFrames have the same Schema?

If so, we can use concat_df_unchecked. It isn't unsafe, but it will elide schema checks and duplicate name checks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good catch, I checked and realized this PR as it is would cause differing schemas to be silently coerced, I will need to do some more fixing

crates/polars-lazy/src/physical_plan/executors/scan/csv.rs Outdated Show resolved Hide resolved
@ritchie46
Copy link
Member

ritchie46 commented May 14, 2024

Looks great! Have a few minor comments.

I suspect this also should be much better in performance when dealing with many files.

@nameexhaustion nameexhaustion marked this pull request as draft May 14, 2024 09:57
@nameexhaustion nameexhaustion marked this pull request as ready for review May 14, 2024 11:40
@ritchie46 ritchie46 merged commit 674a048 into pola-rs:main May 14, 2024
26 checks passed
@nameexhaustion nameexhaustion deleted the csv-file-list branch May 14, 2024 12:12
MarcoGorelli pushed a commit to MarcoGorelli/polars that referenced this pull request May 15, 2024
@c-peters c-peters added the accepted Ready for implementation label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants