-
Notifications
You must be signed in to change notification settings - Fork 1
Use fsspec.parquet for reading remote Parquet files #385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #385 +/- ##
==========================================
+ Coverage 97.77% 97.79% +0.01%
==========================================
Files 19 19
Lines 2024 2042 +18
==========================================
+ Hits 1979 1997 +18
Misses 45 45 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Click here to view all benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few questions!
…ng open_parquet_file (#386) * Initial plan * Refactor to always use fsspec.parquet.open_parquet_file - Import fsspec.parquet at module level (assume always available) - Always use open_parquet_file for single file paths and URLs - Add _get_storage_options_and_path to determine file options - Use smaller block_size for HTTP/HTTPS (remote resources) - Keep fallback to pq.read_table for file-like objects, lists, directories, and explicit filesystems - Remove obsolete _should_use_fsspec_optimization and _read_with_fsspec_optimization functions - Remove test_fsspec_optimization_path_detection test (function no longer exists) Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
…d_path (#387) * Initial plan * Add unit tests for _get_storage_options_and_path with Path and UPath cases Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> * Fix import ordering in test_io.py Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
…y to improve code coverage (#388) * Initial plan * Add unit tests for output_names parameter in map_rows to improve code coverage Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> * Add unit tests for _is_directory function to cover lines 197 and 201 in io.py Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being annoying, but I believe it is an example of AI code we discussed yesterday: the machine doesn't have enough context and generates not what we reall want to have.
My primary point here that we can just delete most of the code and replace it with 1) wrapping input with UPath
, 2) using fsspec.parquet.open_parquet_file(upath.path, upath.fs)
.
In the rebasing and refactoring, the fs= argument was not being set when a filesystem instance was available.
Not annoying at all! Thanks for the close look. There is a tension between what nested-pandas offers on its own and what LSDB uses from it. For example, the existing Now it's probably true that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you for all the work on this. Just one comment that is related to the performance regressions, once the regressions are resolved with no loss in support for reading directories (important for parquet!) just ping me again and I can quickly approve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!!
Change Description
Following the recommendations in this NVIDIA blog post, use
fsspec.parquet.open_parquet_file
for single Parquet files, while preserving the existingpandas.read_parquet
interface for all existing cases.Closes #365 .
Solution Description
As the blog post describes, this optimization avoids the typical read-ahead strategy that works well for local files, in favor of making more precise reads that use less bandwidth.
Testing this change using the PANSTARRS catalog:
.head()
.random_sample
Code Quality
Project-Specific Pull Request Checklists
New Feature Checklist