Skip to content

Conversation

gitosaurus
Copy link
Contributor

@gitosaurus gitosaurus commented Oct 7, 2025

Change Description

Following the recommendations in this NVIDIA blog post, use fsspec.parquet.open_parquet_file for single Parquet files, while preserving the existing pandas.read_parquet interface for all existing cases.

Closes #365 .

  • My PR includes a link to the issue that I am addressing

Solution Description

As the blog post describes, this optimization avoids the typical read-ahead strategy that works well for local files, in favor of making more precise reads that use less bandwidth.

Testing this change using the PANSTARRS catalog:

Operation Before After Ratio
.head() 10.4s 6.16s 0.59
mean of one column 3h 3m 1h 52m 0.61
.random_sample 2m 36s 1m 37s 0.62

Code Quality

  • I have read the Contribution Guide
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

New Feature Checklist

  • I have added or updated the docstrings associated with my feature using the NumPy docstring format
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover my new feature
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

@gitosaurus gitosaurus requested a review from dougbrn October 7, 2025 22:56
Copy link

codecov bot commented Oct 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.79%. Comparing base (c5cf5ee) to head (c0f1449).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #385      +/-   ##
==========================================
+ Coverage   97.77%   97.79%   +0.01%     
==========================================
  Files          19       19              
  Lines        2024     2042      +18     
==========================================
+ Hits         1979     1997      +18     
  Misses         45       45              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

github-actions bot commented Oct 7, 2025

Before [c5cf5ee] After [38cfd83] Ratio Benchmark (Parameter)
4.05±0.4s 3.68±0.2s ~0.91 benchmarks.ReadFewColumnsS3.time_run
841M 1.23G 1.46 benchmarks.ReadFewColumnsS3.peakmem_run
33.3±0.2ms 33.7±1ms 1.01 benchmarks.AssignSingleDfToNestedSeries.time_run
250M 250M 1.00 benchmarks.AssignSingleDfToNestedSeries.peakmem_run
136M 136M 1.00 benchmarks.CountNestedBy.peakmem_run
102M 102M 1.00 benchmarks.NestedFrameAddNested.peakmem_run
107M 107M 1.00 benchmarks.NestedFrameQuery.peakmem_run
11.2±0.1ms 11.2±0.1ms 1.00 benchmarks.NestedFrameQuery.time_run
106M 106M 1.00 benchmarks.NestedFrameReduce.peakmem_run
180M 180M 1.00 benchmarks.ReadFewColumnsHTTPS.peakmem_run

Click here to view all benchmarks.

Copy link
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few questions!

Copilot AI and others added 4 commits October 8, 2025 10:39
…ng open_parquet_file (#386)

* Initial plan

* Refactor to always use fsspec.parquet.open_parquet_file

- Import fsspec.parquet at module level (assume always available)
- Always use open_parquet_file for single file paths and URLs
- Add _get_storage_options_and_path to determine file options
- Use smaller block_size for HTTP/HTTPS (remote resources)
- Keep fallback to pq.read_table for file-like objects, lists, directories, and explicit filesystems
- Remove obsolete _should_use_fsspec_optimization and _read_with_fsspec_optimization functions
- Remove test_fsspec_optimization_path_detection test (function no longer exists)

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
…d_path (#387)

* Initial plan

* Add unit tests for _get_storage_options_and_path with Path and UPath cases

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

* Fix import ordering in test_io.py

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
Copilot AI and others added 2 commits October 8, 2025 13:52
…y to improve code coverage (#388)

* Initial plan

* Add unit tests for output_names parameter in map_rows to improve code coverage

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

* Add unit tests for _is_directory function to cover lines 197 and 201 in io.py

Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>
@gitosaurus gitosaurus requested a review from dougbrn October 8, 2025 21:33
Copy link
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

Copy link
Collaborator

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being annoying, but I believe it is an example of AI code we discussed yesterday: the machine doesn't have enough context and generates not what we reall want to have.
My primary point here that we can just delete most of the code and replace it with 1) wrapping input with UPath, 2) using fsspec.parquet.open_parquet_file(upath.path, upath.fs).

In the rebasing and refactoring, the fs= argument was not being
set when a filesystem instance was available.
@gitosaurus
Copy link
Contributor Author

gitosaurus commented Oct 9, 2025

Sorry for being annoying, but I believe it is an example of AI code we discussed yesterday: the machine doesn't have enough context and generates not what we reall want to have. My primary point here that we can just delete most of the code and replace it with 1) wrapping input with UPath, 2) using fsspec.parquet.open_parquet_file(upath.path, upath.fs).

Not annoying at all! Thanks for the close look.

There is a tension between what nested-pandas offers on its own and what LSDB uses from it. For example, the existing read_parquet method promises that its input argument can be (1) a file-like object; (2) a directory; (3) a file path; or (4) a list of the above. fsspec.parquet.open_parquet_file only deals with the single-file case, which is why the other cases were in there.

Now it's probably true that lsdb only uses the file path use case, but it's possible to use nested_pandas outside of lsdb, and nested-pandas is intended to extend the pandas interface, so these other cases need to be supported.

@gitosaurus gitosaurus requested review from dougbrn and hombit October 10, 2025 21:20
Copy link
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you for all the work on this. Just one comment that is related to the performance regressions, once the regressions are resolved with no loss in support for reading directories (important for parquet!) just ping me again and I can quickly approve

Copy link
Collaborator

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!!

@gitosaurus gitosaurus merged commit c0ba0f2 into main Oct 13, 2025
12 checks passed
@gitosaurus gitosaurus deleted the dtj-parquet-io branch October 13, 2025 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use fsspec.parquet.open_parquet_file for I/O

4 participants