Use fsspec.parquet for reading remote Parquet files #385

gitosaurus · 2025-10-07T22:56:24Z

Change Description

Following the recommendations in this NVIDIA blog post, use fsspec.parquet.open_parquet_file for single Parquet files, while preserving the existing pandas.read_parquet interface for all existing cases.

Closes #365 .

My PR includes a link to the issue that I am addressing

Solution Description

As the blog post describes, this optimization avoids the typical read-ahead strategy that works well for local files, in favor of making more precise reads that use less bandwidth.

Testing this change using the PANSTARRS catalog:

Operation	Before	After	Ratio
`.head()`	10.4s	6.16s	0.59
mean of one column	3h 3m	1h 52m	0.61
`.random_sample`	2m 36s	1m 37s	0.62

Code Quality

I have read the Contribution Guide
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

New Feature Checklist

I have added or updated the docstrings associated with my feature using the NumPy docstring format
I have updated the tutorial to highlight my new feature (if appropriate)
I have added unit/End-to-End (E2E) test cases to cover my new feature
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

…quet (#369)

codecov · 2025-10-07T22:57:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.79%. Comparing base (c5cf5ee) to head (c0f1449).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #385      +/-   ##
==========================================
+ Coverage   97.77%   97.79%   +0.01%     
==========================================
  Files          19       19              
  Lines        2024     2042      +18     
==========================================
+ Hits         1979     1997      +18     
  Misses         45       45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-10-07T23:03:47Z

Before [`c5cf5ee`]	After [`38cfd83`]	Ratio	Benchmark (Parameter)
4.05±0.4s	3.68±0.2s	~0.91	benchmarks.ReadFewColumnsS3.time_run
841M	1.23G	1.46	benchmarks.ReadFewColumnsS3.peakmem_run
33.3±0.2ms	33.7±1ms	1.01	benchmarks.AssignSingleDfToNestedSeries.time_run
250M	250M	1.00	benchmarks.AssignSingleDfToNestedSeries.peakmem_run
136M	136M	1.00	benchmarks.CountNestedBy.peakmem_run
102M	102M	1.00	benchmarks.NestedFrameAddNested.peakmem_run
107M	107M	1.00	benchmarks.NestedFrameQuery.peakmem_run
11.2±0.1ms	11.2±0.1ms	1.00	benchmarks.NestedFrameQuery.time_run
106M	106M	1.00	benchmarks.NestedFrameReduce.peakmem_run
180M	180M	1.00	benchmarks.ReadFewColumnsHTTPS.peakmem_run

Click here to view all benchmarks.

dougbrn

Looks good, just a few questions!

src/nested_pandas/nestedframe/io.py

…ng open_parquet_file (#386) * Initial plan * Refactor to always use fsspec.parquet.open_parquet_file - Import fsspec.parquet at module level (assume always available) - Always use open_parquet_file for single file paths and URLs - Add _get_storage_options_and_path to determine file options - Use smaller block_size for HTTP/HTTPS (remote resources) - Keep fallback to pq.read_table for file-like objects, lists, directories, and explicit filesystems - Remove obsolete _should_use_fsspec_optimization and _read_with_fsspec_optimization functions - Remove test_fsspec_optimization_path_detection test (function no longer exists) Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

…d_path (#387) * Initial plan * Add unit tests for _get_storage_options_and_path with Path and UPath cases Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> * Fix import ordering in test_io.py Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

…y to improve code coverage (#388) * Initial plan * Add unit tests for output_names parameter in map_rows to improve code coverage Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> * Add unit tests for _is_directory function to cover lines 197 and 201 in io.py Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: gitosaurus <6794831+gitosaurus@users.noreply.github.com>

dougbrn

Looks good, thank you!

hombit

Sorry for being annoying, but I believe it is an example of AI code we discussed yesterday: the machine doesn't have enough context and generates not what we reall want to have.
My primary point here that we can just delete most of the code and replace it with 1) wrapping input with UPath, 2) using fsspec.parquet.open_parquet_file(upath.path, upath.fs).

benchmarks/benchmarks.py

src/nested_pandas/nestedframe/io.py

tests/nested_pandas/nestedframe/test_nestedframe.py

In the rebasing and refactoring, the fs= argument was not being set when a filesystem instance was available.

gitosaurus · 2025-10-09T17:32:25Z

Sorry for being annoying, but I believe it is an example of AI code we discussed yesterday: the machine doesn't have enough context and generates not what we reall want to have. My primary point here that we can just delete most of the code and replace it with 1) wrapping input with UPath, 2) using fsspec.parquet.open_parquet_file(upath.path, upath.fs).

Not annoying at all! Thanks for the close look.

There is a tension between what nested-pandas offers on its own and what LSDB uses from it. For example, the existing read_parquet method promises that its input argument can be (1) a file-like object; (2) a directory; (3) a file path; or (4) a list of the above. fsspec.parquet.open_parquet_file only deals with the single-file case, which is why the other cases were in there.

Now it's probably true that lsdb only uses the file path use case, but it's possible to use nested_pandas outside of lsdb, and nested-pandas is intended to extend the pandas interface, so these other cases need to be supported.

dougbrn

Looks good, thank you for all the work on this. Just one comment that is related to the performance regressions, once the regressions are resolved with no loss in support for reading directories (important for parquet!) just ping me again and I can quickly approve

src/nested_pandas/nestedframe/io.py

hombit

Thank you!!!

Add automatic fsspec optimization for remote storage using fsspec.par…

2dcb31e

…quet (#369)

gitosaurus requested a review from dougbrn October 7, 2025 22:56

dougbrn reviewed Oct 8, 2025

View reviewed changes

src/nested_pandas/nestedframe/io.py Outdated Show resolved Hide resolved

src/nested_pandas/nestedframe/io.py Outdated Show resolved Hide resolved

Copilot AI and others added 4 commits October 8, 2025 10:39

Run pre-commit checks

7bada7a

Obey ruff

154a534

Copilot AI mentioned this pull request Oct 8, 2025

Add unit tests for output_names in map_rows and _is_directory in io.py to improve code coverage #388

Merged

Copilot AI and others added 2 commits October 8, 2025 13:52

Obey ruff

8a33bf5

gitosaurus requested a review from dougbrn October 8, 2025 21:33

dougbrn approved these changes Oct 8, 2025

View reviewed changes

hombit requested changes Oct 9, 2025

View reviewed changes

gitosaurus added 2 commits October 9, 2025 08:28

Plumbing of fs=filesystem was missing, preventing optimization

547720c

In the rebasing and refactoring, the fs= argument was not being set when a filesystem instance was available.

Add explicit fsspec dependency

6e2d025

gitosaurus added 6 commits October 10, 2025 10:08

Run ruff

adde7a7

Resolve specialization problems

94e654d

No longer needed

30cf106

Simplify extraction of storage_options

8e88a8c

Resolve errors found by tests

1dde0c8

Remove now-unnecessary benchmark

62281ee

gitosaurus requested review from dougbrn and hombit October 10, 2025 21:20

dougbrn reviewed Oct 10, 2025

View reviewed changes

src/nested_pandas/nestedframe/io.py Show resolved Hide resolved

gitosaurus added 3 commits October 10, 2025 15:39

Optimize local-directory check.

84d3a09

Smaller block size for S3 too

98ae57e

Okay, s3 not helped by block size

c0f1449

gitosaurus mentioned this pull request Oct 10, 2025

Use fsspec optimized reading for lists of Parquet files #389

Open

3 tasks

gitosaurus requested a review from dougbrn October 11, 2025 00:01

gitosaurus enabled auto-merge (squash) October 11, 2025 00:02

dougbrn approved these changes Oct 11, 2025

View reviewed changes

hombit approved these changes Oct 13, 2025

View reviewed changes

gitosaurus merged commit c0ba0f2 into main Oct 13, 2025
12 checks passed

gitosaurus deleted the dtj-parquet-io branch October 13, 2025 13:10

Use fsspec.parquet for reading remote Parquet files #385

Use fsspec.parquet for reading remote Parquet files #385

Uh oh!

Conversation

gitosaurus commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Solution Description

Code Quality

Project-Specific Pull Request Checklists

New Feature Checklist

Uh oh!

codecov bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougbrn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dougbrn left a comment

Choose a reason for hiding this comment

Uh oh!

hombit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gitosaurus commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougbrn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hombit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gitosaurus commented Oct 7, 2025 •

edited

Loading

codecov bot commented Oct 7, 2025 •

edited

Loading

github-actions bot commented Oct 7, 2025 •

edited

Loading

gitosaurus commented Oct 9, 2025 •

edited

Loading