fix: Forward storage_options to parquet metadata reads#1
Open
mattijsdp wants to merge 7 commits into
Open
Conversation
Schema/collection/failure-info reads passed `storage_options` (and `credential_provider`) to the data read via `pl.read_parquet` / `pl.scan_parquet`, but the separate embedded-schema metadata read called `pl.read_parquet_metadata` with no options. Against non-AWS S3-compatible stores reached purely through `storage_options` (lakeFS, MinIO, R2, Tigris, …) the metadata read fell back to the default AWS credential chain and endpoint, breaking typed reads. Thread the storage-related options into all metadata reads in `_storage/parquet.py` via a small `_metadata_read_options` helper. Fixes Quantco#352 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`read_parquet_metadata_schema` and `read_parquet_metadata_collection` read parquet metadata from a (possibly remote) source but accepted no options, so they could not reach non-AWS S3-compatible stores either. Accept and forward `**kwargs` (e.g. `storage_options`, `credential_provider`) to `pl.read_parquet_metadata`, matching `read_parquet`/`scan_parquet`. Add s3-marked regression tests covering both helpers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…retries Address review feedback on the storage-options forwarding fix: - Match the file's docstring convention for the public metadata helpers: drop the enumerated `storage_options`/`credential_provider` note and use the same terse "passed directly to :meth:`polars.read_parquet_metadata`" wording as `read_parquet`/`scan_parquet`. - Forward `retries` alongside `storage_options`/`credential_provider` in `_metadata_read_options`, since `read_parquet_metadata` accepts it and it is storage-reaching. Clarify in the docstring why the call sites must filter the scan/read kwargs (the narrower `read_parquet_metadata` signature rejects options like `n_rows`/`columns`) instead of forwarding everything. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Shrink `_metadata_read_options` to a one-line comment matching the other private helpers in the module (no oversized docstring). - Extract an `s3_storage_options` fixture (mirrors `s3_tmp_path`, but strips the AWS_* env vars so the store is reachable *only* via `storage_options`) and use it across the schema, collection and failure-info regression tests. - Add a failure-info regression test covering the `scan_failure_info` metadata read, and split the schema test so the typed read and the standalone `read_parquet_metadata_schema` helper are asserted independently. - Drop the end-to-end collection typed-read test: it cannot pass via `storage_options` alone because member discovery goes through fsspec (`url_to_fs`/`fs.exists`), which does not receive `storage_options` -- a separate limitation from the polars metadata read this PR fixes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documentation should stand on its own, so remove the issue-tracker links from the `s3_storage_options` fixture and the storage-options regression tests, and shrink their docstrings to a single line in line with the surrounding tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes Quantco#352.
Schema.read_parquet/scan_parquet(plus collection and failure-info reads) failed against S3-compatible stores reached viastorage_options. The data read gotstorage_options, but the separate embedded-metadata read didn't — so polars fell back to the default AWS endpoint/credentials.Fix
_storage/parquet.py— forward the storage optionspl.read_parquet_metadataaccepts (storage_options,credential_provider,retries) to every metadata read, via a small_metadata_read_options(kwargs)helper. Only present keys are forwarded, so defaults are unchanged.read_parquet_metadata_schema(schema.py) andread_parquet_metadata_collection(collection/collection.py) now accept and forward**kwargs.Tests
s3-marked regression tests using ans3_storage_optionsfixture that supplies credentials/endpoint only viastorage_options(AWS_* env stripped), so the metadata read fails unless options are forwarded. Verified each fails without the fix and passes with it.pixi run lintclean; full suite green (non-s3 + s3).Scope note
Limited to polars parquet-metadata reads (covers the issue's single-file repro). A
storage_options-only collection read still fails earlier, at fsspec-based member discovery (url_to_fs/glob), which uses a different option vocabulary than polars and needs per-backend translation — a separate, broader change, intentionally not addressed here.Opened against my fork for review before submitting upstream.