Skip to content

fix: Forward storage_options to parquet metadata reads#1

Open
mattijsdp wants to merge 7 commits into
mainfrom
fix/parquet-metadata-storage-options
Open

fix: Forward storage_options to parquet metadata reads#1
mattijsdp wants to merge 7 commits into
mainfrom
fix/parquet-metadata-storage-options

Conversation

@mattijsdp
Copy link
Copy Markdown
Owner

@mattijsdp mattijsdp commented Jun 2, 2026

Summary

Fixes Quantco#352.

Schema.read_parquet/scan_parquet (plus collection and failure-info reads) failed against S3-compatible stores reached via storage_options. The data read got storage_options, but the separate embedded-metadata read didn't — so polars fell back to the default AWS endpoint/credentials.

Fix

  • _storage/parquet.py — forward the storage options pl.read_parquet_metadata accepts (storage_options, credential_provider, retries) to every metadata read, via a small _metadata_read_options(kwargs) helper. Only present keys are forwarded, so defaults are unchanged.
  • read_parquet_metadata_schema (schema.py) and read_parquet_metadata_collection (collection/collection.py) now accept and forward **kwargs.

Tests

s3-marked regression tests using an s3_storage_options fixture that supplies credentials/endpoint only via storage_options (AWS_* env stripped), so the metadata read fails unless options are forwarded. Verified each fails without the fix and passes with it. pixi run lint clean; full suite green (non-s3 + s3).

Scope note

Limited to polars parquet-metadata reads (covers the issue's single-file repro). A storage_options-only collection read still fails earlier, at fsspec-based member discovery (url_to_fs/glob), which uses a different option vocabulary than polars and needs per-backend translation — a separate, broader change, intentionally not addressed here.

Opened against my fork for review before submitting upstream.

mattijsdp and others added 7 commits June 2, 2026 14:16
Schema/collection/failure-info reads passed `storage_options` (and
`credential_provider`) to the data read via `pl.read_parquet` /
`pl.scan_parquet`, but the separate embedded-schema metadata read called
`pl.read_parquet_metadata` with no options. Against non-AWS S3-compatible
stores reached purely through `storage_options` (lakeFS, MinIO, R2, Tigris,
…) the metadata read fell back to the default AWS credential chain and
endpoint, breaking typed reads.

Thread the storage-related options into all metadata reads in
`_storage/parquet.py` via a small `_metadata_read_options` helper.

Fixes Quantco#352

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`read_parquet_metadata_schema` and `read_parquet_metadata_collection` read
parquet metadata from a (possibly remote) source but accepted no options, so
they could not reach non-AWS S3-compatible stores either. Accept and forward
`**kwargs` (e.g. `storage_options`, `credential_provider`) to
`pl.read_parquet_metadata`, matching `read_parquet`/`scan_parquet`.

Add s3-marked regression tests covering both helpers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…retries

Address review feedback on the storage-options forwarding fix:

- Match the file's docstring convention for the public metadata helpers: drop
  the enumerated `storage_options`/`credential_provider` note and use the same
  terse "passed directly to :meth:`polars.read_parquet_metadata`" wording as
  `read_parquet`/`scan_parquet`.
- Forward `retries` alongside `storage_options`/`credential_provider` in
  `_metadata_read_options`, since `read_parquet_metadata` accepts it and it is
  storage-reaching. Clarify in the docstring why the call sites must filter the
  scan/read kwargs (the narrower `read_parquet_metadata` signature rejects
  options like `n_rows`/`columns`) instead of forwarding everything.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Shrink `_metadata_read_options` to a one-line comment matching the other
  private helpers in the module (no oversized docstring).
- Extract an `s3_storage_options` fixture (mirrors `s3_tmp_path`, but strips the
  AWS_* env vars so the store is reachable *only* via `storage_options`) and use
  it across the schema, collection and failure-info regression tests.
- Add a failure-info regression test covering the `scan_failure_info` metadata
  read, and split the schema test so the typed read and the standalone
  `read_parquet_metadata_schema` helper are asserted independently.
- Drop the end-to-end collection typed-read test: it cannot pass via
  `storage_options` alone because member discovery goes through fsspec
  (`url_to_fs`/`fs.exists`), which does not receive `storage_options` -- a
  separate limitation from the polars metadata read this PR fixes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Documentation should stand on its own, so remove the issue-tracker links from
the `s3_storage_options` fixture and the storage-options regression tests, and
shrink their docstrings to a single line in line with the surrounding tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet metadata read ignores storage_options, breaking typed reads from non-AWS S3 stores

1 participant