Skip to content

Reimplement parquet (de)serialization #232

@hombit

Description

@hombit

Feature request

read_parquet

  • Automatically cast struct-list columns to nested. Introduce reject_nesting: bool | list[str] = False which would help to exclude columns from being casted. Provide a nice error message if struct-list is not "nested", something like "ooh-ooh, please use npd.read_parquet(reject_nesting=["failed_column"]) instead".
  • Allow engine="pyarrow" only
  • Allow dtypes_backend="pyarrow" only
  • Pack partially loaded struct-list columns to nested, e.g. loaded with columns=["lc.t", "lc.flux"].

For the last one, there is an important edge case (existing in Rubin DP1), columns=["flux", "lc.flux"], which fails with current stable pandas. I think we should use pyarrow directly:

fname = ...
table = pa.parquet.read_pandas(fname, columns=[...], ...)
schema = pa.parquetParquetSchema(fname)
# Figure out how to pack sub-columns back with schema and table
table = ...
nested_columns = [...]
nf = NestedFrame(table.to_pandas(types_mapper=lambda ty: NestedDtype(ty) if ty in nested_columns else pd.ArrowDtype(ty)))

to_parquet

  • use_nested_dtype: bool = False would cast NestedDtype to the corresponding arrow pandas type before saving.

Before submitting
Please check the following:

  • I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
  • I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
  • If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions