-
Couldn't load subscription status.
- Fork 1
Closed
Description
Feature request
read_parquet
- Automatically cast struct-list columns to nested. Introduce
reject_nesting: bool | list[str] = Falsewhich would help to exclude columns from being casted. Provide a nice error message ifstruct-listis not "nested", something like "ooh-ooh, please usenpd.read_parquet(reject_nesting=["failed_column"])instead". - Allow
engine="pyarrow"only - Allow
dtypes_backend="pyarrow"only - Pack partially loaded struct-list columns to nested, e.g. loaded with
columns=["lc.t", "lc.flux"].
For the last one, there is an important edge case (existing in Rubin DP1), columns=["flux", "lc.flux"], which fails with current stable pandas. I think we should use pyarrow directly:
fname = ...
table = pa.parquet.read_pandas(fname, columns=[...], ...)
schema = pa.parquetParquetSchema(fname)
# Figure out how to pack sub-columns back with schema and table
table = ...
nested_columns = [...]
nf = NestedFrame(table.to_pandas(types_mapper=lambda ty: NestedDtype(ty) if ty in nested_columns else pd.ArrowDtype(ty)))to_parquet
use_nested_dtype: bool = Falsewould castNestedDtypeto the corresponding arrow pandas type before saving.
Before submitting
Please check the following:
- I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
- I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
- If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request