Skip to content

ENH: Optional preservation of Sparse columns in Parquet/Feather via Arrow metadata #63049

@antznette1

Description

@antznette1

Summary

Add an opt-in flag to preserve SparseDtype on Parquet/Feather roundtrip by storing minimal dtype metadata in Arrow schema and reconstructing on read. No behavior change by default.

API

  • Write: DataFrame.to_parquet(..., preserve_sparse=False), DataFrame.to_feather(..., preserve_sparse=False)
  • Read: read_parquet(..., preserve_sparse=False), read_feather(..., preserve_sparse=False)
  • Alternative name for feedback: preserve_extension_arrays.

Behavior

  • Default (False): current behavior unchanged (dense on read).
  • When True: write Arrow field metadata (subtype, fill_value); read reconstructs SparseArray(SparseDtype).

Implementation sketch

  • Writer: detect SparseDtype columns, attach schema field metadata (e.g., b"pandas.sparse.dtype", b"pandas.sparse.version"), keep physical encoding compatible.
  • Reader: if preserve_sparse=True and metadata present, rebuild sparse columns from dense values + recorded fill_value/subtype.

Tests

  • Parquet and Feather roundtrip.
  • Subtypes: int64/float64/boolean; various fill_values (0, 0.0, False, NaN).
  • Mixed frames (sparse + dense).
  • Verify off-by-default behavior.

Notes

  • Scopes strictly to I/O compatibility (acknowledges DEPR: SparseDtype #56518 discussion).
  • Backward compatible and opt-in.
  • Namespaced metadata (e.g., pandas.sparse.*).

Request for feedback

  • API flag name (preserve_sparse vs generalized).
  • Metadata keys/placement and interop concerns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions