Skip to content

Reading int96 timestamp in Parquet #427

@linhr

Description

@linhr

Spark may write timestamp as the deprecated int96 physical type in Parquet files. Currently, such data cannot be read correctly in Sail.

  1. Arrow reads int96 as timestamp with nanosecond unit, while Spark expects microsecond unit. So the valid value range is different.
  2. Schema analysis request (printSchema()) fails since we cannot convert the Arrow data type (nanosecond unit) back to Spark data type.

We should respect the Spark schema (stored as a metadata key) when reading the Parquet file. Type casting of timestamp seems possible after the recent upstream fix (apache/arrow-rs#7285). So we should be able to handle this after the next Arrow release.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions