Skip to content

Restore DF 51 SchemaAdapter cast behaviour in ParquetOpener#45

Merged
zhuqi-lucas merged 1 commit into
massive-com:branch-52from
zhuqi-lucas:fix/opener-schema-cast
Apr 1, 2026
Merged

Restore DF 51 SchemaAdapter cast behaviour in ParquetOpener#45
zhuqi-lucas merged 1 commit into
massive-com:branch-52from
zhuqi-lucas:fix/opener-schema-cast

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Collaborator

Summary

Restore DF 51's SchemaAdapter::map_batch() cast behaviour in the replace_schema step of ParquetOpener.

Problem

DF 52 removed SchemaAdapter and the replace_schema step now does strict type validation via RecordBatch::try_new_with_options. This fails for schema-evolved files where:

  • List inner field names differ (e.g. conditions vs element)
  • List inner field nullability differs (e.g. non-null Int32 vs Int32)
  • Column types differ (e.g. Utf8 vs Date32)

Error:

column types must match schema types, expected List(Int32, field: 'element')
but found List(non-null Int32, field: 'conditions') at column index 1

Fix

In replace_schema, for each column:

  1. If types match → zero-copy (common fast path)
  2. If types differ and castable → arrow::compute::cast (handles Utf8→Date32 etc)
  3. If types still differ after cast (List/Struct metadata) → rebuild array with into_builder().data_type(target).build()

This matches DF 51's SchemaAdapter::map_batch() + cast_column() pipeline.

Copilot AI review requested due to automatic review settings March 31, 2026 11:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores DataFusion 51–style schema adaptation behavior in ParquetOpener’s replace_schema step so schema-evolved Parquet files (type changes and nested field metadata differences) can be read by casting/retyping columns instead of failing strict schema validation.

Changes:

  • Adds per-column adaptation in replace_schema: fast-path zero-copy when types match, otherwise attempts arrow::compute::cast, and finally rebuilds array data with the target DataType for nested metadata differences.
  • Extends Arrow imports to include ArrayRef for the adapted array vector.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datafusion/datasource-parquet/src/opener.rs
Comment thread datafusion/datasource-parquet/src/opener.rs
Comment thread datafusion/datasource-parquet/src/opener.rs
@zhuqi-lucas zhuqi-lucas force-pushed the fix/opener-schema-cast branch from ea67f67 to 24339a5 Compare March 31, 2026 12:18
DF 52 removed SchemaAdapter which handled type/field-name mismatches
between file and table schemas. The replace_schema step now:

1. Casts columns via arrow::compute::cast when types differ
   (e.g. Utf8 → Date32 for schema evolution)
2. Rebuilds arrays with target DataType when metadata differs
   (e.g. List inner field name/nullability mismatch)

This does NOT force nullability — that's the caller's responsibility
(e.g. atlas's adapt_table_schema_for_parquet for file columns).

Tests:
- test_utf8_to_date32_schema_evolution
- test_list_field_name_and_nullability_mismatch (quotes_v1 regression)
- test_nullability_mismatch_non_null_to_nullable
@zhuqi-lucas zhuqi-lucas force-pushed the fix/opener-schema-cast branch from 24339a5 to fd292b6 Compare March 31, 2026 12:40
@zhuqi-lucas zhuqi-lucas merged commit 0a0302b into massive-com:branch-52 Apr 1, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants