Skip to content

Skip files outside partition structure in hive-partitioned listing tables#51

Merged
zhuqi-lucas merged 2 commits intobranch-52from
fix/skip-non-partition-files-branch52
Apr 21, 2026
Merged

Skip files outside partition structure in hive-partitioned listing tables#51
zhuqi-lucas merged 2 commits intobranch-52from
fix/skip-non-partition-files-branch52

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Collaborator

Summary

Fix Unable to get field named "year_month" error on hive-partitioned listing tables when stale files exist in the root directory (outside any partition_col=value/ path).

Problem

try_into_partitioned_file included root-level files with empty partition_values, causing downstream errors when queries reference partition columns. This also caused Cannot merge statistics with different number of columns since the root file has a different schema than partition files.

Root cause

s3://reference-data/vendors/benzinga/gov_trades/ has:

data.parquet                          ← stale root file (27 cols, no partition)
year_month=2014-01/data.parquet       ← partition files (28 cols)
year_month=2014-02/data.parquet
...

Fix

try_into_partitioned_file returns Ok(None) for files not matching the partition structure. The caller skips them via try_filter_map.

Verified locally

Before: Cannot merge statistics with different number of columns: 27 vs 28
After: SELECT year_month, COUNT(*) GROUP BY year_month works correctly

Tests (5 passing)

  • test_try_into_partitioned_file_valid_partition
  • test_try_into_partitioned_file_root_file_skipped
  • test_try_into_partitioned_file_wrong_partition_name
  • test_try_into_partitioned_file_multiple_partitions
  • test_try_into_partitioned_file_partial_partition_skipped

…bles

When a hive-partitioned listing table contains files in the root
directory (not inside any partition_col=value/ path), these files
have no partition values. Previously they were included with empty
partition_values, causing "Unable to get field named" errors when
queries reference partition columns.

Now try_into_partitioned_file returns None for files that don't
match the partition structure, and the caller skips them.
Copilot AI review requested due to automatic review settings April 21, 2026 03:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates hive-partitioned listing table file discovery to skip files that don’t conform to the expected partition_col=value/ directory structure, preventing downstream errors when stale/non-partitioned files exist in the table root.

Changes:

  • Change try_into_partitioned_file to return Result<Option<PartitionedFile>> and return Ok(None) for files outside the partition structure.
  • Update pruned_partition_list to skip non-partitioned files via try_filter_map.
  • Add unit tests covering valid/invalid partition paths; add chrono as a dev-dependency for test ObjectMeta construction.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.

File Description
datafusion/catalog-listing/src/helpers.rs Skip nonconforming files during partitioned listing and add tests for the new behavior.
datafusion/catalog-listing/Cargo.toml Add chrono as a dev-dependency for the new tests.
Comments suppressed due to low confidence (1)

datafusion/catalog-listing/src/helpers.rs:368

  • parse_partitions_for_path can return Some even when it matched fewer path segments than partition_cols (e.g., a file at .../year=2024 with expected partitions year, month). In that case this code will produce fewer partition_values than there are partition columns, which can later cause schema/partition-value mismatches or errors during partition pruning. Consider validating parsed.len() == partition_cols.len() (and returning Ok(None) if not) before constructing partition_values, and add a regression test for this case.
    let partition_values = parsed
        .into_iter()
        .zip(partition_cols)
        .map(|(parsed, (_, datatype))| {
            ScalarValue::try_from_string(parsed.to_string(), datatype)
        })
        .collect::<Result<Vec<_>>>()?;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datafusion/catalog-listing/src/helpers.rs
Comment thread datafusion/catalog-listing/src/helpers.rs
Comment thread datafusion/catalog-listing/src/helpers.rs
Comment thread datafusion/catalog-listing/src/helpers.rs
Comment thread datafusion/catalog-listing/src/helpers.rs Outdated
Comment thread datafusion/catalog-listing/src/helpers.rs
@zhuqi-lucas zhuqi-lucas force-pushed the fix/skip-non-partition-files-branch52 branch from d42afba to c052327 Compare April 21, 2026 03:15
…bles

When a hive-partitioned listing table contains files in the root
directory (not inside any partition_col=value/ path), these files
have no partition values. Previously they were included with empty
partition_values, causing "Unable to get field named" errors when
queries reference partition columns.

Now try_into_partitioned_file returns None for files that don't
match the partition structure, and the caller skips them.
@zhuqi-lucas zhuqi-lucas force-pushed the fix/skip-non-partition-files-branch52 branch from c052327 to df2781c Compare April 21, 2026 03:19
@zhuqi-lucas zhuqi-lucas merged commit b4dbb6a into branch-52 Apr 21, 2026
58 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants