Skip files outside partition structure in hive-partitioned listing tables#51
Merged
zhuqi-lucas merged 2 commits intobranch-52from Apr 21, 2026
Merged
Conversation
…bles When a hive-partitioned listing table contains files in the root directory (not inside any partition_col=value/ path), these files have no partition values. Previously they were included with empty partition_values, causing "Unable to get field named" errors when queries reference partition columns. Now try_into_partitioned_file returns None for files that don't match the partition structure, and the caller skips them.
There was a problem hiding this comment.
Pull request overview
This PR updates hive-partitioned listing table file discovery to skip files that don’t conform to the expected partition_col=value/ directory structure, preventing downstream errors when stale/non-partitioned files exist in the table root.
Changes:
- Change
try_into_partitioned_fileto returnResult<Option<PartitionedFile>>and returnOk(None)for files outside the partition structure. - Update
pruned_partition_listto skip non-partitioned files viatry_filter_map. - Add unit tests covering valid/invalid partition paths; add
chronoas a dev-dependency for testObjectMetaconstruction.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| datafusion/catalog-listing/src/helpers.rs | Skip nonconforming files during partitioned listing and add tests for the new behavior. |
| datafusion/catalog-listing/Cargo.toml | Add chrono as a dev-dependency for the new tests. |
Comments suppressed due to low confidence (1)
datafusion/catalog-listing/src/helpers.rs:368
parse_partitions_for_pathcan returnSomeeven when it matched fewer path segments thanpartition_cols(e.g., a file at.../year=2024with expected partitionsyear, month). In that case this code will produce fewerpartition_valuesthan there are partition columns, which can later cause schema/partition-value mismatches or errors during partition pruning. Consider validatingparsed.len() == partition_cols.len()(and returningOk(None)if not) before constructingpartition_values, and add a regression test for this case.
let partition_values = parsed
.into_iter()
.zip(partition_cols)
.map(|(parsed, (_, datatype))| {
ScalarValue::try_from_string(parsed.to_string(), datatype)
})
.collect::<Result<Vec<_>>>()?;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d42afba to
c052327
Compare
…bles When a hive-partitioned listing table contains files in the root directory (not inside any partition_col=value/ path), these files have no partition values. Previously they were included with empty partition_values, causing "Unable to get field named" errors when queries reference partition columns. Now try_into_partitioned_file returns None for files that don't match the partition structure, and the caller skips them.
c052327 to
df2781c
Compare
xudong963
approved these changes
Apr 21, 2026
zhuqi-lucas
added a commit
that referenced
this pull request
Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix
Unable to get field named "year_month"error on hive-partitioned listing tables when stale files exist in the root directory (outside anypartition_col=value/path).Problem
try_into_partitioned_fileincluded root-level files with emptypartition_values, causing downstream errors when queries reference partition columns. This also causedCannot merge statistics with different number of columnssince the root file has a different schema than partition files.Root cause
s3://reference-data/vendors/benzinga/gov_trades/has:Fix
try_into_partitioned_filereturnsOk(None)for files not matching the partition structure. The caller skips them viatry_filter_map.Verified locally
Before:
Cannot merge statistics with different number of columns: 27 vs 28After:
SELECT year_month, COUNT(*) GROUP BY year_monthworks correctlyTests (5 passing)
test_try_into_partitioned_file_valid_partitiontest_try_into_partitioned_file_root_file_skippedtest_try_into_partitioned_file_wrong_partition_nametest_try_into_partitioned_file_multiple_partitionstest_try_into_partitioned_file_partial_partition_skipped