Skip to content

feat: validate DataFile column_indices at commit time#6414

Merged
westonpace merged 1 commit into
lance-format:mainfrom
westonpace:feat-validate-data-files-at-commit
Apr 6, 2026
Merged

feat: validate DataFile column_indices at commit time#6414
westonpace merged 1 commit into
lance-format:mainfrom
westonpace:feat-validate-data-files-at-commit

Conversation

@westonpace
Copy link
Copy Markdown
Member

Summary

  • Add check_column_indices() validation in rust/lance/src/io/commit.rs that rejects non-leaf fields (structs, lists) with real column indices in v2.1+ data files at commit time, preventing cryptic read-time errors
  • Exempts packed structs and blob fields which legitimately have column indices in v2.1+
  • Wired into both commit_transaction and do_commit_detached_transaction paths

Closes #6412

Test plan

  • test_check_column_indices_rejects_struct_with_column — struct with column_index=0 in v2.1 → error
  • test_check_column_indices_rejects_list_with_column — list with column_index=0 in v2.1 → error
  • test_check_column_indices_allows_correct_v21 — correct indices (non-leaf=-1, leaf>=0) → ok
  • test_check_column_indices_allows_packed_struct — packed struct with real column_index → ok
  • test_check_column_indices_skips_v20 — non-leaf with column_index>=0 in v2.0 → ok (no validation)
  • cargo clippy -p lance --tests -- -D warnings passes
  • cargo fmt --all clean

🤖 Generated with Claude Code

@github-actions github-actions Bot added the enhancement New feature or request label Apr 6, 2026
@westonpace westonpace force-pushed the feat-validate-data-files-at-commit branch 2 times, most recently from 38d4eec to 6f3f6e3 Compare April 6, 2026 14:22
In Lance file format v2.1+, non-leaf fields (structs, lists) no longer
have their own columns — their validity is folded into rep/def levels.
Their column_indices entry should be -1. However, users manually
constructing DataFile messages could incorrectly assign real column
indices, leading to cryptic read-time errors.

Add check_column_indices() validation that runs at commit time in both
regular and detached commit paths. The validation checks:
- fields and column_indices have matching lengths
- non-leaf fields have column_index=-1
- leaf/packed-struct/blob fields have a real column index (not -1)

Field ids not found in the schema are skipped, since schema evolution
operations (cast, drop column) leave old field ids in existing data
files. Packed structs and blob fields are exempted from the non-leaf
check as they legitimately have column indices even in v2.1+.

Closes lance-format#6412

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@westonpace westonpace force-pushed the feat-validate-data-files-at-commit branch from 6f3f6e3 to bef97e9 Compare April 6, 2026 15:34
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 96.87500% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/io/commit.rs 96.87% 3 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

@westonpace westonpace marked this pull request as ready for review April 6, 2026 16:35
@westonpace westonpace merged commit efc9374 into lance-format:main Apr 6, 2026
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

At commit time return an error if the data file metadata is incorrect

2 participants