Skip to content

feat: add DataFile.create helper for building DataFile metadata#6427

Merged
westonpace merged 3 commits intolance-format:mainfrom
westonpace:feat-data-file-helper
Apr 8, 2026
Merged

feat: add DataFile.create helper for building DataFile metadata#6427
westonpace merged 3 commits intolance-format:mainfrom
westonpace:feat-data-file-helper

Conversation

@westonpace
Copy link
Copy Markdown
Member

Summary

  • Adds DataFile.create(dataset, path, *, base_id=None) classmethod that reads a lance file's metadata and automatically constructs a DataFile with correct field IDs, column indices, file version, and file size
  • Eliminates the need for manual DataFile construction when performing DataReplacement operations
  • Handles packed structs, structural file versions (v2.1+), subset columns, and external base paths

Closes #6413

Test plan

  • test_data_file_create_basic — verifies fields, column_indices, version, file_size for a two-column file
  • test_data_file_create_subset_columns — single column from a multi-column dataset
  • test_data_file_create_end_to_end — full DataReplacement round-trip using the new helper
  • test_data_file_create_unknown_column — error on column not in dataset schema
  • All existing test_table_ops.py tests still pass

🤖 Generated with Claude Code

…lance files

Adds a convenience method `DataFile.create(dataset, path)` that reads a lance
file's metadata and automatically determines field IDs, column indices, file
version, and file size — eliminating the need for manual DataFile construction
when performing DataReplacement operations.

Closes lance-format#6413

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added enhancement New feature or request python labels Apr 7, 2026
Move the core logic from the Python binding into a public async method
on Dataset so Rust users can also construct DataFile metadata from
existing lance files. The Python binding is now a thin wrapper.

Also refactors data_file_dir to reuse the new data_file_dir_for_base
helper, removing duplicated base path resolution logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@westonpace westonpace marked this pull request as ready for review April 7, 2026 17:02
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 5.40541% with 70 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset.rs 5.40% 69 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Comment thread rust/lance/src/dataset.rs
Comment on lines +1747 to +1749
let file = scheduler
.open_file(&filepath, &CachedFileSize::unknown())
.await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: you just read the file size, so you should be able to pass it here:

Suggested change
let file = scheduler
.open_file(&filepath, &CachedFileSize::unknown())
.await?;
let file = scheduler
.open_file(&filepath, &CachedFileSize::new(file_size))
.await?;

@westonpace westonpace merged commit 6112a34 into lance-format:main Apr 8, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a python helper function for creating DataFile metadata

2 participants