Skip to content

feat: expose tracked_files and all_files on LanceDataset#6011

Merged
wjones127 merged 2 commits into
lance-format:mainfrom
wjones127:feat/dataset-file-inspection-apis
Jun 5, 2026
Merged

feat: expose tracked_files and all_files on LanceDataset#6011
wjones127 merged 2 commits into
lance-format:mainfrom
wjones127:feat/dataset-file-inspection-apis

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 commented Feb 25, 2026

Adds new tracked_files() and all_files() methods that return data about files in a table. Both return as Arrow data.

tracked_files() outputs a row for every file referenced by each version. Files that are referenced by multiple versions (such as a data file) have a row for each version. This has columns for base_uri, version, path, and file_type.

all_files() outputs a row for every file in the dataset root directory, whether or not they are part of the table. This has columns for base_uri, path, file_size, last_modified.

These two data streams can be used in combination to do deeper analysis on file structure of a table. It can answer questions like: How much of the storage space is taken up by untracked files? When were untracked files created? Which files are taking up the most space? How big is version X?

@github-actions github-actions Bot added enhancement New feature or request A-python Python bindings labels Feb 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review

P1 Issues

1. Accidental test assertion removal

The diff removes an existing assertion from test_default_scan_options_nearest:

-    assert "id" in result.column_names

This appears unintentional - the assertion belongs to the previous test and should be preserved.

2. Documentation mismatch in tracked_files() Rust doc comment

The doc comment at rust/lance/src/dataset/files.rs:373 states:

| `type` | `Dictionary(Int32, Utf8)` (non-null) | ...

But the actual schema uses Int8 for the dictionary key:

DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8))

The doc should say Dictionary(Int8, Utf8).


Otherwise, the implementation looks solid with good test coverage, efficient batching, and proper error propagation through the channel pattern.

@wjones127 wjones127 changed the title feat(python): expose tracked_files and all_files on LanceDataset feat: expose tracked_files and all_files on LanceDataset Feb 25, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 25, 2026

Codecov Report

❌ Patch coverage is 93.77934% with 53 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/files.rs 93.13% 40 Missing and 13 partials ⚠️

📢 Thoughts on this report? Let us know!

@wjones127 wjones127 marked this pull request as ready for review February 25, 2026 21:53
@wjones127
Copy link
Copy Markdown
Contributor Author

wjones127 commented Feb 25, 2026

There will be some follow ups to make this useful:

  • Add progress reporting (re-use the index progress stuff)
  • Optimize speed of listing all files
  • Optimize speed of listing tracked files

@Xuanwo
Copy link
Copy Markdown
Collaborator

Xuanwo commented Mar 7, 2026

Should blob files also need to be tracked in those APIs?

@wjones127 wjones127 force-pushed the feat/dataset-file-inspection-apis branch 3 times, most recently from 8d63dd2 to 02e4bfc Compare May 12, 2026 00:46
Adds new `tracked_files()` and `all_files()` methods that return data
about files in a table. Both return as Arrow data.

`tracked_files()` outputs a row for every file referenced by each
version. Files that are referenced by multiple versions (such as a data
file) have a row for each version. This has columns for `base_uri`,
`version`, `path`, and `file_type`. Supports a `min_version` filter and
progress callback via `TrackedFilesOptions`. Internally split into a
Lister (enumerates locations) and a Reader (reads manifests with
memory-aware backpressure, ~1 GB budget) to avoid deadlocks.

`all_files()` outputs a row for every file in the dataset root
directory, whether or not they are part of the table. This has columns
for `base_uri`, `path`, `file_size`, `last_modified`.

These two data streams can be used in combination to do deeper analysis
on file structure of a table.

Adds PyO3 bindings and Python wrappers with docstrings; both return a
`pa.RecordBatchReader`.

Co-Authored-By: Claude <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/dataset-file-inspection-apis branch from 02e4bfc to e783188 Compare May 12, 2026 16:00
@wjones127
Copy link
Copy Markdown
Contributor Author

Should blob files also need to be tracked in those APIs?

Yes, though I'd like to do that in a follow up.

FileRow {
version: manifest.version,
base_uri: Cow::Borrowed(effective_base_uri),
path: Cow::Owned(format!("{}/{}", DATA_DIR, data_file.path)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check whether the base_uri points to a dataset root here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base URIs in manifest.base_paths are always Lance dataset roots — they're only populated by Manifest::shallow_clone, which records the source dataset's root as a BasePath. So appending {DATA_DIR}/{data_file.path} yields a valid path without an extra check, and a None base_id falls back to this dataset's own root, which is likewise a valid root by definition.

Comment thread rust/lance/src/dataset/files.rs Outdated
let deletion_files = manifest.fragments.iter().filter_map(|fragment| {
fragment.deletion_file.as_ref().map(|del_file| FileRow {
version: manifest.version,
base_uri: Cow::Borrowed(base_uri),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that the deletion files are from another base?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — yes. DeletionFile carries its own base_id and shallow_clone sets it alongside the data files. Fixed in ec45637: both branches now share a resolve_base_uri helper, and I extended the base-id unit test to cover a deletion file from another base.

Deletion files carry a `base_id` when they originate from a shallow
clone (set by `Manifest::shallow_clone`), but `tracked_files` always
reported them under the dataset's own `base_uri`. Resolve them against
`manifest.base_paths` the same way data files are, via a shared
`resolve_base_uri` helper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me. I'll add the review from Claude for completeness but none of these suggestions seem critical:

Here is my review of PR #6011:


PR #6011: feat: expose tracked_files and all_files on LanceDataset

Overview: Adds two new streaming APIs — tracked_files() and all_files() — for inspecting the physical file layout of a Lance dataset. The Rust implementation lives in a new dataset/files.rs module and uses a multi-task pipeline architecture. Python bindings are thin wrappers delegating to Rust.


Correctness

Pipeline termination / stream completeness:
The tracked_files_with_options pipeline spawns 4 tasks (Lister, Reader, Emitter, IndexLister) that write into a shared tx channel. The stream ends when rx is dropped — but that only happens after RecordBatchStreamAdapter is dropped. The problem: there's no coordination to ensure all 4 senders have finished before the stream ends. If a consumer calls read_all() and the IndexLister or Emitter is still in flight when RecordBatchStreamAdapter returns, results could be silently dropped. Specifically:

  • tx is cloned 4 times; it's dropped in tracked_files_with_options after all 4 tokio::spawn calls, but tx_idx (the last clone) is only dropped when IndexLister finishes. If IndexLister is slow, the stream will correctly wait. This seems fine on closer inspection, but the flow is subtle enough to warrant a comment.

Index deduplication across manifest versions:
The uuid_cache in IndexLister deduplicates UUID directory listings across versions, which is correct and efficient. However, if multiple manifest versions reference the same index UUID, index_file_batch is called once per version with the same files. This means index files will appear N times in the output (once per version referencing them) — which appears to be intentional (as with data files), but it's not documented in the tracked_files API docs.

size as i64 cast in build_all_files_batch:
meta.size is usize, so on a 64-bit platform it's fine, but a usize value > i64::MAX would silently truncate. Using i64::try_from(meta.size).unwrap_or(i64::MAX) would be safer.

Memory budget race:
The Reader task checks inflight_mem < MANIFEST_MEMORY_BUDGET before launching, but adds to inflight_mem after the check. Under concurrent reads, the actual in-flight memory could exceed the budget by up to parallelism × max_manifest_size. This is probably acceptable given the budget is 1 GB and MANIFEST_DECOMPRESSION_RATIO is an estimate, but worth a comment.


Design / API

all_files missing base_paths coverage:
The docstring correctly notes that base_paths entries are not scanned, but this is a sharp edge for shallow-clone users. Consider at minimum adding a See Also reference to the tracked_files method and noting that cross-referencing the two outputs is how you detect untracked vs tracked files.

tracked_files return type:
tracked_files and tracked_files_with_options return SendableRecordBatchStream (not Result<...>), hiding async errors until the consumer iterates. This is consistent with DataFusion conventions, but the function signature difference (async fn returning a stream vs. a Result) may surprise callers. Documenting this explicitly would help.

FileType discriminant coupling:
The comment in file_types.rs correctly notes that discriminants must stay in sync with FILE_TYPE_DICT_ARRAY, but the only enforcement is a code comment. A #[cfg(test)] assertion checking that FileType::Manifest as i8 == 0, DataFile == 1, etc., would prevent silent breakage if someone reorders variants.


Performance

manifest_file_batches capacity estimate:
The batch count estimate iter.len().div_ceil(BATCH_SIZE) is computed before the first batch is flushed, then re-computed inside the loop as next_size. The exact_size(size) wrapping uses the outer estimate. Since ExactSizeIterator::len() is used for capacity, any mismatch causes over/under-allocation. This looks correct but is subtle — the outer size is the total batch count, and exact_size(size) just pre-sizes the iterator's length hint, not the builder capacity. Fine as-is, but adds cognitive overhead.

collect_column_values in tests:
Test helper calls dict_value_at which does two downcast_ref attempts per cell. Acceptable in tests but worth noting if this pattern migrates to production code.


Test Coverage

Tests are thorough and follow project conventions. Specific strengths:

  • test_tracked_files_paths_match_disk cross-validates tracked_files against all_files — excellent cross-API integration test.
  • test_manifest_file_rows_per_file_base_id is a focused unit test for the shallow-clone base_id resolution logic.
  • Progress callback semantics are verified.

Missing coverage:

  • No test for all_files when base_paths contains externally-located files (to document the known limitation).
  • No test for tracked_files on an empty dataset (zero manifests).
  • test_tracked_files_progress asserts the last update has manifests_total == Some(3), but the first two updates could have manifests_total == None due to the race between Lister and Emitter — the test doesn't assert this, which is fine but leaves the "total is None until listing finishes" contract untested.

Minor Issues

  • python/python/lance/dataset.py: all_files docstring says last_modified but the schema column is last_modified (consistent), while the original PR description says last_modified. No issue, just confirming consistency.
  • rust/lance-table/src/utils.rs: Adding impl<T: Iterator> ExactSizeIterator for ExactSize<T> is a blanket impl that could conflict with future stdlib changes. It's a small, self-contained addition but worth flagging as a potential semver hazard.
  • The MANIFEST_MEMORY_BUDGET const of 1 GB is undocumented as to whether this is per-dataset-instance or global. Since inflight_mem is created fresh per call to tracked_files_with_options, it's per-call — but concurrent calls could exceed available memory.

Summary

This is a well-structured, genuinely useful feature with solid test coverage. The pipeline architecture is appropriate for streaming large datasets without OOM. The main things worth addressing before merge:

  1. (P1) Document that index files appear once per manifest version that references them (or deduplicate if unintended).
  2. (P1) Add a #[cfg(test)] sanity check that FileType discriminants match FILE_TYPE_DICT_ARRAY ordering.
  3. (P2) usize as i64 cast for file sizes — use try_from or document the assumption.
  4. (P2) Test for empty dataset (zero manifests) edge case.

@wjones127 wjones127 merged commit df94ee6 into lance-format:main Jun 5, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants