feat: expose tracked_files and all_files on LanceDataset by wjones127 · Pull Request #6011 · lance-format/lance

wjones127 · 2026-02-25T16:36:59Z

Adds new tracked_files() and all_files() methods that return data about files in a table. Both return as Arrow data.

tracked_files() outputs a row for every file referenced by each version. Files that are referenced by multiple versions (such as a data file) have a row for each version. This has columns for base_uri, version, path, and file_type.

all_files() outputs a row for every file in the dataset root directory, whether or not they are part of the table. This has columns for base_uri, path, file_size, last_modified.

These two data streams can be used in combination to do deeper analysis on file structure of a table. It can answer questions like: How much of the storage space is taken up by untracked files? When were untracked files created? Which files are taking up the most space? How big is version X?

github-actions · 2026-02-25T16:38:14Z

PR Review

P1 Issues

1. Accidental test assertion removal

The diff removes an existing assertion from test_default_scan_options_nearest:

-    assert "id" in result.column_names

This appears unintentional - the assertion belongs to the previous test and should be preserved.

2. Documentation mismatch in tracked_files() Rust doc comment

The doc comment at rust/lance/src/dataset/files.rs:373 states:

| `type` | `Dictionary(Int32, Utf8)` (non-null) | ...

But the actual schema uses Int8 for the dictionary key:

DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8))

The doc should say Dictionary(Int8, Utf8).

Otherwise, the implementation looks solid with good test coverage, efficient batching, and proper error propagation through the channel pattern.

codecov · 2026-02-25T17:31:34Z

Codecov Report

❌ Patch coverage is 93.77934% with 53 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/files.rs	93.13%	40 Missing and 13 partials ⚠️

📢 Thoughts on this report? Let us know!

wjones127 · 2026-02-25T21:56:00Z

There will be some follow ups to make this useful:

Add progress reporting (re-use the index progress stuff)
Optimize speed of listing all files
Optimize speed of listing tracked files

Xuanwo · 2026-03-07T06:36:14Z

Should blob files also need to be tracked in those APIs?

Adds new `tracked_files()` and `all_files()` methods that return data about files in a table. Both return as Arrow data. `tracked_files()` outputs a row for every file referenced by each version. Files that are referenced by multiple versions (such as a data file) have a row for each version. This has columns for `base_uri`, `version`, `path`, and `file_type`. Supports a `min_version` filter and progress callback via `TrackedFilesOptions`. Internally split into a Lister (enumerates locations) and a Reader (reads manifests with memory-aware backpressure, ~1 GB budget) to avoid deadlocks. `all_files()` outputs a row for every file in the dataset root directory, whether or not they are part of the table. This has columns for `base_uri`, `path`, `file_size`, `last_modified`. These two data streams can be used in combination to do deeper analysis on file structure of a table. Adds PyO3 bindings and Python wrappers with docstrings; both return a `pa.RecordBatchReader`. Co-Authored-By: Claude <noreply@anthropic.com>

wjones127 · 2026-05-12T20:38:08Z

Should blob files also need to be tracked in those APIs?

Yes, though I'd like to do that in a follow up.

BubbleCal · 2026-05-27T13:55:25Z

+            FileRow {
+                version: manifest.version,
+                base_uri: Cow::Borrowed(effective_base_uri),
+                path: Cow::Owned(format!("{}/{}", DATA_DIR, data_file.path)),


do we need to check whether the base_uri points to a dataset root here?

The base URIs in manifest.base_paths are always Lance dataset roots — they're only populated by Manifest::shallow_clone, which records the source dataset's root as a BasePath. So appending {DATA_DIR}/{data_file.path} yields a valid path without an extra check, and a None base_id falls back to this dataset's own root, which is likewise a valid root by definition.

BubbleCal · 2026-05-27T13:57:34Z

+    let deletion_files = manifest.fragments.iter().filter_map(|fragment| {
+        fragment.deletion_file.as_ref().map(|del_file| FileRow {
+            version: manifest.version,
+            base_uri: Cow::Borrowed(base_uri),


is it possible that the deletion files are from another base?

Good catch — yes. DeletionFile carries its own base_id and shallow_clone sets it alongside the data files. Fixed in ec45637: both branches now share a resolve_base_uri helper, and I extended the base-id unit test to cover a deletion file from another base.

Deletion files carry a `base_id` when they originate from a shallow clone (set by `Manifest::shallow_clone`), but `tracked_files` always reported them under the dataset's own `base_uri`. Resolve them against `manifest.base_paths` the same way data files are, via a shared `resolve_base_uri` helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

westonpace

This seems fine to me. I'll add the review from Claude for completeness but none of these suggestions seem critical:

Here is my review of PR #6011:

PR #6011: `feat: expose tracked_files and all_files on LanceDataset`

Overview: Adds two new streaming APIs — tracked_files() and all_files() — for inspecting the physical file layout of a Lance dataset. The Rust implementation lives in a new dataset/files.rs module and uses a multi-task pipeline architecture. Python bindings are thin wrappers delegating to Rust.

Correctness

Pipeline termination / stream completeness:
The tracked_files_with_options pipeline spawns 4 tasks (Lister, Reader, Emitter, IndexLister) that write into a shared tx channel. The stream ends when rx is dropped — but that only happens after RecordBatchStreamAdapter is dropped. The problem: there's no coordination to ensure all 4 senders have finished before the stream ends. If a consumer calls read_all() and the IndexLister or Emitter is still in flight when RecordBatchStreamAdapter returns, results could be silently dropped. Specifically:

tx is cloned 4 times; it's dropped in tracked_files_with_options after all 4 tokio::spawn calls, but tx_idx (the last clone) is only dropped when IndexLister finishes. If IndexLister is slow, the stream will correctly wait. This seems fine on closer inspection, but the flow is subtle enough to warrant a comment.

Index deduplication across manifest versions:
The uuid_cache in IndexLister deduplicates UUID directory listings across versions, which is correct and efficient. However, if multiple manifest versions reference the same index UUID, index_file_batch is called once per version with the same files. This means index files will appear N times in the output (once per version referencing them) — which appears to be intentional (as with data files), but it's not documented in the tracked_files API docs.

size as i64 cast in build_all_files_batch:
meta.size is usize, so on a 64-bit platform it's fine, but a usize value > i64::MAX would silently truncate. Using i64::try_from(meta.size).unwrap_or(i64::MAX) would be safer.

Memory budget race:
The Reader task checks inflight_mem < MANIFEST_MEMORY_BUDGET before launching, but adds to inflight_mem after the check. Under concurrent reads, the actual in-flight memory could exceed the budget by up to parallelism × max_manifest_size. This is probably acceptable given the budget is 1 GB and MANIFEST_DECOMPRESSION_RATIO is an estimate, but worth a comment.

Design / API

all_files missing base_paths coverage:
The docstring correctly notes that base_paths entries are not scanned, but this is a sharp edge for shallow-clone users. Consider at minimum adding a See Also reference to the tracked_files method and noting that cross-referencing the two outputs is how you detect untracked vs tracked files.

tracked_files return type:
tracked_files and tracked_files_with_options return SendableRecordBatchStream (not Result<...>), hiding async errors until the consumer iterates. This is consistent with DataFusion conventions, but the function signature difference (async fn returning a stream vs. a Result) may surprise callers. Documenting this explicitly would help.

FileType discriminant coupling:
The comment in file_types.rs correctly notes that discriminants must stay in sync with FILE_TYPE_DICT_ARRAY, but the only enforcement is a code comment. A #[cfg(test)] assertion checking that FileType::Manifest as i8 == 0, DataFile == 1, etc., would prevent silent breakage if someone reorders variants.

Performance

manifest_file_batches capacity estimate:
The batch count estimate iter.len().div_ceil(BATCH_SIZE) is computed before the first batch is flushed, then re-computed inside the loop as next_size. The exact_size(size) wrapping uses the outer estimate. Since ExactSizeIterator::len() is used for capacity, any mismatch causes over/under-allocation. This looks correct but is subtle — the outer size is the total batch count, and exact_size(size) just pre-sizes the iterator's length hint, not the builder capacity. Fine as-is, but adds cognitive overhead.

collect_column_values in tests:
Test helper calls dict_value_at which does two downcast_ref attempts per cell. Acceptable in tests but worth noting if this pattern migrates to production code.

Test Coverage

Tests are thorough and follow project conventions. Specific strengths:

test_tracked_files_paths_match_disk cross-validates tracked_files against all_files — excellent cross-API integration test.
test_manifest_file_rows_per_file_base_id is a focused unit test for the shallow-clone base_id resolution logic.
Progress callback semantics are verified.

Missing coverage:

No test for all_files when base_paths contains externally-located files (to document the known limitation).
No test for tracked_files on an empty dataset (zero manifests).
test_tracked_files_progress asserts the last update has manifests_total == Some(3), but the first two updates could have manifests_total == None due to the race between Lister and Emitter — the test doesn't assert this, which is fine but leaves the "total is None until listing finishes" contract untested.

Minor Issues

python/python/lance/dataset.py: all_files docstring says last_modified but the schema column is last_modified (consistent), while the original PR description says last_modified. No issue, just confirming consistency.
rust/lance-table/src/utils.rs: Adding impl<T: Iterator> ExactSizeIterator for ExactSize<T> is a blanket impl that could conflict with future stdlib changes. It's a small, self-contained addition but worth flagging as a potential semver hazard.
The MANIFEST_MEMORY_BUDGET const of 1 GB is undocumented as to whether this is per-dataset-instance or global. Since inflight_mem is created fresh per call to tracked_files_with_options, it's per-call — but concurrent calls could exceed available memory.

Summary

This is a well-structured, genuinely useful feature with solid test coverage. The pipeline architecture is appropriate for streaming large datasets without OOM. The main things worth addressing before merge:

(P1) Document that index files appear once per manifest version that references them (or deduplicate if unintended).
(P1) Add a #[cfg(test)] sanity check that FileType discriminants match FILE_TYPE_DICT_ARRAY ordering.
(P2) usize as i64 cast for file sizes — use try_from or document the assumption.
(P2) Test for empty dataset (zero manifests) edge case.

github-actions Bot added enhancement New feature or request A-python Python bindings labels Feb 25, 2026

wjones127 changed the title ~~feat(python): expose tracked_files and all_files on LanceDataset~~ feat: expose tracked_files and all_files on LanceDataset Feb 25, 2026

wjones127 marked this pull request as ready for review February 25, 2026 21:53

wjones127 mentioned this pull request Mar 7, 2026

Automatic cleanup of indexes on failed builds #6123

Open

wjones127 force-pushed the feat/dataset-file-inspection-apis branch 3 times, most recently from 8d63dd2 to 02e4bfc Compare May 12, 2026 00:46

wjones127 force-pushed the feat/dataset-file-inspection-apis branch from 02e4bfc to e783188 Compare May 12, 2026 16:00

BubbleCal reviewed May 27, 2026

View reviewed changes

westonpace approved these changes Jun 5, 2026

View reviewed changes

wjones127 merged commit df94ee6 into lance-format:main Jun 5, 2026
29 checks passed

This was referenced Jun 5, 2026

Make sure tracked_files() supports shallow clone / base paths #7133

Open

Make sure tracked_files() supports blobs #7134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose tracked_files and all_files on LanceDataset#6011

feat: expose tracked_files and all_files on LanceDataset#6011
wjones127 merged 2 commits into
lance-format:mainfrom
wjones127:feat/dataset-file-inspection-apis

wjones127 commented Feb 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

codecov Bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

wjones127 commented Feb 25, 2026 •

edited

Loading

Uh oh!

Xuanwo commented Mar 7, 2026

Uh oh!

wjones127 commented May 12, 2026

Uh oh!

BubbleCal May 27, 2026

Uh oh!

wjones127 Jun 1, 2026

Uh oh!

BubbleCal May 27, 2026

Uh oh!

wjones127 Jun 1, 2026

Uh oh!

westonpace left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wjones127 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Feb 25, 2026

PR Review

P1 Issues

Uh oh!

codecov Bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wjones127 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xuanwo commented Mar 7, 2026

Uh oh!

wjones127 commented May 12, 2026

Uh oh!

BubbleCal May 27, 2026

Choose a reason for hiding this comment

Uh oh!

wjones127 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

BubbleCal May 27, 2026

Choose a reason for hiding this comment

Uh oh!

wjones127 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

PR #6011: feat: expose tracked_files and all_files on LanceDataset

Correctness

Design / API

Performance

Test Coverage

Minor Issues

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wjones127 commented Feb 25, 2026 •

edited

Loading

codecov Bot commented Feb 25, 2026 •

edited

Loading

wjones127 commented Feb 25, 2026 •

edited

Loading

PR #6011: `feat: expose tracked_files and all_files on LanceDataset`