feat: expose tracked_files and all_files on LanceDataset#6011
Conversation
PR ReviewP1 Issues1. Accidental test assertion removal The diff removes an existing assertion from - assert "id" in result.column_namesThis appears unintentional - the assertion belongs to the previous test and should be preserved. 2. Documentation mismatch in The doc comment at But the actual schema uses DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8))The doc should say Otherwise, the implementation looks solid with good test coverage, efficient batching, and proper error propagation through the channel pattern. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
There will be some follow ups to make this useful:
|
|
Should blob files also need to be tracked in those APIs? |
8d63dd2 to
02e4bfc
Compare
Adds new `tracked_files()` and `all_files()` methods that return data about files in a table. Both return as Arrow data. `tracked_files()` outputs a row for every file referenced by each version. Files that are referenced by multiple versions (such as a data file) have a row for each version. This has columns for `base_uri`, `version`, `path`, and `file_type`. Supports a `min_version` filter and progress callback via `TrackedFilesOptions`. Internally split into a Lister (enumerates locations) and a Reader (reads manifests with memory-aware backpressure, ~1 GB budget) to avoid deadlocks. `all_files()` outputs a row for every file in the dataset root directory, whether or not they are part of the table. This has columns for `base_uri`, `path`, `file_size`, `last_modified`. These two data streams can be used in combination to do deeper analysis on file structure of a table. Adds PyO3 bindings and Python wrappers with docstrings; both return a `pa.RecordBatchReader`. Co-Authored-By: Claude <noreply@anthropic.com>
02e4bfc to
e783188
Compare
Yes, though I'd like to do that in a follow up. |
| FileRow { | ||
| version: manifest.version, | ||
| base_uri: Cow::Borrowed(effective_base_uri), | ||
| path: Cow::Owned(format!("{}/{}", DATA_DIR, data_file.path)), |
There was a problem hiding this comment.
do we need to check whether the base_uri points to a dataset root here?
There was a problem hiding this comment.
The base URIs in manifest.base_paths are always Lance dataset roots — they're only populated by Manifest::shallow_clone, which records the source dataset's root as a BasePath. So appending {DATA_DIR}/{data_file.path} yields a valid path without an extra check, and a None base_id falls back to this dataset's own root, which is likewise a valid root by definition.
| let deletion_files = manifest.fragments.iter().filter_map(|fragment| { | ||
| fragment.deletion_file.as_ref().map(|del_file| FileRow { | ||
| version: manifest.version, | ||
| base_uri: Cow::Borrowed(base_uri), |
There was a problem hiding this comment.
is it possible that the deletion files are from another base?
There was a problem hiding this comment.
Good catch — yes. DeletionFile carries its own base_id and shallow_clone sets it alongside the data files. Fixed in ec45637: both branches now share a resolve_base_uri helper, and I extended the base-id unit test to cover a deletion file from another base.
Deletion files carry a `base_id` when they originate from a shallow clone (set by `Manifest::shallow_clone`), but `tracked_files` always reported them under the dataset's own `base_uri`. Resolve them against `manifest.base_paths` the same way data files are, via a shared `resolve_base_uri` helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
westonpace
left a comment
There was a problem hiding this comment.
This seems fine to me. I'll add the review from Claude for completeness but none of these suggestions seem critical:
Here is my review of PR #6011:
PR #6011: feat: expose tracked_files and all_files on LanceDataset
Overview: Adds two new streaming APIs — tracked_files() and all_files() — for inspecting the physical file layout of a Lance dataset. The Rust implementation lives in a new dataset/files.rs module and uses a multi-task pipeline architecture. Python bindings are thin wrappers delegating to Rust.
Correctness
Pipeline termination / stream completeness:
The tracked_files_with_options pipeline spawns 4 tasks (Lister, Reader, Emitter, IndexLister) that write into a shared tx channel. The stream ends when rx is dropped — but that only happens after RecordBatchStreamAdapter is dropped. The problem: there's no coordination to ensure all 4 senders have finished before the stream ends. If a consumer calls read_all() and the IndexLister or Emitter is still in flight when RecordBatchStreamAdapter returns, results could be silently dropped. Specifically:
txis cloned 4 times; it's dropped intracked_files_with_optionsafter all 4tokio::spawncalls, buttx_idx(the last clone) is only dropped when IndexLister finishes. If IndexLister is slow, the stream will correctly wait. This seems fine on closer inspection, but the flow is subtle enough to warrant a comment.
Index deduplication across manifest versions:
The uuid_cache in IndexLister deduplicates UUID directory listings across versions, which is correct and efficient. However, if multiple manifest versions reference the same index UUID, index_file_batch is called once per version with the same files. This means index files will appear N times in the output (once per version referencing them) — which appears to be intentional (as with data files), but it's not documented in the tracked_files API docs.
size as i64 cast in build_all_files_batch:
meta.size is usize, so on a 64-bit platform it's fine, but a usize value > i64::MAX would silently truncate. Using i64::try_from(meta.size).unwrap_or(i64::MAX) would be safer.
Memory budget race:
The Reader task checks inflight_mem < MANIFEST_MEMORY_BUDGET before launching, but adds to inflight_mem after the check. Under concurrent reads, the actual in-flight memory could exceed the budget by up to parallelism × max_manifest_size. This is probably acceptable given the budget is 1 GB and MANIFEST_DECOMPRESSION_RATIO is an estimate, but worth a comment.
Design / API
all_files missing base_paths coverage:
The docstring correctly notes that base_paths entries are not scanned, but this is a sharp edge for shallow-clone users. Consider at minimum adding a See Also reference to the tracked_files method and noting that cross-referencing the two outputs is how you detect untracked vs tracked files.
tracked_files return type:
tracked_files and tracked_files_with_options return SendableRecordBatchStream (not Result<...>), hiding async errors until the consumer iterates. This is consistent with DataFusion conventions, but the function signature difference (async fn returning a stream vs. a Result) may surprise callers. Documenting this explicitly would help.
FileType discriminant coupling:
The comment in file_types.rs correctly notes that discriminants must stay in sync with FILE_TYPE_DICT_ARRAY, but the only enforcement is a code comment. A #[cfg(test)] assertion checking that FileType::Manifest as i8 == 0, DataFile == 1, etc., would prevent silent breakage if someone reorders variants.
Performance
manifest_file_batches capacity estimate:
The batch count estimate iter.len().div_ceil(BATCH_SIZE) is computed before the first batch is flushed, then re-computed inside the loop as next_size. The exact_size(size) wrapping uses the outer estimate. Since ExactSizeIterator::len() is used for capacity, any mismatch causes over/under-allocation. This looks correct but is subtle — the outer size is the total batch count, and exact_size(size) just pre-sizes the iterator's length hint, not the builder capacity. Fine as-is, but adds cognitive overhead.
collect_column_values in tests:
Test helper calls dict_value_at which does two downcast_ref attempts per cell. Acceptable in tests but worth noting if this pattern migrates to production code.
Test Coverage
Tests are thorough and follow project conventions. Specific strengths:
test_tracked_files_paths_match_diskcross-validatestracked_filesagainstall_files— excellent cross-API integration test.test_manifest_file_rows_per_file_base_idis a focused unit test for the shallow-clonebase_idresolution logic.- Progress callback semantics are verified.
Missing coverage:
- No test for
all_fileswhenbase_pathscontains externally-located files (to document the known limitation). - No test for
tracked_fileson an empty dataset (zero manifests). test_tracked_files_progressasserts the last update hasmanifests_total == Some(3), but the first two updates could havemanifests_total == Nonedue to the race between Lister and Emitter — the test doesn't assert this, which is fine but leaves the "total is None until listing finishes" contract untested.
Minor Issues
python/python/lance/dataset.py:all_filesdocstring sayslast_modifiedbut the schema column islast_modified(consistent), while the original PR description sayslast_modified. No issue, just confirming consistency.rust/lance-table/src/utils.rs: Addingimpl<T: Iterator> ExactSizeIterator for ExactSize<T>is a blanket impl that could conflict with future stdlib changes. It's a small, self-contained addition but worth flagging as a potential semver hazard.- The
MANIFEST_MEMORY_BUDGETconst of 1 GB is undocumented as to whether this is per-dataset-instance or global. Sinceinflight_memis created fresh per call totracked_files_with_options, it's per-call — but concurrent calls could exceed available memory.
Summary
This is a well-structured, genuinely useful feature with solid test coverage. The pipeline architecture is appropriate for streaming large datasets without OOM. The main things worth addressing before merge:
- (P1) Document that index files appear once per manifest version that references them (or deduplicate if unintended).
- (P1) Add a
#[cfg(test)]sanity check thatFileTypediscriminants matchFILE_TYPE_DICT_ARRAYordering. - (P2)
usize as i64cast for file sizes — usetry_fromor document the assumption. - (P2) Test for empty dataset (zero manifests) edge case.
Adds new
tracked_files()andall_files()methods that return data about files in a table. Both return as Arrow data.tracked_files()outputs a row for every file referenced by each version. Files that are referenced by multiple versions (such as a data file) have a row for each version. This has columns forbase_uri,version,path, andfile_type.all_files()outputs a row for every file in the dataset root directory, whether or not they are part of the table. This has columns forbase_uri,path,file_size,last_modified.These two data streams can be used in combination to do deeper analysis on file structure of a table. It can answer questions like: How much of the storage space is taken up by untracked files? When were untracked files created? Which files are taking up the most space? How big is version X?