fix: refresh last_updated metadata on Operation::Merge for rewritten fragments#6640
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
6d332c5 to
1ca31f2
Compare
0f6d501 to
613a87e
Compare
|
@claude review |
| /// Returns true if Operation::Merge rewrote this fragment's column data files (Fragment::files | ||
| /// changed versus the previous manifest). Used to bump last_updated_at_version_meta only when | ||
| /// new column values were materialized to disk. | ||
| /// | ||
| /// Deletion file changes alone are not treated as rewrites: tombstones remove rows but | ||
| /// survivors did not receive new column bytes; stamping last_updated for those rows would be | ||
| /// incorrect for CDF. | ||
| #[inline] | ||
| fn merge_fragment_physically_rewritten(prev: &Fragment, merged: &Fragment) -> bool { | ||
| debug_assert_eq!(prev.id, merged.id); | ||
| prev.files != merged.files | ||
| } |
There was a problem hiding this comment.
🟡 merge_fragment_physically_rewritten compares Fragment.files via Vec::PartialEq, which transitively compares CachedFileSize — an interior-mutable AtomicU64 that scheduler.rs:632 writes in place on the live manifest's atomics. A caller that constructs merged fragments by cloning from get_fragments() (which snapshots CachedFileSize) and a concurrent scan that flips the original atomic between clone-time and build_manifest can drive prev.files != merged.files to true on identical files, falsely triggering refresh_row_latest_update_meta_for_full_frag_rewrite_cols and overwriting carry-forward last_updated history. Fix: compare only the immutable identity fields (path, fields, column_indices, file_major_version, file_minor_version, base_id), or normalize CachedFileSize before comparing.
Extended reasoning...
Bug
merge_fragment_physically_rewritten in transaction.rs:3520-3531 decides whether the carry-forward path needs to refresh per-row last_updated metadata by comparing prev.files != merged.files. Vec::eq uses the derived PartialEq on DataFile (rust/lance-table/src/format/fragment.rs:27), which compares all fields including file_size_bytes: CachedFileSize.
CachedFileSize wraps an AtomicU64 (rust/lance-io/src/utils.rs:177). Critically:
- CachedFileSize::eq loads the AtomicU64 and compares the value (utils.rs:226-231).
- CachedFileSize::set mutates the AtomicU64 in place via Relaxed store (utils.rs:252-255), through a shared reference.
- Clone for CachedFileSize copies a snapshot of the current value into a new AtomicU64 (utils.rs:218-224); subsequent set() on the original is not reflected in the clone.
scheduler.rs:632 calls file_size_bytes.set(size) on the caller-supplied CachedFileSize whenever a file's size becomes known. Callers in dataset/optimize.rs:516, dataset/optimize/binary_copy.rs:263, and dataset/fragment.rs:987 pass &data_file.file_size_bytes from the live dataset.manifest.fragments — i.e. they mutate the live manifest's atomic.
In build_manifest (rust/lance/src/io/commit.rs:800), current_manifest is dataset.manifest.as_ref(), so prev.files[].file_size_bytes is the live atomic. The merged fragments come from caller-side construction, which for several flows starts from a Clone of get_fragments() (a frozen snapshot).
Trigger sequence
- T1: caller clones the existing fragments via get_fragments() → CachedFileSize snapshot, value = 0 (unknown).
- T2: a concurrent scan or compaction running on the same Dataset opens those files via open_file_with_priority(..., &df.file_size_bytes) → the live atomic is set to N. The snapshot is unaffected.
- T3: caller commits an Operation::Merge that carries the fragment forward unchanged. build_manifest evaluates prev.files != merged.files: prev side reads N from the live atomic, merged side reads 0 from the snapshot → equality returns false despite identical path/fields/versions/base_id.
- T4: merge_fragment_physically_rewritten returns true → refresh_row_latest_update_meta_for_full_frag_rewrite_cols stamps every row with new_version, destroying the preserved per-row last_updated_at_version_meta history that the caller intended to carry forward.
This silently breaks CDF semantics for the carry-forward case: rows are reported as 'updated at v_new' even though no column data was actually rewritten.
Why current in-tree callers do not exercise it
The two confirmed callers — add_columns/alter_columns and drop_columns_impl — change Fragment::files length (append a new file or drop one), so Vec::eq short-circuits on length and CachedFileSize never participates. The AllNulls path does not access files, so both atomics stay at zero. The PR's own carry-forward test (merge_build_manifest_skips_refresh_when_carry_forward_stable_row_ids) constructs both DataFiles with the same Option<NonZero>=None and runs no concurrent scan, so it does not exercise the race. The vulnerability is latent: any future external caller (Python/JNI Operation::Merge construction) that submits a true carry-forward fragment whose live manifest atomic gets populated mid-flight will hit it.
Suggested fix
Compare only the immutable identity fields, e.g.:
fn data_file_identity_eq(a: &DataFile, b: &DataFile) -> bool {
a.path == b.path
&& a.fields == b.fields
&& a.column_indices == b.column_indices
&& a.file_major_version == b.file_major_version
&& a.file_minor_version == b.file_minor_version
&& a.base_id == b.base_id
}Then iterate prev.files / merged.files pairwise. Alternatively, normalize file_size_bytes on both sides before the Vec equality check. The current debug_assert_eq!(prev.id, merged.id) plus a length check also belong in the helper.
| None => { | ||
| // Fragment id not present in the previous manifest (Merge may append | ||
| // ids beyond the original list). Apply the same full-fragment | ||
| // last_updated refresh as for a rewrite. | ||
| lance_table::rowids::version::refresh_row_latest_update_meta_for_full_frag_rewrite_cols( | ||
| fragment, | ||
| new_version, | ||
| )?; | ||
| } |
There was a problem hiding this comment.
🟡 In the new Operation::Merge None branch (lines 2031-2039), brand-new fragment IDs only have last_updated_at_version_meta initialized; created_at_version_meta is left at whatever the caller passed (typically None). This is inconsistent with Operation::Append (1810-1814) and Operation::Overwrite (1949-1953), which set both fields via build_version_meta for genuinely new fragments. No in-tree caller currently appends new fragment IDs through Merge, but the Python transaction API exposes Operation::Merge directly (python/src/transaction.rs:313), so external callers reaching this branch will see _row_created_at_version come back NULL while _row_last_updated_at_version is populated. Suggest mirroring the Append pattern in the None arm: build the version meta once and assign to both fields.
Extended reasoning...
What the bug is
The PR adds two arms inside the new Operation::Merge block in build_manifest:
Some(prev)branch (existing fragment id) — refreshlast_updatedonly ifFragment::fileschanged. This is correct: the rows already existed in a prior version, socreated_atmust be preserved as-is.Nonebranch (fragment id not in previous manifest, i.e. brand-new) — unconditionally callsrefresh_row_latest_update_meta_for_full_frag_rewrite_cols.
refresh_row_latest_update_meta_for_full_frag_rewrite_cols (rust/lance-table/src/rowids/version.rs:493-520) only writes fragment.last_updated_at_version_meta = Some(version_meta); at line 516. It never touches created_at_version_meta. So a brand-new fragment introduced via Merge will end up with last_updated stamped to new_version and created_at left as whatever the caller supplied — typically None.
Why this is inconsistent with the rest of the file
The established pattern for newly-introduced fragments under stable row IDs is to populate both fields:
// Operation::Append (lines 1810-1814)
let version_meta = build_version_meta(fragment, new_version);
fragment.last_updated_at_version_meta = version_meta.clone();
fragment.created_at_version_meta = version_meta;Operation::Overwrite does the same at 1949-1953, and Update's resolve_update_version_metadata does the same for genuinely new rows. The None branch is the only place a brand-new fragment gets last_updated set without created_at.
Step-by-step proof
- External caller (e.g. via the Python API at
python/src/transaction.rs:313) constructs anOperation::Mergewith a fragment whoseidis greater than any id in the previous manifest, withrow_id_meta = Some(...),physical_rows = Some(N), and both*_version_metafields left asNone(the natural default for a freshly-builtFragment). merge_fragments_valid(transaction.rs:3536) accepts this —test_merge_fragments_validTest 4 explicitly asserts that adding new fragment IDs is valid.build_manifestenters the Merge arm,prev_by_id.get(&fragment.id)returnsNone, the None arm runs, andrefresh_row_latest_update_meta_for_full_frag_rewrite_colspopulates onlylast_updated_at_version_metatonew_version.created_at_version_metastaysNone.- A subsequent CDF query against the new rows returns
new_versionfor_row_last_updated_at_versionbutNULLfor_row_created_at_version— for rows that demonstrably did not exist beforenew_version.
Why I disagree with the refutation
The refutation argues that Merge is caller-controlled and the asymmetry mirrors the Some(prev) arm (which also doesn't touch created_at). But the Some(prev) arm is preserving created_at from a row that genuinely was created in an earlier version — that is correct. The None arm by contrast is overwriting the caller's last_updated (treating it the same way Append does) while leaving created_at un-overwritten. If the contract is "caller fully owns version meta," the None branch should not be touching last_updated either; if the contract is "merge stamps brand-new fragments like Append does," it should also stamp created_at. The current half-and-half behavior is the inconsistency.
The refutation also notes (correctly) that no in-tree caller exercises this path today (schema_evolution.rs and others all derive fragments from get_fragments()), which is why I'm flagging this as a nit rather than blocking. But the None branch was deliberately added by this PR with a comment explicitly calling out that "Merge may append ids beyond the original list," and the Python API is a public surface that can hit it.
How to fix
Mirror the Append/Overwrite pattern in the None arm:
None => {
let version_meta = build_version_meta(fragment, new_version);
fragment.last_updated_at_version_meta = version_meta.clone();
fragment.created_at_version_meta = version_meta;
}Either that, or drop the None arm entirely if the case is genuinely impossible — but that contradicts merge_fragments_valid and the comment justifying the branch.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
613a87e to
c1f662f
Compare
Summary
build_manifest, when stable row IDs are enabled (next_row_id.is_some()), theOperation::Mergearm now bumpslast_updated_at_version_metafor fragments whoseon-disk data changed versus the previous manifest, using
refresh_row_latest_update_meta_for_full_frag_rewrite_cols(fragment, new_version).Fragment::filesdiffers from the priormanifest for the same fragment id (column data rewritten); skip when files match
(metadata-only / carry-forward). Deletion-file-only changes do not trigger refresh.
last_updatedis not left unset.build_manifestis called fresh on every commit retrywith the latest committed manifest, so
new_versionis always consistent with the actualcommitted version regardless of intervening transactions.
Root cause
Operation::Mergepreviously appended merged fragments without updating per-rowlast_updatedmetadata for physical rewrites. Dataset version advanced, but_row_last_updated_at_versioncould stay stale for rewritten rows, breaking CDF semanticsfor merges that commit through
Operation::Merge(e.g. ADD COLUMNS-style materialization andother callers sharing this arm).
Test plan
cargo test -p lance merge_build_manifest— covers files change -> last_updated -> newversion, same files -> unchanged, and stable row IDs disabled -> no last_updated metadata.
lanceCI / regression suites unchanged aside from these tests.