Skip to content

feat: add blob v2 schema#4948

Merged
Xuanwo merged 1 commit intomainfrom
blob-v2-plan
Nov 6, 2025
Merged

feat: add blob v2 schema#4948
Xuanwo merged 1 commit intomainfrom
blob-v2-plan

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Oct 14, 2025

Part of #4947

This PR will add blob v2 shcema for lance so that we are ready to start writing new desc fields for blob.

The new schema is gated under file format version 2.2.


This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.

@Xuanwo Xuanwo requested a review from westonpace October 14, 2025 09:11
@github-actions github-actions bot added the enhancement New feature or request label Oct 14, 2025
@Xuanwo Xuanwo mentioned this pull request Oct 14, 2025
9 tasks
@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Oct 14, 2025

Oh, I get it. We didn't support packed struct with Utf8 fields yet, need to address #2862 first.

@Xuanwo Xuanwo force-pushed the blob-v2-plan branch 2 times, most recently from 3d02e43 to c58ff77 Compare October 27, 2025 05:03
@codecov-commenter
Copy link

codecov-commenter commented Oct 27, 2025

Codecov Report

❌ Patch coverage is 93.43066% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.73%. Comparing base (3088977) to head (dede437).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-core/src/datatypes/field.rs 89.85% 7 Missing ⚠️
rust/lance-core/src/datatypes/schema.rs 97.61% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset/write.rs 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4948      +/-   ##
==========================================
- Coverage   81.74%   81.73%   -0.02%     
==========================================
  Files         341      341              
  Lines      140915   141049     +134     
  Branches   140915   141049     +134     
==========================================
+ Hits       115198   115289      +91     
- Misses      21900    21945      +45     
+ Partials     3817     3815       -2     
Flag Coverage Δ
unittests 81.73% <93.43%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure this is the correct approach. I could also be wrong! My thinking was that blob-v2 was going to be more or less a table-level feature?

In other words, we had this table:

Example Size Category Data Type Table Type
Small Inline Normal Binary Regular Column
512KB Out-of-line Blob Regular Column
10MB Packed Packed Struct BlobFile Column
1GB Dedicated Packed Struct BlobFile Column

So I wasn't expecting any changes to the file reader (since we now have support for blob and packed struct).

It looks like you are introducing the data file concept here as a new encoding for the file reader? I would think the new data file columns would be handled mostly at the table level (the file reader would just see them as a packed struct column)?

))
}
}
DataBlock::Nullable(nullable) => self.create_per_value(field, nullable.data.as_ref()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we get a nullable block here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our new packed struct descriptor, blob_id / blob_uri are optional so we will need to handle the Nullable first. Btw, I'm fine with change them into non-nullable to make the logic easier.

Comment on lines +1464 to +1469
if matches!(data_type, DataType::Null) {
return Self::AllNull(AllNullDataBlock { num_values });
}
let mut builder = BooleanBufferBuilder::new(num_values as usize);
builder.append_n(num_values as usize, false);
nulls = Nullability::Some(NullBuffer::new(builder.finish()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s the same issue we ran into with nullable fields. I suspect there might be a bug, but I haven’t done deep research yet. The problem is that when we use optional fields like blob_id and blob_uri, we end up generating an all-null data block. But the packed struct per-value encoding can’t handle this case correctly. So I created a nullable buffer instead.


I start to think using nullable fields is a bad idea 😭

My intention is to save some bits for cases where all blob_ids are null.

Comment on lines +179 to +184
if let Some(ids) = blob_ids.as_mut() {
ids.push(0);
}
if let Some(builder) = uri_builder.as_mut() {
builder.append_null();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a big deal, given blobs are going to be large and expensive, so we don't have to worry that much about per-row costs. However, if this were a regular column, I'd suggest getting the if statements out of the per-value loop here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will give this a try.

Comment on lines +203 to +208
if let Some(ids) = blob_ids.as_mut() {
ids.push(0);
}
if let Some(builder) = uri_builder.as_mut() {
builder.append_value("");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why are we inserting 0/"" here? Is this because they are packed / inline blobs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we’re filling in missing values for inline and out-of-line blobs to make our decoding logic a bit cleaner. Maybe I just need to append_null instead, will fix.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Oct 30, 2025

Thank you @westonpace for the review!

My thinking was that blob-v2 was going to be more or less a table-level feature?

Yes, blob v2 is a table level feature, we just using the struct descriptor to carry some metadata.

So I wasn't expecting any changes to the file reader (since we now have support for blob and packed struct).

The changes to the file reader are mostly for compatibility with 2.1 files, where we need to fill the added columns blob_id and blob_uri to ensure they are handled correctly in the same logic.

It looks like you are introducing the data file concept here as a new encoding for the file reader?

Oh no, I’m not adding a new encoding here, just two new columns in the same struct descriptor. I tend to reuse the same field with different schemas and do evolution at runtime.

I would think the new data file columns would be handled mostly at the table level (the file reader would just see them as a packed struct column)?

Yes, that’s my expectation too. Maybe the change set itself is not clear about what I'm doing? I can add some comments on the confusing part. Or just use a new field for that?

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Nov 3, 2025

Had an offline meeting with @westonpace, will change blob v2 schema into a table level thing instead of file level thing.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Nov 4, 2025

cc @westonpace, let me know if the new impl align what you think

@Xuanwo Xuanwo requested a review from westonpace November 4, 2025 10:45
Signed-off-by: Xuanwo <github@xuanwo.io>
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor naming change. I think (but may be understanding wrong) that "blob version 1" is "inline only" and "blob version 2" is "inline, packed, and dedicated".

If this is correct then I think we should call it BLOBFILE_DESC_FIELD (or PACKED_DESC_FIELD). We are not replacing the existing inline approach, we are adding a new packed approach which will utilize blob-files. The inline approach will still use the 2-field description. The packed approach will use the 5-field description.

Comment on lines +72 to +73
pub static BLOB_V2_DESC_LANCE_FIELD: LazyLock<Field> =
LazyLock::new(|| Field::try_from(&*BLOB_V2_DESC_FIELD).unwrap());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just call this BLOBFILE_DESC_FIELD? This way it is clear we are not replacing BLOB_DESC_FIELD (we still need it for inline case)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current idea is to have a blob v2 concept that includes all supported blob types like 'inline', 'packed', or 'dedicated'. This means blob v2 will cover all the uses of blob v1, which is just inline. I think this makes compatibility easier without changing any logic on the inline side.

We’ll only reuse the file encoding from blob v1 (inline), but all the table schema will be in blob v2. With this change, we can do the version check early at the table level instead of at the time of writing data.

Btw, I’m not strong on this. If you prefer to just keep packed/dedicated blob types as an extension to inline, I’m fine with making this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move this dicussion to #5163

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing review to approved since I think I only disagree on naming at this point.

@Xuanwo Xuanwo merged commit 1f38a7f into main Nov 6, 2025
30 of 31 checks passed
@Xuanwo Xuanwo deleted the blob-v2-plan branch November 6, 2025 10:36
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Nov 6, 2025
Part of lance-format#4947

This PR will add blob v2 shcema for lance so that we are ready to start
writing new desc fields for blob.

The new schema is gated under file format version `2.2`.

---

**This PR was primarily authored with Codex using GPT-5-Codex and then
hand-reviewed by me. I AM responsible for every change made in this PR.
I aimed to keep it aligned with our goals, though I may have missed
minor issues. Please flag anything that feels off, I'll fix it
quickly.**

Signed-off-by: Xuanwo <github@xuanwo.io>
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
Part of lance-format#4947

This PR will add blob v2 shcema for lance so that we are ready to start
writing new desc fields for blob.

The new schema is gated under file format version `2.2`.

---

**This PR was primarily authored with Codex using GPT-5-Codex and then
hand-reviewed by me. I AM responsible for every change made in this PR.
I aimed to keep it aligned with our goals, though I may have missed
minor issues. Please flag anything that feels off, I'll fix it
quickly.**

Signed-off-by: Xuanwo <github@xuanwo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants