feat: add blob v2 schema by Xuanwo · Pull Request #4948 · lance-format/lance

Xuanwo · 2025-10-14T09:11:51Z

Part of #4947

This PR will add blob v2 shcema for lance so that we are ready to start writing new desc fields for blob.

The new schema is gated under file format version 2.2.

This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.

Xuanwo · 2025-10-14T13:26:25Z

Oh, I get it. We didn't support packed struct with Utf8 fields yet, need to address #2862 first.

codecov-commenter · 2025-10-27T08:18:59Z

Codecov Report

❌ Patch coverage is 93.43066% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.73%. Comparing base (3088977) to head (dede437).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-core/src/datatypes/field.rs	89.85%	7 Missing ⚠️
rust/lance-core/src/datatypes/schema.rs	97.61%	0 Missing and 1 partial ⚠️
rust/lance/src/dataset/write.rs	85.71%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4948      +/-   ##
==========================================
- Coverage   81.74%   81.73%   -0.02%     
==========================================
  Files         341      341              
  Lines      140915   141049     +134     
  Branches   140915   141049     +134     
==========================================
+ Hits       115198   115289      +91     
- Misses      21900    21945      +45     
+ Partials     3817     3815       -2

Flag	Coverage Δ
unittests	`81.73% <93.43%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

westonpace

Hmm, I'm not sure this is the correct approach. I could also be wrong! My thinking was that blob-v2 was going to be more or less a table-level feature?

In other words, we had this table:

Example Size	Category	Data Type	Table Type
Small	Inline	Normal Binary	Regular Column
512KB	Out-of-line	Blob	Regular Column
10MB	Packed	Packed Struct	BlobFile Column
1GB	Dedicated	Packed Struct	BlobFile Column

So I wasn't expecting any changes to the file reader (since we now have support for blob and packed struct).

It looks like you are introducing the data file concept here as a new encoding for the file reader? I would think the new data file columns would be handled mostly at the table level (the file reader would just see them as a packed struct column)?

westonpace · 2025-10-30T12:59:34Z

rust/lance-encoding/src/compression.rs

                    ))
                }
            }
+            DataBlock::Nullable(nullable) => self.create_per_value(field, nullable.data.as_ref()),


How do we get a nullable block here?

In our new packed struct descriptor, blob_id / blob_uri are optional so we will need to handle the Nullable first. Btw, I'm fine with change them into non-nullable to make the logic easier.

westonpace · 2025-10-30T13:00:15Z

rust/lance-encoding/src/data.rs

+            if matches!(data_type, DataType::Null) {
+                return Self::AllNull(AllNullDataBlock { num_values });
+            }
+            let mut builder = BooleanBufferBuilder::new(num_values as usize);
+            builder.append_n(num_values as usize, false);
+            nulls = Nullability::Some(NullBuffer::new(builder.finish()));


Why this change?

That’s the same issue we ran into with nullable fields. I suspect there might be a bug, but I haven’t done deep research yet. The problem is that when we use optional fields like blob_id and blob_uri, we end up generating an all-null data block. But the packed struct per-value encoding can’t handle this case correctly. So I created a nullable buffer instead.

I start to think using nullable fields is a bad idea 😭

My intention is to save some bits for cases where all blob_ids are null.

westonpace · 2025-10-30T13:03:00Z

rust/lance-encoding/src/encodings/logical/blob.rs

+                if let Some(ids) = blob_ids.as_mut() {
+                    ids.push(0);
+                }
+                if let Some(builder) = uri_builder.as_mut() {
+                    builder.append_null();
+                }


Probably not a big deal, given blobs are going to be large and expensive, so we don't have to worry that much about per-row costs. However, if this were a regular column, I'd suggest getting the if statements out of the per-value loop here.

Good idea, will give this a try.

westonpace · 2025-10-30T13:03:55Z

rust/lance-encoding/src/encodings/logical/blob.rs

+                    if let Some(ids) = blob_ids.as_mut() {
+                        ids.push(0);
+                    }
+                    if let Some(builder) = uri_builder.as_mut() {
+                        builder.append_value("");
+                    }


Wait, why are we inserting 0/"" here? Is this because they are packed / inline blobs?

Yep, we’re filling in missing values for inline and out-of-line blobs to make our decoding logic a bit cleaner. Maybe I just need to append_null instead, will fix.

Xuanwo · 2025-10-30T14:32:43Z

Thank you @westonpace for the review!

My thinking was that blob-v2 was going to be more or less a table-level feature?

Yes, blob v2 is a table level feature, we just using the struct descriptor to carry some metadata.

So I wasn't expecting any changes to the file reader (since we now have support for blob and packed struct).

The changes to the file reader are mostly for compatibility with 2.1 files, where we need to fill the added columns blob_id and blob_uri to ensure they are handled correctly in the same logic.

It looks like you are introducing the data file concept here as a new encoding for the file reader?

Oh no, I’m not adding a new encoding here, just two new columns in the same struct descriptor. I tend to reuse the same field with different schemas and do evolution at runtime.

I would think the new data file columns would be handled mostly at the table level (the file reader would just see them as a packed struct column)?

Yes, that’s my expectation too. Maybe the change set itself is not clear about what I'm doing? I can add some comments on the confusing part. Or just use a new field for that?

Xuanwo · 2025-11-03T18:31:10Z

Had an offline meeting with @westonpace, will change blob v2 schema into a table level thing instead of file level thing.

Xuanwo · 2025-11-04T10:45:06Z

cc @westonpace, let me know if the new impl align what you think

Signed-off-by: Xuanwo <github@xuanwo.io>

westonpace

Minor naming change. I think (but may be understanding wrong) that "blob version 1" is "inline only" and "blob version 2" is "inline, packed, and dedicated".

If this is correct then I think we should call it BLOBFILE_DESC_FIELD (or PACKED_DESC_FIELD). We are not replacing the existing inline approach, we are adding a new packed approach which will utilize blob-files. The inline approach will still use the 2-field description. The packed approach will use the 5-field description.

westonpace · 2025-11-05T13:29:19Z

rust/lance-core/src/datatypes.rs

+pub static BLOB_V2_DESC_LANCE_FIELD: LazyLock<Field> =
+    LazyLock::new(|| Field::try_from(&*BLOB_V2_DESC_FIELD).unwrap());


I wonder if we should just call this BLOBFILE_DESC_FIELD? This way it is clear we are not replacing BLOB_DESC_FIELD (we still need it for inline case)

My current idea is to have a blob v2 concept that includes all supported blob types like 'inline', 'packed', or 'dedicated'. This means blob v2 will cover all the uses of blob v1, which is just inline. I think this makes compatibility easier without changing any logic on the inline side.

We’ll only reuse the file encoding from blob v1 (inline), but all the table schema will be in blob v2. With this change, we can do the version check early at the table level instead of at the time of writing data.

Btw, I’m not strong on this. If you prefer to just keep packed/dedicated blob types as an extension to inline, I’m fine with making this change.

I will move this dicussion to #5163

rust/lance-core/src/datatypes/field.rs

westonpace

Changing review to approved since I think I only disagree on naming at this point.

Part of lance-format#4947 This PR will add blob v2 shcema for lance so that we are ready to start writing new desc fields for blob. The new schema is gated under file format version `2.2`. --- **This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.** Signed-off-by: Xuanwo <github@xuanwo.io>

Xuanwo requested a review from westonpace October 14, 2025 09:11

github-actions bot added the enhancement New feature or request label Oct 14, 2025

Xuanwo mentioned this pull request Oct 14, 2025

Tracking issues for Blob V2 #4947

Open

9 tasks

Xuanwo mentioned this pull request Oct 14, 2025

Allow packing of variable length columns #2862

Closed

Xuanwo force-pushed the blob-v2-plan branch 2 times, most recently from 3d02e43 to c58ff77 Compare October 27, 2025 05:03

westonpace requested changes Oct 30, 2025

View reviewed changes

Xuanwo force-pushed the blob-v2-plan branch from 4a21dc1 to 10f3e3a Compare November 4, 2025 07:07

Xuanwo requested a review from westonpace November 4, 2025 10:45

Implement blob v2

769938e

Signed-off-by: Xuanwo <github@xuanwo.io>

Xuanwo force-pushed the blob-v2-plan branch from dede437 to 769938e Compare November 5, 2025 08:50

westonpace reviewed Nov 5, 2025

View reviewed changes

westonpace approved these changes Nov 5, 2025

View reviewed changes

Xuanwo mentioned this pull request Nov 6, 2025

Naming discussion around blob v2 #5163

Closed

Xuanwo merged commit 1f38a7f into main Nov 6, 2025
30 of 31 checks passed

Xuanwo deleted the blob-v2-plan branch November 6, 2025 10:36

		pub static BLOB_V2_DESC_LANCE_FIELD: LazyLock<Field> =
		LazyLock::new(\|\| Field::try_from(&*BLOB_V2_DESC_FIELD).unwrap());

Conversation

Xuanwo commented Oct 14, 2025

Uh oh!

Xuanwo commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xuanwo commented Oct 30, 2025

Uh oh!

Xuanwo commented Nov 3, 2025

Uh oh!

Xuanwo commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Xuanwo commented Oct 14, 2025 •

edited

Loading

codecov-commenter commented Oct 27, 2025 •

edited

Loading

Xuanwo commented Nov 4, 2025 •

edited

Loading