Conversation
|
Oh, I get it. We didn't support packed struct with |
3d02e43 to
c58ff77
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4948 +/- ##
==========================================
- Coverage 81.74% 81.73% -0.02%
==========================================
Files 341 341
Lines 140915 141049 +134
Branches 140915 141049 +134
==========================================
+ Hits 115198 115289 +91
- Misses 21900 21945 +45
+ Partials 3817 3815 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
westonpace
left a comment
There was a problem hiding this comment.
Hmm, I'm not sure this is the correct approach. I could also be wrong! My thinking was that blob-v2 was going to be more or less a table-level feature?
In other words, we had this table:
| Example Size | Category | Data Type | Table Type |
|---|---|---|---|
| Small | Inline | Normal Binary | Regular Column |
| 512KB | Out-of-line | Blob | Regular Column |
| 10MB | Packed | Packed Struct | BlobFile Column |
| 1GB | Dedicated | Packed Struct | BlobFile Column |
So I wasn't expecting any changes to the file reader (since we now have support for blob and packed struct).
It looks like you are introducing the data file concept here as a new encoding for the file reader? I would think the new data file columns would be handled mostly at the table level (the file reader would just see them as a packed struct column)?
| )) | ||
| } | ||
| } | ||
| DataBlock::Nullable(nullable) => self.create_per_value(field, nullable.data.as_ref()), |
There was a problem hiding this comment.
How do we get a nullable block here?
There was a problem hiding this comment.
In our new packed struct descriptor, blob_id / blob_uri are optional so we will need to handle the Nullable first. Btw, I'm fine with change them into non-nullable to make the logic easier.
rust/lance-encoding/src/data.rs
Outdated
| if matches!(data_type, DataType::Null) { | ||
| return Self::AllNull(AllNullDataBlock { num_values }); | ||
| } | ||
| let mut builder = BooleanBufferBuilder::new(num_values as usize); | ||
| builder.append_n(num_values as usize, false); | ||
| nulls = Nullability::Some(NullBuffer::new(builder.finish())); |
There was a problem hiding this comment.
That’s the same issue we ran into with nullable fields. I suspect there might be a bug, but I haven’t done deep research yet. The problem is that when we use optional fields like blob_id and blob_uri, we end up generating an all-null data block. But the packed struct per-value encoding can’t handle this case correctly. So I created a nullable buffer instead.
I start to think using nullable fields is a bad idea 😭
My intention is to save some bits for cases where all blob_ids are null.
| if let Some(ids) = blob_ids.as_mut() { | ||
| ids.push(0); | ||
| } | ||
| if let Some(builder) = uri_builder.as_mut() { | ||
| builder.append_null(); | ||
| } |
There was a problem hiding this comment.
Probably not a big deal, given blobs are going to be large and expensive, so we don't have to worry that much about per-row costs. However, if this were a regular column, I'd suggest getting the if statements out of the per-value loop here.
There was a problem hiding this comment.
Good idea, will give this a try.
| if let Some(ids) = blob_ids.as_mut() { | ||
| ids.push(0); | ||
| } | ||
| if let Some(builder) = uri_builder.as_mut() { | ||
| builder.append_value(""); | ||
| } |
There was a problem hiding this comment.
Wait, why are we inserting 0/"" here? Is this because they are packed / inline blobs?
There was a problem hiding this comment.
Yep, we’re filling in missing values for inline and out-of-line blobs to make our decoding logic a bit cleaner. Maybe I just need to append_null instead, will fix.
|
Thank you @westonpace for the review!
Yes, blob v2 is a table level feature, we just using the struct descriptor to carry some metadata.
The changes to the file reader are mostly for compatibility with 2.1 files, where we need to fill the added columns
Oh no, I’m not adding a new encoding here, just two new columns in the same struct descriptor. I tend to reuse the same field with different schemas and do evolution at runtime.
Yes, that’s my expectation too. Maybe the change set itself is not clear about what I'm doing? I can add some comments on the confusing part. Or just use a new field for that? |
|
Had an offline meeting with @westonpace, will change blob v2 schema into a table level thing instead of file level thing. |
|
cc @westonpace, let me know if the new impl align what you think |
Signed-off-by: Xuanwo <github@xuanwo.io>
westonpace
left a comment
There was a problem hiding this comment.
Minor naming change. I think (but may be understanding wrong) that "blob version 1" is "inline only" and "blob version 2" is "inline, packed, and dedicated".
If this is correct then I think we should call it BLOBFILE_DESC_FIELD (or PACKED_DESC_FIELD). We are not replacing the existing inline approach, we are adding a new packed approach which will utilize blob-files. The inline approach will still use the 2-field description. The packed approach will use the 5-field description.
| pub static BLOB_V2_DESC_LANCE_FIELD: LazyLock<Field> = | ||
| LazyLock::new(|| Field::try_from(&*BLOB_V2_DESC_FIELD).unwrap()); |
There was a problem hiding this comment.
I wonder if we should just call this BLOBFILE_DESC_FIELD? This way it is clear we are not replacing BLOB_DESC_FIELD (we still need it for inline case)
There was a problem hiding this comment.
My current idea is to have a blob v2 concept that includes all supported blob types like 'inline', 'packed', or 'dedicated'. This means blob v2 will cover all the uses of blob v1, which is just inline. I think this makes compatibility easier without changing any logic on the inline side.
We’ll only reuse the file encoding from blob v1 (inline), but all the table schema will be in blob v2. With this change, we can do the version check early at the table level instead of at the time of writing data.
Btw, I’m not strong on this. If you prefer to just keep packed/dedicated blob types as an extension to inline, I’m fine with making this change.
westonpace
left a comment
There was a problem hiding this comment.
Changing review to approved since I think I only disagree on naming at this point.
Part of lance-format#4947 This PR will add blob v2 shcema for lance so that we are ready to start writing new desc fields for blob. The new schema is gated under file format version `2.2`. --- **This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.** Signed-off-by: Xuanwo <github@xuanwo.io>
Part of lance-format#4947 This PR will add blob v2 shcema for lance so that we are ready to start writing new desc fields for blob. The new schema is gated under file format version `2.2`. --- **This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.** Signed-off-by: Xuanwo <github@xuanwo.io>
Part of #4947
This PR will add blob v2 shcema for lance so that we are ready to start writing new desc fields for blob.
The new schema is gated under file format version
2.2.This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.