Lance Blob v2 Object Layer Proposal #7174
Replies: 3 comments 1 reply
-
|
I think I get the gist of what you are saying. You are proposing a plugin extension point (the object layer?) whose responsibility is to take in a description and request and return data. As a result we should not always assume that position and size mean "offset into a file and number of bytes in that file". However, readers can be confident they will get the 5-value description struct and that there is an API to covert that struct into bytes at a later date. I assume this means we will need metadata of some kind (pack file metadata, etc.) to convert from position to instructions into object storage? To test my understanding let me think about the video case. With the existing capabilities I would just have one row per video and a column of type blob representing the video bytes. I could have another row maybe of type list which is all the "segment ids" in the video and another column of type list<list> which is all the frame ids per segment. To read a single frame I would do something like With this new proposal a blob could be a frame in the video (output is an image) or it could be a segment of the video (output is a smaller video). I suppose I could even have three columns (video, segment, frame) all with type blob (all backed by the same data) and one row per frame. Reading a single frame would just be I'm +1 on the proposal. I think the main advantage of the object layer is that we move the decoding logic into this plugin layer and the user (whoever is writing the queries) doesn't have to keep track of complicated UDFs. They just use the same old |
Beta Was this translation helpful? Give feedback.
-
|
+1 would be great to think this really universal/plug-able to enable other usecases where
like
|
Beta Was this translation helpful? Give feedback.
-
|
+1 on this proposal, and I'd be happy to contribute PRs for it. I want to add some context on the video case, since I've been working on storing video data in Lance (GOP-level blobs
So while I agree the Raw Pack is the right minimal first step, I think the video/frame pack is a strong motivating case for |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Abstract
Blob v2 already decouples a blob’s logical value from its physical placement: a blob can be stored as
Inline,Packed,Dedicated, orExternal. This proposal adds a more general Object Layer on top of these placements.The Object Layer defines how a row-level blob descriptor resolves to a logical blob object:
This allows Lance to express exact sharing, delta encoding, chunked blobs, and future domain-specific blob representations with a single model. As a first step, we only need to implement a minimal Raw Pack under the
Packedplacement; other pack types should be designed separately.Problem
Today the Blob v2 descriptor is largely interpreted as a physical location:
For raw packed blobs this means:
This is sufficient for raw bytes, but there is no clean place to express richer blob representations:
These are object representation concerns, not new placements. They should not turn
BlobKindinto a codec enum, nor should they be modeled as row-to-row dependencies or new page layouts.Design
Introduce an internal Object Layer into the Blob v2 read/write path. The descriptor is first resolved into a placement-specific object reference, then planning and reading happen through a uniform contract.
flowchart TD R["Row"] --> D["Blob v2 descriptor"] D --> O["BlobObjectRef"] O --> P{"Placement backend"} P --> I["Inline"] P --> K["Packed"] P --> X["Dedicated"] P --> E["External"] I --> B["logical bytes"] K --> B X --> B E --> BThe uniform contract can be expressed as:
A key invariant is:
If a representation requires dependencies, chunks, or codec metadata, that metadata must be available before reading the main payload. A reader must not read a delta payload first and only then discover it needs another object.
Placement boundaries
The Object Layer spans all Blob v2 placements, but the role of each placement differs.
InlinePacked.PackedDedicatedExternalA critical boundary:
BlobKindcontinues to express only placement. Representation details belong to the Object Layer backends.Packed Backend and Raw Pack
The first concrete backend should be the Raw Pack for
Packed.In legacy raw packed blobs,
positionmeans a byte offset:In object-backed packed blobs,
positionshould be interpreted as an opaque backend reference:The Raw Pack only needs to prove the new boundary holds:
The Raw Pack can be implemented as an internal Lance file, or as a smaller indexed sidecar format. This proposal does not require all future packs to share the same schema.
The Raw Pack must not store its internal payload recursively as Blob v2. The pack itself is already a blob container.
Future Delta Pack, Chunked Pack, exact-share policy, source-revision pack, video-frame pack, and custom pack types should each have their own independent designs.
Compatibility
The Object Layer itself is an internal abstraction. A required feature is only needed when a backend changes the interpretation of existing descriptor fields.
For object-backed
Packed, Lance will need a feature gate similar to:Compatibility requirements:
Inline, rawPacked,Dedicated, andExternalblobs.Packeddata as raw packed byte slices.Packeddata must fail explicitly on readers that lack the required feature.Beta Was this translation helpful? Give feedback.
All reactions