Replies: 3 comments 2 replies
-
|
What is the difference between a group and a namespace? Aren't both of them a collection of tables / datasets? |
Beta Was this translation helpful? Give feedback.
-
|
If I have the same topic present in many groups, can I query the topic across groups? How is schema resolution handled if the schema of a topic changes incompatibly across groups? |
Beta Was this translation helpful? Give feedback.
-
|
There are some related workstreams that we should probably discuss together. Let me share what I plan to add in a Lance Catalog (both directory and REST). I am wondering if these would be enough for your use case, and can we just extend the catalog/namespace surface, instead of creating another group concept? Workstream 1: add more objectsThe catalog spec uses the concept of "object" from day 1 anticipating more object types than just table and namespace. We plan to add the following:
Basically, whatever exists in common database systems, I think it makes sense to be added, and we have very concrete customer requests asking for these items. This is also a pretty natural development, since these objects mostly exist already in systems like Unity/Polaris/Gravitino/Iceberg REST. But I think we want to optimize our implementation against ML/AI use cases better based on our experience running products like Geneva. Workstream 2: metadata/tag for all these objectsFor each object, the directory catalog This is also from a very concrete customer ask that they want to store around 100k tables in the catalog, and want an easy way to search tables from agent prompts. Also they want to annotate the table with long descriptions generated by LLM. But I feel this would actually also work for grouping objects in a catalog? For example you can assign a table and a view and their relating UDTF all with tags One related prototyping I am doing is #6794, because if we naively run update against the system table, it would be very inefficient for this use case, and using full copy-on-write kind of solves this problem. Currently the biggest problem of directory catalog is that its throughput is bounded by object store commit rate, and can hardly go beyond 6-7TPS on S3, max 20TPS on S3 express. I am planning to explore sharding to scale up TPS as the next step. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Abstract
Lance Groupis a versioned, queryable container for organizing a set of related Lance datasets.With a group, users can manage multiple Lance datasets as a reproducible data unit. A group records named members, pinned dataset versions, semantic relationships, and reusable logical views.
Core definition:
Motivation
Many Lance workloads naturally involve multiple related datasets:
These datasets belong to the same run, session, experiment, or data product. Users need a stable way to open them together, pin versions together, describe relationships among them, and query derived views across datasets.
Lance Groupprovides a first-class model for this multi-dataset pattern.User Model
The main concepts are:
Example group:
Manifest Table
The group manifest is itself a Lance table. It records group identity, members, relationships, and views, and directly reuses the versioning, schema evolution, metadata, scan, and namespace capabilities of Lance datasets.
A group could have the following physical layout:
Opening a group version is equivalent to opening the corresponding dataset version of
_group.lance:The manifest table uses a structured schema. Each row represents a group entry, with the entry type distinguished by
entry_type; different entry types use the corresponding struct column to express their data.Example member row:
Example relationship row:
Example view row:
Opening the same
_group.lanceversion resolves to the same set of member dataset versions. This makes experiments, feature generation, evaluations, and downstream training jobs reproducible.The manifest table can also be queried directly for debugging, auditing, UI display, and catalog synchronization:
Namespace Integration
Groups are integrated into the Lance namespace as named resources.
The namespace can resolve a group name to:
The group manifest table then resolves members and their pinned dataset versions.
This provides a uniform discovery path for connectors:
Temporal Relationships
Temporal data is one of the core use cases for Lance Group. Groups can declare temporal relationships between members, and the execution layer can build aligned views based on these declarations.
Example relationship:
This relationship can support:
API Sketch
Python:
Rust:
SQL:
DuckDB table function:
Connector Integration
DataFusion:
DuckDB:
Spark:
Python:
Implementation Structure
The suggested layering is as follows:
Core storage capabilities that groups and temporal execution will use:
Value
Lance Group turns a set of related datasets into a reproducible data product.
For robotics and multimodal workloads, the workflow becomes:
For ML and data engineering workloads, the same abstraction can support feature groups, evaluation datasets, benchmark suites, derived training tables, and multi-table data releases.
The end result is a single model shared across Python, DataFusion, DuckDB, Spark, and Lance-native applications:
Open Design Items
Beta Was this translation helpful? Give feedback.
All reactions