Lance Group Design Proposal #6887

Xuanwo · 2026-05-21T07:51:15Z

Xuanwo
May 21, 2026
Maintainer

Abstract

Lance Group is a versioned, queryable container for organizing a set of related Lance datasets.

With a group, users can manage multiple Lance datasets as a reproducible data unit. A group records named members, pinned dataset versions, semantic relationships, and reusable logical views.

Namespace
  Resolves names, locations, storage options, permissions

Group
  Organizes related Lance datasets into a versioned data unit

Dataset
  Stores a physical Lance table

Group View
  Exposes a logical table derived from group members

Core definition:

A Lance Group is a versioned semantic container for related Lance datasets, enabling reproducible multi-table data products and reusable logical views.

Motivation

Many Lance workloads naturally involve multiple related datasets:

camera frames       -> Lance dataset
imu samples         -> Lance dataset
lidar sweeps        -> Lance dataset
calibration records -> Lance dataset
run metadata        -> Lance dataset
derived features    -> Lance dataset

These datasets belong to the same run, session, experiment, or data product. Users need a stable way to open them together, pin versions together, describe relationships among them, and query derived views across datasets.

Lance Group provides a first-class model for this multi-dataset pattern.

User Model

The main concepts are:

Group
  A versioned container for a set of related Lance datasets.

Member
  A named Lance dataset pointing to a version, tag, branch, URI, or namespace identifier.

Relationship
  A typed link between members, e.g. temporal alignment, foreign key, or derived-from.

View
  A named logical table derived from group members and relationships.

Materialized View
  A view written back as a Lance dataset.

Example group:

robot_run_001: Lance Group
  members:
    camera: s3://logs/run-001/camera.lance@v17
    imu: s3://logs/run-001/imu.lance@v42
    lidar: s3://logs/run-001/lidar.lance@v8
    calibration: s3://logs/run-001/calibration.lance@v3

  relationships:
    camera_imu_time:
      type: temporal_asof
      left: camera
      right: imu
      by: [session_id]
      left_time: timestamp
      right_time: timestamp
      direction: backward
      tolerance: 5ms

  views:
    camera_with_imu:
      type: temporal_align
      base: camera
      include: [imu]
      relationship: camera_imu_time

Manifest Table

The group manifest is itself a Lance table. It records group identity, members, relationships, and views, and directly reuses the versioning, schema evolution, metadata, scan, and namespace capabilities of Lance datasets.

A group could have the following physical layout:

s3://bucket/logs/run-001/
  _group.lance       # group manifest table
  camera.lance       # member dataset
  imu.lance          # member dataset
  lidar.lance        # member dataset

Opening a group version is equivalent to opening the corresponding dataset version of _group.lance:

open_group("s3://bucket/logs/run-001", version=12)
  -> open Dataset("s3://bucket/logs/run-001/_group.lance").checkout_version(12)
  -> scan manifest rows
  -> resolve pinned member dataset versions

The manifest table uses a structured schema. Each row represents a group entry, with the entry type distinguished by entry_type; different entry types use the corresponding struct column to express their data.

entry_type: string
name: string
description: string?

member: struct<
  kind: string,
  uri: string,
  namespace: string?,
  version: uint64?,
  tag: string?,
  branch: string?
>?

relationship: struct<
  type: string,
  left: string,
  right: string,
  by: list<string>,
  left_time: string?,
  right_time: string?,
  direction: string?,
  tolerance_ns: int64?,
  allow_exact_match: bool?,
  tie_break: string?
>?

view: struct<
  type: string,
  base: string,
  include: list<string>,
  relationship: string?,
  materialized_uri: string?,
  materialized_version: uint64?
>?

created_at: timestamp
updated_at: timestamp

Example member row:

entry_type: "member"
name: "camera"
member:
  kind: "lance_dataset"
  namespace: "robot_logs.run_001.camera"
  uri: "s3://bucket/logs/run-001/camera.lance"
  version: 17

Example relationship row:

entry_type: "relationship"
name: "camera_imu_time"
relationship:
  type: "temporal_asof"
  left: "camera"
  right: "imu"
  by: ["session_id"]
  left_time: "timestamp"
  right_time: "timestamp"
  direction: "backward"
  tolerance_ns: 5000000
  allow_exact_match: true
  tie_break: "latest"

Example view row:

entry_type: "view"
name: "camera_with_imu"
view:
  type: "temporal_align"
  base: "camera"
  include: ["imu"]
  relationship: "camera_imu_time"

Opening the same _group.lance version resolves to the same set of member dataset versions. This makes experiments, feature generation, evaluations, and downstream training jobs reproducible.

The manifest table can also be queried directly for debugging, auditing, UI display, and catalog synchronization:

SELECT name, member.uri, member.version
FROM group_manifest
WHERE entry_type = 'member';

Namespace Integration

Groups are integrated into the Lance namespace as named resources.

robot_logs.run_001              -> group
robot_logs.run_001.camera       -> member dataset
robot_logs.run_001.imu          -> member dataset
robot_logs.run_001.views.aligned_camera_imu

The namespace can resolve a group name to:

group root location or _group.lance location
storage options
current group version

The group manifest table then resolves members and their pinned dataset versions.

This provides a uniform discovery path for connectors:

open group by name
  -> resolve group manifest table
  -> resolve member datasets
  -> expose members and views to the query engine

Temporal Relationships

Temporal data is one of the core use cases for Lance Group. Groups can declare temporal relationships between members, and the execution layer can build aligned views based on these declarations.

Example relationship:

AsOfJoin {
  left: camera
  right: imu
  by: [session_id]
  left_time: timestamp
  right_time: timestamp
  direction: backward
  tolerance: 5ms
  allow_exact_match: true
  tie_break: latest
  output: left_outer
}

This relationship can support:

point-in-time alignment between streams;
nearest-neighbor temporal lookup;
resampling onto a base timeline;
windowed aggregation around an event stream;
materialized feature tables for training and evaluation.

API Sketch

Python:

import lance

group = lance.open_group("robot_logs.run_001")

camera = group.dataset("camera")
imu = group.dataset("imu")

aligned = group.align(
    base="camera",
    with_=["imu"],
    relationship="camera_imu_time",
)

aligned.to_lance("s3://bucket/features/run-001/camera_with_imu.lance")

Rust:

let group = LanceGroup::open("robot_logs.run_001").await?;

let camera = group.dataset("camera").await?;
let aligned = group
    .align()
    .base("camera")
    .with("imu")
    .relationship("camera_imu_time")
    .build()
    .await?;

SQL:

SELECT *
FROM robot_logs.run_001.views.camera_with_imu;

DuckDB table function:

SELECT *
FROM lance_group_asof_join(
  'robot_logs.run_001',
  left => 'camera',
  right => 'imu',
  relationship => 'camera_imu_time'
);

Connector Integration

DataFusion:

Register group members as Lance table providers.
Register group views as logical tables.
Execute temporal views via reusable Lance temporal operators.

DuckDB:

Expose group members via table functions.
Expose group views via table functions.
Map temporal table functions to Lance temporal execution.

Spark:

Open groups as a data source.
Expose members as logical relations.
Expose views as DataFrames.
Push partitioned temporal reads down to Lance where possible.

Python:

Provide group APIs suitable for interactive use.
Materialize views back into Lance datasets.
Share the same group manifest table and temporal semantics with SQL connectors.

Implementation Structure

The suggested layering is as follows:

lance
  single dataset storage, query, index, and version primitives

lance-namespace
  name resolution for datasets and groups

lance-group
  group manifest table, members, relationships, and views

lance-temporal
  reusable temporal execution semantics and Rust execution kernels

lance-robotics
  robotics-oriented ingestion and presets built on Lance Group

Core storage capabilities that groups and temporal execution will use:

dataset version open;
manifest table scan;
projection, filter, limit, and ordered scan;
timestamp and range pruning;
sorted-by / clustered-by metadata;
timestamp statistics or scalar indices;
row-address and late materialization;
scan planning APIs for connectors.

Value

Lance Group turns a set of related datasets into a reproducible data product.

For robotics and multimodal workloads, the workflow becomes:

raw multi-topic logs
  -> topic/table group
  -> reproducible snapshot
  -> temporal alignment
  -> materialized feature/eval/training view
  -> random access / vector / multimodal scan

For ML and data engineering workloads, the same abstraction can support feature groups, evaluation datasets, benchmark suites, derived training tables, and multi-table data releases.

The end result is a single model shared across Python, DataFusion, DuckDB, Spark, and Lance-native applications:

Users can open a versioned group of related datasets, query or materialize consistent logical views across them, and reuse the same temporal semantics across the Lance ecosystem.

Open Design Items

API shape for group resources in the namespace.
Group commit behavior when pinning multiple member versions.
View specification format.
Relationship extension mechanism.
Temporal execution API shared across Python, DataFusion, DuckDB, and Spark.

westonpace · 2026-05-21T13:39:30Z

westonpace
May 21, 2026
Maintainer

What is the difference between a group and a namespace? Aren't both of them a collection of tables / datasets?

2 replies

Xuanwo May 21, 2026
Maintainer Author

unlike a namespace, group is more like a logical table for the collection of tables. Group need to maintain the relation between those tables.

Take this one as an example:

robot_run_001: Lance Group
  members:
    camera: s3://logs/run-001/camera.lance@v17
    imu: s3://logs/run-001/imu.lance@v42
    lidar: s3://logs/run-001/lidar.lance@v8
    calibration: s3://logs/run-001/calibration.lance@v3

  relationships:
    camera_imu_time:
      type: temporal_asof
      left: camera
      right: imu
      by: [session_id]
      left_time: timestamp
      right_time: timestamp
      direction: backward
      tolerance: 5ms

  views:
    camera_with_imu:
      type: temporal_align
      base: camera
      include: [imu]
      relationship: camera_imu_time

robot_run_001 is a Lance Group that packages several related Lance datasets from the same robot run. Each member, such as camera, imu, lidar, and calibration, is pinned to a specific dataset version for reproducibility. The group also defines a temporal as-of relationship between camera and imu, so each camera frame can be aligned with the nearest earlier IMU sample within 5 ms. The camera_with_imu view exposes that aligned result as a logical table.

Technically, users can implement all that logic themselves, but lance-group will get it done with a more friendly API out of the box.

westonpace May 22, 2026
Maintainer

This feels a little bit like a non-materialized view (spanning multiple tables) provided we have some mechanism for an as-of join.

wkalt · 2026-05-21T14:16:04Z

wkalt
May 21, 2026
Collaborator

If I have the same topic present in many groups, can I query the topic across groups? How is schema resolution handled if the schema of a topic changes incompatibly across groups?

0 replies

jackye1995 · 2026-05-22T07:22:30Z

jackye1995
May 22, 2026
Maintainer

There are some related workstreams that we should probably discuss together. Let me share what I plan to add in a Lance Catalog (both directory and REST). I am wondering if these would be enough for your use case, and can we just extend the catalog/namespace surface, instead of creating another group concept?

Workstream 1: add more objects

The catalog spec uses the concept of "object" from day 1 anticipating more object types than just table and namespace. We plan to add the following:

MV and view (we just added MV)
UDF
Lineage (UDTF, chunking UDF)

Basically, whatever exists in common database systems, I think it makes sense to be added, and we have very concrete customer requests asking for these items.

This is also a pretty natural development, since these objects mostly exist already in systems like Unity/Polaris/Gravitino/Iceberg REST. But I think we want to optimize our implementation against ML/AI use cases better based on our experience running products like Geneva.

Workstream 2: metadata/tag for all these objects

For each object, the directory catalog __manifest system table has a metadata column. My plan is to let it store all sorts of tags for these objects, and we can create JSON indexes to do different types of searches.

This is also from a very concrete customer ask that they want to store around 100k tables in the catalog, and want an easy way to search tables from agent prompts. Also they want to annotate the table with long descriptions generated by LLM.

But I feel this would actually also work for grouping objects in a catalog? For example you can assign a table and a view and their relating UDTF all with tags group:xxx so you can search based on it.

One related prototyping I am doing is #6794, because if we naively run update against the system table, it would be very inefficient for this use case, and using full copy-on-write kind of solves this problem. Currently the biggest problem of directory catalog is that its throughput is bounded by object store commit rate, and can hardly go beyond 6-7TPS on S3, max 20TPS on S3 express. I am planning to explore sharding to scale up TPS as the next step.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lance Group Design Proposal #6887

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Lance Group Design Proposal #6887

Uh oh!

Xuanwo May 21, 2026 Maintainer

Abstract

Motivation

User Model

Manifest Table

Namespace Integration

Temporal Relationships

API Sketch

Connector Integration

Implementation Structure

Value

Open Design Items

Replies: 3 comments · 2 replies

Uh oh!

westonpace May 21, 2026 Maintainer

Uh oh!

Xuanwo May 21, 2026 Maintainer Author

Uh oh!

westonpace May 22, 2026 Maintainer

Uh oh!

wkalt May 21, 2026 Collaborator

Uh oh!

Uh oh!

jackye1995 May 22, 2026 Maintainer

Workstream 1: add more objects

Workstream 2: metadata/tag for all these objects

Xuanwo
May 21, 2026
Maintainer

Replies: 3 comments 2 replies

westonpace
May 21, 2026
Maintainer

Xuanwo May 21, 2026
Maintainer Author

westonpace May 22, 2026
Maintainer

wkalt
May 21, 2026
Collaborator

jackye1995
May 22, 2026
Maintainer