Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 10 additions & 9 deletions docs/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,16 @@
"performance"
]
},
{
"group": "Model training",
"pages": [
"training/why-lancedb",
"training/index",
"training/torch",
"training/object-detection",
"training/vlm-finetuning"
]
},
{
"group": "Guides",
"pages": [
Expand Down Expand Up @@ -141,15 +151,6 @@
"storage/index",
"storage/configuration"
]
},
{
"group": "Training",
"pages": [
"training/index",
"training/torch",
"training/object-detection",
"training/vlm-finetuning"
]
}
]
},
Expand Down
109 changes: 71 additions & 38 deletions docs/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,51 +3,85 @@ title: LanceDB
sidebarTitle: "LanceDB"
description: "Multimodal lakehouse for AI."
icon: "/static/assets/logo/lancedb-icon-gray.svg"
keywords: ["open source", "oss"]
keywords: ["multimodal lakehouse", "training", "feature engineering", "search", "open source", "oss"]
---

**LanceDB** is a [multimodal lakehouse](https://lancedb.com/blog/multimodal-lakehouse/) for
AI, built on top of [Lance](/lance), an open-source lakehouse format. Below, we list a few
ways LanceDB can help you build and scale your AI and ML workloads.
**LanceDB** is a [multimodal lakehouse](https://lancedb.com/blog/multimodal-lakehouse/) for AI teams that need
one data layer for curation, feature engineering, search and retrieval, and model training.
It is built on top of [Lance](/lance), an open-source lakehouse format designed for multimodal AI data.

Move from data exploration to model training on one, unified platform without needing to manage a
fragmented stack of storage, feature, retrieval, and training systems.

## Build better models, faster

Training data and experimentation slow down when raw data, metadata, embeddings, features, and governance
artifacts live in separate systems. LanceDB keeps them together in one versioned multimodal table, so AI teams spend less
time stitching infrastructure together and more time improving datasets, testing features, and keeping GPUs fed.

![Training data lifecycle: Curation, Feature Engineering, Search and Retrieval, Training](/static/assets/images/overview/training-data-lifecycle.svg)

Use the same table to curate training data, add derived features, retrieve examples, and feed training jobs that rely on expensive GPUs.
Training workloads can sample, shuffle, and scan projected columns from local storage or object storage, then assemble
GPU-ready batches from a tagged dataset version.

For a deeper look at how this works in training pipelines, start with [Why LanceDB for training](/training/why-lancedb).

## LanceDB suite

The LanceDB suite includes LanceDB OSS, an open-source embedded retrieval library, and LanceDB Enterprise,
a multimodal lakehouse platform for the full AI data lifecycle.
OSS is easy to set up on a local machine for search and regular-scale workflows. LanceDB Enterprise is built
for teams that need scale without building bespoke infrastructure for curation,
feature engineering, search and retrieval, and efficient training data access.

![LanceDB suite: OSS search and Enterprise multimodal lakehouse on Lance format](/static/assets/images/overview/lancedb-suite.svg)

## Why teams use LanceDB

<Steps>
<Step title="High-performance random access and data management for model training">
Use LanceDB to curate, explore and distribute very large multimodal datasets for training and fine-tuning models.
LanceDB comes with built-in table versioning, schema evolution, and fast random access, making it far more efficient to do
dataset slicing, sampling, filtering and shuffles on large, rapidly evolving datasets.
<Step title="One table for the whole AI data loop">
Store images, video, audio, text, annotations, embeddings, and model-generated features together in one schema-enforced table.
The same table can support dataset curation, feature backfills, experiment splits, retrieval, and training.
</Step>
<Step title="High-throughput data access for training">
Training workloads mix fast random access with high-throughput sequential scans. LanceDB is designed for both, so
teams can shuffle data into GPU-ready batches more efficiently, improve input throughput, and iterate on experiments faster.
</Step>
<Step title="Massively scalable, fast and high-quality retrieval − without breaking the bank">
Use LanceDB as the data + retrieval layer for production AI workloads: RAG, agents, semantic search,
recommendation systems, and more.
Keep multimodal data, metadata, and embeddings in the same table and query them via vector search,
full-text search or SQL. Easily add new features (columns in your tables) as your
application evolves, without copying existing data.
<Step title="Fast, versatile search and retrieval">
Whether the end user is a human or an agent, LanceDB powers production retrieval workloads such as semantic search,
hybrid search, RAG, agent memory, and recommendation systems. Retrieval runs against the same LanceDB tables used
for curation, feature engineering, and training workflows.
</Step>
</Steps>

LanceDB is designed for a variety of workloads and deployment scenarios, and supports use cases
that are way beyond traditional vector search. The LanceDB suite includes LanceDB OSS, an open-source embedded library,
and LanceDB Enterprise, a distributed and managed multimodal lakehouse.
Both are built on top of the same open-source Lance format and table abstractions.

![](/static/assets/images/overview/lancedb-suite.png)
## Start with your workload

## Use cases

- **Search**: Build high-performance search and retrieval applications using LanceDB's optimized storage, including vector search, full-text search, and hybrid search with secondary indexes.
- **Data Curation**: Manage and filter on petabyte-scale multimodal datasets, including video and point cloud data, to gain insights, explore data and inform model development.
- **Feature engineering**: Add new columns (features), create embeddings, and transform your data at
scale. LanceDB lets you extend tables both vertically and horizontally with minimal I/O overhead.
- **Training**: Efficiently access and manage large-scale multimodal datasets for training and fine-tuning AI models.
<CardGroup cols={2}>
<Card title="Train and fine-tune models" icon="fire" href="/training/why-lancedb">
Learn why LanceDB works well as the data layer for training workloads.
</Card>
<Card title="Load data into PyTorch" icon="boxes-stacked" href="/training/">
Use LanceDB tables and permutations for projected, shuffled, random-access training reads.
</Card>
<Card title="Browse ready-to-use datasets" icon="database" href="/datasets">
Explore Lance-formatted multimodal datasets with raw bytes, metadata, embeddings, and indices.
</Card>
<Card title="Build search and retrieval" icon="search" href="/search/">
Use vector search, full-text search, hybrid search, reranking, filtering, and SQL.
</Card>
</CardGroup>

## Choose how you run LanceDB
## From local development to production scale

Depending on your needs, you can choose one of the following ways to run LanceDB.
LanceDB OSS and LanceDB Enterprise share the same Lance format and table model. Start locally with the embedded OSS
library, then move to Enterprise when your team needs distributed scale, managed infrastructure, private deployment,
or higher-throughput curation, feature engineering, search and retrieval, and training workflows.

### 1. LanceDB OSS
The fastest way to get started is the open-source embedded library, with client SDKs in Python, TypeScript
and Rust. Run it locally during development, then use the same data model and APIs as you scale up
and need a managed solution. Start here:
and Rust. Run it locally in just a few steps, which lets you explore datasets, curate data, and run search and retrieval workloads
for agents. Start here:

<Columns cols={2}>
<Card
Expand All @@ -59,19 +93,18 @@ and need a managed solution. Start here:
</Card>
<Card
title="Basic Table Operations"
icon="search"
icon="table"
href="/tables/"
>
Create tables, search vectors, and modify data in LanceDB.
Create tables, evolve schemas, version data, and modify rows in LanceDB.
</Card>
</Columns>

### 2. LanceDB Enterprise

[LanceDB Enterprise](/enterprise) is a distributed and managed **multimodal lakehouse** built for
search, curation, feature engineering, and training-oriented data access workflows
on top of the same core table abstraction. This eliminates the need for teams to build bespoke
infrastructure to manage petabyte-scale multimodal datasets.
[LanceDB Enterprise](/enterprise) is a petabyte-scale (and beyond), distributed **multimodal lakehouse** platform built for
search, curation, feature engineering, and high-throughput training data access workflows on top of the same core table
abstraction. This eliminates the need for teams to build bespoke infrastructure to manage large multimodal datasets.
To set up LanceDB Enterprise in your organization, reach out to us at
[contact@lancedb.com](mailto:contact@lancedb.com).

Expand All @@ -88,4 +121,4 @@ private deployments, and can operate under strict [security requirements](/enter
href="/enterprise/quickstart"
>
Get started with LanceDB Enterprise in minutes.
</Card>
</Card>
30 changes: 16 additions & 14 deletions docs/lance.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,15 @@ description: "Open-source lakehouse format for multimodal AI."
icon: "/static/assets/logo/lance-logo-gray.svg"
---

[Lance](https://lance.org/) is an open-source lakehouse format, which provides the
foundation for LanceDB's capabilities. It provides a file format,
table format, and catalog spec with multimodal data at the center of its design, allowing developers
[Lance](https://lance.org/) is an open-source, columnar lakehouse format for multimodal AI.
It provides a file format, table format, and lightweight catalog spec, allowing developers
to build a complete open lakehouse on top of object storage.

Building on top of open foundations and optimizing the format for AI workloads brings
high-performance vector search, full-text search, random access, and feature engineering capabilities
to a single unified system ([LanceDB](/enterprise)), eliminating the need for bespoke ETL and data pipelines that move data
to multiple other specialized data systems.
Building on top of open foundations and optimizing the format for random access
(without compromising scan performance) enables
high-performance vector search, full-text search, indexing, and feature engineering capabilities.
[LanceDB](/enterprise) builds on these capabilities so teams can work with one multimodal data layer
instead of moving data across separate storage, search, feature, and training systems.

<Card
title="Lance format documentation"
Expand All @@ -23,15 +23,17 @@ to multiple other specialized data systems.
Visit the Lance format documentation to learn more about its design, features, and how it enables the multimodal lakehouse.
</Card>

## Advantages of the Lance format
## Capabilities of the Lance format

Advantage | Description
Capability | What it enables
--- | ---
Multimodal storage | Efficiently holds vectors, images, videos, audio, text, and more
Version control | Built-in data versioning for reproducible ML experiments and data lineage
ML-optimized | Designed for training and inference workloads with fast random access
Query performance | Columnar storage enables blazing-fast vector search and analytics
Cloud-native | Seamless integration with cloud object stores (S3, GCS, Azure Blob)
Multimodal storage | Store images, video, audio, text, embeddings, annotations, metadata, features, and more, all in one table.
First-class blob API | Store large binary objects such as images, video, audio, and model artifacts in blob columns with lazy reads and streaming byte access.
Fast random access and scans | Sample, shuffle, and retrieve individual rows efficiently without giving up high-throughput sequential reads.
Flexible data evolution | Add, drop, rename, or alter columns as datasets change, often without rewriting existing data files.
Versioned tables | Reproduce experiments, restore previous states, and tie downstream artifacts to the exact table version they used.
Hybrid search and indexing | Combine vector search, full-text search, and scalar filters on the same dataset with Lance indexes.
Open lakehouse interoperability | Build on object storage and connect Lance tables to open engines such as PyTorch, Ray, Spark, Trino, DuckDB and Polars.

## Key concepts

Expand Down
95 changes: 95 additions & 0 deletions docs/static/assets/images/overview/lancedb-suite.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading