feat: add compound (multi-column) scalar index #5480

tomsanbear · 2025-12-15T20:18:29Z

Summary

Introduces compound B-tree scalar indices to Lance, enabling efficient lookups on multi-column predicates using a single index structure instead of intersecting multiple single-column indices at query time.

Motivation

Many workloads filter on multiple columns simultaneously:

Multi-tenant + time-series: WHERE tenant_id = 'acme' AND timestamp > T
Status + time: WHERE status = 'active' AND created_at BETWEEN X AND Y
Hierarchical: WHERE region = 'us-west' AND department = 'engineering'

Compound indices store rows sorted by the combined key, following leftmost prefix semantics.

Query Patterns Supported

Full key equality: col1 = X AND col2 = Y AND col3 = Z
Prefix lookup: col1 = X or col1 = X AND col2 = Y
Prefix + range: col1 = X AND col2 > Y
IN-list: col1 IN (...) or col1 = X AND col2 IN (...)
IS NULL: col1 = X AND col2 IS NULL

API

dataset
    .create_index(&["tenant_id", "timestamp"], IndexType::Scalar)
    .with_index_name("tenant_time_idx")
    .execute()
    .await?;

Benchmark Results

Benchmarks run on multi-tenant time-series data with queries like WHERE tenant_id = X AND timestamp > Y:

Single fragment dataset:

Scenario	No Index	BTree (tenant only)	Dual BTree	Compound	Compound vs BTree	Compound vs Dual
Tenant only	9.31ms	456µs	459µs	1.13ms	2.47x slower	2.45x slower
Tenant + narrow range	10.62ms	476µs	377µs	670µs	1.41x slower	1.78x slower
Tenant + wide range	9.38ms	514µs	1.67ms	959µs	1.87x slower	1.74x faster
Tenant + full range	8.92ms	512µs	3.09ms	1.15ms	2.25x slower	2.68x faster
Timestamp only	4.43ms	4.43ms	3.47ms	4.22ms	1.05x faster	1.22x slower

Multi-fragment dataset (more realistic production scenario):

Scenario	No Index	BTree (tenant only)	Dual BTree	Compound	Compound vs BTree	Compound vs Dual
Tenant only	8.55ms	2.85ms	2.70ms	2.92ms	1.02x slower	1.08x slower
Tenant + narrow range	8.23ms	2.08ms	472µs	693µs	3.00x faster	1.47x slower
Tenant + medium range	7.43ms	2.31ms	2.77ms	2.07ms	1.11x faster	1.34x faster
Tenant + wide range	7.69ms	1.89ms	767µs	818µs	2.32x faster	1.07x slower

Production Experience

We've been running this implementation in our product with beta customers and are seeing stable, positive results. The index has been exercised against real multi-tenant time-series workloads.

A Note on the Implementation

I understand this is a large change. Many decisions, particularly around introducing CompoundSargableQuery as a separate type and the associated changes for managing index creation, were made pragmatically to ease the initial implementation and reduce the effort of maintaining this fork until we could upstream it.

Limitations

2-8 columns per index: arbitarily decided, was unsure where this configuration should live or if we should allow it to be unbounded
OR conditions not directly supported
LIKE/pattern matching not supported
no support for postgres like "skip scans", left most prefix needs to match

chatgpt-codex-connector · 2025-12-15T20:18:34Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

wjones127 · 2025-12-15T20:24:59Z

It looks like your benchmark only compared to having a BTree index on one of the columns. Could you share results if you create a BTree index on both columns? Our query engine can combine the results of multiple index lookups, so I'd be curious how that compared to a compound index.

tomsanbear · 2025-12-15T21:07:51Z

Reran the benchmark, will push an update to it later but here's a look at the initial result:

Benchmarks run on multi-tenant time-series data with queries like WHERE tenant_id = X AND timestamp > Y:

Single fragment dataset:

Scenario No Index BTree (tenant only) Dual BTree Compound Compound vs BTree Compound vs Dual

Tenant only 9.31ms 456µs 459µs 1.13ms 2.47x slower 2.45x slower

Tenant + narrow range 10.62ms 476µs 377µs 670µs 1.41x slower 1.78x slower

Tenant + wide range 9.38ms 514µs 1.67ms 959µs 1.87x slower 1.74x faster

Tenant + full range 8.92ms 512µs 3.09ms 1.15ms 2.25x slower 2.68x faster

Timestamp only 4.43ms 4.43ms 3.47ms 4.22ms 1.05x faster 1.22x slower

Multi-fragment dataset (more realistic production scenario):

Scenario No Index BTree (tenant only) Dual BTree Compound Compound vs BTree Compound vs Dual

Tenant only 8.55ms 2.85ms 2.70ms 2.92ms 1.02x slower 1.08x slower

Tenant + narrow range 8.23ms 2.08ms 472µs 693µs 3.00x faster 1.47x slower

Tenant + medium range 7.43ms 2.31ms 2.77ms 2.07ms 1.11x faster 1.34x faster

Tenant + wide range 7.69ms 1.89ms 767µs 818µs 2.32x faster 1.07x slower

wjones127

My immediate concerns on this PR aren't really about the compound index itself, but really about the changes to expression handling and other parts the generic index code. I think this would be the first index that covers multiple columns, and I think it needs careful design. I think we need a design discussion on that level before we are ready to move forward on the compound index itself.

I think reviewing both the API for multi-column indices and the concrete impl of the compound index might be too much. I think we should first outline the API changes needed to support multi-column indices (how are queries routed to them and run, for example).

I think the implementation itself looks pretty solid. There are lots of good tests, including property testing which I like to see.

westonpace · 2025-12-19T14:58:47Z

I agree this probably needs to be broken up (11K LOC for one PR is a little daunting 😰). That being said it looks like a lot of well tested work. Thank you for starting this effort!

I also agree with Will that a good start should be on supporting indexes on multiple columns both in the table format and the scanner. This is likely to be useful for other indexes as well. Maybe a good order can be...

Table format support for multiple indexes
Compound sargable query and it's parser
Scanner support for multiple indexes
Compound btree index search and simple train
Distributed training

At a glance it seems very reasonable and there is historical tradition for these kinds of compound indexes. However, like Will, I am also interested in understanding how compound indexes compare to multiple individual indexes.

I do believe they can speed up certain classes of queries. Tenant + narrow range makes sense to me as the winner since the alternative requires crafting the entire bitmap of tenant which can be slow. Although in that case I might wonder if a bitmap index on tenant plus a btree index on the range column would perform similarly to this compound case.

tomsanbear · 2025-12-20T20:37:21Z

Appreciate the feedback, that makes sense. I figured this would need several iterations and probably redesigns, for the most part I wanted to get a POC done on our side to evaluate it vs other indexing strategies and it achieved that on our side.

I guess my next question is what is the best approach for approaching the proper design and development stages for this? I'm interested to contribute to this along with any parallel/prerequisite ground work that you feel might be needed before multi column index can go in.

Introduces compound B-tree scalar indices enabling efficient lookups on multi-column predicates using a single index structure. Features: - CompoundSargableQuery and CompoundScalarQuery types for multi-column queries - CompoundBTreeIndex with load(), search(), update(), and merge support - CompoundQueryParser for extracting multi-column predicates from expressions - Per-column page statistics for query pruning - Support for prefix lookups, range queries, IN-lists, and IS NULL - Fragment reuse remapping for compound indices during compaction - Leftmost prefix rule semantics (2-8 columns) API: dataset.create_index(&["tenant_id", "timestamp"], IndexType::Scalar) .with_index_name("tenant_time_idx") .execute().await?; Query patterns supported: - Full key equality: col1 = X AND col2 = Y AND col3 = Z - Prefix lookup: col1 = X or col1 = X AND col2 = Y - Prefix + range: col1 = X AND col2 > Y - IN-list: col1 IN (...) or col1 = X AND col2 IN (...) - IS NULL: col1 = X AND col2 IS NULL

Adds a fourth benchmark scenario with two separate BTree indices (one on tenant_id, one on timestamp) to compare against the compound index. Key findings: - Compound index is 2-3x faster than single BTree for multi-column queries - Dual BTree can outperform compound for very narrow range queries - Compound wins for medium/wide ranges by avoiding intersection overhead

github-actions bot added the enhancement New feature or request label Dec 15, 2025

wjones127 self-assigned this Dec 15, 2025

wjones127 reviewed Dec 18, 2025

View reviewed changes

tomsanbear added 2 commits January 15, 2026 10:36

tomsanbear force-pushed the feat/compound-index-clean branch from 411d221 to 4871f9c Compare January 15, 2026 17:06

tomsanbear closed this Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add compound (multi-column) scalar index #5480

feat: add compound (multi-column) scalar index #5480

Uh oh!

tomsanbear commented Dec 15, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

wjones127 commented Dec 15, 2025

Uh oh!

tomsanbear commented Dec 15, 2025

Uh oh!

wjones127 left a comment

Uh oh!

westonpace commented Dec 19, 2025

Uh oh!

tomsanbear commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add compound (multi-column) scalar index #5480

feat: add compound (multi-column) scalar index #5480

Uh oh!

Conversation

tomsanbear commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Query Patterns Supported

API

Benchmark Results

Production Experience

A Note on the Implementation

Limitations

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

wjones127 commented Dec 15, 2025

Uh oh!

tomsanbear commented Dec 15, 2025

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace commented Dec 19, 2025

Uh oh!

tomsanbear commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomsanbear commented Dec 15, 2025 •

edited

Loading