Multi-Segment FTS Scoring Semantics: Choosing Between Local and Global BM25 #6789

Xuanwo · 2026-05-14T14:36:36Z

Xuanwo
May 14, 2026
Maintainer

Abstract

Multi-segment FTS allows a single logical FTS index to be composed of multiple physical segments. This introduces a user-visible semantic issue for BM25 scoring: should BM25 corpus statistics be computed independently for each segment, or uniformly across the entire logical FTS index?

This proposal is not about removing global BM25. It aims to make the scoring semantics of multi-segment FTS explicit, so that users can choose between two modes:

Local: each segment uses its own BM25 statistics. This is better suited for large-scale and distributed queries, but _score values are only approximately comparable across segments.
Global: all target segments use a unified set of BM25 statistics. This produces ranking closer to a single merged FTS index, but requires more coordination overhead at query time.

We hope to use this proposal to drive community discussion on: how Lance should define the default scoring behavior for multi-segment FTS, and how the semantics of _score and top-k ranking should be communicated to users.

Background

Lance is extending its multi-segment index capabilities so that large tables can be built and queried through multiple independent index artifacts. For FTS, this means a logical FTS index may no longer correspond to a single physical inverted index, but instead be composed of multiple committed FTS segments.

In a single FTS index, BM25 corpus statistics have a natural definition: they come from the document collection covered by that index. With multi-segment FTS, the boundary of the corpus is no longer uniquely defined.

One interpretation is that each segment is its own scoring corpus. At query time, each segment scores independently, and the results are then merged.

Another interpretation is that all segments under a logical FTS index together form a single scoring corpus. At query time, global statistics are obtained first, and then all segments score documents using the same set of BM25 statistics.

This is not merely an implementation detail. It affects the _score values users see, and can also influence top-k results when a limit is applied.

User-Visible Semantics

FTS query results carry at least two layers of semantics:

matching: which documents satisfy the query.
ranking: how matching documents are ordered by _score.

The scoring mode of multi-segment FTS should not alter matching correctness. Regardless of whether local or global scoring is used, a document that satisfies the query conditions must always be considered a match.

The difference appears at the ranking level. BM25 scores depend on corpus-level statistics such as:

total number of documents
total number of tokens
average document length
document frequency of query terms

If these statistics come from different segments, _score values produced by different segments are only approximately comparable. If the statistics come from the entire logical FTS index, the _score values will be closer to what a single merged index would produce.

Therefore, multi-segment FTS must clearly tell users whether the default _score is calculated on a segment-local basis or on a logical-index-global basis.

Proposed Scoring Modes

Local

In Local mode, each segment scores documents using its own BM25 statistics.

Semantics:

Matching correctness is preserved.
_score is based on segment-local corpus statistics.
Cross-segment _score values are approximate and not guaranteed to be strictly comparable across segments.
Top-k ranking with a limit may differ from that of a single merged FTS index.
Queries do not require an initial aggregation of BM25 statistics across all segments.

This mirrors the default trade-off in many distributed search systems. For example, Elasticsearch defaults to query_then_fetch with shard-local scoring; when global term statistics are needed, users must explicitly use dfs_query_then_fetch.

Global

In Global mode, the target segments score documents using unified BM25 statistics.

Semantics:

Matching correctness is preserved.
_score is based on corpus-wide statistics of the logical FTS index.
Top-k ranking is closer to that of a single merged FTS index.
Queries require extra coordination to obtain global BM25 statistics.

This mode suits scenarios where users care more about consistent global relevance ranking than about minimizing query coordination overhead.

Why Both Modes Should Exist

Local and Global are not a matter of right or wrong. They represent reasonable trade-offs for different workloads.

Local is more suitable for:

tables with many segments.
distributed queries.
users who primarily care about recalling matching documents, not strict global _score consistency.
low-latency or high-concurrency scenarios.

Global is more suitable for:

users who rely on _score for precise relevance ranking.
users who want multi-segment FTS behavior to stay as close as possible to a single merged FTS index.
scenarios with a small number of segments, where the extra coordination cost is acceptable.
debugging, evaluation, and ranking-sensitive workloads.

Lance should not hide one of these semantics inside the implementation. A cleaner approach is to expose this trade-off explicitly to users.

Default Behavior

This is the question on which this proposal most seeks community input.

One option is to default to Local. This would make multi-segment FTS better suited out of the box for large-scale and distributed queries, and would also align more closely with the default distributed search semantics of systems like Elasticsearch. The trade-off is that _score and top-k ranking are not guaranteed to be fully consistent with a merged index.

Another option is to default to Global. This would give users a stronger expectation of score consistency and stay closer to the current correctness-first implementation. However, as the number of segments grows, it makes query-time global coordination a default cost.

Whichever default is chosen, the documentation should clearly state:

Matching correctness is not affected by the scoring mode.
The scoring mode affects _score.
The scoring mode may affect top-k results when a limit is applied.
If users need ranking close to a merged index, they should choose Global.
If users care more about scalability and query latency, they should choose Local.

API Naming

Naming also merits community discussion.

One pair of names is:

Local
Global

The advantage is that they directly describe the source of BM25 statistics, without requiring users to understand Elasticsearch terminology.

Another pair of names is:

QueryThenFetch
DfsQueryThenFetch

The advantage is alignment with Elasticsearch terminology; users familiar with distributed search will immediately understand the trade-off. The downside is that it ties Lance’s API naming to Elasticsearch’s execution model.

I lean towards Local / Global, because they more directly describe the semantics Lance needs to convey: the scope of the scoring statistics.

Relationship to Existing Behavior

The current global scoring path in multi-segment FTS can be seen as the implementation foundation for Global mode. Its value remains: it provides more consistent cross-segment relevance scoring, and can also serve as a contrasting semantic for Local mode.

Introducing Local mode does not mean global BM25 is wrong; it simply acknowledges that, in large-scale distributed scenarios, requiring global BM25 statistics for every query by default may not be the most appropriate trade-off.

Therefore, the core of this proposal is not “remove global BM25”, but:

Multi-segment FTS should make BM25 scoring semantics explicit, so users can choose between scalable local scoring and globally consistent scoring.

Community Discussion Questions

We especially hope the community will provide feedback on these questions:

Should Lance multi-segment FTS default to Local or Global?
Do users expect _score to be globally comparable by default in multi-segment FTS?
Are Local / Global suitable user-facing names?
Is it necessary to align with Elasticsearch’s query_then_fetch / dfs_query_then_fetch terminology?
How should the documentation describe that top-k ranking may differ from a merged index under local scoring?
In which scenarios are users most likely to need Global?
Should the planner be allowed to automatically choose global scoring when the number of segments is very small, or must the choice be entirely user-driven?

Conclusion

Multi-segment FTS exposes the BM25 corpus boundary problem that was previously hidden inside a single index. We need to define it as clear user-facing semantics rather than leaving it buried in implementation details.

This proposal is not about whether global BM25 should exist, but about how Lance should present and expose this choice to users: prioritize better distributed scalability by default, or prioritize stronger global score consistency by default.

jackye1995 · 2026-05-14T17:23:07Z

jackye1995
May 14, 2026
Maintainer

I remember when I did research on Lucene and it is using global BM25. Looked more into the details of Lucene vs Elasticsearch, here is the finding:

There are actually two different boundaries here.

In Lucene, an index can have many physical segments, but BM25 statistics are computed at the IndexSearcher / top-level reader level. IndexSearcher aggregates term and field statistics across all leaf readers / segments, and BM25Similarity uses those aggregated stats for IDF and average document length. So Lucene’s default behavior is global BM25 across Lucene segments.

Elasticsearch adds another layer above Lucene: shards. Each Elasticsearch shard is itself a Lucene index. By default, Elasticsearch uses query_then_fetch, meaning each shard scores using its own shard-local Lucene statistics, and the coordinator merges shard top-k results. However, inside each shard, Lucene is still doing global BM25 across that shard’s Lucene segments.

So the default Elasticsearch behavior is:

global BM25 across Lucene segments within one shard
local BM25 across Elasticsearch shards
coordinator merges shard results afterward

Elasticsearch also has dfs_query_then_fetch, which runs an extra DFS phase to collect term / collection statistics across shards before scoring. That makes scoring closer to global BM25 across all searched shards, but with extra coordination cost.

The scale is important. An Elasticsearch shard is not comparable to a tiny Lucene segment. Elasticsearch defaults to one primary shard per index, and its data stream lifecycle default rollover is around 50GB or 200M documents per primary shard. Lucene segments are much smaller internal storage / merge units: Lucene flushes new segments by default around a 16MB RAM buffer, and the default tiered merge policy targets a max merged segment around 5GB during normal merging. But Lucene does not treat each of those segments as a separate BM25 corpus.

Mapping this back to Lance:

Our persisted FTS segment definitely maps more closely to an Elasticsearch shard than to a Lucene internal segment. The 200M rows default from Elasticsearch also seems like a reasonable default for us to consider. If we follow that model, then the default should be Local, with optional Global for ranking-sensitive queries. This also means the initial iteration can be just local scoring.
MemWAL is different. It maintains an in-memory FTS index, and when we flush the MemTable, the FTS index is flushed as a Lance FTS index too. This is much smaller and maps more closely to Lucene segment size. This case is in the gray area. We should probably benchmark whether local scoring is good enough or whether global scoring is needed. Global might be needed here because the size difference between these flushed segments and persisted FTS segments can be very large, which may make local BM25 scores less comparable. cc @touch-of-grey and @hamersaw

I still think Local / Global are better API names than query_then_fetch / dfs_query_then_fetch, because the latter describe Elasticsearch’s execution model while the real user-facing choice is the scope of BM25 statistics.

0 replies

wjones127 · 2026-05-14T17:37:07Z

wjones127
May 14, 2026
Maintainer

Thanks for writing this up. I did a little research to understand what the "coordination overhead" we are optimizing here.

Why Global generally requires two phases

BM25 scoring per (doc, query) needs both per-doc data (tf, |d|) and corpus-level data (idf per query term, avgdl). In a single-segment index these come from the same source so there's no coordination question. In multi-segment FTS, the per-doc data lives in segments but the corpus-level data is a function of all segments.

The corpus-level inputs are additive across segments:

N = Σ N_i
avgdl = (Σ sumdl_i) / N
df_t = Σ df_t,i for each query term t

So in principle the coordinator can compute exact global stats from a small per-segment summary. The catch is that segments need those stats before they can score, and they need to score before they can prune — and pruning is what makes FTS fast. Inverted-index scoring relies on techniques like WAND / block-max WAND that skip large portions of the posting lists by comparing per-block upper bounds against the current top-k threshold. Those upper bounds are functions of idf, which is a function of df_t and N, which are global.

This is the structural reason two phases show up:

sequenceDiagram
    participant C as Coordinator
    participant S1 as Segment 1
    participant S2 as Segment 2
    participant SN as Segment N

    Note over C,SN: Phase 1: gather corpus statistics
    C->>S1: query terms [q1, q2, ...]
    C->>S2: query terms [q1, q2, ...]
    C->>SN: query terms [q1, q2, ...]
    S1-->>C: (N_1, sumdl_1, df_t,1)
    S2-->>C: (N_2, sumdl_2, df_t,2)
    SN-->>C: (N_N, sumdl_N, df_t,N)

    Note over C: Aggregate:<br/>N = Σ N_i<br/>avgdl = Σ sumdl_i / N<br/>df_t = Σ df_t,i

    Note over C,SN: Phase 2: score with global stats
    C->>S1: globals (N, avgdl, df_t)
    C->>S2: globals (N, avgdl, df_t)
    C->>SN: globals (N, avgdl, df_t)
    S1-->>C: top-k with global BM25
    S2-->>C: top-k with global BM25
    SN-->>C: top-k with global BM25

    Note over C: Merge top-k across segments

The two phases are not about payload size — the per-segment summary in phase 1 is O(|query terms|) plus two scalars, which is tiny. They're about a sequential dependency: segments need globals before they can usefully score and prune.

2 replies

wjones127 May 14, 2026
Maintainer

@Xuanwo One question I have is whether this is really much of a concern when we are running on a single node. Is this purely a distributed search concern?

Xuanwo May 15, 2026
Maintainer Author

Yes, it's not a big difference on a single node. It is purely a distributed search concern.

wjones127 · 2026-05-14T17:46:03Z

wjones127
May 14, 2026
Maintainer

Alternate Proposal: `LocalWithGlobalRescore` as an Alternative to `Local`

Was talking to an LLM about this, and came up with an alternative we might want to consider to using local.

Summary

I'd like to propose a third scoring mode for multi-segment FTS that I think is worth considering as an alternative to Local. It gets essentially the same latency profile as Local (one RTT, segment-local pruning), but produces _score values that are globally comparable and a top-k ranking that's much closer to a merged index.

The core idea: each segment returns its top-K' candidates (K' > k) along with the per-doc sufficient statistics needed to recompute BM25. The coordinator combines per-segment stat summaries into global stats and rescores the union of candidates using the global stats. Final top-k is selected from the rescored set.

Mechanics

Each segment, in a single response, returns:

Its local stats summary: (N_i, sumdl_i, df_t,i for each query term t)
Its top-K' candidates, each as (doc_id, |d|, {tf_t,d for each query term})
The coordinator:
Aggregates summaries: N = Σ N_i, avgdl = Σ sumdl_i / N, df_t = Σ df_t,i
Computes idf_t from global df_t and N
For each candidate from each segment, computes the exact global BM25 score using the doc's |d| and {tf_t,d} against global idf_t and avgdl
Selects top-k from the unioned, rescored candidate set

sequenceDiagram
    participant C as Coordinator
    participant S1 as Segment 1
    participant S2 as Segment 2
    participant SN as Segment N
 
    C->>S1: query, k, K'
    C->>S2: query, k, K'
    C->>SN: query, k, K'
 
    Note over S1,SN: Each segment:<br/>1. Score locally for pruning<br/>2. Take top-K' candidates<br/>3. Emit (doclen, tf vector) per candidate<br/>4. Emit local stats summary
 
    S1-->>C: stats_1 + K' candidates with (doclen, tfs)
    S2-->>C: stats_2 + K' candidates with (doclen, tfs)
    SN-->>C: stats_N + K' candidates with (doclen, tfs)
 
    Note over C: Aggregate stats → global (N, avgdl, df_t)<br/>Rescore all candidates with global BM25<br/>Select top-k from rescored union

Properties

Latency. One RTT, same as Local. Segments do their normal posting-list traversal with local-idf pruning. No coordination round trip.

Score semantics. _score values returned to the user are exact global BM25, computed against the true global corpus stats. They are directly comparable across segments without the cross-segment scale issues of Local.

Ranking accuracy. The final top-k is exact for any doc that made it into a segment's top-K'. A doc only fails to appear in the final top-k if it ranked below K' in its own segment under local scoring but would have ranked in the top-k under global scoring. This failure mode shrinks rapidly as K' grows.

Payload. Each candidate adds O(|query terms|) integers (one tf per term, plus |d| and doc_id). For typical FTS queries (a handful of terms) and reasonable K' (a few hundred), per-segment response size is bounded and modest.

Comparison to `Local` and `Global`

Property	`Local`	`LocalWithGlobalRescore`	`Global` (two-phase)
Round trips	1	1	2
`_score` semantics	Segment-local	Global, exact	Global, exact
`_score` cross-segment comparability	Approximate	Exact	Exact
Top-k ranking	Approximate	Exact for survivors of local top-K'	Exact
Per-doc payload	doc_id, score	doc_id, `	d
Segment-side pruning	Local idf	Local idf	Global idf (WAND)
Failure mode	Score and ranking distortion	Doc ranks below K' locally but would clear globally	None

The key observation: LocalWithGlobalRescore and Local make the same approximation at the segment level (both prune with local idf), but LocalWithGlobalRescore corrects the scoring distortion at the coordinator. The only thing Local saves over LocalWithGlobalRescore is the per-doc payload — a handful of extra integers per candidate.

Why this might be preferable to `Local` as a mode

The argument for exposing Local is that it's the cheap, scalable option for large multi-segment deployments. But the main thing Local actually trades away — score and ranking comparability across segments — is precisely what users notice when they look at results. The thing it saves — a coordinator round trip — is something LocalWithGlobalRescore also saves.

Put differently: Local makes two approximations (segment-local pruning AND segment-local scoring), and the first one is forced by single-RTT semantics while the second one isn't. LocalWithGlobalRescore keeps the forced approximation and drops the optional one.

This raises the question of whether Local needs to exist as a user-facing mode at all, or whether LocalWithGlobalRescore should take its place in the API. The cases where Local's smaller payload would matter (e.g., extreme query throughput with very large K') feel narrow enough to be worth empirical evidence before exposing as a first-class mode.

Open questions

How should K' be chosen? Options: a fixed multiplier of k (e.g., K' = 10·k), a fixed minimum (e.g., K' = max(k, 100)), or user-configurable. Worth empirical study on representative workloads to see where the ranking accuracy curve flattens.
Should the rescore happen on the coordinator or be pushed back to segments as a second optional phase? Coordinator-side is simpler and is fine as long as Σ K' across segments fits comfortably in coordinator memory.
Does the existence of LocalWithGlobalRescore change the case for Local strongly enough to drop Local from the public API, or should both exist?

Naming

LocalWithGlobalRescore is descriptive but long. Alternatives worth considering: Rescored, GlobalRescore, ApproximateGlobal. I lean toward something that signals "global scoring, approximate recall" rather than "local with extra steps."

1 reply

Xuanwo May 15, 2026
Maintainer Author

I love this idea. It looks like we can make local the underlying persistent implementation, and only expose scoring = "accurate" | "fast".

When scoring = "accurate", we use the existing global BM25 logic, with cache as an optimization.
When scoring = "fast", we use the logic represented here.

wjones127 · 2026-05-14T17:53:30Z

wjones127
May 14, 2026
Maintainer

Other alternative: Global with cache

The other alternative to local would be having the coordinator maintain all the corpus statistics in a cache. For indexed parts, it could load this directly from the index and keep in the index cache. Then it could often skip the IO involved in the first phase.

For unindexed fragments, this would have to be a cache of computed stats that have been used so far. New queries would have to still do the first phase with brute force. We'd have to measure and see whether blocking on this makes a huge difference. Can always be skipped with fast_search=true.

There would be some question as to whether this limits the scalability: the entire term stats would have to fit on one node.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Segment FTS Scoring Semantics: Choosing Between Local and Global BM25 #6789

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multi-Segment FTS Scoring Semantics: Choosing Between Local and Global BM25 #6789

Uh oh!

Xuanwo May 14, 2026 Maintainer

Abstract

Background

User-Visible Semantics

Proposed Scoring Modes

Local

Global

Why Both Modes Should Exist

Default Behavior

API Naming

Relationship to Existing Behavior

Community Discussion Questions

Conclusion

Replies: 4 comments · 3 replies

Uh oh!

jackye1995 May 14, 2026 Maintainer

Uh oh!

wjones127 May 14, 2026 Maintainer

Why Global generally requires two phases

Uh oh!

wjones127 May 14, 2026 Maintainer

Uh oh!

Xuanwo May 15, 2026 Maintainer Author

Uh oh!

wjones127 May 14, 2026 Maintainer

Alternate Proposal: LocalWithGlobalRescore as an Alternative to Local

Summary

Mechanics

Properties

Comparison to Local and Global

Why this might be preferable to Local as a mode

Open questions

Naming

Uh oh!

Xuanwo May 15, 2026 Maintainer Author

Uh oh!

wjones127 May 14, 2026 Maintainer

Other alternative: Global with cache

Xuanwo
May 14, 2026
Maintainer

Replies: 4 comments 3 replies

jackye1995
May 14, 2026
Maintainer

wjones127
May 14, 2026
Maintainer

wjones127 May 14, 2026
Maintainer

Xuanwo May 15, 2026
Maintainer Author

wjones127
May 14, 2026
Maintainer

Alternate Proposal: `LocalWithGlobalRescore` as an Alternative to `Local`

Comparison to `Local` and `Global`

Why this might be preferable to `Local` as a mode

Xuanwo May 15, 2026
Maintainer Author

wjones127
May 14, 2026
Maintainer