Multi-Segment FTS Scoring Semantics: Choosing Between Local and Global BM25 #6789
Replies: 4 comments 3 replies
-
|
I remember when I did research on Lucene and it is using global BM25. Looked more into the details of Lucene vs Elasticsearch, here is the finding: There are actually two different boundaries here. In Lucene, an index can have many physical segments, but BM25 statistics are computed at the Elasticsearch adds another layer above Lucene: shards. Each Elasticsearch shard is itself a Lucene index. By default, Elasticsearch uses So the default Elasticsearch behavior is:
Elasticsearch also has The scale is important. An Elasticsearch shard is not comparable to a tiny Lucene segment. Elasticsearch defaults to one primary shard per index, and its data stream lifecycle default rollover is around Mapping this back to Lance:
I still think |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for writing this up. I did a little research to understand what the "coordination overhead" we are optimizing here. Why Global generally requires two phasesBM25 scoring per (doc, query) needs both per-doc data (tf, |d|) and corpus-level data (idf per query term, avgdl). In a single-segment index these come from the same source so there's no coordination question. In multi-segment FTS, the per-doc data lives in segments but the corpus-level data is a function of all segments. The corpus-level inputs are additive across segments:
So in principle the coordinator can compute exact global stats from a small per-segment summary. The catch is that segments need those stats before they can score, and they need to score before they can prune — and pruning is what makes FTS fast. Inverted-index scoring relies on techniques like WAND / block-max WAND that skip large portions of the posting lists by comparing per-block upper bounds against the current top-k threshold. Those upper bounds are functions of This is the structural reason two phases show up: sequenceDiagram
participant C as Coordinator
participant S1 as Segment 1
participant S2 as Segment 2
participant SN as Segment N
Note over C,SN: Phase 1: gather corpus statistics
C->>S1: query terms [q1, q2, ...]
C->>S2: query terms [q1, q2, ...]
C->>SN: query terms [q1, q2, ...]
S1-->>C: (N_1, sumdl_1, df_t,1)
S2-->>C: (N_2, sumdl_2, df_t,2)
SN-->>C: (N_N, sumdl_N, df_t,N)
Note over C: Aggregate:<br/>N = Σ N_i<br/>avgdl = Σ sumdl_i / N<br/>df_t = Σ df_t,i
Note over C,SN: Phase 2: score with global stats
C->>S1: globals (N, avgdl, df_t)
C->>S2: globals (N, avgdl, df_t)
C->>SN: globals (N, avgdl, df_t)
S1-->>C: top-k with global BM25
S2-->>C: top-k with global BM25
SN-->>C: top-k with global BM25
Note over C: Merge top-k across segments
The two phases are not about payload size — the per-segment summary in phase 1 is |
Beta Was this translation helpful? Give feedback.
-
Alternate Proposal:
|
| Property | Local |
LocalWithGlobalRescore |
Global (two-phase) |
|---|---|---|---|
| Round trips | 1 | 1 | 2 |
_score semantics |
Segment-local | Global, exact | Global, exact |
_score cross-segment comparability |
Approximate | Exact | Exact |
| Top-k ranking | Approximate | Exact for survivors of local top-K' | Exact |
| Per-doc payload | doc_id, score | doc_id, ` | d |
| Segment-side pruning | Local idf | Local idf | Global idf (WAND) |
| Failure mode | Score and ranking distortion | Doc ranks below K' locally but would clear globally | None |
The key observation: LocalWithGlobalRescore and Local make the same approximation at the segment level (both prune with local idf), but LocalWithGlobalRescore corrects the scoring distortion at the coordinator. The only thing Local saves over LocalWithGlobalRescore is the per-doc payload — a handful of extra integers per candidate.
Why this might be preferable to Local as a mode
The argument for exposing Local is that it's the cheap, scalable option for large multi-segment deployments. But the main thing Local actually trades away — score and ranking comparability across segments — is precisely what users notice when they look at results. The thing it saves — a coordinator round trip — is something LocalWithGlobalRescore also saves.
Put differently: Local makes two approximations (segment-local pruning AND segment-local scoring), and the first one is forced by single-RTT semantics while the second one isn't. LocalWithGlobalRescore keeps the forced approximation and drops the optional one.
This raises the question of whether Local needs to exist as a user-facing mode at all, or whether LocalWithGlobalRescore should take its place in the API. The cases where Local's smaller payload would matter (e.g., extreme query throughput with very large K') feel narrow enough to be worth empirical evidence before exposing as a first-class mode.
Open questions
- How should K' be chosen? Options: a fixed multiplier of k (e.g., K' = 10·k), a fixed minimum (e.g., K' = max(k, 100)), or user-configurable. Worth empirical study on representative workloads to see where the ranking accuracy curve flattens.
- Should the rescore happen on the coordinator or be pushed back to segments as a second optional phase? Coordinator-side is simpler and is fine as long as Σ K' across segments fits comfortably in coordinator memory.
- Does the existence of
LocalWithGlobalRescorechange the case forLocalstrongly enough to dropLocalfrom the public API, or should both exist?
Naming
LocalWithGlobalRescore is descriptive but long. Alternatives worth considering: Rescored, GlobalRescore, ApproximateGlobal. I lean toward something that signals "global scoring, approximate recall" rather than "local with extra steps."
Beta Was this translation helpful? Give feedback.
-
Other alternative: Global with cacheThe other alternative to local would be having the coordinator maintain all the corpus statistics in a cache. For indexed parts, it could load this directly from the index and keep in the index cache. Then it could often skip the IO involved in the first phase. For unindexed fragments, this would have to be a cache of computed stats that have been used so far. New queries would have to still do the first phase with brute force. We'd have to measure and see whether blocking on this makes a huge difference. Can always be skipped with There would be some question as to whether this limits the scalability: the entire term stats would have to fit on one node. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Abstract
Multi-segment FTS allows a single logical FTS index to be composed of multiple physical segments. This introduces a user-visible semantic issue for BM25 scoring: should BM25 corpus statistics be computed independently for each segment, or uniformly across the entire logical FTS index?
This proposal is not about removing global BM25. It aims to make the scoring semantics of multi-segment FTS explicit, so that users can choose between two modes:
Local: each segment uses its own BM25 statistics. This is better suited for large-scale and distributed queries, but_scorevalues are only approximately comparable across segments.Global: all target segments use a unified set of BM25 statistics. This produces ranking closer to a single merged FTS index, but requires more coordination overhead at query time.We hope to use this proposal to drive community discussion on: how Lance should define the default scoring behavior for multi-segment FTS, and how the semantics of
_scoreand top-k ranking should be communicated to users.Background
Lance is extending its multi-segment index capabilities so that large tables can be built and queried through multiple independent index artifacts. For FTS, this means a logical FTS index may no longer correspond to a single physical inverted index, but instead be composed of multiple committed FTS segments.
In a single FTS index, BM25 corpus statistics have a natural definition: they come from the document collection covered by that index. With multi-segment FTS, the boundary of the corpus is no longer uniquely defined.
One interpretation is that each segment is its own scoring corpus. At query time, each segment scores independently, and the results are then merged.
Another interpretation is that all segments under a logical FTS index together form a single scoring corpus. At query time, global statistics are obtained first, and then all segments score documents using the same set of BM25 statistics.
This is not merely an implementation detail. It affects the
_scorevalues users see, and can also influence top-k results when alimitis applied.User-Visible Semantics
FTS query results carry at least two layers of semantics:
_score.The scoring mode of multi-segment FTS should not alter matching correctness. Regardless of whether local or global scoring is used, a document that satisfies the query conditions must always be considered a match.
The difference appears at the ranking level. BM25 scores depend on corpus-level statistics such as:
If these statistics come from different segments,
_scorevalues produced by different segments are only approximately comparable. If the statistics come from the entire logical FTS index, the_scorevalues will be closer to what a single merged index would produce.Therefore, multi-segment FTS must clearly tell users whether the default
_scoreis calculated on a segment-local basis or on a logical-index-global basis.Proposed Scoring Modes
Local
In
Localmode, each segment scores documents using its own BM25 statistics.Semantics:
_scoreis based on segment-local corpus statistics._scorevalues are approximate and not guaranteed to be strictly comparable across segments.limitmay differ from that of a single merged FTS index.This mirrors the default trade-off in many distributed search systems. For example, Elasticsearch defaults to
query_then_fetchwith shard-local scoring; when global term statistics are needed, users must explicitly usedfs_query_then_fetch.Global
In
Globalmode, the target segments score documents using unified BM25 statistics.Semantics:
_scoreis based on corpus-wide statistics of the logical FTS index.This mode suits scenarios where users care more about consistent global relevance ranking than about minimizing query coordination overhead.
Why Both Modes Should Exist
LocalandGlobalare not a matter of right or wrong. They represent reasonable trade-offs for different workloads.Localis more suitable for:_scoreconsistency.Globalis more suitable for:_scorefor precise relevance ranking.Lance should not hide one of these semantics inside the implementation. A cleaner approach is to expose this trade-off explicitly to users.
Default Behavior
This is the question on which this proposal most seeks community input.
One option is to default to
Local. This would make multi-segment FTS better suited out of the box for large-scale and distributed queries, and would also align more closely with the default distributed search semantics of systems like Elasticsearch. The trade-off is that_scoreand top-k ranking are not guaranteed to be fully consistent with a merged index.Another option is to default to
Global. This would give users a stronger expectation of score consistency and stay closer to the current correctness-first implementation. However, as the number of segments grows, it makes query-time global coordination a default cost.Whichever default is chosen, the documentation should clearly state:
_score.limitis applied.Global.Local.API Naming
Naming also merits community discussion.
One pair of names is:
LocalGlobalThe advantage is that they directly describe the source of BM25 statistics, without requiring users to understand Elasticsearch terminology.
Another pair of names is:
QueryThenFetchDfsQueryThenFetchThe advantage is alignment with Elasticsearch terminology; users familiar with distributed search will immediately understand the trade-off. The downside is that it ties Lance’s API naming to Elasticsearch’s execution model.
I lean towards
Local/Global, because they more directly describe the semantics Lance needs to convey: the scope of the scoring statistics.Relationship to Existing Behavior
The current global scoring path in multi-segment FTS can be seen as the implementation foundation for
Globalmode. Its value remains: it provides more consistent cross-segment relevance scoring, and can also serve as a contrasting semantic forLocalmode.Introducing
Localmode does not mean global BM25 is wrong; it simply acknowledges that, in large-scale distributed scenarios, requiring global BM25 statistics for every query by default may not be the most appropriate trade-off.Therefore, the core of this proposal is not “remove global BM25”, but:
Community Discussion Questions
We especially hope the community will provide feedback on these questions:
LocalorGlobal?_scoreto be globally comparable by default in multi-segment FTS?Local/Globalsuitable user-facing names?query_then_fetch/dfs_query_then_fetchterminology?Global?Conclusion
Multi-segment FTS exposes the BM25 corpus boundary problem that was previously hidden inside a single index. We need to define it as clear user-facing semantics rather than leaving it buried in implementation details.
This proposal is not about whether global BM25 should exist, but about how Lance should present and expose this choice to users: prioritize better distributed scalability by default, or prioritize stronger global score consistency by default.
Beta Was this translation helpful? Give feedback.
All reactions