deterministic score tiebreaker #1608

missinglink · 2022-02-28T18:19:36Z

orangejulius · 2022-02-28T20:49:06Z

Huh, fun. Looks like this actually fails because loading the _id field from disk is expensive:

[exception] java.util.concurrent.ExecutionException: CircuitBreakingException[[fielddata] Data too large, data for [_id] would be [3399601624/3.1gb], which is larger than the limit of [3373164134/3.1gb]]

missinglink · 2022-02-28T21:24:43Z

Wow ok, problem with using _doc instead is it changes between builds.

missinglink · 2022-02-28T21:26:34Z

I'd be interested to know which query caused that, I assumed sort was after hits are pruned

missinglink · 2022-02-28T21:27:18Z

Or at least the second sort criteria was only a secondary sorting based on the top $size hits of the first criteria

missinglink · 2022-03-01T12:48:24Z

CircuitBreakingException

https://gitlab.com/crossref/event_data_query/-/issues/33

missinglink · 2022-03-01T12:58:20Z

reference for _id https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-id-field.html
it mentions this:

In case sorting or aggregating on the _id field is required, it is advised to duplicate the content of the _id field into another field that has doc_values enabled.

It seems that since docvalues is not enabled for _id that the entire document needs to be loaded from disk to use this one field for sorting, which is clearly prohibitively expensive.

So there's a couple options if we'd like to pursue this:

use [_score, _doc], this provides a secondary scoring index based on the document insertion order. This would be deterministic within a single build but not deterministic across multiple builds. My assumption is that this is 'cheap' since this _doc integer is probably already available within the results and wouldn't need to be fetched, but I don't think it really solves the problem, or at least half-solves a problem we might not have 🤔
use some fields with docvalues, such as something like [_score, layer, source, source_id] (the first two being defined in the schema as keyword_with_doc_values and the latter being keyword which would need to change). This would achieve the desired result but would potentially cause a bit of disk spin and therefore a perf hit.

[edit] reversing the fields to have the less common terms first would be preferable!

missinglink · 2022-03-01T13:00:45Z

My personal preference would be either 3. abandon this and move on /or 2. use docvalues fields and hope that the fields are small enough that they are always in RAM and therefore don't suffer a perf hit.

orangejulius · 2022-03-01T13:27:29Z

Yeah, using [_score, layer, source, source_id] makes sense: the source_id field in our elasticsearch documents has been without a purpose for quite some time, since it's duplicated in the _id field and easily calculated. But if a separate field from _id is the only way to use docvalues, then that's perfect.

missinglink · 2022-03-01T16:48:33Z

Seems we can probably use only source_id since it's almost unique we won't require any further sorting conditions.
The dilemma here is that using such a field will also use the most RAM 🤔

orangejulius · 2022-03-01T18:22:26Z

Maybe we want to try a different approach then? What if we re-sorted results in the API after retrieving them from Elasticsearch? Sort first by score, then by gid. This will always be fast because it's sorting at most ~80 records at once (with size=40 and a factor for extra records to remove dupes), and won't require any changes to our ES queries or their performance characteristics.

If sorting by source_id in Elasticsearch turned out to be free, then that would also be fine, but I suspect it's not free.

missinglink · 2022-03-01T22:22:33Z

Yeah I had a similar thought, the concern there is if the top n 'hits' returned from ES to the API were non-deterministic then it wouldn't really solve the core issue

orangejulius · 2022-03-01T22:39:49Z

Ahh, good point. It would probably handle most cases but definitely not every case.

So lets try a build with source_id docvalues enabled, and see how bad it is?

If it's really bad maybe we do it API-side for now and work on introducing more scoring differentiators into the queries in the future?

feat(scoring): deterministic score tiebreaker

5f85b2a

missinglink mentioned this pull request Mar 16, 2022

enable docvalues for source_id field pelias/schema#482

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deterministic score tiebreaker #1608

deterministic score tiebreaker #1608

missinglink commented Feb 28, 2022

orangejulius commented Feb 28, 2022

missinglink commented Feb 28, 2022 •

edited

Loading

missinglink commented Feb 28, 2022

missinglink commented Feb 28, 2022

missinglink commented Mar 1, 2022

missinglink commented Mar 1, 2022 •

edited

Loading

missinglink commented Mar 1, 2022

orangejulius commented Mar 1, 2022

missinglink commented Mar 1, 2022

orangejulius commented Mar 1, 2022

missinglink commented Mar 1, 2022

orangejulius commented Mar 1, 2022

deterministic score tiebreaker #1608

Are you sure you want to change the base?

deterministic score tiebreaker #1608

Conversation

missinglink commented Feb 28, 2022

orangejulius commented Feb 28, 2022

missinglink commented Feb 28, 2022 • edited Loading

missinglink commented Feb 28, 2022

missinglink commented Feb 28, 2022

missinglink commented Mar 1, 2022

missinglink commented Mar 1, 2022 • edited Loading

missinglink commented Mar 1, 2022

orangejulius commented Mar 1, 2022

missinglink commented Mar 1, 2022

orangejulius commented Mar 1, 2022

missinglink commented Mar 1, 2022

orangejulius commented Mar 1, 2022

missinglink commented Feb 28, 2022 •

edited

Loading

missinglink commented Mar 1, 2022 •

edited

Loading