Optimize Attributes Criterion #3378
Labels
enhancement
New feature or improvement
milli
Related to the milli workspace
performance
Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
v1.2.0
PRs/issues solved in v1.2.0 released on 2023-06-05
Milestone
The algorithm
The algorithm based on Set iterate over intervals of position giving sets of docids (
word_level_position_docids
database), and choose between the different Query derivations the best interval giving docids.Explanations
For each query, we have several Query derivations mainly because of
ngrams
orword-split
,to be able to fetch the best docids we have to build meta-intervals for each Query derivation and keep the best.
What is an Interval?
An interval is defined by a
word
aleft
, aright
and contains a set of docids:word
: the word we want to find.left
: the left-most word-position where the searched word can match.right
: the right-most word-position where the searched word can match.word
at a word-position betweenleft
andright
.The interval
("hello", 16, 32)
would contain a set ofdocids
that match the word"hello"
at a word-position between16
and32
.Merge of intervals of words in the same Query derivations to build the meta-intervals
Because each word in the same Query derivations have intervals of position containing sets of docids, we have to merge these intervals creating meta-intervals:
![Attributes criterion (set)](https://user-images.githubusercontent.com/6482087/114524578-a16ec600-9c45-11eb-9b96-7e7ed622a148.jpg)
To create meta-intervals we have to cross merge word-intervals to keep the best relevancy when we select
docids
.In This schemas, we can see that farther is the intervals more cross-merge is needed to create it.
The issue
This algorithm is reset at every call of the criterion, meaning that it needs to restart the iteration over all intervals to fetch the next Set of docids.
We could keep the last state of the algorithm transforming it from a simple
function
to a realiterator
and store it into the criterion instance.Warning
There is some intervals-level stuff that is not explained in this issue which makes the implementation of this optimization more difficult.
self note (@LegendreM)
BinaryHeap
The text was updated successfully, but these errors were encountered: