This repository has been archived by the owner on Apr 4, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 82
Reduce the size of the word_pair_proximity database #639
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
loiclec
added
indexing
Related to the documents/settings indexing algorithms.
querying
Related to the searching/fetch data algorithms.
DB breaking
The related changes break the DB
performance
Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
labels
Sep 14, 2022
loiclec
force-pushed
the
word-pair-proximity-docids-refactor
branch
2 times, most recently
from
September 19, 2022 14:35
9a656b9
to
2983dd8
Compare
Same for word_prefix_pair_proximity
Similar to the word_prefix_pair_proximity one but instead the keys are: (proximity, prefix, word2)
loiclec
force-pushed
the
word-pair-proximity-docids-refactor
branch
from
October 18, 2022 08:38
2983dd8
to
ab2f6f3
Compare
ManyTheFish
suggested changes
Oct 18, 2022
ManyTheFish
suggested changes
Oct 20, 2022
@loiclec, just a message in case you missed the wild git conflict appearing 🐯 |
ManyTheFish
approved these changes
Oct 25, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved! I let you merge this @loiclec if everything is ok on your side 😄
Thanks @ManyTheFish ! bors merge |
Build succeeded: |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
DB breaking
The related changes break the DB
indexing
Related to the documents/settings indexing algorithms.
performance
Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
querying
Related to the searching/fetch data algorithms.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request
What does this PR do?
Fixes #634
Now, the value corresponding to the key
prox word1 word2
in theword_pair_proximity_docids
database contains the ids of the documents in which:word1
is followed byword2
word1
andword2
isprox-1
Before this PR, the
word_pair_proximity_docids
had keys with the formatword1 word2 prox
and the value contained the ids of the documents in which either:word1
is followed byword2
after a minimum ofprox-1
words in between themword2
is followed byword1
after a minimum ofprox-2
wordsAs a consequence of this change, calls such as:
have to be replaced with:
Phrase search
The PR also fixes two bugs in the
resolve_phrase
function. The first bug is that a phrase containing twice the same word would always return zero documents (e.g."dog eats dog"
).The second bug occurs with a phrase such as "fox is smarter than a dog"` and the document with the text:
In that case, the phrase search would not return the documents because:
fox dog 2
inword_pair_proximity_docids
resolve_phrase
looks forfox dog 5
, which returns 0 documentsNew implementation of
resolve_phrase
Given the phrase:
We select the document ids corresponding to all of the following keys in
word_pair_proximity_docids
:1 fox is
1 is smarter
1 smarter than
1 fox smarter
OR2 fox smarter
1 is than
OR2 is than
1 than dog
OR2 than dog
Benchmark Results
Indexing:
Search Wiki:
Search songs: