-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize BM25 for better kiwix-search suggestions #492
Conversation
Codecov Report
@@ Coverage Diff @@
## master #492 +/- ##
==========================================
+ Coverage 73.25% 75.79% +2.54%
==========================================
Files 89 88 -1
Lines 3631 3599 -32
Branches 1626 1612 -14
==========================================
+ Hits 2660 2728 +68
+ Misses 971 870 -101
- Partials 0 1 +1
Continue to review full report at Codecov.
|
95fe273
to
6dde33d
Compare
e1734e5
to
2ded96d
Compare
I've missed the |
I thought the lev. distance was not used anymore because two slow! Where is that still in use? Otherwise we should remove it. |
b000543
to
3bdff52
Compare
I think @mgautierfr is right. It is still being used. I am not familiar with testing the performance, but if its doing the same thing that With the current configuration, when we search "berlin" on this test set, the result is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid a single test case focusing on the reported corner case is not enough. Since I don't have any expertise in search engines in general and the BM25 probabilistic formula in particular I can't judge to what extent this change will have negative effects. You must demonstrate that suggestions work acceptably well on a broad set of inputs (of both titles and queries).
I understand the comment but I’m not sure I have other pertinent test in my mind. Here a few ideas:
|
This pr only improves the single-term search. I should add these tests here:
I will add tests for phrase search in the other pr.
|
This is acceptable if you can guarantee that this PR doesn't affect phrase search in any way. In general when changing some piece of functionality you must ensure that the chances of breaking it are minimal. Tests provide a certain level of such confidence. Therefore before touching untested code one should cover it with tests (and that can be done in a separate PR). |
1a71000
to
544f52f
Compare
I understand your concern @veloman-yunkan. I have tested the changes extensively using some big zim files, and have added another test to verify the order for a small phrase query. Some of the results:
After
As far as this pr is concerned, The impact of BM25 tuning is equal on all the terms of a phrase since the weightage of all terms wdf is reduced by a constant factor. The slight improvement for phrase search this pr brings is the order of documents with the same weightage (in the before case, all documents with score 91). This will be improved and tested further as part of the next pr. |
@kelson42 @mgautierfr PS: I was not able to find any concrete example where the current implementation of lev gave apparent better suggestions. |
544f52f
to
3b1ebbb
Compare
3b1ebbb
to
fe807fd
Compare
I have rebased that branch |
6ad47fb
to
a9476e3
Compare
94ea501
to
72a5768
Compare
@veloman-yunkan I have rebased this PR on #503 so that the changes you mentioned in your review can be implemented. This PR should be merged only after #503 is merged. |
acf0364
to
ee23d54
Compare
That's good. When you rebase (or start) a PR A on top of another PR B, I advise that you set the base branch of A to the development branch of B. Then B's commits will not show up in A's history, and when B is merged A's base branch will be updated automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this PR has been rebased on top of #503, please squash the commit "rebasing to 502-filename-extension-issue, extending TempFile for"
ee23d54
to
0d04fc4
Compare
On my side, this is a approval. I let @veloman-yunkan make the final approval (and merge) the PR when it is ok for him. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change history of this PR looks messy. Please squash all commits into one. Then you can split it into the following commits:
- Introduce the
suggestion
unit test without any changes to the suggestions algorithm. The test must pass. - Change the suggestion algorithm and update the test. Thus the user-observable change in the code behavior will be automatically documented.
- Remove levenshtein
0d04fc4
to
88efdc1
Compare
90f7213
to
22651fb
Compare
88efdc1
to
4702103
Compare
22651fb
to
8499a83
Compare
I am sorry. I didn't notice that you force-pushed the base branch. |
@maneeshpm Please rebase the top 3 commits of this PR |
4702103
to
8e55f97
Compare
8499a83
to
e705b41
Compare
8e55f97
to
38f5752
Compare
38f5752
to
4810555
Compare
Fixes #458
Since we are using a separate index for titles, using the default
Xapian::BM25Weight
tuning parameters poses some issues. BM25 is a "bag of words" algorithm based upon the frequency of words - there's no scoring bonus for matching ordering or for anchoring for a title search which is preferred when we search over titles. The changes I plan to include with this pr are:Xapian::BM25Weight
The within-document-frequency(wdf) factor k1 with a default value of
1
is too much for a title search. Reducingk1
to0.001
and increasing length normalization is sufficient improvement.set_sort_by_relevance_then_values(valuesmap["title"])
When searching a large index, we have several documents with the same relevance. This mixup causes issues like single term queries like "berlin" going way down the suggestion list when it should be around the top. Sorting by values for the same relevance brings them back to top.
With these two changes: