Optimize BM25 for better kiwix-search suggestions #492

maneeshpm · 2021-02-06T15:04:14Z

Fixes #458
Since we are using a separate index for titles, using the default Xapian::BM25Weight tuning parameters poses some issues. BM25 is a "bag of words" algorithm based upon the frequency of words - there's no scoring bonus for matching ordering or for anchoring for a title search which is preferred when we search over titles. The changes I plan to include with this pr are:

Tune Xapian::BM25Weight
The within-document-frequency(wdf) factor k1 with a default value of 1 is too much for a title search. Reducing k1 to 0.001 and increasing length normalization is sufficient improvement.
set_sort_by_relevance_then_values(valuesmap["title"])
When searching a large index, we have several documents with the same relevance. This mixup causes issues like single term queries like "berlin" going way down the suggestion list when it should be around the top. Sorting by values for the same relevance brings them back to top.

With these two changes:

$ kiwix-search --suggestion -v wikipedia_en_all_mini_2021-01.zim "berlin" 
Performing suggestion query `berlin`
Setup queryparser using language eng
Mark query as 'partial'
Parsed query 'berlin' to Query((Zberlin@1 OR (WILDCARD SYNONYM berlin OR Zberlin@1)))
Berlin, Berlin 100
Berlin Berlin 100
Berlin, Berlin (2020) 99
Berlín 99
Berliner 99
Berline 99
Berlin 99
.berlin 99
Hotel Berlin, Berlin 99
Hotel Berlin Berlin 99

codecov · 2021-02-06T15:05:26Z

Codecov Report

Merging #492 (4810555) into master (35cd997) will increase coverage by 2.54%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #492      +/-   ##
==========================================
+ Coverage   73.25%   75.79%   +2.54%     
==========================================
  Files          89       88       -1     
  Lines        3631     3599      -32     
  Branches     1626     1612      -14     
==========================================
+ Hits         2660     2728      +68     
+ Misses        971      870     -101     
- Partials        0        1       +1

Impacted Files	Coverage Δ
src/search.cpp	`57.50% <100.00%> (+8.42%)`	⬆️
src/archive.cpp	`53.01% <0.00%> (+3.01%)`	⬆️
src/fileimpl.cpp	`83.95% <0.00%> (+3.70%)`	⬆️
src/writer/item.cpp	`66.66% <0.00%> (+4.76%)`	⬆️
src/search_iterator.cpp	`30.69% <0.00%> (+8.91%)`	⬆️
src/tools.cpp	`100.00% <0.00%> (+9.37%)`	⬆️
src/dirent_lookup.h	`98.41% <0.00%> (+19.04%)`	⬆️
src/search_internal.h	`75.00% <0.00%> (+28.12%)`	⬆️
include/zim/writer/item.h	`84.61% <0.00%> (+44.61%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35cd997...4810555. Read the comment docs.

mgautierfr · 2021-02-15T09:22:19Z

I've missed the set_sort_by_relevance_then_values in xapian.
The levenshtein_distance we have (https://github.com/openzim/libzim/blob/master/src/search.cpp#L118) is somehow doing the same thing but probably with slower performances. It would be nice to compare the use of set_sort_by_relevance_then_values with the levenshtein_distance and maybe remove the levensthein_distance if it is useless.

kelson42 · 2021-02-15T16:28:14Z

I thought the lev. distance was not used anymore because two slow! Where is that still in use? Otherwise we should remove it.

maneeshpm · 2021-02-15T17:06:10Z

I think @mgautierfr is right. It is still being used. I am not familiar with testing the performance, but if its doing the same thing that set_sort_by_relevance_then_value() already does, it can be removed.

With the current configuration, when we search "berlin" on this test set, the result is:
{"berlin", "hotel berlin, berlin", "not berlin", "berlin wall", "again berlin"}
in order without the last item "fooland".
@kelson42 the optimal suggestion set IMO is
{"berlin", "berlin wall", "hotel berlin, berlin", "not berlin", "again berlin"}
where terms starting with berlin are "anchored". This can be implemented in the other pr as we discussed.

veloman-yunkan

I am afraid a single test case focusing on the reported corner case is not enough. Since I don't have any expertise in search engines in general and the BM25 probabilistic formula in particular I can't judge to what extent this change will have negative effects. You must demonstrate that suggestions work acceptably well on a broad set of inputs (of both titles and queries).

test/suggestion.cpp

kelson42 · 2021-02-16T17:17:32Z

I am afraid a single test case focusing on the reported corner case is not enough. Since I don't have any expertise in search engines in general and the BM25 probabilistic formula in particular I can't judge to what extent this change will have negative effects. You must demonstrate that suggestions work acceptably well on a broad set of inputs (of both titles and queries).

I understand the comment but I’m not sure I have other pertinent test in my mind. Here a few ideas:

test the empty search
Test the case/accent insensitivity
Test the reductions of result each time the search pattern is more complete: “the”, “the wolf”, “the wolf of”, ...
Test the same words but in an other order
Test the search without results
Test the search woth more result than the limit

maneeshpm · 2021-02-16T20:02:21Z

This pr only improves the single-term search. I should add these tests here:

empty search
case sensitivity
search without results
search with more results than limit

I will add tests for phrase search in the other pr.

incremental search
order of words

veloman-yunkan · 2021-02-17T08:58:59Z

This pr only improves the single-term search. I should add these tests here:
* empty search

* case sensitivity

* search without results

* search with more results than limit
I will add tests for phrase search in the other pr.
* incremental search

* order of words

This is acceptable if you can guarantee that this PR doesn't affect phrase search in any way. In general when changing some piece of functionality you must ensure that the chances of breaking it are minimal. Tests provide a certain level of such confidence. Therefore before touching untested code one should cover it with tests (and that can be done in a separate PR).

maneeshpm · 2021-02-17T13:32:11Z

I understand your concern @veloman-yunkan. I have tested the changes extensively using some big zim files, and have added another test to verify the order for a small phrase query. Some of the results:
Before

$ kiwix-search -v --suggestion wikipedia_en_all_mini_2021-01.zim "summer in"
Performing suggestion query `summer in`
Setup queryparser using language eng
Mark query as 'partial'
Parsed query 'summer in' to Query((Zsummer@1 AND (WILDCARD SYNONYM in OR Zin@2)))
In Summer 100
In re Summers 91
In Summer (Renoir) 91
Summer in Berlin 91
Summer in Paradise 91
Shivers in Summer 91
Summer in Abaddon 91
Flambards in Summer 91
Summer in Siam 91
Death in Summer 91

After

$ kiwix-search -v --suggestion wikipedia_en_all_mini_2021-01.zim "summer in"  
Performing suggestion query `summer in`
Setup queryparser using language eng
Mark query as 'partial'
Parsed query 'summer in' to Query((Zsummer@1 AND (WILDCARD SYNONYM in OR Zin@2)))
In Summer 100
Sympathy in Summer 99
Summers in PA 99
Summer in Tyrol 99
Summer in Transylvania 99
Summer in Siam 99
Summer in Paradise 99
Summer in Mississippi 99
Summer in Kingston 99
Summer in Genova 99

As far as this pr is concerned, The impact of BM25 tuning is equal on all the terms of a phrase since the weightage of all terms wdf is reduced by a constant factor. The slight improvement for phrase search this pr brings is the order of documents with the same weightage (in the before case, all documents with score 91). This will be improved and tested further as part of the next pr.

test/suggestion.cpp

maneeshpm · 2021-02-17T17:05:16Z

I've missed the set_sort_by_relevance_then_values in xapian.
The levenshtein_distance we have (https://github.com/openzim/libzim/blob/master/src/search.cpp#L118) is somehow doing the same thing but probably with slower performances. It would be nice to compare the use of set_sort_by_relevance_then_values with the levenshtein_distance and maybe remove the levensthein_distance if it is useless.

@kelson42 @mgautierfr set_sort_by_relevance_then_value() is not doing the same thing as Levenshtein distance. This function directly sorts the the documents having same relevance by increasing(or decreasing) values. Whereas Levenshtein distance calculated the difference between the query string and document title then used it as a key for sort. Are we still planning to remove Levenshtein distance?

PS: I was not able to find any concrete example where the current implementation of lev gave apparent better suggestions.

test/suggestion.cpp

kelson42 · 2021-02-20T10:56:06Z

I have rebased that branch

maneeshpm · 2021-02-22T20:10:53Z

@veloman-yunkan I have rebased this PR on #503 so that the changes you mentioned in your review can be implemented. This PR should be merged only after #503 is merged.

veloman-yunkan · 2021-02-23T09:33:32Z

@veloman-yunkan I have rebased this PR on #503 so that the changes you mentioned in your review can be implemented. This PR should be merged only after #503 is merged.

That's good. When you rebase (or start) a PR A on top of another PR B, I advise that you set the base branch of A to the development branch of B. Then B's commits will not show up in A's history, and when B is merged A's base branch will be updated automatically.

veloman-yunkan

Now that this PR has been rebased on top of #503, please squash the commit "rebasing to 502-filename-extension-issue, extending TempFile for"

mgautierfr · 2021-02-23T10:29:24Z

On my side, this is a approval. I let @veloman-yunkan make the final approval (and merge) the PR when it is ok for him.

veloman-yunkan

The change history of this PR looks messy. Please squash all commits into one. Then you can split it into the following commits:

Introduce the suggestion unit test without any changes to the suggestions algorithm. The test must pass.
Change the suggestion algorithm and update the test. Thus the user-observable change in the code behavior will be automatically documented.
Remove levenshtein

test/suggestion.cpp

veloman-yunkan · 2021-02-23T15:23:45Z

@kelson42

Your rebase and subsequent force push was not performed properly. For a PR with a base branch different from master, the rebase must be performed in a special way when the base branch in rebased. In general it would be better if the author of the PR deals with it.

I am sorry. I didn't notice that you force-pushed the base branch.

veloman-yunkan · 2021-02-23T15:27:57Z

@maneeshpm Please rebase the top 3 commits of this PR --onto 502-filename-extension-issue. You will have to repeat it if the base branch is rebased again.

maneeshpm force-pushed the 360-kiwix-search-order-issue branch 2 times, most recently from 95fe273 to 6dde33d Compare February 13, 2021 13:56

maneeshpm changed the title ~~indexing title with position kiwix-search~~ Optimize BM25 for better kiwix-search suggestions Feb 13, 2021

maneeshpm self-assigned this Feb 13, 2021

maneeshpm linked an issue Feb 13, 2021 that may be closed by this pull request

Can't look up Berlin via Search Index #458

Closed

maneeshpm force-pushed the 360-kiwix-search-order-issue branch 2 times, most recently from e1734e5 to 2ded96d Compare February 14, 2021 07:47

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from b000543 to 3bdff52 Compare February 15, 2021 17:01

maneeshpm marked this pull request as ready for review February 16, 2021 05:21

kelson42 requested a review from veloman-yunkan February 16, 2021 06:26

veloman-yunkan requested changes Feb 16, 2021

View reviewed changes

test/suggestion.cpp Outdated Show resolved Hide resolved

test/suggestion.cpp Outdated Show resolved Hide resolved

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from 1a71000 to 544f52f Compare February 17, 2021 13:25

maneeshpm requested a review from veloman-yunkan February 17, 2021 13:32

veloman-yunkan requested changes Feb 17, 2021

View reviewed changes

test/suggestion.cpp Outdated Show resolved Hide resolved

test/suggestion.cpp Outdated Show resolved Hide resolved

test/suggestion.cpp Outdated Show resolved Hide resolved

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from 544f52f to 3b1ebbb Compare February 17, 2021 18:31

maneeshpm mentioned this pull request Feb 17, 2021

Improve phrase search suggestions #501

Merged

maneeshpm requested a review from veloman-yunkan February 17, 2021 19:06

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from 3b1ebbb to fe807fd Compare February 18, 2021 06:24

veloman-yunkan requested changes Feb 18, 2021

View reviewed changes

test/suggestion.cpp Outdated Show resolved Hide resolved

kelson42 force-pushed the 360-kiwix-search-order-issue branch from 6ad47fb to a9476e3 Compare February 20, 2021 10:56

maneeshpm requested a review from mgautierfr February 22, 2021 19:52

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from 94ea501 to 72a5768 Compare February 22, 2021 20:06

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from acf0364 to ee23d54 Compare February 22, 2021 20:13

veloman-yunkan changed the base branch from master to 502-filename-extension-issue February 23, 2021 09:34

veloman-yunkan requested changes Feb 23, 2021

View reviewed changes

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from ee23d54 to 0d04fc4 Compare February 23, 2021 09:58

maneeshpm requested a review from veloman-yunkan February 23, 2021 10:01

mgautierfr approved these changes Feb 23, 2021

View reviewed changes

veloman-yunkan requested changes Feb 23, 2021

View reviewed changes

test/suggestion.cpp Outdated Show resolved Hide resolved

test/suggestion.cpp Outdated Show resolved Hide resolved

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from 0d04fc4 to 88efdc1 Compare February 23, 2021 12:06

maneeshpm requested a review from veloman-yunkan February 23, 2021 12:14

maneeshpm force-pushed the 502-filename-extension-issue branch from 90f7213 to 22651fb Compare February 23, 2021 14:20

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from 88efdc1 to 4702103 Compare February 23, 2021 14:29

veloman-yunkan approved these changes Feb 23, 2021

View reviewed changes

kelson42 force-pushed the 502-filename-extension-issue branch from 22651fb to 8499a83 Compare February 23, 2021 15:17

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from 4702103 to 8e55f97 Compare February 23, 2021 16:20

kelson42 force-pushed the 502-filename-extension-issue branch from 8499a83 to e705b41 Compare February 23, 2021 16:47

maneeshpm force-pushed the 360-kiwix-search-order-issue branch from 8e55f97 to 38f5752 Compare February 23, 2021 16:56

Base automatically changed from 502-filename-extension-issue to master February 24, 2021 09:42

maneeshpm added 3 commits February 24, 2021 13:21

Add suggestion unit test

649c6b9

Tune BM25Weight, update unit tests

20b58a8

Remove levenshtein

4810555

kelson42 force-pushed the 360-kiwix-search-order-issue branch from 38f5752 to 4810555 Compare February 24, 2021 12:21

kelson42 merged commit ac2cc1f into master Feb 24, 2021

kelson42 deleted the 360-kiwix-search-order-issue branch February 24, 2021 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize BM25 for better kiwix-search suggestions #492

Optimize BM25 for better kiwix-search suggestions #492

maneeshpm commented Feb 6, 2021 •

edited

Loading

codecov bot commented Feb 6, 2021 •

edited

Loading

mgautierfr commented Feb 15, 2021

kelson42 commented Feb 15, 2021

maneeshpm commented Feb 15, 2021 •

edited

Loading

veloman-yunkan left a comment

kelson42 commented Feb 16, 2021 •

edited

Loading

maneeshpm commented Feb 16, 2021

veloman-yunkan commented Feb 17, 2021

maneeshpm commented Feb 17, 2021 •

edited

Loading

maneeshpm commented Feb 17, 2021 •

edited

Loading

kelson42 commented Feb 20, 2021

maneeshpm commented Feb 22, 2021

veloman-yunkan commented Feb 23, 2021

veloman-yunkan left a comment

mgautierfr commented Feb 23, 2021

veloman-yunkan left a comment

veloman-yunkan commented Feb 23, 2021 •

edited

Loading

veloman-yunkan commented Feb 23, 2021 •

edited

Loading

Optimize BM25 for better kiwix-search suggestions #492

Optimize BM25 for better kiwix-search suggestions #492

Conversation

maneeshpm commented Feb 6, 2021 • edited Loading

codecov bot commented Feb 6, 2021 • edited Loading

Codecov Report

mgautierfr commented Feb 15, 2021

kelson42 commented Feb 15, 2021

maneeshpm commented Feb 15, 2021 • edited Loading

veloman-yunkan left a comment

Choose a reason for hiding this comment

kelson42 commented Feb 16, 2021 • edited Loading

maneeshpm commented Feb 16, 2021

veloman-yunkan commented Feb 17, 2021

maneeshpm commented Feb 17, 2021 • edited Loading

maneeshpm commented Feb 17, 2021 • edited Loading

kelson42 commented Feb 20, 2021

maneeshpm commented Feb 22, 2021

veloman-yunkan commented Feb 23, 2021

veloman-yunkan left a comment

Choose a reason for hiding this comment

mgautierfr commented Feb 23, 2021

veloman-yunkan left a comment

Choose a reason for hiding this comment

veloman-yunkan commented Feb 23, 2021 • edited Loading

veloman-yunkan commented Feb 23, 2021 • edited Loading

maneeshpm commented Feb 6, 2021 •

edited

Loading

codecov bot commented Feb 6, 2021 •

edited

Loading

maneeshpm commented Feb 15, 2021 •

edited

Loading

kelson42 commented Feb 16, 2021 •

edited

Loading

maneeshpm commented Feb 17, 2021 •

edited

Loading

maneeshpm commented Feb 17, 2021 •

edited

Loading

veloman-yunkan commented Feb 23, 2021 •

edited

Loading

veloman-yunkan commented Feb 23, 2021 •

edited

Loading