Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize BM25 for better kiwix-search suggestions #492

Merged
merged 3 commits into from
Feb 24, 2021

Conversation

maneeshpm
Copy link
Collaborator

@maneeshpm maneeshpm commented Feb 6, 2021

Fixes #458
Since we are using a separate index for titles, using the default Xapian::BM25Weight tuning parameters poses some issues. BM25 is a "bag of words" algorithm based upon the frequency of words - there's no scoring bonus for matching ordering or for anchoring for a title search which is preferred when we search over titles. The changes I plan to include with this pr are:

  • Tune Xapian::BM25Weight
    The within-document-frequency(wdf) factor k1 with a default value of 1 is too much for a title search. Reducing k1 to 0.001 and increasing length normalization is sufficient improvement.
  • set_sort_by_relevance_then_values(valuesmap["title"])
    When searching a large index, we have several documents with the same relevance. This mixup causes issues like single term queries like "berlin" going way down the suggestion list when it should be around the top. Sorting by values for the same relevance brings them back to top.

With these two changes:

$ kiwix-search --suggestion -v wikipedia_en_all_mini_2021-01.zim "berlin" 
Performing suggestion query `berlin`
Setup queryparser using language eng
Mark query as 'partial'
Parsed query 'berlin' to Query((Zberlin@1 OR (WILDCARD SYNONYM berlin OR Zberlin@1)))
Berlin, Berlin 100
Berlin Berlin 100
Berlin, Berlin (2020) 99
Berlín 99
Berliner 99
Berline 99
Berlin 99
.berlin 99
Hotel Berlin, Berlin 99
Hotel Berlin Berlin 99

@codecov
Copy link

codecov bot commented Feb 6, 2021

Codecov Report

Merging #492 (4810555) into master (35cd997) will increase coverage by 2.54%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #492      +/-   ##
==========================================
+ Coverage   73.25%   75.79%   +2.54%     
==========================================
  Files          89       88       -1     
  Lines        3631     3599      -32     
  Branches     1626     1612      -14     
==========================================
+ Hits         2660     2728      +68     
+ Misses        971      870     -101     
- Partials        0        1       +1     
Impacted Files Coverage Δ
src/search.cpp 57.50% <100.00%> (+8.42%) ⬆️
src/archive.cpp 53.01% <0.00%> (+3.01%) ⬆️
src/fileimpl.cpp 83.95% <0.00%> (+3.70%) ⬆️
src/writer/item.cpp 66.66% <0.00%> (+4.76%) ⬆️
src/search_iterator.cpp 30.69% <0.00%> (+8.91%) ⬆️
src/tools.cpp 100.00% <0.00%> (+9.37%) ⬆️
src/dirent_lookup.h 98.41% <0.00%> (+19.04%) ⬆️
src/search_internal.h 75.00% <0.00%> (+28.12%) ⬆️
include/zim/writer/item.h 84.61% <0.00%> (+44.61%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35cd997...4810555. Read the comment docs.

@maneeshpm maneeshpm force-pushed the 360-kiwix-search-order-issue branch 2 times, most recently from 95fe273 to 6dde33d Compare February 13, 2021 13:56
@maneeshpm maneeshpm changed the title indexing title with position kiwix-search Optimize BM25 for better kiwix-search suggestions Feb 13, 2021
@maneeshpm maneeshpm self-assigned this Feb 13, 2021
@maneeshpm maneeshpm linked an issue Feb 13, 2021 that may be closed by this pull request
@maneeshpm maneeshpm force-pushed the 360-kiwix-search-order-issue branch 2 times, most recently from e1734e5 to 2ded96d Compare February 14, 2021 07:47
@mgautierfr
Copy link
Collaborator

I've missed the set_sort_by_relevance_then_values in xapian.
The levenshtein_distance we have (https://github.com/openzim/libzim/blob/master/src/search.cpp#L118) is somehow doing the same thing but probably with slower performances. It would be nice to compare the use of set_sort_by_relevance_then_values with the levenshtein_distance and maybe remove the levensthein_distance if it is useless.

@kelson42
Copy link
Contributor

I thought the lev. distance was not used anymore because two slow! Where is that still in use? Otherwise we should remove it.

@maneeshpm
Copy link
Collaborator Author

maneeshpm commented Feb 15, 2021

I think @mgautierfr is right. It is still being used. I am not familiar with testing the performance, but if its doing the same thing that set_sort_by_relevance_then_value() already does, it can be removed.

With the current configuration, when we search "berlin" on this test set, the result is:
{"berlin", "hotel berlin, berlin", "not berlin", "berlin wall", "again berlin"}
in order without the last item "fooland".
@kelson42 the optimal suggestion set IMO is
{"berlin", "berlin wall", "hotel berlin, berlin", "not berlin", "again berlin"}
where terms starting with berlin are "anchored". This can be implemented in the other pr as we discussed.

@maneeshpm maneeshpm marked this pull request as ready for review February 16, 2021 05:21
Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid a single test case focusing on the reported corner case is not enough. Since I don't have any expertise in search engines in general and the BM25 probabilistic formula in particular I can't judge to what extent this change will have negative effects. You must demonstrate that suggestions work acceptably well on a broad set of inputs (of both titles and queries).

test/suggestion.cpp Outdated Show resolved Hide resolved
test/suggestion.cpp Outdated Show resolved Hide resolved
@kelson42
Copy link
Contributor

kelson42 commented Feb 16, 2021

I am afraid a single test case focusing on the reported corner case is not enough. Since I don't have any expertise in search engines in general and the BM25 probabilistic formula in particular I can't judge to what extent this change will have negative effects. You must demonstrate that suggestions work acceptably well on a broad set of inputs (of both titles and queries).

I understand the comment but I’m not sure I have other pertinent test in my mind. Here a few ideas:

  • test the empty search
  • Test the case/accent insensitivity
  • Test the reductions of result each time the search pattern is more complete: “the”, “the wolf”, “the wolf of”, ...
  • Test the same words but in an other order
  • Test the search without results
  • Test the search woth more result than the limit

@maneeshpm
Copy link
Collaborator Author

This pr only improves the single-term search. I should add these tests here:

  • empty search
  • case sensitivity
  • search without results
  • search with more results than limit

I will add tests for phrase search in the other pr.

  • incremental search
  • order of words

@veloman-yunkan
Copy link
Collaborator

This pr only improves the single-term search. I should add these tests here:

* empty search

* case sensitivity

* search without results

* search with more results than limit

I will add tests for phrase search in the other pr.

* incremental search

* order of words

This is acceptable if you can guarantee that this PR doesn't affect phrase search in any way. In general when changing some piece of functionality you must ensure that the chances of breaking it are minimal. Tests provide a certain level of such confidence. Therefore before touching untested code one should cover it with tests (and that can be done in a separate PR).

@maneeshpm
Copy link
Collaborator Author

maneeshpm commented Feb 17, 2021

I understand your concern @veloman-yunkan. I have tested the changes extensively using some big zim files, and have added another test to verify the order for a small phrase query. Some of the results:
Before

$ kiwix-search -v --suggestion wikipedia_en_all_mini_2021-01.zim "summer in"
Performing suggestion query `summer in`
Setup queryparser using language eng
Mark query as 'partial'
Parsed query 'summer in' to Query((Zsummer@1 AND (WILDCARD SYNONYM in OR Zin@2)))
In Summer 100
In re Summers 91
In Summer (Renoir) 91
Summer in Berlin 91
Summer in Paradise 91
Shivers in Summer 91
Summer in Abaddon 91
Flambards in Summer 91
Summer in Siam 91
Death in Summer 91

After

$ kiwix-search -v --suggestion wikipedia_en_all_mini_2021-01.zim "summer in"  
Performing suggestion query `summer in`
Setup queryparser using language eng
Mark query as 'partial'
Parsed query 'summer in' to Query((Zsummer@1 AND (WILDCARD SYNONYM in OR Zin@2)))
In Summer 100
Sympathy in Summer 99
Summers in PA 99
Summer in Tyrol 99
Summer in Transylvania 99
Summer in Siam 99
Summer in Paradise 99
Summer in Mississippi 99
Summer in Kingston 99
Summer in Genova 99

As far as this pr is concerned, The impact of BM25 tuning is equal on all the terms of a phrase since the weightage of all terms wdf is reduced by a constant factor. The slight improvement for phrase search this pr brings is the order of documents with the same weightage (in the before case, all documents with score 91). This will be improved and tested further as part of the next pr.

test/suggestion.cpp Outdated Show resolved Hide resolved
test/suggestion.cpp Outdated Show resolved Hide resolved
test/suggestion.cpp Outdated Show resolved Hide resolved
@maneeshpm
Copy link
Collaborator Author

maneeshpm commented Feb 17, 2021

I've missed the set_sort_by_relevance_then_values in xapian.
The levenshtein_distance we have (https://github.com/openzim/libzim/blob/master/src/search.cpp#L118) is somehow doing the same thing but probably with slower performances. It would be nice to compare the use of set_sort_by_relevance_then_values with the levenshtein_distance and maybe remove the levensthein_distance if it is useless.

@kelson42 @mgautierfr set_sort_by_relevance_then_value() is not doing the same thing as Levenshtein distance. This function directly sorts the the documents having same relevance by increasing(or decreasing) values. Whereas Levenshtein distance calculated the difference between the query string and document title then used it as a key for sort. Are we still planning to remove Levenshtein distance?

PS: I was not able to find any concrete example where the current implementation of lev gave apparent better suggestions.

test/suggestion.cpp Outdated Show resolved Hide resolved
@kelson42
Copy link
Contributor

I have rebased that branch

@maneeshpm
Copy link
Collaborator Author

@veloman-yunkan I have rebased this PR on #503 so that the changes you mentioned in your review can be implemented. This PR should be merged only after #503 is merged.

@veloman-yunkan
Copy link
Collaborator

@veloman-yunkan I have rebased this PR on #503 so that the changes you mentioned in your review can be implemented. This PR should be merged only after #503 is merged.

That's good. When you rebase (or start) a PR A on top of another PR B, I advise that you set the base branch of A to the development branch of B. Then B's commits will not show up in A's history, and when B is merged A's base branch will be updated automatically.

@veloman-yunkan veloman-yunkan changed the base branch from master to 502-filename-extension-issue February 23, 2021 09:34
Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this PR has been rebased on top of #503, please squash the commit "rebasing to 502-filename-extension-issue, extending TempFile for"

@mgautierfr
Copy link
Collaborator

On my side, this is a approval. I let @veloman-yunkan make the final approval (and merge) the PR when it is ok for him.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change history of this PR looks messy. Please squash all commits into one. Then you can split it into the following commits:

  1. Introduce the suggestion unit test without any changes to the suggestions algorithm. The test must pass.
  2. Change the suggestion algorithm and update the test. Thus the user-observable change in the code behavior will be automatically documented.
  3. Remove levenshtein

test/suggestion.cpp Outdated Show resolved Hide resolved
test/suggestion.cpp Outdated Show resolved Hide resolved
@veloman-yunkan
Copy link
Collaborator

veloman-yunkan commented Feb 23, 2021

@kelson42

Your rebase and subsequent force push was not performed properly. For a PR with a base branch different from master, the rebase must be performed in a special way when the base branch in rebased. In general it would be better if the author of the PR deals with it.

I am sorry. I didn't notice that you force-pushed the base branch.

@veloman-yunkan
Copy link
Collaborator

veloman-yunkan commented Feb 23, 2021

@maneeshpm Please rebase the top 3 commits of this PR --onto 502-filename-extension-issue. You will have to repeat it if the base branch is rebased again.

Base automatically changed from 502-filename-extension-issue to master February 24, 2021 09:42
@kelson42 kelson42 merged commit ac2cc1f into master Feb 24, 2021
@kelson42 kelson42 deleted the 360-kiwix-search-order-issue branch February 24, 2021 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't look up Berlin via Search Index
4 participants