Better handling of multi search #729

mgautierfr · 2022-03-23T15:47:19Z

Based on #724

Better cache system
Be able to specify a list of zim files to search in.
Correctly protect search from multithread race condition (may be the root cause of Better unexpected exception handling? #760)
Maximal number of books on which we do a multizim search can be configured

The filtering of books to use in the search is made using new querystring parameter:

books.id to specify book's id to use (may be provided several times to select several books)
books.name to specify book's name to use (may be provided several times to select several books).
content, same as books.name. Keep for compatibility. content can be provided only once
books.filter.foo to do a search on the books using the foo criteria. Available criterias are the same as to search books in the opds stream

This PR now integrate #730 as both PR must be merge together to have something coherent.

codecov · 2022-03-30T13:27:34Z

Codecov Report

Merging #729 (531a6a4) into master (d4da05e) will increase coverage by 1.41%.
The diff coverage is 89.01%.

❗ Current head 531a6a4 differs from pull request most recent head a7651d0. Consider uploading reports for the commit a7651d0 to get more accurate results

@@            Coverage Diff             @@
##           master     #729      +/-   ##
==========================================
+ Coverage   61.97%   63.39%   +1.41%     
==========================================
  Files          58       59       +1     
  Lines        3887     4051     +164     
  Branches     2103     2192      +89     
==========================================
+ Hits         2409     2568     +159     
- Misses       1477     1481       +4     
- Partials        1        2       +1

Impacted Files	Coverage Δ
include/search_renderer.h	`100.00% <ø> (ø)`
include/server.h	`100.00% <ø> (ø)`
src/server.cpp	`79.16% <ø> (ø)`
src/server/request_context.cpp	`81.44% <33.33%> (-5.43%)`	⬇️
src/library.cpp	`82.40% <80.39%> (+0.97%)`	⬆️
src/tools/otherTools.h	`85.71% <85.71%> (ø)`
src/tools/lrucache.h	`98.36% <90.00%> (+7.18%)`	⬆️
src/server/internalServer.cpp	`84.81% <90.83%> (+1.69%)`	⬆️
include/library.h	`100.00% <100.00%> (ø)`
src/search_renderer.cpp	`90.81% <100.00%> (ø)`
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4da05e...a7651d0. Read the comment docs.

veloman-yunkan

Since @kelson42 requested my review on this WIP PR I started looking at the changes but I soon figured out that I was missing some context. The outcome of my first iteration are a few low value comments for the first couple of commits. It will be much helpful if a high level description of the use-model and functional enhancement sought by this PR is provided.

src/library.cpp

include/library.h

src/library.cpp

src/server/request_context.h

stale · 2022-04-16T05:59:43Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

mgautierfr · 2022-04-21T13:52:03Z

@veloman-yunkan There is still few unit tests missing but it is ready for re-review. Please review it as a new PR as I've change few thing when rebasing on master and it was difficult to do fixup commit. PR description updated.

veloman-yunkan

This is only part of the review (I only had time to skim through the first several commits).

include/library.h

veloman-yunkan · 2022-04-23T09:11:38Z

src/tools/otherTools.h

+    try {
+      const char* envString = std::getenv(name);
+      if (envString == nullptr) {
+        throw std::runtime_error("Environment variable not set");


Include the name of the environment variable in the error message

This execption is never thrown to the caller as it is catched by following catch(...)
We have to provide a string, so we use it as "documentation", but it is useless to generate one.

src/tools/otherTools.h

src/tools/otherTools.cpp

src/library.cpp

veloman-yunkan · 2022-04-23T10:06:08Z

src/library.cpp

 bool Library::removeBookById(const std::string& id)
 {
  std::lock_guard<std::mutex> lock(m_mutex);
  mp_impl->m_bookDB->delete_document("Q" + id);
-  dropReader(id);
+  dropCache(id);
  return mp_impl->m_books.erase(id) == 1;
 }


Shouldn't the cache size be updated here too? If not, a comment must be added that it is not done intentionally

Yes, I think it is not needed to update the cache size here. I've added a comment.

veloman-yunkan · 2022-04-23T10:25:14Z

src/tools/lrucache.h

@@ -138,12 +138,18 @@ class lru_cache {
    return _cache_items_map.size();
  }

+  size_t set_max_size(size_t new_size) {


I know that this class is a hybrid of a snake with a camel, but I wonder if your choice of the style here was deliberate :)

Now getting serious. If the cache's current size exceeds the new value of max size, shouldn't it be truncated immediately? Though, a deeper question is - do we really need a dynamic cache size at all (i.e. is linking the cache size to the actual amount of data a good idea)?

I know that this class is a hybrid of a snake with a camel, but I wonder if your choice of the style here was deliberate :)

Yes, indeed.

Now getting serious. If the cache's current size exceeds the new value of max size, shouldn't it be truncated immediately? Though, a deeper question is - do we really need a dynamic cache size at all (i.e. is linking the cache size to the actual amount of data a good idea)?

The real question is what is a good default value for the cache size ?
On a use case as library.kiwix.org, as we have a lot of zim files, we probably want a important cache.
But on small server runs on a raspberryPI, we want a small cache.

Using a percentage of the number of book is a heuristic that takes this into account (although not perfect, as all heuristic).
Before this PR, the cache was created after the library was populated, so we could calculate the cache size once. But as we add books to the library after the cache creation, we need to increase the cache size as we add books.

Reducing the actual cache size seems less important. Either it is not a problem, or it was already a problem when we increase the cache size (and so user should have set a fixed value corresponding to its usecase)

veloman-yunkan

A second set of review comments covering the next few commits.

include/library.h

src/library.cpp

include/library.h

src/library.cpp

src/server/internalServer.cpp

veloman-yunkan

This completes the first pass over the entire PR. But I think that with the big picture of the PR now in my head I will make another pass even before any of my comments is addressed.

src/search_renderer.cpp

src/server/request_context.h

static/templates/search_result.html

src/search_renderer.cpp

src/server/internalServer.cpp

static/i18n/en.json

mgautierfr · 2022-04-27T14:03:23Z

@veloman-yunkan I should have handle all your numerous remarks. Ready for another review pass.

The prefix will be used to parse a "query to select book" in different context. For now we have only one context : selecting books for the catalog search. But we will want to select books to do fulltext search on them (will be done in later commit)

`selectBooks` allow us to parse a query in a "standard" way to get the book(s) on which the user want to work.

This introduce a intermediate mustache object to store information about the request made by the user.

We are currently limiting to 5 but it will be changed in next commit.

The default value is 0, which means no limit.

- Adapt lrucache.cpp for rigth include path and use `kiwix::lru_cache` instead of `zim::lru_cache`. - Add missing `#include <set>` in lrucache.h

When ConcurrentCache store a shared_ptr we may have shared_ptr in used while the ConcurrentCache has drop it. When we "recreate" a value to put in the cache, we don't want to recreate it, but copying the shared_ptr in use. To do so we use a (unlimited) store of weak_ptr (aka `WeakStore`) Every created shared_ptr added to the cache has a weak_ptr ref also stored in the WeakStore, and we check the WeakStore before creating the value.

libzim's search is not thread safe (mainly because xapian is not). So we must protect our search objects from multi thread calls. The best way to do this is to associate a mutex to the `zim::Searcher` and lock the searcher each time we access object derivated from the searcher (search, results, iterator, ...)

Providing the core part of the query explicitly in the search results testsuite test data.

Note that some tests are failing and will be fixed in next commits.

The request_context can now take a filter to select arguments to keep in the query string.

We have to reuse the query the user give us to generate the pagination links. At search result rendering step we don't have access to the query object. The best place to know which arguments are used to select books (and so which arguments to keep in the pagination links) is when we parse the query to select books. Fix tests (pagination links) with book selector other than "books.id=" (pattern=jazz&books.query.lang=eng)

Fix tests with querystring needed url encoding (pattern=jazz&books.query.title=Ray%20Charles)

veloman-yunkan

LGBT

mgautierfr changed the base branch from master to search_improvement March 23, 2022 15:49

mgautierfr mentioned this pull request Mar 23, 2022

Search rendering #730

Closed

mgautierfr force-pushed the search_improvement branch 4 times, most recently from 701a416 to 311f783 Compare March 29, 2022 12:07

Base automatically changed from search_improvement to master March 29, 2022 12:42

mgautierfr force-pushed the multizimsearch branch from 27953dc to 8637a8e Compare March 30, 2022 10:15

kelson42 requested a review from veloman-yunkan March 30, 2022 15:14

veloman-yunkan reviewed Mar 30, 2022

View reviewed changes

src/library.cpp Outdated Show resolved Hide resolved

include/library.h Outdated Show resolved Hide resolved

include/library.h Outdated Show resolved Hide resolved

src/library.cpp Outdated Show resolved Hide resolved

src/server/request_context.h Outdated Show resolved Hide resolved

mgautierfr mentioned this pull request Apr 13, 2022

Replace deprecated functions kiwix/kiwix-desktop#831

Merged

stale bot added the stale label Apr 16, 2022

mgautierfr force-pushed the multizimsearch branch from d72ea4c to 5296157 Compare April 21, 2022 13:45

stale bot removed the stale label Apr 21, 2022

mgautierfr requested a review from veloman-yunkan April 21, 2022 13:52

mgautierfr force-pushed the multizimsearch branch 5 times, most recently from 5431661 to 8be016a Compare April 21, 2022 15:26

veloman-yunkan requested changes Apr 23, 2022

View reviewed changes

veloman-yunkan requested changes Apr 24, 2022

View reviewed changes

mgautierfr force-pushed the multizimsearch branch from 8be016a to b7522f1 Compare April 27, 2022 14:01

mgautierfr requested a review from veloman-yunkan April 27, 2022 14:03

mgautierfr force-pushed the multizimsearch branch from b7522f1 to 282c03b Compare April 27, 2022 15:37

mgautierfr and others added 23 commits June 2, 2022 12:22

Use the newly introduced searcherCache for multizim searcher.

8546236

Handle multiple arguments in RequestContext.

98c54b2

Allow user to select multiple books when doing search.

22996e4

Move get_search_filter and subrange.

76ebfd7

Add a prefix in get_search_filter

4438106

The prefix will be used to parse a "query to select book" in different context. For now we have only one context : selecting books for the catalog search. But we will want to select books to do fulltext search on them (will be done in later commit)

Introduce selectBooks

76d5faf

`selectBooks` allow us to parse a query in a "standard" way to get the book(s) on which the user want to work.

Use selectBooks in handle_search

39d0a56

Make the search_rendered handle multizim search.

077ceac

This introduce a intermediate mustache object to store information about the request made by the user.

Move i18n helper functions

c721320

Introduce Error exception to do i18n

f0065fd

Prefix env variable name with KIWIX_

cf30233

Limit the number of zim in multizim fulltext search.

b74910b

We are currently limiting to 5 but it will be changed in next commit.

Make the limit of zim files per search configurable.

0081b4d

The default value is 0, which means no limit.

Copy the lrucache test from libzim.

2b38d2c

- Adapt lrucache.cpp for rigth include path and use `kiwix::lru_cache` instead of `zim::lru_cache`. - Add missing `#include <set>` in lrucache.h

Preparing to enhance the search results testsuite

3b3d7ad

Providing the core part of the query explicitly in the search results testsuite test data.

First test case for multizim search

f45962c

Add some more testing.

e2ab7fd

Note that some tests are failing and will be fixed in next commits.

Make the request_context be able to generate a querystring for a subset.

b483a8e

The request_context can now take a filter to select arguments to keep in the query string.

Correctly url encode querystring

3bca433

Fix tests with querystring needed url encoding (pattern=jazz&books.query.title=Ray%20Charles)

Check early that provided bookIds are valid

a7651d0

mgautierfr force-pushed the multizimsearch branch from 684e5dc to a7651d0 Compare June 2, 2022 10:39

veloman-yunkan approved these changes Jun 2, 2022

View reviewed changes

mgautierfr merged commit 3704d8a into master Jun 2, 2022

mgautierfr deleted the multizimsearch branch June 2, 2022 10:49

veloman-yunkan mentioned this pull request Jul 31, 2022

Strange internal server errors when searching on a ZIM file served via library.xml #803

Open

veloman-yunkan mentioned this pull request Oct 13, 2022

docker kiwix-serve crashing kiwix/kiwix-tools#579

Closed

veloman-yunkan mentioned this pull request Oct 31, 2022

kiwix-serve ZIM fd needs to be smarter kiwix/kiwix-tools#142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of multi search #729

Better handling of multi search #729

mgautierfr commented Mar 23, 2022 •

edited

Loading

codecov bot commented Mar 30, 2022 •

edited

Loading

veloman-yunkan left a comment

stale bot commented Apr 16, 2022

mgautierfr commented Apr 21, 2022

veloman-yunkan left a comment

veloman-yunkan Apr 23, 2022

mgautierfr Apr 26, 2022

veloman-yunkan Apr 23, 2022

mgautierfr Apr 26, 2022

veloman-yunkan Apr 23, 2022

mgautierfr Apr 26, 2022

veloman-yunkan left a comment

veloman-yunkan left a comment

mgautierfr commented Apr 27, 2022

veloman-yunkan left a comment

Better handling of multi search #729

Better handling of multi search #729

Conversation

mgautierfr commented Mar 23, 2022 • edited Loading

codecov bot commented Mar 30, 2022 • edited Loading

Codecov Report

veloman-yunkan left a comment

Choose a reason for hiding this comment

stale bot commented Apr 16, 2022

mgautierfr commented Apr 21, 2022

veloman-yunkan left a comment

Choose a reason for hiding this comment

veloman-yunkan Apr 23, 2022

Choose a reason for hiding this comment

mgautierfr Apr 26, 2022

Choose a reason for hiding this comment

veloman-yunkan Apr 23, 2022

Choose a reason for hiding this comment

mgautierfr Apr 26, 2022

Choose a reason for hiding this comment

veloman-yunkan Apr 23, 2022

Choose a reason for hiding this comment

mgautierfr Apr 26, 2022

Choose a reason for hiding this comment

veloman-yunkan left a comment

Choose a reason for hiding this comment

veloman-yunkan left a comment

Choose a reason for hiding this comment

mgautierfr commented Apr 27, 2022

veloman-yunkan left a comment

Choose a reason for hiding this comment

mgautierfr commented Mar 23, 2022 •

edited

Loading

codecov bot commented Mar 30, 2022 •

edited

Loading