Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of multi search #729

Merged
merged 29 commits into from
Jun 2, 2022
Merged

Better handling of multi search #729

merged 29 commits into from
Jun 2, 2022

Conversation

mgautierfr
Copy link
Member

@mgautierfr mgautierfr commented Mar 23, 2022

Based on #724

  • Better cache system
  • Be able to specify a list of zim files to search in.
  • Correctly protect search from multithread race condition (may be the root cause of Better unexpected exception handling? #760)
  • Maximal number of books on which we do a multizim search can be configured

The filtering of books to use in the search is made using new querystring parameter:

  • books.id to specify book's id to use (may be provided several times to select several books)
  • books.name to specify book's name to use (may be provided several times to select several books).
  • content, same as books.name. Keep for compatibility. content can be provided only once
  • books.filter.foo to do a search on the books using the foo criteria. Available criterias are the same as to search books in the opds stream

This PR now integrate #730 as both PR must be merge together to have something coherent.

@mgautierfr mgautierfr changed the base branch from master to search_improvement March 23, 2022 15:49
@mgautierfr mgautierfr mentioned this pull request Mar 23, 2022
@mgautierfr mgautierfr force-pushed the search_improvement branch 4 times, most recently from 701a416 to 311f783 Compare March 29, 2022 12:07
Base automatically changed from search_improvement to master March 29, 2022 12:42
@codecov
Copy link

codecov bot commented Mar 30, 2022

Codecov Report

Merging #729 (531a6a4) into master (d4da05e) will increase coverage by 1.41%.
The diff coverage is 89.01%.

❗ Current head 531a6a4 differs from pull request most recent head a7651d0. Consider uploading reports for the commit a7651d0 to get more accurate results

@@            Coverage Diff             @@
##           master     #729      +/-   ##
==========================================
+ Coverage   61.97%   63.39%   +1.41%     
==========================================
  Files          58       59       +1     
  Lines        3887     4051     +164     
  Branches     2103     2192      +89     
==========================================
+ Hits         2409     2568     +159     
- Misses       1477     1481       +4     
- Partials        1        2       +1     
Impacted Files Coverage Δ
include/search_renderer.h 100.00% <ø> (ø)
include/server.h 100.00% <ø> (ø)
src/server.cpp 79.16% <ø> (ø)
src/server/request_context.cpp 81.44% <33.33%> (-5.43%) ⬇️
src/library.cpp 82.40% <80.39%> (+0.97%) ⬆️
src/tools/otherTools.h 85.71% <85.71%> (ø)
src/tools/lrucache.h 98.36% <90.00%> (+7.18%) ⬆️
src/server/internalServer.cpp 84.81% <90.83%> (+1.69%) ⬆️
include/library.h 100.00% <100.00%> (ø)
src/search_renderer.cpp 90.81% <100.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4da05e...a7651d0. Read the comment docs.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @kelson42 requested my review on this WIP PR I started looking at the changes but I soon figured out that I was missing some context. The outcome of my first iteration are a few low value comments for the first couple of commits. It will be much helpful if a high level description of the use-model and functional enhancement sought by this PR is provided.

src/library.cpp Outdated Show resolved Hide resolved
include/library.h Outdated Show resolved Hide resolved
include/library.h Outdated Show resolved Hide resolved
src/library.cpp Outdated Show resolved Hide resolved
src/server/request_context.h Outdated Show resolved Hide resolved
@stale
Copy link

stale bot commented Apr 16, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@mgautierfr
Copy link
Member Author

@veloman-yunkan There is still few unit tests missing but it is ready for re-review. Please review it as a new PR as I've change few thing when rebasing on master and it was difficult to do fixup commit. PR description updated.

@mgautierfr mgautierfr force-pushed the multizimsearch branch 5 times, most recently from 5431661 to 8be016a Compare April 21, 2022 15:26
Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only part of the review (I only had time to skim through the first several commits).

include/library.h Outdated Show resolved Hide resolved
try {
const char* envString = std::getenv(name);
if (envString == nullptr) {
throw std::runtime_error("Environment variable not set");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the name of the environment variable in the error message

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This execption is never thrown to the caller as it is catched by following catch(...)
We have to provide a string, so we use it as "documentation", but it is useless to generate one.

src/tools/otherTools.h Show resolved Hide resolved
src/tools/otherTools.cpp Outdated Show resolved Hide resolved
src/library.cpp Outdated Show resolved Hide resolved
src/library.cpp Outdated Show resolved Hide resolved
src/library.cpp Outdated Show resolved Hide resolved
src/library.cpp Show resolved Hide resolved
src/library.cpp Outdated
Comment on lines 194 to 226
bool Library::removeBookById(const std::string& id)
{
std::lock_guard<std::mutex> lock(m_mutex);
mp_impl->m_bookDB->delete_document("Q" + id);
dropReader(id);
dropCache(id);
return mp_impl->m_books.erase(id) == 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the cache size be updated here too? If not, a comment must be added that it is not done intentionally

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is not needed to update the cache size here. I've added a comment.

@@ -138,12 +138,18 @@ class lru_cache {
return _cache_items_map.size();
}

size_t set_max_size(size_t new_size) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this class is a hybrid of a snake with a camel, but I wonder if your choice of the style here was deliberate :)

Now getting serious. If the cache's current size exceeds the new value of max size, shouldn't it be truncated immediately? Though, a deeper question is - do we really need a dynamic cache size at all (i.e. is linking the cache size to the actual amount of data a good idea)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this class is a hybrid of a snake with a camel, but I wonder if your choice of the style here was deliberate :)

Yes, indeed.

Now getting serious. If the cache's current size exceeds the new value of max size, shouldn't it be truncated immediately? Though, a deeper question is - do we really need a dynamic cache size at all (i.e. is linking the cache size to the actual amount of data a good idea)?

The real question is what is a good default value for the cache size ?
On a use case as library.kiwix.org, as we have a lot of zim files, we probably want a important cache.
But on small server runs on a raspberryPI, we want a small cache.

Using a percentage of the number of book is a heuristic that takes this into account (although not perfect, as all heuristic).
Before this PR, the cache was created after the library was populated, so we could calculate the cache size once. But as we add books to the library after the cache creation, we need to increase the cache size as we add books.

Reducing the actual cache size seems less important. Either it is not a problem, or it was already a problem when we increase the cache size (and so user should have set a fixed value corresponding to its usecase)

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A second set of review comments covering the next few commits.

include/library.h Show resolved Hide resolved
src/library.cpp Show resolved Hide resolved
src/library.cpp Outdated Show resolved Hide resolved
include/library.h Outdated Show resolved Hide resolved
src/library.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This completes the first pass over the entire PR. But I think that with the big picture of the PR now in my head I will make another pass even before any of my comments is addressed.

src/search_renderer.cpp Outdated Show resolved Hide resolved
src/server/request_context.h Outdated Show resolved Hide resolved
static/templates/search_result.html Outdated Show resolved Hide resolved
src/search_renderer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
static/i18n/en.json Outdated Show resolved Hide resolved
@mgautierfr
Copy link
Member Author

@veloman-yunkan I should have handle all your numerous remarks. Ready for another review pass.

mgautierfr and others added 23 commits June 2, 2022 12:22
The prefix will be used to parse a "query to select book" in different context.
For now we have only one context : selecting books for the catalog search.
But we will want to select books to do fulltext search on them
(will be done in later commit)
`selectBooks` allow us to parse a query in a "standard" way to get
the book(s) on which the user want to work.
This introduce a intermediate mustache object to store information
about the request made by the user.
We are currently limiting to 5 but it will be changed in next commit.
The default value is 0, which means no limit.
- Adapt lrucache.cpp for rigth include path
  and use `kiwix::lru_cache` instead of `zim::lru_cache`.
- Add missing `#include <set>` in lrucache.h
When ConcurrentCache store a shared_ptr we may have shared_ptr in used
while the ConcurrentCache has drop it.
When we "recreate" a value to put in the cache, we don't want to recreate
it, but copying the shared_ptr in use.

To do so we use a (unlimited) store of weak_ptr (aka `WeakStore`)
Every created shared_ptr added to the cache has a weak_ptr ref also stored
in the WeakStore, and we check the WeakStore before creating the value.
libzim's search is not thread safe (mainly because xapian is not).
So we must protect our search objects from multi thread calls.

The best way to do this is to associate a mutex to the `zim::Searcher`
and lock the searcher each time we access object derivated from the
searcher (search, results, iterator, ...)
Providing the core part of the query explicitly in the search results
testsuite test data.
Note that some tests are failing and will be fixed in next commits.
The request_context can now take a filter to select arguments to
keep in the query string.
We have to reuse the query the user give us to generate the
pagination links.
At search result rendering step we don't have access to the query object.
The best place to know which arguments are used to select books
(and so which arguments to keep in the pagination links) is when we
parse the query to select books.

Fix tests (pagination links) with book selector other than "books.id="
(pattern=jazz&books.query.lang=eng)
Fix tests with querystring needed url encoding
(pattern=jazz&books.query.title=Ray%20Charles)
Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGBT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants