Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multilang ZIMs #904

Merged
merged 6 commits into from Mar 8, 2023
Merged

Support for multilang ZIMs #904

merged 6 commits into from Mar 8, 2023

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented Feb 27, 2023

Fixes #903

Note that a multilanguage ZIM/book is counted as 1 full book in the results to each of the matching /catalog/v2/entries?lang=<LANG> queries.

@codecov
Copy link

codecov bot commented Feb 27, 2023

Codecov Report

Patch coverage: 89.65% and project coverage change: +0.11 🎉

Comparison is base (3072513) 72.00% compared to head (eb002ae) 72.11%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #904      +/-   ##
==========================================
+ Coverage   72.00%   72.11%   +0.11%     
==========================================
  Files          54       54              
  Lines        3751     3766      +15     
  Branches     2089     2100      +11     
==========================================
+ Hits         2701     2716      +15     
  Misses       1048     1048              
  Partials        2        2              
Impacted Files Coverage Δ
src/libxml_dumper.cpp 0.00% <0.00%> (ø)
src/library.cpp 84.03% <95.00%> (+0.43%) ⬆️
include/book.h 96.15% <100.00%> (ø)
src/book.cpp 88.00% <100.00%> (+0.16%) ⬆️
src/manager.cpp 75.60% <100.00%> (ø)
src/opds_dumper.cpp 99.18% <100.00%> (ø)
src/server/internalServer.cpp 88.71% <100.00%> (+0.01%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@veloman-yunkan veloman-yunkan marked this pull request as ready for review February 28, 2023 12:58
@mgautierfr
Copy link
Member

While this PR indeed fix #903, it seems a bit too simple to fully support multilang ZIMs. (And I would be pleasantly surprised if it is enough)

At least, in Library::updateBookDB, we use book.getLanguage() to set the stemmer. With a multilang book, it will not work.
The same way, we don't split the comma separated list when indexing the book itself. It seems to work (based on the tests you have added), propably because we find the queried language in the book's language list but it would be better to explicitly split it.

And I think we should change std::string Book::getLanguage() to std::vector<std::string> Book::getLanguages(). This way, we would force all our users to handle correctly multi language book.

@veloman-yunkan
Copy link
Collaborator Author

in Library::updateBookDB, we use book.getLanguage() to set the stemmer. With a multilang book, it will not work.

What stemmer should be used for a multilang book? Currently, in such cases the stemmer is simply not used since an attempt to create a stemmer for a non-existing language fails with an exception which is intercepted and ignored.

The same way, we don't split the comma separated list when indexing the book itself. It seems to work (based on the tests you have added), propably because we find the queried language in the book's language list but it would be better to explicitly split it.

Will fix.

And I think we should change std::string Book::getLanguage() to std::vector<std::string> Book::getLanguages()

Should we do that in one step? Or add Book::getLanguages() and deprecate Book::getLanguage()?

@veloman-yunkan
Copy link
Collaborator Author

And I think we should change std::string Book::getLanguage() to std::vector<std::string> Book::getLanguages()

Should we do that in one step? Or add Book::getLanguages() and deprecate Book::getLanguage()?

Another option is to preserve Book::getLanguage() as a wrapper around Book::getLanguages() that throws when called on a multilang book.

@veloman-yunkan veloman-yunkan force-pushed the support_for_multilang_zims branch 2 times, most recently from f87d7b4 to c7c4655 Compare March 1, 2023 15:29
@mgautierfr
Copy link
Member

Agree with the current code.

I think it would be better to depreciate Book::getLanguage. We can be sure that people will forget to catch the potential exception thrown if we are in multizim. It is better to move the change explicitly in the api.

@veloman-yunkan
Copy link
Collaborator Author

I think it would be better to depreciate Book::getLanguage. We can be sure that people will forget to catch the potential exception thrown if we are in multizim. It is better to move the change explicitly in the api.

@mgautierfr OK, I will deprecate it. Should its current implementation remain unchanged (i.e. should it keep returning a comma separated list of languages for multilang ZIMs)? The other alternatives are:

  1. return the first element of Book::getLanguages() (or an empty string if the latter is empty)
  2. return the only element of Book::getLanguages() if there is only one language, otherwise raise an exception.

@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr And what about Bookmark::getLanguage()? Should we make the same change to Bookmark?

@mgautierfr
Copy link
Member

Should its current implementation remain unchanged (i.e. should it keep returning a comma separated list of languages for multilang ZIMs) ?

Yes. We keep the function (as deprecated) to not break the api, so we shouldn't change the behavior (even if it is buggy)

And what about Bookmark::getLanguage()? Should we make the same change to Bookmark?

Bookmark is about only one article. We can expect that even if zim file is multilanguages, one article is in one language only, so no change needed.

src/library.cpp Show resolved Hide resolved
src/server/internalServer.cpp Show resolved Hide resolved
`Book::getLanguages()` is used instead of `Book::getLanguage()` when
determining the set of languages for a collection of books.
Introduced `Book::getCommaSeparatedLanguages()` instead.
@mgautierfr mgautierfr merged commit 88de978 into main Mar 8, 2023
@mgautierfr mgautierfr deleted the support_for_multilang_zims branch March 8, 2023 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZIM with multiple language metadata are not properly supported
2 participants