-
-
Notifications
You must be signed in to change notification settings - Fork 19
Add indexdata + automatic indexing of PDF items #182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #182 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 32 33 +1
Lines 1452 1531 +79
Branches 251 273 +22
=========================================
+ Hits 1452 1531 +79 ☔ View full report in Codecov by Sentry. |
2b753ea
to
d50b67b
Compare
Some files like https://irp.fas.org/doddir/milmed/milderm.pdf are raising "MuPDF error: format error: cmsOpenProfileFromMem failed" error. Looks like it could be fixed since it is an ICC profile issue (for which we do not care): pymupdf/PyMuPDF#3572. I will fix this. |
Fix is different than expected, but at least it is working, PR is again ready for review |
3da83f5
to
7d519e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is great.
I think we should extend StaticItem
instead.
As for the API, I think we can add the following to add_item_for()
:
auto_index: bool = True
: keeps libzim auto index + our pdf autoindex. Setting this to False would skip the PDF but would also overload the item with an empty IndexData so that libzim doesn't index it.index_content: str | None = None
which would generate the appropriate indexdata with wordcount if set. It doesn't handle keywords but current PDF impl doesn't either and we can extend in the future.
WDYT?
I did not passed And I also modified Other than that, I think the change will please you. |
I see it's missing from my comment but I meant There are a couple of unresolved discussions… |
Then I get what you meant, and I agree the extra import is not very lean |
I finally decided to keep using |
Fix #167
Fix #168
Edited description
Changes:
IndexData
to hold indexing data (title, content, keywords) before passing it to libzimindex_data: IndexData | None
andauto_index: bool | None
for customizing indexing inStaticItem
andadd_item_for
:index_data
from calller for customized indexingauto_index
to False to disable indexing (both in python-scraperlib and libzim)Former description and points to discuss
Changes:
IndexingItem
class capable to customize index data from data passed from the scraper or automatically from PDF contentIndexData
class holding the index dataOpen points to discuss:
IndexingItem
class or should we simply embed all this logic inStaticItem
?add_indexing_item_for
, similar toadd_item_for
? Or just enrich theadd_item_for
with new arguments?