Skip to content

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Jul 11, 2024

Fix #167
Fix #168

Edited description

Changes:

  • create helper class IndexData to hold indexing data (title, content, keywords) before passing it to libzim
  • add support for automated indexing of PDFs in python-scraperlib, building an IndexData object automatically
    • it relies on PyMuPDF for now to extract PDF metadata and content
    • item_data title is populated based on title + author + subject, separated by dashes
    • item_data content is simply all pages content concatenated with new line to separate them
  • add support for index_data: IndexData | None and auto_index: bool | None for customizing indexing in StaticItem and add_item_for:
    • pass custom index_data from calller for customized indexing
    • set auto_index to False to disable indexing (both in python-scraperlib and libzim)
    • otherwise, indexing is automated in python-scraperlib for PDF documents (for now) or in libzim (for others, text or html for now)

Former description and points to discuss

Changes:

  • add new IndexingItem class capable to customize index data from data passed from the scraper or automatically from PDF content
    • this uses a new IndexData class holding the index data
    • for PDF, it relies on PyMuPDF for now to extract PDF metadata and content
      • item title is populated based on title + author + subject, separated by dashes
      • item content is simply all pages content concatenated with new line to separate them

Open points to discuss:

  • do we really need this new IndexingItem class or should we simply embed all this logic in StaticItem?
  • if we keep the separate class, do we need a new add_indexing_item_for, similar to add_item_for? Or just enrich the add_item_for with new arguments?

Copy link

codecov bot commented Jul 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (1eddabc) to head (c91646f).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #182   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           32        33    +1     
  Lines         1452      1531   +79     
  Branches       251       273   +22     
=========================================
+ Hits          1452      1531   +79     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 force-pushed the pdf_support branch 2 times, most recently from 2b753ea to d50b67b Compare July 15, 2024 12:37
@benoit74 benoit74 self-assigned this Jul 15, 2024
@benoit74 benoit74 marked this pull request as ready for review July 15, 2024 12:46
@benoit74 benoit74 requested a review from rgaudin July 15, 2024 12:46
@benoit74
Copy link
Collaborator Author

Some files like https://irp.fas.org/doddir/milmed/milderm.pdf are raising "MuPDF error: format error: cmsOpenProfileFromMem failed" error. Looks like it could be fixed since it is an ICC profile issue (for which we do not care): pymupdf/PyMuPDF#3572. I will fix this.

@benoit74
Copy link
Collaborator Author

Fix is different than expected, but at least it is working, PR is again ready for review

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is great.
I think we should extend StaticItem instead.

As for the API, I think we can add the following to add_item_for():

  • auto_index: bool = True: keeps libzim auto index + our pdf autoindex. Setting this to False would skip the PDF but would also overload the item with an empty IndexData so that libzim doesn't index it.
  • index_content: str | None = None which would generate the appropriate indexdata with wordcount if set. It doesn't handle keywords but current PDF impl doesn't either and we can extend in the future.

WDYT?

@benoit74 benoit74 changed the title Add indexing item + automatic indexing of PDF items Add indexdata + automatic indexing of PDF items Jul 29, 2024
@benoit74
Copy link
Collaborator Author

I did not passed index_content: str | None = None but index_data: IndexData | None since it also allows to set the title which is used for suggestions, which is quite important (item title is not used for suggestions when index data is passed)

And I also modified add_item_for since this is quite heavily used in scrapers.

Other than that, I think the change will please you.

@benoit74 benoit74 requested a review from rgaudin July 29, 2024 13:38
@rgaudin
Copy link
Member

rgaudin commented Jul 29, 2024

I did not passed index_content: str | None = None but index_data: IndexData | None since it also allows to set the title which is used for suggestions

I see it's missing from my comment but I meant index_content and index_title. I think requiring this extra import is in opposition with what add_item_for tries to achieve but you're the judge of that.

There are a couple of unresolved discussions…

@benoit74
Copy link
Collaborator Author

I see it's missing from my comment but I meant index_content and index_title. I think requiring this extra import is in opposition with what add_item_for tries to achieve but you're the judge of that.

Then I get what you meant, and I agree the extra import is not very lean

@benoit74
Copy link
Collaborator Author

I finally decided to keep using index_data in add_item_for and StaticItem because it is a convenient way to force user to pass both title and content should he decide to customize index_data and to detect when this is not done with pyright. Otherwise one might be tempted to pass only an index_title or only an index_content and this is not what we want.

@benoit74 benoit74 merged commit efc2942 into main Jul 30, 2024
@benoit74 benoit74 deleted the pdf_support branch July 30, 2024 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add utility to get PDF info for proper titles on PDF entries Add utility to index PDF documents content
2 participants