Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Switched from PyMuPDF to pypdfium2 #829

Merged
merged 18 commits into from Feb 24, 2022
Merged

refactor: Switched from PyMuPDF to pypdfium2 #829

merged 18 commits into from Feb 24, 2022

Conversation

fg-mindee
Copy link
Contributor

This PR introduces the following modifications:

  • switched PDF backend from PyMuPDF to pypdfium2
  • deprecated annotation reading (not supported by pypdfium2 for now)
  • updates unittests & documentation
  • only uses PyMuPDF for PDF mocking in the unittests (does impact license for users)

Closes #486

Any feedback is welcome!

@fg-mindee fg-mindee added topic: documentation Improvements or additions to documentation critical High priority module: io Related to doctr.io ext: tests Related to tests folder type: breaking change Introducing a breaking change ext: docs Related to docs folder labels Feb 21, 2022
@fg-mindee fg-mindee added this to the 0.5.1 milestone Feb 21, 2022
@fg-mindee fg-mindee self-assigned this Feb 21, 2022
charlesmindee
charlesmindee previously approved these changes Feb 22, 2022
Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that! I wanted to do it quickly but I am running out of time 😅
Are you sure keeping pymupdf for the tests is not an issue ?

@fg-mindee
Copy link
Contributor Author

Thanks for that! I wanted to do it quickly but I am running out of time sweat_smile Are you sure keeping pymupdf for the tests is not an issue ?

For pypi downloads, I'm positive since it doesn't ship the test folder. But just to make sure that the entire codebase is rid of this, I changed it to Pillow + docTR synthesize function to mock PDFs 👍

@codecov
Copy link

codecov bot commented Feb 23, 2022

Codecov Report

Merging #829 (f7f585f) into main (41237e9) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #829      +/-   ##
==========================================
- Coverage   95.97%   95.95%   -0.02%     
==========================================
  Files         131      131              
  Lines        4988     4993       +5     
==========================================
+ Hits         4787     4791       +4     
- Misses        201      202       +1     
Flag Coverage Δ
unittests 95.95% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/io/pdf.py 100.00% <100.00%> (+1.66%) ⬆️
doctr/io/reader.py 100.00% <100.00%> (ø)
doctr/models/_utils.py 97.59% <0.00%> (-1.21%) ⬇️
doctr/models/detection/linknet/tensorflow.py 97.67% <0.00%> (-1.00%) ⬇️
doctr/models/detection/linknet/pytorch.py 97.97% <0.00%> (-0.89%) ⬇️
doctr/models/classification/resnet/pytorch.py 100.00% <0.00%> (ø)
doctr/models/classification/resnet/tensorflow.py 100.00% <0.00%> (ø)
doctr/models/classification/magc_resnet/pytorch.py 100.00% <0.00%> (ø)
doctr/models/recognition/sar/pytorch.py 99.20% <0.00%> (+0.06%) ⬆️
doctr/models/utils/pytorch.py 100.00% <0.00%> (+5.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 41237e9...f7f585f. Read the comment docs.

@fg-mindee
Copy link
Contributor Author

fg-mindee commented Feb 23, 2022

Also it seems that pypdfium2 isn't supported by some versions of Python (cf. docker job) 🤔 But we'll handle this in another PR

charlesmindee
charlesmindee previously approved these changes Feb 24, 2022
Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It is cool to get rid of pymupdf

Copy link
Collaborator

@charlesmindee charlesmindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@fg-mindee fg-mindee merged commit 2581daa into main Feb 24, 2022
@fg-mindee fg-mindee deleted the pdf branch February 24, 2022 16:27
@github-actions
Copy link

Hey @fg-mindee 👋
You merged this PR, but it is not correctly labeled. The list of valid labels is available at https://github.com/mindee/doctr/blob/main/.github/verify_pr_labels.py

@mara004
Copy link
Contributor

mara004 commented May 18, 2022

I see that text extraction and image localisation code was removed with this change.

PDFium provides capabilities to extract text from PDFs, but I don't have a support model for it yet in pypdfium2. It's a feature that has already been requested several times, and I hope to work on it soon. The pdf_text_page example from pdfbrain1 should give a pretty good idea on how to do it.

I'm not sure if PDFium can locate images, but will look it up.

Footnotes

  1. This is a collection of helper classes for interfacing with the PDFium API using ctypes, but it still uses the old pypdfium, the predecessor of pypdfium2.

@frgfm
Copy link
Collaborator

frgfm commented May 18, 2022

Thanks a lot @mara004, we're quite excited about how thorough you are in the development of this project :)

@mara004
Copy link
Contributor

mara004 commented May 19, 2022

I'm currently working on the new text extraction helper class in pypdfium2-team/pypdfium2#110. If you have any suggestions/requests about the API, please let me know.
Concerning images, I've browsed the pdfium code in fpdf_edit.h and found out that locating image objects would be possible using the functions FPDFPage_CountObjects(), FPDFPage_GetObject(), FPDFPageObj_GetType() and FPDFPageObj_GetBounds(). I've also created a support model that is part of the same pull request.

@mara004
Copy link
Contributor

mara004 commented May 20, 2022

I have just merged the new support models into main: pypdfium2-team/pypdfium2@bbc2438

@frgfm
Copy link
Collaborator

frgfm commented May 22, 2022

@mara004 that's great! Do you have any minimal snippet example please? So that I can integrate the feature into docTR 😁

@mara004
Copy link
Contributor

mara004 commented May 22, 2022

I guess something like this:

Images

# Creates a list of image bounding boxes of left, right, bottom and top in PDF canvas units
pdf = pdfium.PdfDocument(filepath)
page = pdf.get_page(index)
images = []
for obj in page.get_objects():
    if obj.get_type() == pdfium.FPDF_PAGEOBJ_IMAGE:
        images.append(obj.get_pos())
[g.close() for g in (page, pdf)]

Text

# Creates a list of pairs of bounding boxes and their text content
pdf = pdfium.PdfDocument(filepath)
page = pdf.get_page(index)
textpage = page.get_textpage()
text_boxes = []
for bbox in textpage.get_rectboxes():
    text_boxes.append( (bbox, textpage.get_text(*bbox)) )
[g.close() for g in (textpage, page, pdf)]

@mara004
Copy link
Contributor

mara004 commented May 24, 2022

FYI, I yet made a few changes to the text API in pypdfium2-team/pypdfium2@3ac10be and adapted the above example.

@mara004
Copy link
Contributor

mara004 commented May 25, 2022

I just released pypdfium2 1.10.0, which contains the new support models.

@frgfm
Copy link
Collaborator

frgfm commented May 25, 2022

Can we get people like you as "customer service" for all OSS libraries haha? 😁
I'll check how to integrate that soon, thanks 🙏

@mara004
Copy link
Contributor

mara004 commented May 25, 2022

I'm glad to help OSS projects as far as my limited knowledge permits :).
Seriously, in the end I think pypdfium2 can only benefit if it gets used more widely, so I aim to make it fit for the needs of embedders. I'm personally working on a python project which depends on PDF rendering quite a lot, so I thought it might be important to consolidate pypdfium2 a bit.

@mara004
Copy link
Contributor

mara004 commented Jun 8, 2022

The API rewrite I mentioned in the other thread should be finished now, and I plan to release version 2.0.0 of pypdfium2 soon (there's a pre-release already). I have updated the above examples again and will submit a PR for the rendering code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
critical High priority ext: docs Related to docs folder ext: tests Related to tests folder module: io Related to doctr.io topic: documentation Improvements or additions to documentation type: breaking change Introducing a breaking change type: misc Miscellaneous
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider removing PyMuPDF for dependency that is not AGPL licensed
4 participants