refactor: Switched from PyMuPDF to pypdfium2 #829

fg-mindee · 2022-02-21T16:33:37Z

This PR introduces the following modifications:

switched PDF backend from PyMuPDF to pypdfium2
deprecated annotation reading (not supported by pypdfium2 for now)
updates unittests & documentation
only uses PyMuPDF for PDF mocking in the unittests (does impact license for users)

Closes #486

Any feedback is welcome!

charlesmindee

Thanks for that! I wanted to do it quickly but I am running out of time 😅
Are you sure keeping pymupdf for the tests is not an issue ?

fg-mindee · 2022-02-23T17:44:57Z

Thanks for that! I wanted to do it quickly but I am running out of time sweat_smile Are you sure keeping pymupdf for the tests is not an issue ?

For pypi downloads, I'm positive since it doesn't ship the test folder. But just to make sure that the entire codebase is rid of this, I changed it to Pillow + docTR synthesize function to mock PDFs 👍

codecov · 2022-02-23T17:52:59Z

Codecov Report

Merging #829 (f7f585f) into main (41237e9) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #829      +/-   ##
==========================================
- Coverage   95.97%   95.95%   -0.02%     
==========================================
  Files         131      131              
  Lines        4988     4993       +5     
==========================================
+ Hits         4787     4791       +4     
- Misses        201      202       +1

Flag	Coverage Δ
unittests	`95.95% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/io/pdf.py	`100.00% <100.00%> (+1.66%)`	⬆️
doctr/io/reader.py	`100.00% <100.00%> (ø)`
doctr/models/_utils.py	`97.59% <0.00%> (-1.21%)`	⬇️
doctr/models/detection/linknet/tensorflow.py	`97.67% <0.00%> (-1.00%)`	⬇️
doctr/models/detection/linknet/pytorch.py	`97.97% <0.00%> (-0.89%)`	⬇️
doctr/models/classification/resnet/pytorch.py	`100.00% <0.00%> (ø)`
doctr/models/classification/resnet/tensorflow.py	`100.00% <0.00%> (ø)`
doctr/models/classification/magc_resnet/pytorch.py	`100.00% <0.00%> (ø)`
doctr/models/recognition/sar/pytorch.py	`99.20% <0.00%> (+0.06%)`	⬆️
doctr/models/utils/pytorch.py	`100.00% <0.00%> (+5.00%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 41237e9...f7f585f. Read the comment docs.

fg-mindee · 2022-02-23T17:53:57Z

Also it seems that pypdfium2 isn't supported by some versions of Python (cf. docker job) 🤔 But we'll handle this in another PR

charlesmindee

Thanks! It is cool to get rid of pymupdf

charlesmindee

thanks!

github-actions · 2022-02-24T16:27:22Z

Hey @fg-mindee 👋
You merged this PR, but it is not correctly labeled. The list of valid labels is available at https://github.com/mindee/doctr/blob/main/.github/verify_pr_labels.py

mara004 · 2022-05-18T13:34:46Z

I see that text extraction and image localisation code was removed with this change.

PDFium provides capabilities to extract text from PDFs, but I don't have a support model for it yet in pypdfium2. It's a feature that has already been requested several times, and I hope to work on it soon. The pdf_text_page example from pdfbrain¹ should give a pretty good idea on how to do it.

I'm not sure if PDFium can locate images, but will look it up.

This is a collection of helper classes for interfacing with the PDFium API using ctypes, but it still uses the old pypdfium, the predecessor of pypdfium2. ↩

frgfm · 2022-05-18T13:37:12Z

Thanks a lot @mara004, we're quite excited about how thorough you are in the development of this project :)

mara004 · 2022-05-19T20:50:46Z

I'm currently working on the new text extraction helper class in pypdfium2-team/pypdfium2#110. If you have any suggestions/requests about the API, please let me know.
Concerning images, I've browsed the pdfium code in fpdf_edit.h and found out that locating image objects would be possible using the functions FPDFPage_CountObjects(), FPDFPage_GetObject(), FPDFPageObj_GetType() and FPDFPageObj_GetBounds(). I've also created a support model that is part of the same pull request.

mara004 · 2022-05-20T14:43:21Z

I have just merged the new support models into main: pypdfium2-team/pypdfium2@bbc2438

frgfm · 2022-05-22T20:45:28Z

@mara004 that's great! Do you have any minimal snippet example please? So that I can integrate the feature into docTR 😁

mara004 · 2022-05-22T21:17:14Z

I guess something like this:

Images

# Creates a list of image bounding boxes of left, right, bottom and top in PDF canvas units
pdf = pdfium.PdfDocument(filepath)
page = pdf.get_page(index)
images = []
for obj in page.get_objects():
    if obj.get_type() == pdfium.FPDF_PAGEOBJ_IMAGE:
        images.append(obj.get_pos())
[g.close() for g in (page, pdf)]

Text

# Creates a list of pairs of bounding boxes and their text content
pdf = pdfium.PdfDocument(filepath)
page = pdf.get_page(index)
textpage = page.get_textpage()
text_boxes = []
for bbox in textpage.get_rectboxes():
    text_boxes.append( (bbox, textpage.get_text(*bbox)) )
[g.close() for g in (textpage, page, pdf)]

mara004 · 2022-05-24T14:48:24Z

FYI, I yet made a few changes to the text API in pypdfium2-team/pypdfium2@3ac10be and adapted the above example.

mara004 · 2022-05-25T16:22:18Z

I just released pypdfium2 1.10.0, which contains the new support models.

frgfm · 2022-05-25T19:28:35Z

Can we get people like you as "customer service" for all OSS libraries haha? 😁
I'll check how to integrate that soon, thanks 🙏

mara004 · 2022-05-25T19:37:12Z

I'm glad to help OSS projects as far as my limited knowledge permits :).
Seriously, in the end I think pypdfium2 can only benefit if it gets used more widely, so I aim to make it fit for the needs of embedders. I'm personally working on a python project which depends on PDF rendering quite a lot, so I thought it might be important to consolidate pypdfium2 a bit.

mara004 · 2022-06-08T17:17:37Z

The API rewrite I mentioned in the other thread should be finished now, and I plan to release version 2.0.0 of pypdfium2 soon (there's a pre-release already). I have updated the above examples again and will submit a PR for the rendering code.

fg-mindee added 11 commits February 21, 2022 16:46

chore: Updated PDF lib

dd170ae

refactor: Refactored PDF parsing

5488bbb

test: Updated unittests

7cc8198

docs: Updated instructions

fce7d9f

refactor: Switched to another PDF backend

c361bbf

docs: Updated documentation

62250db

fix: Fixed demo

e801428

refactor: Removed legacy imports

5f4099b

style: Updated mypy config

498f224

fix: Fixed read_pdf

2b57500

chore: Updated deps

932f5e9

fg-mindee added topic: documentation Improvements or additions to documentation critical High priority module: io Related to doctr.io ext: tests Related to tests folder type: breaking change Introducing a breaking change ext: docs Related to docs folder labels Feb 21, 2022

fg-mindee added this to the 0.5.1 milestone Feb 21, 2022

fg-mindee requested review from fharper and charlesmindee February 21, 2022 16:33

fg-mindee self-assigned this Feb 21, 2022

charlesmindee previously approved these changes Feb 22, 2022

View reviewed changes

fg-mindee added 5 commits February 23, 2022 18:17

test: Fixed unittests

caa9a86

fix: Fixed analysis script

2477e59

chore: Fixed requirements

ae471cf

test: Removed PyMuPDF from unittests

e535e4b

chore: Removed PyMuPDF

51523fe

fg-mindee dismissed charlesmindee’s stale review via 51523fe February 23, 2022 17:33

fg-mindee requested a review from charlesmindee February 23, 2022 17:45

test: Fixed unittest

0506938

charlesmindee previously approved these changes Feb 24, 2022

View reviewed changes

charlesmindee mentioned this pull request Feb 24, 2022

[conda] Unable to make a conda build #113

Closed

fix: Fixed Dockerfile

f7f585f

fg-mindee dismissed charlesmindee’s stale review via f7f585f February 24, 2022 15:38

fg-mindee requested a review from charlesmindee February 24, 2022 15:46

charlesmindee approved these changes Feb 24, 2022

View reviewed changes

fg-mindee merged commit 2581daa into main Feb 24, 2022

fg-mindee deleted the pdf branch February 24, 2022 16:27

fg-mindee added the type: misc Miscellaneous label Feb 25, 2022

fg-mindee mentioned this pull request Mar 10, 2022

Brew installed packages are not found in Python #813

Closed

frgfm mentioned this pull request May 7, 2022

Add information on PR labelling process #907

Closed

frgfm mentioned this pull request May 31, 2022

[documents] Benchmark PDF document reading + numpy conversion options #23

Closed

mara004 mentioned this pull request Jun 8, 2022

Update io/pdf.py to new pypdfium2 API #944

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Switched from PyMuPDF to pypdfium2 #829

refactor: Switched from PyMuPDF to pypdfium2 #829

fg-mindee commented Feb 21, 2022

charlesmindee left a comment

fg-mindee commented Feb 23, 2022

codecov bot commented Feb 23, 2022 •

edited

fg-mindee commented Feb 23, 2022 •

edited

charlesmindee left a comment

charlesmindee left a comment

github-actions bot commented Feb 24, 2022

mara004 commented May 18, 2022 •

edited

frgfm commented May 18, 2022

mara004 commented May 19, 2022 •

edited

mara004 commented May 20, 2022

frgfm commented May 22, 2022

mara004 commented May 22, 2022 •

edited

mara004 commented May 24, 2022 •

edited

mara004 commented May 25, 2022

frgfm commented May 25, 2022

mara004 commented May 25, 2022

mara004 commented Jun 8, 2022 •

edited

refactor: Switched from PyMuPDF to pypdfium2 #829

refactor: Switched from PyMuPDF to pypdfium2 #829

Conversation

fg-mindee commented Feb 21, 2022

charlesmindee left a comment

Choose a reason for hiding this comment

fg-mindee commented Feb 23, 2022

codecov bot commented Feb 23, 2022 • edited

Codecov Report

fg-mindee commented Feb 23, 2022 • edited

charlesmindee left a comment

Choose a reason for hiding this comment

charlesmindee left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 24, 2022

mara004 commented May 18, 2022 • edited

Footnotes

frgfm commented May 18, 2022

mara004 commented May 19, 2022 • edited

mara004 commented May 20, 2022

frgfm commented May 22, 2022

mara004 commented May 22, 2022 • edited

Images

Text

mara004 commented May 24, 2022 • edited

mara004 commented May 25, 2022

frgfm commented May 25, 2022

mara004 commented May 25, 2022

mara004 commented Jun 8, 2022 • edited

codecov bot commented Feb 23, 2022 •

edited

fg-mindee commented Feb 23, 2022 •

edited

mara004 commented May 18, 2022 •

edited

mara004 commented May 19, 2022 •

edited

mara004 commented May 22, 2022 •

edited

mara004 commented May 24, 2022 •

edited

mara004 commented Jun 8, 2022 •

edited