Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency on PDFs which contain large amounts of text #29

Closed
gabriel-v opened this issue May 13, 2022 · 5 comments
Closed

Improve efficiency on PDFs which contain large amounts of text #29

gabriel-v opened this issue May 13, 2022 · 5 comments

Comments

@gabriel-v
Copy link

gabriel-v commented May 13, 2022

If a PDF contains a large amount of text and a small amount of pictures, we only want to OCR the pictures. The script currently OCRs the whole pages, including any existing text, which is undesirable because of the CPU consumption, and degradation of existing text.

I want to implement a change (probably optional, enabled by a flag) to only run OCR on the images, not on any exising text. I would split the images away from the PDF using pdfimages, and then somehow re-create the layer sandwich using only the OCR generated for those images. The original text inside the files should be left untouched.

Do you have any pointers on doing this? I have a couple of ideas I want to investigate:

  • process everything page by page
    • edit the PDF to make all text invisible (same color as background)
    • run pipeline as it is -- OCR should be faster, since most of the page is blank
    • recombine original PDF text & image layers with the new OCR layer overlay (still page by page)
    • still inefficient -- OCR needs to scan through a lot of empty pages
  • process everything image by image
    • run pdfimages to extract images from PDF (along with page number, img size and coordinates)
    • maybe use pdf2html to get image location & position
    • create PDF sandwiches for each image separately (using pdf2pdfocr, of course)
    • re-combine them in the original PDF using pdfjam and pdftk
    • more efficient -- we don't give blank images to the OCR engine
@LeoFCardoso
Copy link
Owner

Hello, @gabriel-v.
Great use case. Thank you.
By now, I think it would be simpler if script remove all known text before start OCR.
I'll check on this.

@LeoFCardoso
Copy link
Owner

I think a good try is here: https://stackoverflow.com/questions/24322338/remove-all-text-from-pdf-file

input.pdf

I tested manually with a 162 page text PDF with one single image on page 1 (see attached file).

No text in PDF (first page with image and 161 blank pages)
[2022-05-13 09:09:22.666163] [LOG] Success in 71.173 seconds!

Normal PDF (162 pages with text)
[2022-05-13 09:21:12.134621] [LOG] Success in 607.781 seconds!

My first conclusion is: tesseract is fast with blank pages.

Maybe we can optimize even more detecting blank pages and avoid calling tesseract for them. Method "do_check_img_greyscale" can be used as example.

This use case is interesting. I'll code this! :-)

@LeoFCardoso
Copy link
Owner

Please let me know if last commit works for you.

@LeoFCardoso
Copy link
Owner

[2022-05-15 09:24:48.847586] [LOG] Success in 16.423 seconds!

@gabriel-v
Copy link
Author

Yes, I see a 9X improvement on speed with the new flag for documents of moderate text size and low image count (30 pages).

Thank you for the feature!

Testing on this document with 30 pages of text: https://raw.githubusercontent.com/liquidinvestigations/hoover-testdata/master/data/disk-files/pdf-doc-txt/stanley.ec02.pdf

time pdf2pdfocr.py -v -a -l eng -x '--oem 1 --psm 1' -j 0.01  -i document.pdf
real    1m53.855s                                                                                                              
user    1m48.059s 
time pdf2pdfocr.py -v -a -l eng -x '--oem 1 --psm 1' -j 0.01 --ignore-existing-text -i document.pdf 
real    0m13.981s                                                                                                              
user    0m12.642s  

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants