New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve efficiency on PDFs which contain large amounts of text #29
Comments
Hello, @gabriel-v. |
I think a good try is here: https://stackoverflow.com/questions/24322338/remove-all-text-from-pdf-file I tested manually with a 162 page text PDF with one single image on page 1 (see attached file). No text in PDF (first page with image and 161 blank pages) Normal PDF (162 pages with text) My first conclusion is: tesseract is fast with blank pages. Maybe we can optimize even more detecting blank pages and avoid calling tesseract for them. Method "do_check_img_greyscale" can be used as example. This use case is interesting. I'll code this! :-) |
Please let me know if last commit works for you. |
|
Yes, I see a 9X improvement on speed with the new flag for documents of moderate text size and low image count (30 pages). Thank you for the feature! Testing on this document with 30 pages of text: https://raw.githubusercontent.com/liquidinvestigations/hoover-testdata/master/data/disk-files/pdf-doc-txt/stanley.ec02.pdf
Thanks again! |
If a PDF contains a large amount of text and a small amount of pictures, we only want to OCR the pictures. The script currently OCRs the whole pages, including any existing text, which is undesirable because of the CPU consumption, and degradation of existing text.
I want to implement a change (probably optional, enabled by a flag) to only run OCR on the images, not on any exising text. I would split the images away from the PDF using
pdfimages
, and then somehow re-create the layer sandwich using only the OCR generated for those images. The original text inside the files should be left untouched.Do you have any pointers on doing this? I have a couple of ideas I want to investigate:
pdfimages
to extract images from PDF (along with page number, img size and coordinates)The text was updated successfully, but these errors were encountered: