You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi. I process a lot of PDF's for ML but in many cases I don't want the images, they just make the files big and I do nothing with them. Is there a way (already?) to ocr a PDF without keeping images (photographs)? If not, do you know of a workflow where I could pre-process the PDF to strip the images prior to OCR'ing?
Thanks, this product is GREAT.
The text was updated successfully, but these errors were encountered:
In the most recent versions you can use ocrmypdf --optimize 0 --sidecar ocr_output.txt --output-type none in.pdf /dev/null
to disable generation of an output PDF and produce the OCR output as "sidecar" text. If the PDF had a mix of born digital text and OCRable text, the sidecar contains only OCR. Presumably you're after the text anyway so this may be sufficient.
I don't think it would be a good idea to discard any images after OCR is performed - that means if you discover there were errors in text recognition, you cannot recover or redo the OCR engine with better software. It's best to keep the images intact.
If you have a look at graft.py that is where the original page is combined with the OCR. You could discard the original page and only include the text layer.
Describe the proposed feature
Hi. I process a lot of PDF's for ML but in many cases I don't want the images, they just make the files big and I do nothing with them. Is there a way (already?) to ocr a PDF without keeping images (photographs)? If not, do you know of a workflow where I could pre-process the PDF to strip the images prior to OCR'ing?
Thanks, this product is GREAT.
The text was updated successfully, but these errors were encountered: