[Feature]: Switch to remove images? #1127

pinballelectronica · 2023-07-13T12:59:42Z

Describe the proposed feature

Hi. I process a lot of PDF's for ML but in many cases I don't want the images, they just make the files big and I do nothing with them. Is there a way (already?) to ocr a PDF without keeping images (photographs)? If not, do you know of a workflow where I could pre-process the PDF to strip the images prior to OCR'ing?

Thanks, this product is GREAT.

jbarlow83 · 2023-07-13T21:01:24Z

In the most recent versions you can use
ocrmypdf --optimize 0 --sidecar ocr_output.txt --output-type none in.pdf /dev/null
to disable generation of an output PDF and produce the OCR output as "sidecar" text. If the PDF had a mix of born digital text and OCRable text, the sidecar contains only OCR. Presumably you're after the text anyway so this may be sufficient.

I don't think it would be a good idea to discard any images after OCR is performed - that means if you discover there were errors in text recognition, you cannot recover or redo the OCR engine with better software. It's best to keep the images intact.

If you have a look at graft.py that is where the original page is combined with the OCR. You could discard the original page and only include the text layer.

jbarlow83 · 2023-08-09T00:10:33Z

One can also use Ghostscript to delete all images from a PDF:

gs -q -sDEVICE=pdfwrite -dFILTERIMAGES=true -o output.pdf input.pdf

\

pinballelectronica added the enhancement label Jul 13, 2023

pinballelectronica assigned jbarlow83 Jul 13, 2023

jbarlow83 closed this as completed Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Switch to remove images? #1127

[Feature]: Switch to remove images? #1127

pinballelectronica commented Jul 13, 2023

jbarlow83 commented Jul 13, 2023

jbarlow83 commented Aug 9, 2023

[Feature]: Switch to remove images? #1127

[Feature]: Switch to remove images? #1127

Comments

pinballelectronica commented Jul 13, 2023

Describe the proposed feature

jbarlow83 commented Jul 13, 2023

jbarlow83 commented Aug 9, 2023