Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Switch to remove images? #1127

Closed
pinballelectronica opened this issue Jul 13, 2023 · 2 comments
Closed

[Feature]: Switch to remove images? #1127

pinballelectronica opened this issue Jul 13, 2023 · 2 comments
Assignees

Comments

@pinballelectronica
Copy link

Describe the proposed feature

Hi. I process a lot of PDF's for ML but in many cases I don't want the images, they just make the files big and I do nothing with them. Is there a way (already?) to ocr a PDF without keeping images (photographs)? If not, do you know of a workflow where I could pre-process the PDF to strip the images prior to OCR'ing?

Thanks, this product is GREAT.

@jbarlow83
Copy link
Collaborator

In the most recent versions you can use
ocrmypdf --optimize 0 --sidecar ocr_output.txt --output-type none in.pdf /dev/null
to disable generation of an output PDF and produce the OCR output as "sidecar" text. If the PDF had a mix of born digital text and OCRable text, the sidecar contains only OCR. Presumably you're after the text anyway so this may be sufficient.

I don't think it would be a good idea to discard any images after OCR is performed - that means if you discover there were errors in text recognition, you cannot recover or redo the OCR engine with better software. It's best to keep the images intact.

If you have a look at graft.py that is where the original page is combined with the OCR. You could discard the original page and only include the text layer.

@jbarlow83
Copy link
Collaborator

One can also use Ghostscript to delete all images from a PDF:

gs -q -sDEVICE=pdfwrite -dFILTERIMAGES=true -o output.pdf input.pdf

\

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants