Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Make PDF Searchable: Overlay OCR'd text over Image? #7

Closed
dcaud opened this issue Jan 13, 2023 · 2 comments
Closed

[FEATURE]: Make PDF Searchable: Overlay OCR'd text over Image? #7

dcaud opened this issue Jan 13, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@dcaud
Copy link

dcaud commented Jan 13, 2023

Input: A PDF
Output: A PDF

The output will look like the input, yet will have invisible OCR'd text in the same position as the text-as-image. I guess this could be accomplished using the bounding boxes from Google's JSON output and then using something like rMarkdown to recreate the PDFs (i.e., add the image to a page, then add the OCR'd text to the page)?

There is an example Python script here that essentially seems to do the job: https://shreevatsa.net/post/add-ocr-layer-to-pdf/

@dcaud dcaud added the enhancement New feature or request label Jan 13, 2023
@Hegghammer
Copy link
Owner

Agreed. As a start, I reverse-engineered gcv2hocr, so I can convert .json files from Document AI to .hocr files, which in turn can be used to create PDFs with hocr-tools. It works fine for LTR languages, but I'm struggling with RTL scripts. Will publish when I have a robust solution.

@Hegghammer
Copy link
Owner

I have developed a temporary solution for this, described here: https://dair.info/articles/searchable_pdfs.

I will look more closely at the LaTeX approach when I have the time. The sticking point is whether TinyTex can handle it. If not, it would require users to have a working Latex installation on the side, and the solution wouldn't be R native any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants