[FEATURE]: Make PDF Searchable: Overlay OCR'd text over Image? #7

dcaud · 2023-01-13T23:03:33Z

Input: A PDF
Output: A PDF

The output will look like the input, yet will have invisible OCR'd text in the same position as the text-as-image. I guess this could be accomplished using the bounding boxes from Google's JSON output and then using something like rMarkdown to recreate the PDFs (i.e., add the image to a page, then add the OCR'd text to the page)?

There is an example Python script here that essentially seems to do the job: https://shreevatsa.net/post/add-ocr-layer-to-pdf/

Hegghammer · 2023-01-24T01:25:49Z

Agreed. As a start, I reverse-engineered gcv2hocr, so I can convert .json files from Document AI to .hocr files, which in turn can be used to create PDFs with hocr-tools. It works fine for LTR languages, but I'm struggling with RTL scripts. Will publish when I have a robust solution.

Hegghammer · 2023-08-31T21:23:45Z

I have developed a temporary solution for this, described here: https://dair.info/articles/searchable_pdfs.

I will look more closely at the LaTeX approach when I have the time. The sticking point is whether TinyTex can handle it. If not, it would require users to have a working Latex installation on the side, and the solution wouldn't be R native any more.

dcaud added the enhancement New feature or request label Jan 13, 2023

dcaud assigned Hegghammer Jan 13, 2023

Hegghammer closed this as completed Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Make PDF Searchable: Overlay OCR'd text over Image? #7

[FEATURE]: Make PDF Searchable: Overlay OCR'd text over Image? #7

dcaud commented Jan 13, 2023 •

edited

Loading

Hegghammer commented Jan 24, 2023

Hegghammer commented Aug 31, 2023

[FEATURE]: Make PDF Searchable: Overlay OCR'd text over Image? #7

[FEATURE]: Make PDF Searchable: Overlay OCR'd text over Image? #7

Comments

dcaud commented Jan 13, 2023 • edited Loading

Hegghammer commented Jan 24, 2023

Hegghammer commented Aug 31, 2023

dcaud commented Jan 13, 2023 •

edited

Loading