You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The output will look like the input, yet will have invisible OCR'd text in the same position as the text-as-image. I guess this could be accomplished using the bounding boxes from Google's JSON output and then using something like rMarkdown to recreate the PDFs (i.e., add the image to a page, then add the OCR'd text to the page)?
Agreed. As a start, I reverse-engineered gcv2hocr, so I can convert .json files from Document AI to .hocr files, which in turn can be used to create PDFs with hocr-tools. It works fine for LTR languages, but I'm struggling with RTL scripts. Will publish when I have a robust solution.
I will look more closely at the LaTeX approach when I have the time. The sticking point is whether TinyTex can handle it. If not, it would require users to have a working Latex installation on the side, and the solution wouldn't be R native any more.
Input: A PDF
Output: A PDF
The output will look like the input, yet will have invisible OCR'd text in the same position as the text-as-image. I guess this could be accomplished using the bounding boxes from Google's JSON output and then using something like rMarkdown to recreate the PDFs (i.e., add the image to a page, then add the OCR'd text to the page)?
There is an example Python script here that essentially seems to do the job: https://shreevatsa.net/post/add-ocr-layer-to-pdf/
The text was updated successfully, but these errors were encountered: