Spaces in Japanese #1041

KajiyaOokami · 2022-11-29T16:25:27Z

Hi all!
I wonder if it is possible to do OCR having all spaces completely ignored in the outcome? Languages like Japanese do not really use any spaces (even after commas or periods), but currently OCRmyPDF seems to find spaces between almost every character, which is very problematic when you want to search for sentences/words in the document, or google translate parts of it...
Thank you in advance!

KajiyaOokami · 2022-11-29T16:28:19Z

oh, I forgot to give some example! For example, what should be copy-pasted from the output PDF as:

"美学や社会学の分野では",

Appears instead as:

"美学や社会学の分野では".

And it is like that for the whole document.

KajiyaOokami · 2022-12-01T08:47:08Z

I have no programming knowledge at all, but upon researching a little bit more, I believe such problem comes not when the text is extracted from the input PDF, but when it is laid on the output PDF. I think so because if, for instance, the text is extracted from the input PDF but then visualized in the terminal, there are no spaces.
I suspect the issue must be related to character box sizes, particularly if roman character box sizes are being employed by the program for characters other than roman. If so, the result when applying the layer on Japanese text would be quite spaced thin boxes (thus the program interpreting there are so many spaces), rather than compact wider boxes as it should be.

From a source: "In principle, characters in Japanese composition are designed in a square box and positioned without spaces, i.e. solid setting."

amitdo · 2022-12-28T07:24:54Z

tesseract-ocr/tesseract#2702

jbarlow83 · 2023-06-09T06:25:37Z

duplicate - tesseract issue

jbarlow83 added the third party issue Problem with a third party dependency label Jun 9, 2023

jbarlow83 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spaces in Japanese #1041

Spaces in Japanese #1041

KajiyaOokami commented Nov 29, 2022

KajiyaOokami commented Nov 29, 2022

KajiyaOokami commented Dec 1, 2022 •

edited

amitdo commented Dec 28, 2022

jbarlow83 commented Jun 9, 2023

Spaces in Japanese #1041

Spaces in Japanese #1041

Comments

KajiyaOokami commented Nov 29, 2022

KajiyaOokami commented Nov 29, 2022

KajiyaOokami commented Dec 1, 2022 • edited

amitdo commented Dec 28, 2022

jbarlow83 commented Jun 9, 2023

KajiyaOokami commented Dec 1, 2022 •

edited