Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces in Japanese #1041

Closed
KajiyaOokami opened this issue Nov 29, 2022 · 4 comments
Closed

Spaces in Japanese #1041

KajiyaOokami opened this issue Nov 29, 2022 · 4 comments
Labels
third party issue Problem with a third party dependency

Comments

@KajiyaOokami
Copy link

Hi all!
I wonder if it is possible to do OCR having all spaces completely ignored in the outcome? Languages like Japanese do not really use any spaces (even after commas or periods), but currently OCRmyPDF seems to find spaces between almost every character, which is very problematic when you want to search for sentences/words in the document, or google translate parts of it...
Thank you in advance!

@KajiyaOokami
Copy link
Author

oh, I forgot to give some example! For example, what should be copy-pasted from the output PDF as:

"美学や社会学の分野では",

Appears instead as:

"美学 や 社会 学 の 分 野 で は".

And it is like that for the whole document.

@KajiyaOokami
Copy link
Author

KajiyaOokami commented Dec 1, 2022

I have no programming knowledge at all, but upon researching a little bit more, I believe such problem comes not when the text is extracted from the input PDF, but when it is laid on the output PDF. I think so because if, for instance, the text is extracted from the input PDF but then visualized in the terminal, there are no spaces.
I suspect the issue must be related to character box sizes, particularly if roman character box sizes are being employed by the program for characters other than roman. If so, the result when applying the layer on Japanese text would be quite spaced thin boxes (thus the program interpreting there are so many spaces), rather than compact wider boxes as it should be.

From a source: "In principle, characters in Japanese composition are designed in a square box and positioned without spaces, i.e. solid setting."

@amitdo
Copy link

amitdo commented Dec 28, 2022

@jbarlow83 jbarlow83 added the third party issue Problem with a third party dependency label Jun 9, 2023
@jbarlow83
Copy link
Collaborator

duplicate - tesseract issue

@jbarlow83 jbarlow83 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
third party issue Problem with a third party dependency
Projects
None yet
Development

No branches or pull requests

3 participants