New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spaces in Japanese #1041
Comments
oh, I forgot to give some example! For example, what should be copy-pasted from the output PDF as: "美学や社会学の分野では", Appears instead as: "美学 や 社会 学 の 分 野 で は". And it is like that for the whole document. |
I have no programming knowledge at all, but upon researching a little bit more, I believe such problem comes not when the text is extracted from the input PDF, but when it is laid on the output PDF. I think so because if, for instance, the text is extracted from the input PDF but then visualized in the terminal, there are no spaces. From a source: "In principle, characters in Japanese composition are designed in a square box and positioned without spaces, i.e. solid setting." |
duplicate - tesseract issue |
Hi all!
I wonder if it is possible to do OCR having all spaces completely ignored in the outcome? Languages like Japanese do not really use any spaces (even after commas or periods), but currently OCRmyPDF seems to find spaces between almost every character, which is very problematic when you want to search for sentences/words in the document, or google translate parts of it...
Thank you in advance!
The text was updated successfully, but these errors were encountered: