Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Words array not correctly generated for Japanese/Korean/Chinese #413

Closed
rigelglen opened this issue Feb 24, 2020 · 2 comments
Closed

Words array not correctly generated for Japanese/Korean/Chinese #413

rigelglen opened this issue Feb 24, 2020 · 2 comments
Labels

Comments

@rigelglen
Copy link

I'm trying to use tesseract js to detect some Korean text, but the tesseract output for words has one character per word.

20200224_144821

Screenshot 2020-02-24 at 4 22 34 PM

The value of text property is correctly recognised as 안녕하세요! without spaces between the characters, however the words array has single characters in it.

The worker parameters I am setting are:

tessedit_pageseg_mode: PSM.AUTO,
tessedit_ocr_engine_mode: OEM.LSTM_ONLY,
preserve_interword_spaces: '1'

I am using tesseract.js: "^2.0.2".

@jeromewu
Copy link
Member

jeromewu commented Mar 9, 2020

After some quick research, I have to say there is no good way to solve this issue, so we won't fix this issue for now.

@jeromewu jeromewu closed this as completed Mar 9, 2020
@kazupon
Copy link

kazupon commented Mar 25, 2021

Hi!
I have same issue.

I found this issue on tesseract.
tesseract-ocr/tesseract#991

In this issue, it seems to be possible to work around it with parameters.

If tesseract.js-core can also be controlled by parameters, it would seem to solve this issue, but is it hard?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants