Words array not correctly generated for Japanese/Korean/Chinese #413

rigelglen · 2020-02-24T11:00:16Z

I'm trying to use tesseract js to detect some Korean text, but the tesseract output for words has one character per word.

The value of text property is correctly recognised as 안녕하세요! without spaces between the characters, however the words array has single characters in it.

The worker parameters I am setting are:

tessedit_pageseg_mode: PSM.AUTO,
tessedit_ocr_engine_mode: OEM.LSTM_ONLY,
preserve_interword_spaces: '1'

I am using tesseract.js: "^2.0.2".

The text was updated successfully, but these errors were encountered:

jeromewu · 2020-03-09T13:15:55Z

After some quick research, I have to say there is no good way to solve this issue, so we won't fix this issue for now.

kazupon · 2021-03-25T08:59:11Z

Hi!
I have same issue.

I found this issue on tesseract.
tesseract-ocr/tesseract#991

In this issue, it seems to be possible to work around it with parameters.

If tesseract.js-core can also be controlled by parameters, it would seem to solve this issue, but is it hard?

jeromewu added the wontfix label Mar 9, 2020

jeromewu closed this as completed Mar 9, 2020

xxchan mentioned this issue Apr 30, 2023

Extra spaces in non space-delimited language like CJK #748

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Words array not correctly generated for Japanese/Korean/Chinese #413

Words array not correctly generated for Japanese/Korean/Chinese #413

rigelglen commented Feb 24, 2020

jeromewu commented Mar 9, 2020

kazupon commented Mar 25, 2021

Words array not correctly generated for Japanese/Korean/Chinese #413

Words array not correctly generated for Japanese/Korean/Chinese #413

Comments

rigelglen commented Feb 24, 2020

jeromewu commented Mar 9, 2020

kazupon commented Mar 25, 2021