gImageREADER does not find non-english dicts #13

titus483 · 2015-01-24T18:43:31Z

This is for gImageReader 3.0.1 under Windows 7.
I followed the dictionary installation instructions and downloaded the german de_DE.zip and copied the de_DE.aff and de_DE.dic into /share/myspell/dicts. They are there along with the en_US files.
But when I try to select "German" with "Recognize selection", even after "Redetect Languages" I can't select "German" (or "Deutsch"). There is just "English" -> "English (United States)" or "Multilingual" -> "English".

manisandro · 2015-01-24T18:51:36Z

Hi,
Two types of language data are used by gImageReader:

The tesseract language definitions: these are necessary for performing OCR for a specific language (tesseract is the OCR engine used by gImageReader). You can download these here [1].
The spellchecking dictionaries. These are used to perform spell checking on the OCR result. The *.aff and *.dic are spellchecking dictionary files.

So in short, while you installed the spellchecking dictionaries, you are missing the actual language support for tesseract. For German, you'll want to download this [2] and place the deu.traineddata file therein in the Tesseract language definitions folder (.../share/tessdata).

[1] https://code.google.com/p/tesseract-ocr/downloads/list
[2] https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.deu.tar.gz

titus483 · 2015-01-24T19:26:20Z

Hi Sandro,
thanks for your quick reply. Yes, I did that as a first step (sorry forgot to
mention it):

I copied the deu.traineddata into the tessdata folder
I copied the *.aff and *.dic files into the gImageReader folder

I indeed followed an article of the German c't magazine 4/2015 where that is
described step by step. But it still doesn't work for me...

Sandro Mani notifications@github.com hat am 24. Januar 2015 um 19:51
geschrieben:

Hi,
Two types of language data are used by gImageReader:
 * The tesseract language definitions: these are necessary for performing
OCR for a specific language (tesseract is the OCR engine used by
gImageReader). You can download these here [1].
* The spellchecking dictionaries. These are used to perform spell
checking on the OCR result. the *.aff and *.dic are spelling dictionary files.

So in short, while you installed the spellchecking dictionaries, you are
missing the actual language support for tesseract. For German, you'll want to
download this [2] and place the deu.traineddata file therein in the Tesseract
language definitions folder (.../share/tessdata).

[1] https://code.google.com/p/tesseract-ocr/downloads/list
[2]
https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.deu.tar.gz

—
Reply to this email directly or view it on GitHub
#13 (comment) .

manisandro · 2015-01-24T19:43:05Z

So you should have something like this in .../usr:

|
|--> myspell
|   |--> dicts
|       |--> de_DE.aff
|       |--> de_DE.dic
|       |--> en_US.aff
|       |--> en_US.dic
|       |--> README.txt
|--> tessdata
    |--> deu.traineddata
    |--> eng.traineddata
    |--> README.txt

If this does not work (though that would really be a first), try only with the deu.traineddata file, without the de_DE.aff and de_DE.dic to see whether German appears as an entry in the menu.

titus483 · 2015-01-24T19:55:56Z

Great! Now I've got it. My mistake was that I copied the deu.traineddata into
the Tesseract/tessdata folder, not into the gImageReader/.../tessdata folder!
And now I've got the right menu. Thanks a lot for your help!

Sandro Mani notifications@github.com hat am 24. Januar 2015 um 20:43
geschrieben:

So you should have something like this in .../usr:

|
|--> myspell
| |--> dicts
| |--> de_DE.aff
| |--> de_DE.dic
| |--> en_US.aff
| |--> en_US.dic
| |--> README.txt
|--> tessdata
|--> deu.traineddata
|--> eng.traineddata
|--> README.txt

If this does not work (though that would really be a first), try only with
the deu.traineddata file, without the de_DE.aff and de_DE.dic to see whether
German appears as an entry in the menu.

—
Reply to this email directly or view it on GitHub
#13 (comment) .

manisandro · 2015-01-24T19:56:46Z

Ok cool!

narayaan · 2016-02-14T14:34:30Z

Maybe quite a noobish question, but I'm trying to add the Dutch tesseract data to gImageReader. A Google search led me to this page.

Since the tesseract code has been transferred to GitHub, I started looking there. I'm wondering which files exactly I should copy. All of them, or just the wordlist?

https://github.com/tesseract-ocr/langdata/tree/master/nld

narayaan · 2016-02-14T14:46:52Z

Found it, languages can now be dowloaded at:
https://github.com/tesseract-ocr/tessdata

wally53 · 2016-07-24T19:26:49Z

Hi Sandro,
I am using gImageReader 3.1.91 under Windows 7 with Tesseract 3.05.00 and I am trying to install the German Fraktur OCR software.
I followed the dictionary installation instructions installed the following:

I copied the deu-frak.traineddata into the tessdata folder
I copied the *.aff and *.dic files into the gImageReader folder
They are there along with the en_US files. But when I try to select "German" with "Recognize selection", even after "Redetect Languages" I can't select "German" (or "Deutsch"). There is just "English" -> "English (United States)"
It seems that other people had this kind of problem solved in the past - so obviously I am missing somthing somwhere.

manisandro · 2016-07-24T20:12:01Z

To which tessdata folder did you download the traineddata files? gImageReader bundles tesseract, so you need to make sure you place the traineddata files in the ...\gImageReader\share\tessdata folder.

wally53 · 2016-07-25T07:59:59Z

The following files are in the ...\gImageReader\share\tessdata folder:

deu.traineddata
deu-frak.traineddata
eng.traineddata
README

Am 7/24/2016 um 10:12 PM schrieb Sandro Mani:

To which tessdata folder did you download the traineddata files?
gImageReader bundles tesseract, so you need to make sure you place the
traineddata files in the ...\gImageReader\share\tessdata folder.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ATrFLfiIVC_4NNQkfqGL2nlXqoFBNXsXks5qY8cSgaJpZM4DWv6i.

Dr. Walter T. Penzhorn
Dorfstr 21 a / D-79576 Weil am Rhein
Telefon: +49 (0)7621 / 425-0875
Webseite: www.wpenzhorn.de

manisandro · 2016-07-25T10:37:52Z

Did you make sure you downloaded the actual binary blob and not the html page on github for the traineddata file? Can you try with the integrated tessdata manager (you'll need to start the program as administrator)?

wally53 · 2016-07-26T08:48:16Z

The following files and their sizes are in the
...\gImageReader\share\tessdata folder:

deu.traineddata 13 054 KB
deu-frak.traineddata 1 933 KB
eng.traineddata 21 364 KB
README 1 KB

Can you try with the integrated tessdata manager (you'll need to
start the program as administrator)?

I have run the program as administrator, using the gImageReadr - without
success.
However, I am not too sure, what it means to run the "integrated
tessdata manager"

Am 7/25/2016 um 12:37 PM schrieb Sandro Mani:

Did you make sure you downloaded the actual binary blob and not the
html page on github for the traineddata file? Can you try with the
integrated tessdata manager (you'll need to start the program as
administrator)?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ATrFLXNyHHmoDeqD_pEzHvotfig9bwqgks5qZJIBgaJpZM4DWv6i.

Dr. Walter T. Penzhorn
Dorfstr 21 a / D-79576 Weil am Rhein
Telefon: +49 (0)7621 / 425-0875
Webseite: www.wpenzhorn.de

manisandro · 2016-07-27T07:44:52Z

The integrated tessdata manger can be launched from the language selection menu -> "manage languages..."
If that also does not work we need to do some proper debugging...

pmontrasio · 2018-06-21T10:30:11Z

On Ubuntu the solution is

sudo apt-get install myspell-de

Other languages have their own myspell file, examples: myspell-fr myspell-it.
By the way, on Ubuntu the files in tessdata are installed with

sudo apt-get install tesseract-ocr-due tesseract-ocr-fra tesseract-ocr-ita

manisandro closed this as completed Jan 24, 2015

SantosSi mentioned this issue Nov 12, 2017

In hOCR mode sometimes deleting a node makes the gIR crash. #215

Closed

SantosSi mentioned this issue Nov 27, 2017

hOCR PDF export: prevent users from overwriting any input image PDF file #243

Closed

napasa mentioned this issue Dec 26, 2017

newest master code occur exception when export pdf #276

Closed

SantosSi mentioned this issue Dec 27, 2017

Qt5,Debian,libtesseract4: Crash on recognition #279

Closed

SantosSi mentioned this issue Sep 30, 2018

Qt: crash on overwriting hOCR + file set to 0 bytes #375

Closed

TeoColuccio mentioned this issue Apr 19, 2020

Glibmm-error, detected trace/breakpoint #445

Closed

hendrack mentioned this issue Mar 13, 2024

Segfault on Alpine (OpenCL, Tesseract issue?) #668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gImageREADER does not find non-english dicts #13

gImageREADER does not find non-english dicts #13

titus483 commented Jan 24, 2015

manisandro commented Jan 24, 2015

titus483 commented Jan 24, 2015

manisandro commented Jan 24, 2015

titus483 commented Jan 24, 2015

manisandro commented Jan 24, 2015

narayaan commented Feb 14, 2016

narayaan commented Feb 14, 2016

wally53 commented Jul 24, 2016

manisandro commented Jul 24, 2016

wally53 commented Jul 25, 2016

manisandro commented Jul 25, 2016

wally53 commented Jul 26, 2016

manisandro commented Jul 27, 2016

pmontrasio commented Jun 21, 2018

gImageREADER does not find non-english dicts #13

gImageREADER does not find non-english dicts #13

Comments

titus483 commented Jan 24, 2015

manisandro commented Jan 24, 2015

titus483 commented Jan 24, 2015

manisandro commented Jan 24, 2015

titus483 commented Jan 24, 2015

manisandro commented Jan 24, 2015

narayaan commented Feb 14, 2016

narayaan commented Feb 14, 2016

wally53 commented Jul 24, 2016

manisandro commented Jul 24, 2016

wally53 commented Jul 25, 2016

manisandro commented Jul 25, 2016

wally53 commented Jul 26, 2016

manisandro commented Jul 27, 2016

pmontrasio commented Jun 21, 2018