Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gImageREADER does not find non-english dicts #13

Closed
titus483 opened this issue Jan 24, 2015 · 14 comments
Closed

gImageREADER does not find non-english dicts #13

titus483 opened this issue Jan 24, 2015 · 14 comments

Comments

@titus483
Copy link

This is for gImageReader 3.0.1 under Windows 7.
I followed the dictionary installation instructions and downloaded the german de_DE.zip and copied the de_DE.aff and de_DE.dic into /share/myspell/dicts. They are there along with the en_US files.
But when I try to select "German" with "Recognize selection", even after "Redetect Languages" I can't select "German" (or "Deutsch"). There is just "English" -> "English (United States)" or "Multilingual" -> "English".

@manisandro
Copy link
Owner

Hi,
Two types of language data are used by gImageReader:

  • The tesseract language definitions: these are necessary for performing OCR for a specific language (tesseract is the OCR engine used by gImageReader). You can download these here [1].
  • The spellchecking dictionaries. These are used to perform spell checking on the OCR result. The *.aff and *.dic are spellchecking dictionary files.

So in short, while you installed the spellchecking dictionaries, you are missing the actual language support for tesseract. For German, you'll want to download this [2] and place the deu.traineddata file therein in the Tesseract language definitions folder (.../share/tessdata).

[1] https://code.google.com/p/tesseract-ocr/downloads/list
[2] https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.deu.tar.gz

@titus483
Copy link
Author

Hi Sandro,
thanks for your quick reply. Yes, I did that as a first step (sorry forgot to
mention it):

  1. I copied the deu.traineddata into the tessdata folder
  2. I copied the *.aff and *.dic files into the gImageReader folder

I indeed followed an article of the German c't magazine 4/2015 where that is
described step by step. But it still doesn't work for me...

Sandro Mani notifications@github.com hat am 24. Januar 2015 um 19:51
geschrieben:

Hi,
Two types of language data are used by gImageReader:

 * The tesseract language definitions: these are necessary for performing

OCR for a specific language (tesseract is the OCR engine used by
gImageReader). You can download these here [1].
* The spellchecking dictionaries. These are used to perform spell
checking on the OCR result. the *.aff and *.dic are spelling dictionary files.

So in short, while you installed the spellchecking dictionaries, you are
missing the actual language support for tesseract. For German, you'll want to
download this [2] and place the deu.traineddata file therein in the Tesseract
language definitions folder (.../share/tessdata).

[1] https://code.google.com/p/tesseract-ocr/downloads/list
[2]
https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.deu.tar.gz


Reply to this email directly or view it on GitHub
#13 (comment) .

@manisandro
Copy link
Owner

So you should have something like this in .../usr:

|
|--> myspell
|   |--> dicts
|       |--> de_DE.aff
|       |--> de_DE.dic
|       |--> en_US.aff
|       |--> en_US.dic
|       |--> README.txt
|--> tessdata
    |--> deu.traineddata
    |--> eng.traineddata
    |--> README.txt

If this does not work (though that would really be a first), try only with the deu.traineddata file, without the de_DE.aff and de_DE.dic to see whether German appears as an entry in the menu.

@titus483
Copy link
Author

Great! Now I've got it. My mistake was that I copied the deu.traineddata into
the Tesseract/tessdata folder, not into the gImageReader/.../tessdata folder!
And now I've got the right menu. Thanks a lot for your help!

Sandro Mani notifications@github.com hat am 24. Januar 2015 um 20:43
geschrieben:

So you should have something like this in .../usr:

|
|--> myspell
| |--> dicts
| |--> de_DE.aff
| |--> de_DE.dic
| |--> en_US.aff
| |--> en_US.dic
| |--> README.txt
|--> tessdata
|--> deu.traineddata
|--> eng.traineddata
|--> README.txt

If this does not work (though that would really be a first), try only with
the deu.traineddata file, without the de_DE.aff and de_DE.dic to see whether
German appears as an entry in the menu.


Reply to this email directly or view it on GitHub
#13 (comment) .

@manisandro
Copy link
Owner

Ok cool!

@narayaan
Copy link

Maybe quite a noobish question, but I'm trying to add the Dutch tesseract data to gImageReader. A Google search led me to this page.

Since the tesseract code has been transferred to GitHub, I started looking there. I'm wondering which files exactly I should copy. All of them, or just the wordlist?

https://github.com/tesseract-ocr/langdata/tree/master/nld

@narayaan
Copy link

Found it, languages can now be dowloaded at:
https://github.com/tesseract-ocr/tessdata

@wally53
Copy link

wally53 commented Jul 24, 2016

Hi Sandro,
I am using gImageReader 3.1.91 under Windows 7 with Tesseract 3.05.00 and I am trying to install the German Fraktur OCR software.
I followed the dictionary installation instructions installed the following:

  1. I copied the deu-frak.traineddata into the tessdata folder
  2. I copied the *.aff and *.dic files into the gImageReader folder
    They are there along with the en_US files. But when I try to select "German" with "Recognize selection", even after "Redetect Languages" I can't select "German" (or "Deutsch"). There is just "English" -> "English (United States)"
    It seems that other people had this kind of problem solved in the past - so obviously I am missing somthing somwhere.

@manisandro
Copy link
Owner

To which tessdata folder did you download the traineddata files? gImageReader bundles tesseract, so you need to make sure you place the traineddata files in the ...\gImageReader\share\tessdata folder.

@wally53
Copy link

wally53 commented Jul 25, 2016

The following files are in the ...\gImageReader\share\tessdata folder:

deu.traineddata
deu-frak.traineddata
eng.traineddata
README

Am 7/24/2016 um 10:12 PM schrieb Sandro Mani:

To which tessdata folder did you download the traineddata files?
gImageReader bundles tesseract, so you need to make sure you place the
traineddata files in the ...\gImageReader\share\tessdata folder.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ATrFLfiIVC_4NNQkfqGL2nlXqoFBNXsXks5qY8cSgaJpZM4DWv6i.


Dr. Walter T. Penzhorn
Dorfstr 21 a / D-79576 Weil am Rhein
Telefon: +49 (0)7621 / 425-0875
Webseite: www.wpenzhorn.de

@manisandro
Copy link
Owner

Did you make sure you downloaded the actual binary blob and not the html page on github for the traineddata file? Can you try with the integrated tessdata manager (you'll need to start the program as administrator)?

@wally53
Copy link

wally53 commented Jul 26, 2016

The following files and their sizes are in the
...\gImageReader\share\tessdata folder:

deu.traineddata 13 054 KB
deu-frak.traineddata 1 933 KB
eng.traineddata 21 364 KB
README 1 KB

Can you try with the integrated tessdata manager (you'll need to
start the program as administrator)?

I have run the program as administrator, using the gImageReadr - without
success.
However, I am not too sure, what it means to run the "integrated
tessdata manager"

Am 7/25/2016 um 12:37 PM schrieb Sandro Mani:

Did you make sure you downloaded the actual binary blob and not the
html page on github for the traineddata file? Can you try with the
integrated tessdata manager (you'll need to start the program as
administrator)?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#13 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ATrFLXNyHHmoDeqD_pEzHvotfig9bwqgks5qZJIBgaJpZM4DWv6i.


Dr. Walter T. Penzhorn
Dorfstr 21 a / D-79576 Weil am Rhein
Telefon: +49 (0)7621 / 425-0875
Webseite: www.wpenzhorn.de

@manisandro
Copy link
Owner

The integrated tessdata manger can be launched from the language selection menu -> "manage languages..."
If that also does not work we need to do some proper debugging...

@pmontrasio
Copy link

On Ubuntu the solution is

sudo apt-get install myspell-de

Other languages have their own myspell file, examples: myspell-fr myspell-it.
By the way, on Ubuntu the files in tessdata are installed with

sudo apt-get install tesseract-ocr-due tesseract-ocr-fra tesseract-ocr-ita

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants