FYI - tesseract4.0.0-beta.1 traineddata files for scripts #323

Shreeshrii · 2018-03-18T12:48:08Z

Please see tesseract-ocr/tessdata_fast@9f875fb

The newly available script traineddata files have now been moved to a subdirectory.

This is FYI and applies only to the experimental windows version.

The text was updated successfully, but these errors were encountered:

manisandro · 2018-03-18T14:31:33Z

I think I'll generally need to check what needs to be updated for tesseract 4, also "fast" vs "best" models.

Shreeshrii · 2018-03-18T14:55:54Z

Debian and Ubuntu packages have tessdata_fast.

Only tessdata_best will work as "continue_from" for LSTM training.

Those who want to use legacy tesseract engine with tesseract4 need traineddata from tessdata repo.

manisandro · 2018-03-18T22:49:54Z

Not really having time to play around with tesseract 4 at the moment, do you perhaps know:

What consequences does the script folder have as far as gImageReader is concerned? Does the tessdata manager just need to separately handle scripts, and place them in the correct folder?
Is there any difference between invoking tesseract for a language traineddata or a script traineddata?
I suppose the "fast" models conflict with the "best" ones, i.e. the filenames collide in the tessdata folder? So I the only clean approach for the tessdata manager to handle both would be to have two separate tessdata prefixes for the respective variants, and switch via the TESSDATA_PREFIX environment variable.

Shreeshrii · 2018-03-19T05:58:38Z

Is there any difference between invoking tesseract for a language traineddata or a script traineddata?

No. It is the same, with -l in command line.

Does the tessdata manager just need to separately handle scripts, and place them in the correct folder?

Earlier all these files were in the root directory in repo. Now the scripts traineddata are under script subdirectory. You will need to show both language and script files under the languages menu.

clean approach for the tessdata manager to handle both would be to have two separate tessdata prefixes for the respective variants, and switch via the TESSDATA_PREFIX environment variable.

I don't think that gimagereader allows training of tesseract, in that case you could limit users to use models from tessdata_fast, since they are newer/smaller/faster models.

There has been change regarding the TESSDATA_PREFIX variable. The processing has been made consistent with the --tessdata-dir from command line and TESSDATA_PREFIX. Both should specify the full path.

tesseract-ocr/tesseract@af6994e

tesseract-ocr/tesseract@9035217

Shreeshrii · 2018-03-20T04:40:23Z

Please see following comment from Ray today, regarding best and fast

tesseract-ocr/tessdata_best#17 (comment)

manisandro · 2018-03-20T08:54:44Z

Thanks

amitdo · 2018-03-20T09:34:07Z

tesseract 4.0 supports this usage:

tesseract in.png out -l subdir/eng

manisandro · 2018-03-20T09:39:40Z

I suppose in the API, subdir is what is passed as datapath to TessBaseAPI::Init?

int Init(const char* datapath, const char* language)

amitdo · 2018-03-20T10:04:27Z

const char* language = "fast/eng"
//const char* language = "best/eng"
//const char* language = "best/script/Latin"

CC:@stweil

manisandro · 2018-03-20T10:05:20Z

Ok, thanks!

amitdo · 2018-03-20T11:06:06Z

On Debian testing/sid and Ubuntu 18.04:
https://packages.debian.org/buster/tesseract-ocr-script-latn

The scripts are not in a 'script' dir.
/usr/share/tesseract-ocr/4.00/tessdata/Latin.traineddata

manisandro · 2018-03-20T11:08:27Z

Since I'm also the Fedora maintainer of tesseract, I suppose the upstream recommendation would be something like

/<tessdata_prefix>/fast
/<tessdata_prefix>/best
/<tessdata_prefix>/script

right?

Shreeshrii · 2018-03-20T11:20:04Z

script subdir is there in both fast and best. I think Amit recommended script under both best and fast

const char* language = "fast/eng"
//const char* language = "fast/script/Latin"

//const char* language = "best/eng"
//const char* language = "best/script/Latin"

This would also reflect the directory structure in tessdata_fast and tessdata_best.

stweil · 2018-03-20T13:39:46Z

The user interface ideally should not only offer to install or select languages, but also scripts for Tesseract 4. And for most users using data from tessdata_fast seems to be the best choice. Therefore the README.txt also needs an updated URL https://github.com/tesseract-ocr/tessdata_fast.

manisandro · 2018-03-29T22:38:53Z

I've added support for handling script traineddatas and switched to tessdata_fast if compiled against tesseract4.x.

Testing welcome, windows builds of current master are here:
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.0.0beta1.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.0.0beta1.exe

Shreeshrii · 2018-03-30T02:15:44Z

Thanks. I will download and give it a spin

…

On Fri 30 Mar, 2018, 4:08 AM Sandro Mani, ***@***.***> wrote: Closed #323 <#323>. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#323 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o4b2Lx6uMQ2EFogIB8N8t_BKCVDnks5tjWJ_gaJpZM4SvLxZ> .

Shreeshrii · 2018-03-30T09:49:34Z

I installed the beta just now and then looked at the share/tessdata folder. It still has all the old files that I had there. So, this could be a possible reason for incompatibility if users have older version of traineddata files.

I understand the need to keep the old traineddata so that files do not have to be downloaded again.

Maybe give an option during install whether old files should be saved...

manisandro · 2018-03-30T09:53:52Z

This is an issue I've though about a couple of times. The best thing would be if the traineddatas contained the version of tesseract they are compatible with, so that you can notify the user about incompatible traineddatas. At the moment there is simply no way to know.

Shreeshrii · 2018-03-30T10:00:51Z

Also, the reason for the script subdir was because there is a lao language and Lao script.

I saw that all script files already in share/tessdata were identified as script in menu :-)

As a testcase, I downloaded both lao and Lao thru the Tessdata manager. The process got an error, screenshot attached. It is probably related to overwriting the file - Do I need to run the program in administrator mode if I want to install languages or is that handled internally?

Shreeshrii · 2018-03-30T10:09:14Z

OK, Ran program as administrator and chose Lao and lao. No error, but it installed only Lao. No message either.

Currently there is only this pair which has problem because of windows being case insensitive. You could change one of their file names after downloading.

manisandro · 2018-03-30T10:11:06Z

Hmm, cannot reproduce all traineddatas being marked as script...
There is however indeed the issue that script traineddatas aren't installed to the script subdir. The problem is that the script subdir was only introduced after the 4.0.0-beta.1 tag in the tessdata repo. I'll need to add some extra logic to handle this (though it will automatically be resolved with current tesseract master or following releases).

manisandro · 2018-03-30T10:12:11Z

And about admin vs non admin mode: if you don't have write permissions to %ProgramFiles%, you can still select to use user-directories for storing the data in the gimagereader setting dialog.

Shreeshrii · 2018-03-30T10:19:14Z

ccannot reproduce all traineddatas being marked as script...

I meant that the existing script traineddatas, those starting with a CAPITAL letter, were correctly identified as script in the menu, and sorted on top.

Suggestions:

User reports seem to suggest that individual languages are more accurate. So, in general, Script traineddata would be useful for a language that has NOT been trained for but is written in that script. So I would suggest that you keep the languages list on top followed by scripts
The language names are being displayed in their appropriate script. I think it would be useful to display the name or language code in english when hovering over the name. eg. I have installed many languages for testing but do not necessarily know what each one looks like.

manisandro · 2018-03-30T10:21:44Z

Ok makes sense

manisandro · 2018-03-30T11:18:45Z

I've refreshed the builds (same links as above).

Shreeshrii · 2018-06-11T12:23:42Z

Please rebuild these 4.00-beta builds again for testing.

I am finding that my recently trained traineddata files are NOT working in gimagereader. They show up in menu, but when clicked on the recognition does not happen. Older custom traineddata still works.

I am wondering whether there is some compatibility issue, since I can use them on command line with recently built tesseract.

These are variations on the Devanagari.traineddata, if you want I can provide a copy of the file.

manisandro · 2018-06-11T12:33:51Z

I'll try and do so this evening - I'm currently fighting (in my currently limited free time) with the tessdata manager to find a sufficiently robust logic to detect the tessdata git tag to use. Since you are a contributor to the tessdata repository: it would be great if the TESSERACT_VERSION_STR exposed in version.h were equal to the tessdata tag. The 4.0.0-beta1 release still has

#define TESSERACT_VERSION_STR "4.00.00alpha"

but the tessdata tag is 4.0.0-beta.1. (I believe [1] should ensure that in the future, a correct version string is exposed).

[1] tesseract-ocr/tesseract@6bbfc3b

Shreeshrii · 2018-06-11T13:58:07Z

@stweil is the one making changes regarding version.

The tessdata files themselves are all marked as 4.00.00alpha since they are from last year.

manisandro · 2018-06-11T13:59:54Z

Do the tessdatas actually contain a version encoded somewhere? Up to now I always just relied on the github tag, which currently is 4.0.0-beta.1.

Shreeshrii · 2018-06-11T14:04:19Z

Version string has been addded to the files. Please see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#format-of-traineddata-files

However, I don't think there is 'control' over what the values should be. It takes any string currently for new files.

Ray/Jeff's uploads say Version string:4.00.00alpha

Shreeshrii · 2018-06-12T08:31:32Z

However, I don't think there is 'control' over what the values should be. It takes any string currently for new files.

My earlier comment is NOT entirely correct. While version string can be assigned any string, the program adds its version before it. As an example see a recent training where the tesseract version 4.0.0-beta.1-344-ge61406 has been appended before the custom version string.

 combine_tessdata -u san-iast.traineddata san-iast
Extracting tessdata components from san-iast.traineddata
Wrote san-iast.config
Wrote san-iast.lstm
Wrote san-iast.lstm-punc-dawg
Wrote san-iast.lstm-word-dawg
Wrote san-iast.lstm-number-dawg
Wrote san-iast.lstm-unicharset
Wrote san-iast.lstm-recoder
Wrote san-iast.version
Version string:4.0.0-beta.1-344-ge61406:san:shreeshrii20180610:from:4.00.00alpha:Devanagari:synth20170629test
0:config:size=1013, offset=192
17:lstm:size=2777386, offset=1205
18:lstm-punc-dawg:size=4322, offset=2778591
19:lstm-word-dawg:size=19187298, offset=2782913
20:lstm-number-dawg:size=450, offset=21970211
21:lstm-unicharset:size=23358, offset=21970661
22:lstm-recoder:size=2929, offset=21994019
23:version:size=94, offset=21996948

manisandro · 2018-06-14T13:02:55Z

@Shreeshrii P.s. Sorry but I've not yet been able to do a build, something broke with the mingw-qt5-5.11 update in Fedora Rawhide, need to debug that one first.

Shreeshrii · 2018-06-14T13:03:53Z

No problem. Please take your time.

manisandro · 2018-06-21T22:13:25Z

At last, here are some fresh builds:

https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe

I'd be also interested in testing of the tessdata manager, which should now also properly handle script tessdatas.

Shreeshrii · 2018-06-27T05:28:17Z

Thank you. I was able to use my custom traineddata with it. Will test further and let you know.

…

On Fri 22 Jun, 2018, 3:43 AM Sandro Mani, ***@***.***> wrote: At last, here are some fresh builds: https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe I'd be also interested in testing of the tessdata manager, which should now also properly handle script tessdatas. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#323 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE2_o-hn6P5UngbWpuu5K4_R82Bj5Zxvks5t_BqIgaJpZM4SvLxZ> .

manisandro closed this as completed Mar 29, 2018

FYI - tesseract4.0.0-beta.1 traineddata files for scripts #323

FYI - tesseract4.0.0-beta.1 traineddata files for scripts #323

Comments

Shreeshrii commented Mar 18, 2018

manisandro commented Mar 18, 2018

Shreeshrii commented Mar 18, 2018

manisandro commented Mar 18, 2018

Shreeshrii commented Mar 19, 2018

Shreeshrii commented Mar 20, 2018 • edited

manisandro commented Mar 20, 2018

amitdo commented Mar 20, 2018

manisandro commented Mar 20, 2018

amitdo commented Mar 20, 2018 • edited

manisandro commented Mar 20, 2018

amitdo commented Mar 20, 2018

manisandro commented Mar 20, 2018

Shreeshrii commented Mar 20, 2018 • edited

stweil commented Mar 20, 2018 • edited

manisandro commented Mar 29, 2018

Shreeshrii commented Mar 30, 2018 via email

Shreeshrii commented Mar 30, 2018

manisandro commented Mar 30, 2018

Shreeshrii commented Mar 30, 2018

Shreeshrii commented Mar 30, 2018

manisandro commented Mar 30, 2018 • edited

manisandro commented Mar 30, 2018

Shreeshrii commented Mar 30, 2018

manisandro commented Mar 30, 2018

manisandro commented Mar 30, 2018

Shreeshrii commented Jun 11, 2018

manisandro commented Jun 11, 2018

Shreeshrii commented Jun 11, 2018 • edited

manisandro commented Jun 11, 2018

Shreeshrii commented Jun 11, 2018

Shreeshrii commented Jun 12, 2018 • edited

manisandro commented Jun 14, 2018

Shreeshrii commented Jun 14, 2018

manisandro commented Jun 21, 2018

Shreeshrii commented Jun 27, 2018 via email

Shreeshrii commented Mar 20, 2018 •

edited

amitdo commented Mar 20, 2018 •

edited

Shreeshrii commented Mar 20, 2018 •

edited

stweil commented Mar 20, 2018 •

edited

manisandro commented Mar 30, 2018 •

edited

Shreeshrii commented Jun 11, 2018 •

edited

Shreeshrii commented Jun 12, 2018 •

edited