Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FYI - tesseract4.0.0-beta.1 traineddata files for scripts #323

Closed
Shreeshrii opened this issue Mar 18, 2018 · 35 comments
Closed

FYI - tesseract4.0.0-beta.1 traineddata files for scripts #323

Shreeshrii opened this issue Mar 18, 2018 · 35 comments

Comments

@Shreeshrii
Copy link
Contributor

Please see tesseract-ocr/tessdata_fast@9f875fb

The newly available script traineddata files have now been moved to a subdirectory.

This is FYI and applies only to the experimental windows version.

@manisandro
Copy link
Owner

I think I'll generally need to check what needs to be updated for tesseract 4, also "fast" vs "best" models.

@Shreeshrii
Copy link
Contributor Author

Debian and Ubuntu packages have tessdata_fast.

Only tessdata_best will work as "continue_from" for LSTM training.

Those who want to use legacy tesseract engine with tesseract4 need traineddata from tessdata repo.

@manisandro
Copy link
Owner

Not really having time to play around with tesseract 4 at the moment, do you perhaps know:

  • What consequences does the script folder have as far as gImageReader is concerned? Does the tessdata manager just need to separately handle scripts, and place them in the correct folder?
  • Is there any difference between invoking tesseract for a language traineddata or a script traineddata?
  • I suppose the "fast" models conflict with the "best" ones, i.e. the filenames collide in the tessdata folder? So I the only clean approach for the tessdata manager to handle both would be to have two separate tessdata prefixes for the respective variants, and switch via the TESSDATA_PREFIX environment variable.

@Shreeshrii
Copy link
Contributor Author

Is there any difference between invoking tesseract for a language traineddata or a script traineddata?

No. It is the same, with -l in command line.

Does the tessdata manager just need to separately handle scripts, and place them in the correct folder?

Earlier all these files were in the root directory in repo. Now the scripts traineddata are under script subdirectory. You will need to show both language and script files under the languages menu.

clean approach for the tessdata manager to handle both would be to have two separate tessdata prefixes for the respective variants, and switch via the TESSDATA_PREFIX environment variable.

I don't think that gimagereader allows training of tesseract, in that case you could limit users to use models from tessdata_fast, since they are newer/smaller/faster models.

There has been change regarding the TESSDATA_PREFIX variable. The processing has been made consistent with the --tessdata-dir from command line and TESSDATA_PREFIX. Both should specify the full path.

tesseract-ocr/tesseract@af6994e

tesseract-ocr/tesseract@9035217

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Mar 20, 2018

Please see following comment from Ray today, regarding best and fast

tesseract-ocr/tessdata_best#17 (comment)

@manisandro
Copy link
Owner

Thanks

@amitdo
Copy link

amitdo commented Mar 20, 2018

tesseract 4.0 supports this usage:

tesseract in.png out -l subdir/eng

@manisandro
Copy link
Owner

I suppose in the API, subdir is what is passed as datapath to TessBaseAPI::Init?

int Init(const char* datapath, const char* language)

@amitdo
Copy link

amitdo commented Mar 20, 2018

const char* language = "fast/eng"
//const char* language = "best/eng"
//const char* language = "best/script/Latin"

CC:@stweil

@manisandro
Copy link
Owner

Ok, thanks!

@amitdo
Copy link

amitdo commented Mar 20, 2018

On Debian testing/sid and Ubuntu 18.04:
https://packages.debian.org/buster/tesseract-ocr-script-latn

The scripts are not in a 'script' dir.
/usr/share/tesseract-ocr/4.00/tessdata/Latin.traineddata

@manisandro
Copy link
Owner

Since I'm also the Fedora maintainer of tesseract, I suppose the upstream recommendation would be something like

/<tessdata_prefix>/fast
/<tessdata_prefix>/best
/<tessdata_prefix>/script

right?

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Mar 20, 2018

script subdir is there in both fast and best. I think Amit recommended script under both best and fast

const char* language = "fast/eng"
//const char* language = "fast/script/Latin"

//const char* language = "best/eng"
//const char* language = "best/script/Latin"

This would also reflect the directory structure in tessdata_fast and tessdata_best.

@stweil
Copy link
Contributor

stweil commented Mar 20, 2018

The user interface ideally should not only offer to install or select languages, but also scripts for Tesseract 4. And for most users using data from tessdata_fast seems to be the best choice. Therefore the README.txt also needs an updated URL https://github.com/tesseract-ocr/tessdata_fast.

@manisandro
Copy link
Owner

I've added support for handling script traineddatas and switched to tessdata_fast if compiled against tesseract4.x.

Testing welcome, windows builds of current master are here:
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.0.0beta1.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.0.0beta1.exe

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Mar 30, 2018 via email

@Shreeshrii
Copy link
Contributor Author

I installed the beta just now and then looked at the share/tessdata folder. It still has all the old files that I had there. So, this could be a possible reason for incompatibility if users have older version of traineddata files.

I understand the need to keep the old traineddata so that files do not have to be downloaded again.

Maybe give an option during install whether old files should be saved...

@manisandro
Copy link
Owner

This is an issue I've though about a couple of times. The best thing would be if the traineddatas contained the version of tesseract they are compatible with, so that you can notify the user about incompatible traineddatas. At the moment there is simply no way to know.

@Shreeshrii
Copy link
Contributor Author

Also, the reason for the script subdir was because there is a lao language and Lao script.

I saw that all script files already in share/tessdata were identified as script in menu :-)

As a testcase, I downloaded both lao and Lao thru the Tessdata manager. The process got an error, screenshot attached. It is probably related to overwriting the file - Do I need to run the program in administrator mode if I want to install languages or is that handled internally?

gimagereader-lao

@Shreeshrii
Copy link
Contributor Author

OK, Ran program as administrator and chose Lao and lao. No error, but it installed only Lao. No message either.

Currently there is only this pair which has problem because of windows being case insensitive. You could change one of their file names after downloading.

@manisandro
Copy link
Owner

manisandro commented Mar 30, 2018

Hmm, cannot reproduce all traineddatas being marked as script...
There is however indeed the issue that script traineddatas aren't installed to the script subdir. The problem is that the script subdir was only introduced after the 4.0.0-beta.1 tag in the tessdata repo. I'll need to add some extra logic to handle this (though it will automatically be resolved with current tesseract master or following releases).

@manisandro
Copy link
Owner

And about admin vs non admin mode: if you don't have write permissions to %ProgramFiles%, you can still select to use user-directories for storing the data in the gimagereader setting dialog.

@Shreeshrii
Copy link
Contributor Author

ccannot reproduce all traineddatas being marked as script...

I meant that the existing script traineddatas, those starting with a CAPITAL letter, were correctly identified as script in the menu, and sorted on top.

Suggestions:

  1. User reports seem to suggest that individual languages are more accurate. So, in general, Script traineddata would be useful for a language that has NOT been trained for but is written in that script. So I would suggest that you keep the languages list on top followed by scripts

  2. The language names are being displayed in their appropriate script. I think it would be useful to display the name or language code in english when hovering over the name. eg. I have installed many languages for testing but do not necessarily know what each one looks like.

@manisandro
Copy link
Owner

Ok makes sense

@manisandro
Copy link
Owner

I've refreshed the builds (same links as above).

@Shreeshrii
Copy link
Contributor Author

Please rebuild these 4.00-beta builds again for testing.

I am finding that my recently trained traineddata files are NOT working in gimagereader. They show up in menu, but when clicked on the recognition does not happen. Older custom traineddata still works.

I am wondering whether there is some compatibility issue, since I can use them on command line with recently built tesseract.

These are variations on the Devanagari.traineddata, if you want I can provide a copy of the file.

@manisandro
Copy link
Owner

I'll try and do so this evening - I'm currently fighting (in my currently limited free time) with the tessdata manager to find a sufficiently robust logic to detect the tessdata git tag to use. Since you are a contributor to the tessdata repository: it would be great if the TESSERACT_VERSION_STR exposed in version.h were equal to the tessdata tag. The 4.0.0-beta1 release still has

#define TESSERACT_VERSION_STR "4.00.00alpha"

but the tessdata tag is 4.0.0-beta.1. (I believe [1] should ensure that in the future, a correct version string is exposed).

[1] tesseract-ocr/tesseract@6bbfc3b

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Jun 11, 2018

@stweil is the one making changes regarding version.

The tessdata files themselves are all marked as 4.00.00alpha since they are from last year.

@manisandro
Copy link
Owner

Do the tessdatas actually contain a version encoded somewhere? Up to now I always just relied on the github tag, which currently is 4.0.0-beta.1.

@Shreeshrii
Copy link
Contributor Author

Version string has been addded to the files. Please see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#format-of-traineddata-files

However, I don't think there is 'control' over what the values should be. It takes any string currently for new files.

Ray/Jeff's uploads say Version string:4.00.00alpha

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Jun 12, 2018

However, I don't think there is 'control' over what the values should be. It takes any string currently for new files.

My earlier comment is NOT entirely correct. While version string can be assigned any string, the program adds its version before it. As an example see a recent training where the tesseract version 4.0.0-beta.1-344-ge61406 has been appended before the custom version string.

 combine_tessdata -u san-iast.traineddata san-iast
Extracting tessdata components from san-iast.traineddata
Wrote san-iast.config
Wrote san-iast.lstm
Wrote san-iast.lstm-punc-dawg
Wrote san-iast.lstm-word-dawg
Wrote san-iast.lstm-number-dawg
Wrote san-iast.lstm-unicharset
Wrote san-iast.lstm-recoder
Wrote san-iast.version
Version string:4.0.0-beta.1-344-ge61406:san:shreeshrii20180610:from:4.00.00alpha:Devanagari:synth20170629test
0:config:size=1013, offset=192
17:lstm:size=2777386, offset=1205
18:lstm-punc-dawg:size=4322, offset=2778591
19:lstm-word-dawg:size=19187298, offset=2782913
20:lstm-number-dawg:size=450, offset=21970211
21:lstm-unicharset:size=23358, offset=21970661
22:lstm-recoder:size=2929, offset=21994019
23:version:size=94, offset=21996948

@manisandro
Copy link
Owner

@Shreeshrii P.s. Sorry but I've not yet been able to do a build, something broke with the mingw-qt5-5.11 update in Fedora Rawhide, need to debug that one first.

@Shreeshrii
Copy link
Contributor Author

No problem. Please take your time.

@manisandro
Copy link
Owner

At last, here are some fresh builds:

https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe

I'd be also interested in testing of the tessdata manager, which should now also properly handle script tessdatas.

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Jun 27, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants