New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FYI - tesseract4.0.0-beta.1 traineddata files for scripts #323
Comments
I think I'll generally need to check what needs to be updated for tesseract 4, also "fast" vs "best" models. |
Debian and Ubuntu packages have tessdata_fast. Only tessdata_best will work as "continue_from" for LSTM training. Those who want to use legacy tesseract engine with tesseract4 need traineddata from tessdata repo. |
Not really having time to play around with tesseract 4 at the moment, do you perhaps know:
|
No. It is the same, with -l in command line.
Earlier all these files were in the root directory in repo. Now the scripts traineddata are under script subdirectory. You will need to show both language and script files under the languages menu.
I don't think that gimagereader allows training of tesseract, in that case you could limit users to use models from tessdata_fast, since they are newer/smaller/faster models. There has been change regarding the TESSDATA_PREFIX variable. The processing has been made consistent with the --tessdata-dir from command line and TESSDATA_PREFIX. Both should specify the full path. |
Please see following comment from Ray today, regarding best and fast |
Thanks |
tesseract 4.0 supports this usage:
|
I suppose in the API,
|
CC:@stweil |
Ok, thanks! |
On Debian testing/sid and Ubuntu 18.04: The scripts are not in a 'script' dir. |
Since I'm also the Fedora maintainer of tesseract, I suppose the upstream recommendation would be something like
right? |
script subdir is there in both fast and best. I think Amit recommended script under both best and fast const char* language = "fast/eng" //const char* language = "best/eng" This would also reflect the directory structure in tessdata_fast and tessdata_best. |
The user interface ideally should not only offer to install or select languages, but also scripts for Tesseract 4. And for most users using data from tessdata_fast seems to be the best choice. Therefore the README.txt also needs an updated URL https://github.com/tesseract-ocr/tessdata_fast. |
I've added support for handling script traineddatas and switched to tessdata_fast if compiled against tesseract4.x. Testing welcome, windows builds of current master are here: |
Thanks. I will download and give it a spin
…On Fri 30 Mar, 2018, 4:08 AM Sandro Mani, ***@***.***> wrote:
Closed #323 <#323>.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#323 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o4b2Lx6uMQ2EFogIB8N8t_BKCVDnks5tjWJ_gaJpZM4SvLxZ>
.
|
I installed the beta just now and then looked at the share/tessdata folder. It still has all the old files that I had there. So, this could be a possible reason for incompatibility if users have older version of traineddata files. I understand the need to keep the old traineddata so that files do not have to be downloaded again. Maybe give an option during install whether old files should be saved... |
This is an issue I've though about a couple of times. The best thing would be if the traineddatas contained the version of tesseract they are compatible with, so that you can notify the user about incompatible traineddatas. At the moment there is simply no way to know. |
Also, the reason for the script subdir was because there is a lao language and Lao script. I saw that all script files already in share/tessdata were identified as script in menu :-) As a testcase, I downloaded both lao and Lao thru the Tessdata manager. The process got an error, screenshot attached. It is probably related to overwriting the file - Do I need to run the program in administrator mode if I want to install languages or is that handled internally? |
OK, Ran program as administrator and chose Lao and lao. No error, but it installed only Lao. No message either. Currently there is only this pair which has problem because of windows being case insensitive. You could change one of their file names after downloading. |
Hmm, cannot reproduce all traineddatas being marked as script... |
And about admin vs non admin mode: if you don't have write permissions to |
I meant that the existing script traineddatas, those starting with a CAPITAL letter, were correctly identified as script in the menu, and sorted on top. Suggestions:
|
Ok makes sense |
I've refreshed the builds (same links as above). |
Please rebuild these 4.00-beta builds again for testing. I am finding that my recently trained traineddata files are NOT working in gimagereader. They show up in menu, but when clicked on the recognition does not happen. Older custom traineddata still works. I am wondering whether there is some compatibility issue, since I can use them on command line with recently built tesseract. These are variations on the Devanagari.traineddata, if you want I can provide a copy of the file. |
I'll try and do so this evening - I'm currently fighting (in my currently limited free time) with the tessdata manager to find a sufficiently robust logic to detect the tessdata git tag to use. Since you are a contributor to the tessdata repository: it would be great if the TESSERACT_VERSION_STR exposed in version.h were equal to the tessdata tag. The 4.0.0-beta1 release still has
but the tessdata tag is |
@stweil is the one making changes regarding version. The tessdata files themselves are all marked as 4.00.00alpha since they are from last year. |
Do the tessdatas actually contain a version encoded somewhere? Up to now I always just relied on the github tag, which currently is 4.0.0-beta.1. |
Version string has been addded to the files. Please see https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#format-of-traineddata-files However, I don't think there is 'control' over what the values should be. It takes any string currently for new files. Ray/Jeff's uploads say |
My earlier comment is NOT entirely correct. While version string can be assigned any string, the program adds its version before it. As an example see a recent training where the tesseract version
|
@Shreeshrii P.s. Sorry but I've not yet been able to do a build, something broke with the mingw-qt5-5.11 update in Fedora Rawhide, need to debug that one first. |
No problem. Please take your time. |
At last, here are some fresh builds: https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe I'd be also interested in testing of the tessdata manager, which should now also properly handle script tessdatas. |
Thank you. I was able to use my custom traineddata with it. Will test
further and let you know.
…On Fri 22 Jun, 2018, 3:43 AM Sandro Mani, ***@***.***> wrote:
At last, here are some fresh builds:
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_i686_tesseract4.git87635c1.exe
https://smani.fedorapeople.org/tmp/gImageReader_3.2.99_qt5_x86_64_tesseract4.git87635c1.exe
I'd be also interested in testing of the tessdata manager, which should
now also properly handle script tessdatas.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#323 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE2_o-hn6P5UngbWpuu5K4_R82Bj5Zxvks5t_BqIgaJpZM4SvLxZ>
.
|
Please see tesseract-ocr/tessdata_fast@9f875fb
The newly available script traineddata files have now been moved to a subdirectory.
This is FYI and applies only to the experimental windows version.
The text was updated successfully, but these errors were encountered: