-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Other Languages #54
Comments
There are no default models, but you can train one easily, either using some training data from real scanned images or artificial data generated using ocropus-linegen. We have used it for Devanagari and Greek script with a lot of success. Some researchers reported results on Arabic Handwriting recognition using OCRopus. I can help in running a basic model if you decide to train your own models. |
Thanks so much Adnan! Your help would be very appreciated. Can you point me to what you did with the Devanagari or Greek languages? We can also take this offline if you prefer. |
You are welcome! The only different thing we did with Devanagari is the text-line normalization. Instead of using the default ocropus line normalization, we used a different method. |
Hi, Thanks for this wonderful project. I am trying to test for Japanese text. |
In ocropus-rtrain, I changed from repr to unicode. You are great !!! |
octopus-rtrain creates codec( target character union ) by read_text + lstm.normalize_nfkc During training loop, correct text(transcript) is loaded by ocrolib.read_text(base+".gt.txt") Doesn't this cause any problem? |
After 4 millions iteration with 2402 kinds of Japanese characters, it does not seem to converge. I'll try c++ version. |
How big was your dataset? |
I generated 2000 lines of random text(UTF8) from 2402 chars (official common usage characters). |
For Chinese characters, you probably need a much larger number of hidden units, and possibly some other tricks as well. Please share what you come up with. |
@isaomatsunami Have you made any progress in training Japanese Character? I'm trying to train ocropy to recognize Chinese now. |
No. I tried ocropy with hidden nodes of 200 and found, as far as I estimate, that it began to learn one char by forgetting another. |
anything update about Chinese i have been read Adnan`s phd thesis,and I have 2 million documents (pdf or xps we can transfrom to jpeg) containing Chinese and English characters both ,need some help and tips about how to train a model |
Hi, It would be interesting to see how LSTM would work on Chinese. Can you send me some sample pages? Kind regards, Adnan Ul-Hasan On Sat, Apr 16, 2016 at 9:06 PM -0700, "wanghaisheng" notifications@github.com wrote: anything update about Chinese i have been read Adnan`s phd thesis,and I have 2 million documents containing Chinese and English characters both ,need some help and tips about how to train a model — |
@adnanulhasan |
@isaomatsunami sir ,how do you get all your Ground Truth data ? |
Hi guys |
Hi, |
Hi @adnanulhasan thanks for your response, |
Give the path to gt.txt files instead of mentioning telugucharacters. |
@adnanulhasan thanks dude, |
@adnanulhasan One of the papers from your groups, mentions the availability of a a ground truth devanagari database called 'Dev-DB'. Is there a possibility you can link me to it? |
@adnanulhasan If I want to train an Arabic model, do you suggest using ocropy or clstm? |
Is there support for non-latin languages like Chinese, Japanese or Thai?
The text was updated successfully, but these errors were encountered: