Other Languages #54

cinjon · 2015-08-06T15:41:24Z

Is there support for non-latin languages like Chinese, Japanese or Thai?

adnanulhasan · 2015-08-06T18:01:20Z

There are no default models, but you can train one easily, either using some training data from real scanned images or artificial data generated using ocropus-linegen. We have used it for Devanagari and Greek script with a lot of success. Some researchers reported results on Arabic Handwriting recognition using OCRopus. I can help in running a basic model if you decide to train your own models.

cinjon · 2015-08-06T18:19:51Z

Thanks so much Adnan!

Your help would be very appreciated. Can you point me to what you did with the Devanagari or Greek languages? We can also take this offline if you prefer.

adnanulhasan · 2015-08-07T08:04:49Z

You are welcome!

The only different thing we did with Devanagari is the text-line normalization. Instead of using the default ocropus line normalization, we used a different method.
I think it would be better if we could talk off this platform. You can email me at adnan@cs.uni-kl.de.

isaomatsunami · 2015-09-08T03:29:32Z

Hi, Thanks for this wonderful project.

I am trying to test for Japanese text.
As you know or not, Japanese characters looks like this.
"日本語でFracturは亀の子文字という"
Yes, there are over-20-edge characters. and Japanese uses around 5000 different characters.
Which tuning parameters do I have to care? Rough suggestions are appreciated, I will try.

isaomatsunami · 2015-09-11T14:42:42Z

In ocropus-rtrain, I changed from repr to unicode.
print " TRU:",unicode(transcript)
print " ALN:",unicode(gta[:len(transcript)+5])
print " OUT:",unicode(pred[:len(transcript)+5])

You are great !!!
My Mac is learning 2705 characters now. It's just like a kid, trying to read.
Model data is over 50 MB.

isaomatsunami · 2015-09-13T16:01:23Z

octopus-rtrain creates codec( target character union ) by read_text + lstm.normalize_nfkc
ocropy.read_text() calls occupy.normalize_text() which calls unicodedata.normalize('NFC',s) from inside.
lstm.normalize_nfkc() calls unicodedata.normalize('NFKC',s)

During training loop, correct text(transcript) is loaded by ocrolib.read_text(base+".gt.txt")
This transcript does not go through NFKC normalization.

Doesn't this cause any problem?

isaomatsunami · 2015-10-06T05:21:15Z

After 4 millions iteration with 2402 kinds of Japanese characters, it does not seem to converge. I'll try c++ version.

cinjon · 2015-10-06T05:29:44Z

How big was your dataset?

isaomatsunami · 2015-10-06T05:41:44Z

I generated 2000 lines of random text(UTF8) from 2402 chars (official common usage characters).
c++ version seems to be running without any modification.

tmbdev · 2015-10-11T02:10:39Z

For Chinese characters, you probably need a much larger number of hidden units, and possibly some other tricks as well. Please share what you come up with.

Halfish · 2016-01-29T10:58:25Z

@isaomatsunami Have you made any progress in training Japanese Character? I'm trying to train ocropy to recognize Chinese now.

isaomatsunami · 2016-01-29T11:09:41Z

No. I tried ocropy with hidden nodes of 200 and found, as far as I estimate, that it began to learn one char by forgetting another.
I am training clstm against 3877 classes of Chinese/Japanese characters with hidden node = 800.
After 150000 iteration, it keeps 3.8-5% error rate. See clstm section.

wanghaisheng · 2016-04-17T04:06:23Z

anything update about Chinese i have been read Adnan`s phd thesis,and I have 2 million documents (pdf or xps we can transfrom to jpeg) containing Chinese and English characters both ,need some help and tips about how to train a model
do we need to specify the dpi of picture

adnanulhasan · 2016-04-17T07:04:35Z

Hi,

It would be interesting to see how LSTM would work on Chinese. Can you send me some sample pages?

Kind regards,

Adnan Ul-Hasan

On Sat, Apr 16, 2016 at 9:06 PM -0700, "wanghaisheng" notifications@github.com wrote:

anything update about Chinese i have been read Adnan`s phd thesis,and I have 2 million documents containing Chinese and English characters both ,need some help and tips about how to train a model

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub

wanghaisheng · 2016-04-17T17:17:22Z

@adnanulhasan
you can touch me here edwin_uestc@163.com

wanghaisheng · 2016-05-07T11:45:40Z

@isaomatsunami sir ,how do you get all your Ground Truth data ?
i am using https://github.com/tmbdev/ocropy/wiki/Working-with-Ground-Truth way right now,but i want to use existing character to generate

harinath141 · 2016-12-27T12:27:02Z

Hi guys
I am working on a indic language telugu model,
I struck at this point
I just want to train it with telugu charecter set,
but the ocropus-rtrain loading all charecters,digits,and all how i even created a telugu='' variable in ocrlib/chars.py but not succeded.
Please help me

adnanulhasan · 2016-12-27T15:26:54Z

Hi,
Training ocropy for Telugu should be straight forward. You can use -c parameter to include the characters from from GT text files.

harinath141 · 2016-12-27T16:11:41Z

Hi @adnanulhasan thanks for your response,
I'm. Trying command as
Ocorpus-rtrain -o te book/0001/010000.bin.png -c telugucharectars
But it's not working

adnanulhasan · 2016-12-27T16:26:20Z

Give the path to gt.txt files instead of mentioning telugucharacters.
-c book/0001/010000.gt.txt

harinath141 · 2016-12-27T16:33:34Z

@adnanulhasan thanks dude,
Some time trackback error coming during training
Is it still open?

switchfootsid · 2017-01-26T13:38:13Z

@adnanulhasan One of the papers from your groups, mentions the availability of a a ground truth devanagari database called 'Dev-DB'. Is there a possibility you can link me to it?

ghost · 2017-06-18T16:06:22Z

@adnanulhasan If I want to train an Arabic model, do you suggest using ocropy or clstm?
what changes should I do to ocropy, char.py?

Closes ocropus-archive#54.

wanghaisheng mentioned this issue Apr 17, 2016

references wanghaisheng/awesome-ocr#1

Open

zuphilip mentioned this issue May 5, 2016

Error in ocropus-rtrain #93

Open

kba pushed a commit to kba/ocropy that referenced this issue Dec 16, 2017

Fix default model selection without -m parameter

8a2c10a

Closes ocropus-archive#54.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Other Languages #54

Other Languages #54

cinjon commented Aug 6, 2015

adnanulhasan commented Aug 6, 2015

cinjon commented Aug 6, 2015

adnanulhasan commented Aug 7, 2015

isaomatsunami commented Sep 8, 2015

isaomatsunami commented Sep 11, 2015

isaomatsunami commented Sep 13, 2015

isaomatsunami commented Oct 6, 2015

cinjon commented Oct 6, 2015

isaomatsunami commented Oct 6, 2015

tmbdev commented Oct 11, 2015

Halfish commented Jan 29, 2016

isaomatsunami commented Jan 29, 2016

wanghaisheng commented Apr 17, 2016 •

edited

Loading

adnanulhasan commented Apr 17, 2016

wanghaisheng commented Apr 17, 2016

wanghaisheng commented May 7, 2016

harinath141 commented Dec 27, 2016

adnanulhasan commented Dec 27, 2016

harinath141 commented Dec 27, 2016

adnanulhasan commented Dec 27, 2016

harinath141 commented Dec 27, 2016

switchfootsid commented Jan 26, 2017

ghost commented Jun 18, 2017

Other Languages #54

Other Languages #54

Comments

cinjon commented Aug 6, 2015

adnanulhasan commented Aug 6, 2015

cinjon commented Aug 6, 2015

adnanulhasan commented Aug 7, 2015

isaomatsunami commented Sep 8, 2015

isaomatsunami commented Sep 11, 2015

isaomatsunami commented Sep 13, 2015

isaomatsunami commented Oct 6, 2015

cinjon commented Oct 6, 2015

isaomatsunami commented Oct 6, 2015

tmbdev commented Oct 11, 2015

Halfish commented Jan 29, 2016

isaomatsunami commented Jan 29, 2016

wanghaisheng commented Apr 17, 2016 • edited Loading

adnanulhasan commented Apr 17, 2016

wanghaisheng commented Apr 17, 2016

wanghaisheng commented May 7, 2016

harinath141 commented Dec 27, 2016

adnanulhasan commented Dec 27, 2016

harinath141 commented Dec 27, 2016

adnanulhasan commented Dec 27, 2016

harinath141 commented Dec 27, 2016

switchfootsid commented Jan 26, 2017

ghost commented Jun 18, 2017

wanghaisheng commented Apr 17, 2016 •

edited

Loading