Madcat Arabic handwritten text line recognition #2356

aarora8 · 2018-04-14T10:10:51Z

It's using 750k utterances from madcat Arabic data. It is getting the line image from the MAR (minimum area rectangle). It contains the recent TDNN training recipe for End2end and regular chain recipe. @hhadian

add README(README.txt)
add scripts for data preparation (text, wav.scp and utt2spk file) (local/prepare_data.sh, local/process_data.py, local/create_line_image_from_page_image.py)
add scripts for feature extraction (local/make_features.py)
add scripts for lexicon, language modeling, grammar (local/train_lm.sh, local/prepare_lexicon.py, local/prepare_dict.sh)
add script for GMM-HMM training and using chain model (local/chain/run_cnn_1a.sh, run.sh, run_end2end.sh, local/chain/run_cnn_chainali_1b.sh, local/chain/run_flatstart_cnn1a.sh, local/chain/compare_wer.sh)
Other (cmd.sh, link to image/steps/utils, v1/local/score.sh, path.sh, local/check_tools.sh)

Some of its info and features are as follows:
-It is getting the line image from the MAR (minimum area rectangle).
-It is currently building the language model from training utterances only.
-Its lexicon size is 95k words, OOV rate is around 1.5%.
-For quick debugging and experiments, it can be run with a subset of the dataset based on writing conditions (writing style, speed, carefulness) of the image.
-It contains the recent TDNN training recipe for End2end and regular chain.
-WER 12.97% with line image formed by stitching the word images.
-WER 15.03% with line image formed using MAR.

To do:
replace PIL in create_line_image_from_page_image
replace convex_hull library routine
update configs

… fixed bug with character encoding

…prepare scripts

… with certain gedi files

…cat_ar_lm

danpovey · 2018-05-04T17:28:53Z

Guys, let's leave PIL for now and sort this out later. It's not a deep dependency.

…

On Fri, May 4, 2018 at 11:46 AM, jtrmal ***@***.***> wrote: I wonder if we should use imagemagick or similar package for either converting the image on-the-fly to some easier format and/or querying the image properties (size for example). I know it adds another dependency, but it's a fairly common package and it might simplify the whole issue for us. y. On Fri, May 4, 2018 at 11:27 AM Hossein Hadian ***@***.***> wrote: > I'm a little surprised to see you replace scipy with PIL. I though we were > going to move, in the long term at least, from PIL to scikit-image, at > least I see a conversation in my email about this. @hhadian > <https://github.com/hhadian> might want to comment. > > One options is to have a script image/check_dependencies.sh and have it > check dependencies for all the things used in the image/ scripts-- we can > add options to enable or disable certain checks as needed depending on > where it's called form. @hhadian <https://github.com/hhadian>, what do > you think? > > As Ashish explained, it was because scipy inefficiently would load the > whole image just to get its size and since, AFAIK, PIL is required (when we > use the recommended imageio) regardless, I imagined it would make sense > to use it here. > As an alternative approach (to get_image2num_frames.py) I think we can > skip this step and get the features (w/o enforcing image lengths to allowed > lengths) and then compute image2num_frames from the features (which will > be fast), and finally enforce the utterance length when getting the e2e egs > in nnet3-chain-e2e-get-egs.cc. This can work for OCR because the > beginning/end frames are always white pixel padding and we can simply > repeat them. > > I agree with image/check_dependencies.sh. We can remove > local/check_tools.sh and instead call that in run.sh. Also, we can move > local/make_features.py to image/ because I guess all the OCR recipes use > the same copy (is that right Ashish?). > > Hossein > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#2356 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ AKisXwk4CGlWCDa5puZk6sPYskFJulOeks5tvHNNgaJpZM4TVAuH> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2356 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu1cxCHT1ghfWbqBMSsr5YbSZsBVMks5tvHfLgaJpZM4TVAuH> .

…a.py

…ge.py

Madcat ar lm

updating parameters

…cat_ar_2

…tures.py

Madcat ar 2

danpovey · 2018-05-15T04:15:04Z

egs/madcat_ar/v1/local/chain/run_cnn_chainali_1b.sh

@@ -0,0 +1,226 @@
+#!/bin/bash
+


@aarora8, there should be results at the top of these files, obtained from the compare_wer.sh script.

currently, run_cnn_chainali_1b.sh was giving a little bit worse result than run_flatstart_cnn1a.sh. I am currently running it with higher epochs and more tree-leaves. should i update it after current run completion or with recent results.

sorry, I realized I haven't run run_cnn_1a.sh, run_cnn_chainali_1b and run_cnn_end2endali_1a, with the recent code. I have currently ran run_flatstart_cnn1a.sh and run_cnn_end2endali_1b (UC) scripts. I updated the results on those two scripts, I will run and update the results for other scripts aswell.

adding latest results, removing dev from extract features to save time

danpovey · 2018-05-15T04:25:52Z

you can just put the current results for now and change it later. But I think you should create a tuning/ directory and make those run* scripts links to it. I.e. all 1a and 1b suffixes (etc.) would be within the tuning/ directory and there would be soft links from one directory above. Experiments that differ only in suffix like 1a or 1b are supposed to be diffferent versions of the same experiment (differently tuned). And you generally won't change the results after doing the initial experiment. At least that's how we normally do it. But before we merge this, it's OK to just take the best current experiment and make it 1a.

…

On Tue, May 15, 2018 at 12:20 AM, Ashish Arora ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/madcat_ar/v1/local/chain/run_cnn_chainali_1b.sh <#2356 (comment)>: > @@ -0,0 +1,226 @@ +#!/bin/bash + currently, it was giving a little bit worse result than run_flatstart_cnn1a.sh. I am currently running it with higher epochs and more tree-leaves. should i update it after current run completion or with recent results. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2356 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu-OpHkIySqeKQCc5L1nLZe7bCuAIks5tyleMgaJpZM4TVAuH> .

aarora8 · 2018-05-15T04:30:01Z

ok, thank you. I will add the results and make it 1a.

________________________________ From: Daniel Povey <notifications@github.com> Sent: Tuesday, May 15, 2018 12:26:05 AM To: kaldi-asr/kaldi Cc: Ashish Arora; Mention Subject: Re: [kaldi-asr/kaldi] Madcat Arabic handwritten text line recognition (#2356) you can just put the current results for now and change it later. But I think you should create a tuning/ directory and make those run* scripts links to it. I.e. all 1a and 1b suffixes (etc.) would be within the tuning/ directory and there would be soft links from one directory above. Experiments that differ only in suffix like 1a or 1b are supposed to be diffferent versions of the same experiment (differently tuned). And you generally won't change the results after doing the initial experiment. At least that's how we normally do it. But before we merge this, it's OK to just take the best current experiment and make it 1a.

On Tue, May 15, 2018 at 12:20 AM, Ashish Arora ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/madcat_ar/v1/local/chain/run_cnn_chainali_1b.sh <#2356 (comment)>: > @@ -0,0 +1,226 @@ +#!/bin/bash + currently, it was giving a little bit worse result than run_flatstart_cnn1a.sh. I am currently running it with higher epochs and more tree-leaves. should i update it after current run completion or with recent results. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2356 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu-OpHkIySqeKQCc5L1nLZe7bCuAIks5tyleMgaJpZM4TVAuH> .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#2356 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AcFBRW1XzGdpYC1eVF8iKf6bCQ3INeGqks5tyljdgaJpZM4TVAuH>.

Madcat ar 2

danpovey · 2018-05-15T05:53:41Z

Yes, update it later.

…

On Tue, May 15, 2018 at 1:32 AM, Ashish Arora ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/madcat_ar/v1/local/train_lm.sh <#2356 (comment)>: > +#for order in 3; do +#rm -f ${lm_dir}/${num_word}_${order}.pocolm/.done + +if [ $stage -le 0 ]; then + mkdir -p ${dir}/data + mkdir -p ${dir}/data/text + + echo "$0: Getting the Data sources" + + rm ${dir}/data/text/* 2>/dev/null || true + + # use the validation data as the dev set. + # Note: the name 'dev' is treated specially by pocolm, it automatically + # becomes the dev set. + + cat data/dev/text | cut -d " " -f 2- > ${dir}/data/text/dev.txt I tried using first 5000 lines from the train text here and the rest for training, but the first 5k lines are occurring again in remaining training. train_lm.sh, gives an error due to it. currently, to remove the error, I reverted the change. I was working on a task of adding Arabic gigaworld corpus text data in the language model. currently, it is not complete. should i update it later with that some portion of Arabic gigaworld corpus text data. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2356 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu6ZDGZLdpV_opIiQ3XCg_Z6EmZHNks5tymhngaJpZM4TVAuH> .

Madcat ar 2

Chun-Chieh Chang and others added 30 commits January 17, 2018 13:06

created initial files from the iam_ocr files

49725d0

uncommented some code

9cfd8fd

adding modification for creating line images from page image

426bf45

minor change

2171d89

minor change

5748484

minor change

8e0a793

minor change

c823cda

minor change

be7c418

made changes to where to images of each line of text were saved. Also…

c30dc5e

… fixed bug with character encoding

changes from jonathan

191413f

minor change

ebfb231

WIP but in the middle of writing script to feed train/test data into …

1794440

…prepare scripts

added part to read the data train/test/dev splits

6d0b6f6

finished part with using the data train/test/dev splits. WIP some bug…

04a08cb

… with certain gedi files

modification for fixing incomplete xml

58ddb0c

fixing merge conflict

c528627

minor fix

f3ea009

fixed some issues with processing same files multiple times

20bfeaf

modification for accesing a file only once

1d737c9

minor fix

8853005

modification for accesing a file only once

5d6794b

opening a file only if it exists

0a98283

omdification for also obtaining missiong file name

b52fd17

changes to fix multiple lines of same xml. Also sorted text and utt2spk

450586b

merge aarora8

7079dc1

fixed issue where skipped line number would still be included in text

cc9f196

code to filter corrupt files

c6fd6d5

code to filter corrupt files

32936f0

minor fix

1b22f15

added taking care of xml files with inconsistent lineID

fae97d9

Merge branch 'madcat_ar' of https://github.com/aarora8/kaldi into mad…

1b98e0f

…cat_ar_lm

aarora8 added 14 commits May 6, 2018 10:12

convex hull, updated wer, updated train lm , updated configs

90594ff

minor fix create_line_image_from_page_image.py

3da6146

minor fix create_line_image_from_page_image.py

1f1b230

saving in .png create_line_image_from_page_image.py local/process_dat…

71d2a01

…a.py

modification for denoising line image create_line_image_from_page_ima…

9cdc6eb

…ge.py

minor fix

028e12f

Merge pull request #13 from aarora8/madcat_ar_lm

f0346a1

Madcat ar lm

updating parameters

80c659e

Merge pull request #14 from aarora8/madcat_ar_lm

cda2394

updating parameters

Merge branch 'madcat_ar' of https://github.com/aarora8/kaldi into mad…

3b0ccc4

…cat_ar_2

reverting changes in create_line_image_from_page_image.py

2ab5d37

modification for fix missing lines issue extract_features.sh make_fea…

40617b7

…tures.py

Merge pull request #15 from aarora8/madcat_ar_2

32c5a2c

Madcat ar 2

adding latest results, removing dev from extract features to save time

a0511f4

danpovey reviewed May 15, 2018

View reviewed changes

Merge pull request #16 from aarora8/madcat_ar_2

01f02ec

adding latest results, removing dev from extract features to save time

aarora8 added 4 commits May 15, 2018 00:41

updating results

7cb4472

updating reuslts

5aa4bb1

Merge pull request #17 from aarora8/madcat_ar_2

f0dfc8d

Madcat ar 2

minor fix

3fc3455

aarora8 added 2 commits May 17, 2018 16:50

updating parameters

6a4f65f

Merge pull request #18 from aarora8/madcat_ar_2

1fd1619

Madcat ar 2

danpovey merged commit 108832d into kaldi-asr:master May 17, 2018

dpriver pushed a commit to dpriver/kaldi that referenced this pull request Sep 13, 2018

[egs] Madcat Arabic handwritten text line recognition (kaldi-asr#2356)

076329e

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[egs] Madcat Arabic handwritten text line recognition (kaldi-asr#2356)

1f23296

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Madcat Arabic handwritten text line recognition #2356

Madcat Arabic handwritten text line recognition #2356

aarora8 commented Apr 14, 2018 •

edited

Loading

danpovey commented May 4, 2018 via email

danpovey May 15, 2018

aarora8 May 15, 2018 •

edited

Loading

aarora8 May 15, 2018

danpovey commented May 15, 2018 via email

aarora8 commented May 15, 2018 via email

danpovey commented May 15, 2018 via email

Madcat Arabic handwritten text line recognition #2356

Madcat Arabic handwritten text line recognition #2356

Conversation

aarora8 commented Apr 14, 2018 • edited Loading

danpovey commented May 4, 2018 via email

danpovey May 15, 2018

Choose a reason for hiding this comment

aarora8 May 15, 2018 • edited Loading

Choose a reason for hiding this comment

aarora8 May 15, 2018

Choose a reason for hiding this comment

danpovey commented May 15, 2018 via email

aarora8 commented May 15, 2018 via email

danpovey commented May 15, 2018 via email

aarora8 commented Apr 14, 2018 •

edited

Loading

aarora8 May 15, 2018 •

edited

Loading