New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[egs] Add recipe for Mozilla Common Voice corpus v1 #2057

Merged
merged 5 commits into from Dec 4, 2017

Conversation

Projects
None yet
3 participants
@entn-at
Copy link
Contributor

entn-at commented Dec 2, 2017

This a basic recipe for the recently released Mozilla Common Voice corpus (v1, CC-0 licensed). See https://voice.mozilla.org/data

Some of the data preparation scripts were taken from the voxforge recipe (dict, LM). The systems and chain model setup were adapted from mini_librispeech (including speed perturbation, PCA transform for i-vector extraction, etc.).

I did not tune the setup, the chain system already has WERs of about 5% (see RESULTS).

@entn-at

This comment has been minimized.

Copy link
Contributor

entn-at commented Dec 2, 2017

Note that this recipe currently only uses the "valid" portion of the corpus, that is, utterances that have had at least 2 people listen to them, and the majority of those listeners said the audio matches the text.

fixed-affine-layer name=lda input=Append(-2,-1,0,1,2,ReplaceIndex(ivector, t, 0)) affine-transform-file=$dir/configs/lda.mat
# the first splicing is moved before the lda layer, so no splicing here
relu-batchnorm-layer name=tdnn1 dim=512

This comment has been minimized.

@danpovey

danpovey Dec 2, 2017

Contributor

This system is rather small for a 500-hour dataset. You may want to try dim=768 instead of 512.

I also notice that in the RESULTS file you called this 1e (IIRC).

@@ -0,0 +1,65 @@
#!/bin/bash

This comment has been minimized.

@jtrmal

jtrmal Dec 2, 2017

Contributor

can you replace this script by a symlink steps/score_kaldi.sh, please?

if [ $stage -le 0 ]; then
mkdir -p $data

local/download_and_untar.sh $(/usr/bin/dirname $data) $data_url

This comment has been minimized.

@jtrmal

jtrmal Dec 2, 2017

Contributor

Does the absolute pathname /usr/bin/dirname have any particular reason?

--trainer.num-epochs=4 \
--trainer.frames-per-iter=1500000 \
--trainer.optimization.num-jobs-initial=3 \
--trainer.optimization.num-jobs-final=3 \

This comment has been minimized.

@danpovey

danpovey Dec 2, 2017

Contributor

If your setup allows it, it would be a good idea, for speed, to increase num-jobs-final to something like 12.

This comment has been minimized.

@entn-at

entn-at Dec 2, 2017

Contributor

Unfortunately, I only have 3 GPUs, but I will change it in the script to 12.

for f in phones.txt words.txt phones.txt L.fst L_disambig.fst phones; do
cp -r data/lang/$f $test
done
cat $lmdir/lm.arpa | \

This comment has been minimized.

@jtrmal

jtrmal Dec 2, 2017

Contributor

I'd prefer the rest of the script to be replaced by utils/format_lm.sh

@danpovey

This comment has been minimized.

Copy link
Contributor

danpovey commented Dec 2, 2017

@danpovey

This comment has been minimized.

Copy link
Contributor

danpovey commented Dec 2, 2017

@entn-at

This comment has been minimized.

Copy link
Contributor

entn-at commented Dec 2, 2017

No problem, I have gridengine set up. I'm going test it with jobs-final=12 (It's just going to take a while longer).

@danpovey

This comment has been minimized.

Copy link
Contributor

danpovey commented Dec 2, 2017

Addressing comments
change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon
@entn-at

This comment has been minimized.

Copy link
Contributor

entn-at commented Dec 2, 2017

I made the following changes:

  • change score.sh to a symlink to steps/score_kaldi.sh
  • remove absolute path to dirname
  • replace local/format_data.sh with a call to utils/format_lm.sh
  • use <unk> instead of SIL in lexicon

I'm currently running the whole recipe from start to finish. Once that's done I'll add another commit with the changes to run_tdnn_1a.sh and RESULTS.

@danpovey

This comment has been minimized.

Copy link
Contributor

danpovey commented Dec 4, 2017

Thanks a lot! @jtrmal, please merge when and if you're OK with it. No need to check more, necessarily.

@entn-at

This comment has been minimized.

Copy link
Contributor

entn-at commented Dec 4, 2017

Thanks for the quick review and the helpful comments!

@jtrmal

This comment has been minimized.

Copy link
Contributor

jtrmal commented Dec 4, 2017

all right, I'll merge. Thanks a lot!

@jtrmal jtrmal merged commit 93ceca7 into kaldi-asr:master Dec 4, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

kronos-cm added a commit to kronos-cm/kaldi that referenced this pull request Dec 18, 2017

Merge branch 'master' of https://github.com/kaldi-asr/kaldi
* 'master' of https://github.com/kaldi-asr/kaldi: (58 commits)
  [src] Fix bug in nnet3 optimization, affecting Scale() operation; cosmetic fixes. (kaldi-asr#2088)
  [egs] Mac compatibility fix to SGMM+MMI: remove -T option to cp (kaldi-asr#2087)
  [egs] Copy dictionary-preparation-script fix from fisher-english(8e7793f) to fisher-swbd and ami (kaldi-asr#2084)
  [egs] Small fix to backstitch in AMI scripts (kaldi-asr#2083)
  [scripts] Fix augment_data_dir.py (relates to non-pipe case of wav.scp) (kaldi-asr#2081)
  [egs,scripts] Add OPGRU scripts and recipes (kaldi-asr#1950)
  [egs] Add an l2-regularize-based recipe for image recognition setups (kaldi-asr#2066)
  [src] Bug-fix to assertion in cu-sparse-matrix.cc (RE large matrices) (kaldi-asr#2077)
  [egs] Add a tdnn+lstm+attention+backstitch recipe for tedlium (kaldi-asr#1982)
  [src,egs] Small cosmetic fixes (kaldi-asr#2074)
  [src] Small fix RE CuSparse error code printing (kaldi-asr#2070)
  [src] Fix compilation error on MSVC: missing include. (kaldi-asr#2064)
  [egs] Update to CSJ example scripts, with chain+TDNN recipes.  Thanks: @rickychanhoyin (kaldi-asr#2035)
  [scripts,egs] Convert ". path.sh" to ". ./path.sh" (kaldi-asr#2061)
  [doc] Add documentation about matrix row and column ranges in scp files.
  [egs] Add recipe for Mozilla Common Voice corpus v1 (kaldi-asr#2057)
  [scripts] Fix bug in slurm.pl affecting log format (kaldi-asr#2063)
  [src] Fix some small typos (kaldi-asr#2060)
  [scripts] Adding --num-threads option to ivector extraction scripts; script fixes (kaldi-asr#2055)
  [src] Bug-fix to conceptual bug in Minimum Bayes Risk/sausage code.  Thanks:@jtrmal (kaldi-asr#2056)
  ...

mahsa7823 pushed a commit to mahsa7823/kaldi that referenced this pull request Feb 28, 2018

[egs] Add recipe for Mozilla Common Voice corpus v1 (kaldi-asr#2057)
* [egs] Add recipe for Mozilla Common Voice corpus v1

* Addressing comments

change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon

* Update chain tdnn system and results

* Add license (Apache 2.0) info line to data prep script

Skaiste pushed a commit to Skaiste/idlak that referenced this pull request Sep 26, 2018

[egs] Add recipe for Mozilla Common Voice corpus v1 (kaldi-asr#2057)
* [egs] Add recipe for Mozilla Common Voice corpus v1

* Addressing comments

change score.sh to a symlink to steps/score_kaldi.sh; remove path to dirname; replace local/format_data.sh with call to utils/format_lm.sh; use <unk> instead of SIL in lexicon

* Update chain tdnn system and results

* Add license (Apache 2.0) info line to data prep script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment