Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
EC2 Installation Walkthrough
In this guide I will explain how to setup OpenDcd with Kaldi on EC and decode open source models based on Librispeech corpus. For this walkthrough I used a large instance with four cores and 15GB of memory. OpenDcd is very memory efficient for both decoding and graph construction and this is easily enough to decode the large 4-gram model.
Machine Configure and Setup
sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install -y gcc-4.9 g++-4.9 cpp-4.9 subversion make zlib1g-dev automake libtool autoconf libatlas3-base flac git
Due a bug in gcc 4.8 we installed gcc 4.9 and set hard links as the system default
sudo ln -s /usr/bin/g++-4.9 /usr/bin/g++ sudo ln -s /usr/bin/gcc-4.9 /usr/bin/gcc sudo ln -s /usr/bin/gcc-4.9 /usr/bin/cc sudo ln -s -f bash /bin/sh
svn co https://svn.code.sf.net/p/kaldi/code/trunk kaldi cd kaldi/tools make cd ../src ./configure
For descent runtime performance it is essential to edit the kaldi.mk file and add the -O2 switch. Now just type make to build the Kaldi and optionally specify the number of cores.
git clone https://github.com/edobashira/opendcd.git cd opendcd/3rdparty make cd ../src/bin make -j4
There are two graph construction methods, in the first we take a set of Kaldi component transducers as the input to the build process. In the second method we take raw language model and lexicon and build everything from scratch. In this recipe we will use the pre-built models from kaldi-asr.org and use the first method.
We need three sets of the models the language model and lexicon, the acoustic model and the models used in the iVector extractor.
The helper script makeclevel.sh will build the cascade from the model. It need three four parameters, two locations of the model files, the directory to write the result and the path where Kaldi is installed.
script/makeclevel.sh lang_test_tgsmall nnet_a graph_test_tgsmall ../../kaldi
In modern neural network based speech recognition the decoding pipeline consists of three steps: feature extraction, state like computation and the search algorithm.
First we will grab a set of utterance from openslr.
wget http://www.openslr.org/resources/12/test-clean.tar.gz tar -zxf test-clean.tar.gz
In recent Kaldi there is new online decoder which contains a several tools for online decoding. In particular the online2-wav-nnet-am-compute is perfect for needs. This will take the raw waveform compute the features and neural networks output activations. This is perfect for connecting with OpenDcd to complete the recognition cascade.
online2bin/online2-wav-nnet2-am-compute \ --online=true \ --apply-log=true \ --config=online_nnet2_decoding.conf \ nnet_a/final.mdl \ ark:test-clean.utt2psk \ "ark:~/tools/kaldi/src/featbin/wav-copy scp,p:test-clean.scp ark:- |" \ ark:-
We first need to create several config files and utterance list. The OpenDcd repository contains the config files and the utterance list is generated by a helper script. The utterance file contents will be briefly described here. The utterance list test-clean.scp gives the files names and the flac command to convert them to raw Wwav files.
1089-134686-0011 flac -c -d -s LibriSpeech/test-clean/1089/134686//1089-134686-0011.flac | 1089-134686-0028 flac -c -d -s LibriSpeech/test-clean/1089/134686//1089-134686-0028.flac | 1089-134686-0032 flac -c -d -s LibriSpeech/test-clean/1089/134686//1089-134686-0032.flac | ... ...
The utterance list files is the utt2spk file. We won't be using any speaker adaptation in this walkthrough but the file is still needed and in the case of no adaptation the file name simples maps its self.
1089-134686-0011 1089-134686-0011 1089-134686-0028 1089-134686-0028 1089-134686-0032 1089-134686-0032 1089-134686-0012 1089-134686-0012 1089-134686-0022 1089-134686-0022 ... ...
In the file step we connect the feature extraction to OpenDcd to complete the decoding pipeline.
~/tools/kaldi/src/online2bin/online2-wav-nnet2-am-compute \ --online=true \ --apply-log=true \ --config=online_nnet2_decoding.conf \ nnet_a/final.mdl \ ark:test-clean.utt2psk \ "ark:~/tools/kaldi/src/featbin/wav-copy scp,p:test-clean.scp ark:- |" \ ark:- 2> feats.log |\ ../src/bin/dcd-recog \ --word_symbols_table=words.txt \ --decoder_type=hmm_lattice \ --beam=15 \ --acoustic_scale=0.1 \ --fst_reset_period=1 \ graph_test_tgsmall/arcs.far \ graph_test_tgsmall/la.C.det.L.fst,graph_test_tgsmall/G.fst \ ark:- recog.far
If everything worked correctly the decoder will write output like the following:
Currently the recognition results are written in two ways. Directly to stdout as part of the logging and as an OpenFst FAR file.
farinfo recog-dynamic.far far type sttable arc type standard fst type vector # of FSTs 38 total # of states 4043 total # of arcs 4005 total # of final states 38
farprintbeststrings is tool included with OpenDcd that provides many more features over the standard OpenFst tool.
compute-wer is a command from Kaldi. The below command prints all the best string from the set of far files output from the decoders. In addition it removed the unk symbol and does not display the weights of the path. The reference file
LibriSpeech/dev-clean/text is automatically installed as part of the librispeech scripts.
farprintnbeststrings --symbols=words.txt --print_weights=false \ --format=kaldi --wildcards=3 dev-clean.a?.far | \ compute-wer --text --mode=present ark:- ark:LibriSpeech/dev-clean/text