Skip to content

EC2 Installation Walkthrough

edobashira edited this page Nov 7, 2014 · 22 revisions

Introduction

In this guide I will explain how to setup OpenDcd with Kaldi on EC and decode open source models based on Librispeech corpus. For this walkthrough I used a large instance with four cores and 15GB of memory. OpenDcd is very memory efficient for both decoding and graph construction and this is easily enough to decode the large 4-gram model.

Machine Configure and Setup

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install -y gcc-4.9 g++-4.9 cpp-4.9 subversion make zlib1g-dev automake libtool autoconf libatlas3-base flac git

Due a bug in gcc 4.8 we installed gcc 4.9 and set hard links as the system default

sudo ln -s /usr/bin/g++-4.9 /usr/bin/g++
sudo ln -s /usr/bin/gcc-4.9 /usr/bin/gcc
sudo ln -s /usr/bin/gcc-4.9 /usr/bin/cc
sudo ln -s -f bash /bin/sh

Install Kaldi

   svn co https://svn.code.sf.net/p/kaldi/code/trunk kaldi
   cd kaldi/tools
   make
   cd ../src
   ./configure

For descent runtime performance it is essential to edit the kaldi.mk file and add the -O2 switch. Now just type make to build the Kaldi and optionally specify the number of cores.

  make -j4

Install OpenDcd

   git clone https://github.com/edobashira/opendcd.git
   cd opendcd/3rdparty
   make
   cd ../src/bin
   make -j4

Graph Construction

There are two graph construction methods, in the first we take a set of Kaldi component transducers as the input to the build process. In the second method we take raw language model and lexicon and build everything from scratch. In this recipe we will use the pre-built models from kaldi-asr.org and use the first method.

Download Models

We need three sets of the models the language model and lexicon, the acoustic model and the models used in the iVector extractor.

Build Cascade

The helper script makeclevel.sh will build the cascade from the model. It need three four parameters, two locations of the model files, the directory to write the result and the path where Kaldi is installed.

   script/makeclevel.sh lang_test_tgsmall  nnet_a graph_test_tgsmall ../../kaldi

Decoding

In modern neural network based speech recognition the decoding pipeline consists of three steps: feature extraction, state like computation and the search algorithm.

Features

First we will grab a set of utterance from openslr.

   wget http://www.openslr.org/resources/12/test-clean.tar.gz
   tar -zxf test-clean.tar.gz

In recent Kaldi there is new online decoder which contains a several tools for online decoding. In particular the online2-wav-nnet-am-compute is perfect for needs. This will take the raw waveform compute the features and neural networks output activations. This is perfect for connecting with OpenDcd to complete the recognition cascade.

 online2bin/online2-wav-nnet2-am-compute \
  --online=true \
  --apply-log=true \
  --config=online_nnet2_decoding.conf \
  nnet_a/final.mdl \
  ark:test-clean.utt2psk \
  "ark:~/tools/kaldi/src/featbin/wav-copy scp,p:test-clean.scp ark:- |" \
  ark:-

We first need to create several config files and utterance list. The OpenDcd repository contains the config files and the utterance list is generated by a helper script. The utterance file contents will be briefly described here. The utterance list test-clean.scp gives the files names and the flac command to convert them to raw Wwav files.

   1089-134686-0011  flac -c -d -s LibriSpeech/test-clean/1089/134686//1089-134686-0011.flac |
   1089-134686-0028  flac -c -d -s LibriSpeech/test-clean/1089/134686//1089-134686-0028.flac |
   1089-134686-0032  flac -c -d -s LibriSpeech/test-clean/1089/134686//1089-134686-0032.flac |
   ...
   ...

The utterance list files is the utt2spk file. We won't be using any speaker adaptation in this walkthrough but the file is still needed and in the case of no adaptation the file name simples maps its self.

   1089-134686-0011        1089-134686-0011
   1089-134686-0028        1089-134686-0028
   1089-134686-0032        1089-134686-0032
   1089-134686-0012        1089-134686-0012
   1089-134686-0022        1089-134686-0022
   ...
   ...

Decode

In the file step we connect the feature extraction to OpenDcd to complete the decoding pipeline.

  ~/tools/kaldi/src/online2bin/online2-wav-nnet2-am-compute \
  --online=true \
  --apply-log=true \
  --config=online_nnet2_decoding.conf \
  nnet_a/final.mdl \
  ark:test-clean.utt2psk \
  "ark:~/tools/kaldi/src/featbin/wav-copy scp,p:test-clean.scp ark:- |" \
  ark:- 2> feats.log |\
  ../src/bin/dcd-recog \
    --word_symbols_table=words.txt \
    --decoder_type=hmm_lattice \
    --beam=15 \
    --acoustic_scale=0.1 \
    --fst_reset_period=1 \
    graph_test_tgsmall/arcs.far \
    graph_test_tgsmall/la.C.det.L.fst,graph_test_tgsmall/G.fst \
    ark:- recog.far

If everything worked correctly the decoder will write output like the following:

Currently the recognition results are written in two ways. Directly to stdout as part of the logging and as an OpenFst FAR file.

   farinfo recog-dynamic.far
   far type                                          sttable
   arc type                                          standard
   fst type                                          vector
   # of FSTs                                         38
   total # of states                                 4043
   total # of arcs                                   4005
   total # of final states                           38

##Evaluation

farprintbeststrings is tool included with OpenDcd that provides many more features over the standard OpenFst tool. compute-wer is a command from Kaldi. The below command prints all the best string from the set of far files output from the decoders. In addition it removed the unk symbol and does not display the weights of the path. The reference file LibriSpeech/dev-clean/text is automatically installed as part of the librispeech scripts.

    farprintnbeststrings --symbols=words.txt --print_weights=false \
    --format=kaldi --wildcards=3 dev-clean.a?.far | \
    compute-wer --text --mode=present ark:- ark:LibriSpeech/dev-clean/text

Beam Max Arcs WER SER
15 $$$inf$$$ 7.79 61.07

$$inf$$