---

# Recipe for MiniLibriSpeech

Source:

* https://groups.google.com/forum/#!topic/kaldi-help/tzyCwt7zgMQ

* https://towardsdatascience.com/how-to-start-with-kaldi-and-speech-recognition-a9b7670ffff6

* https://eleanorchodroff.com/tutorial/kaldi/training-overview.html

* https://jrmeyer.github.io/asr/2016/12/15/DNN-AM-Kaldi.html

* https://kaldi-asr.org/doc/kaldi_for_dummies.html

* https://kaldi-asr.org/doc/tutorial_running.html for commands to view results and models

* https://github.com/DB-jiemin/kaldi-script/blob/master/decoding/thesis_oplatek.pdf

* https://sites.google.com/site/dpovey/kaldi-lectures

* https://medium.freecodecamp.org/a-deep-dive-into-part-of-speech-tagging-using-viterbi-algorithm-17c8de32e8bc Virterbi algo

Recipe taken from crim Kaldi repo cloned with:

* `git clone https://www.crim.ca/stash/scm/reco/crim_kaldi_egs.git`

It is located at:

* `crim_kaldi_egs/mini_librispeech/s5`

Snippets of code are taken from:

* `crim_kaldi_egs/mini_librispeech/s5/run.sh`

## Basics of ASR



During training, two models are trained independently. Speech training data, comprised of transcribed recordings in a target language, is used to learn the acoustic model. The role of the acoustic model is to predict components of a phoneme, given a segment of audio features derived from the feature extractor.  This consists of the elemental sounds of a given language –44 in English, for example. An n-gram model predicts the next word in a sequence for the language model. This is trained on text-only transcripts of conversations. If you hear the sentence, “Today, the clouds look grey. I think it’s going to _____”, you know a higher probability exists for the word ‘rain’ to occur next than ‘potato’. These two models combine to form the basis for speech recognition.

Audio is run through the feature extractor during inference, producing phoneme probabilities over time. An ASR decoder utilize these probabilities, along with the language model, to decode the most likely written sentence for the given input waveform. NVIDIA’s work in optimizing the Kaldi pipeline includes prior GPU optimizations to both the acoustic model and the introduction of a GPU-based Viterbi decoder in this post for the language model. NVIDIA GPUs parallelize this compute intensive decoding process for the first time.

## 0. Prepare directory structure and symbolic links

### 0.1 Create symbolic links in `crim_kaldi_egs/mini_librispeech/s5` for:

* `steps`: `ln -s ../wsj/s5/steps .`
    
* `utils`: `ln -s ../wsj/s5/utils .`
    
### 0.2 Create directories if not already present in recipe:

* `conf`: Configuration file for specific recipe. The directory `conf`local requires one file mfcc.conf, which contains the parameters for MFCC feature extraction.
    
* `local`: Local contains data for this specific recipe or project.

## 1. Obtain a written transcript of the speech data

For a more precise alignment, utterance (~sentence) level start and end times are helpful, but not necessary.

### 1.1 Consider these variables from now on:

```bash
# Change this location to somewhere where you want to put the data.
data=./corpus/

data_url=www.openslr.org/resources/31
lm_url=www.openslr.org/resources/11
```

### 1.2 Consider this command file definition (`cmd.sh`) if you run locally:

```bash
# you can change cmd.sh depending on what type of queue you are using.
# If you have no queueing system and want to run on a local machine, you
# can change all instances 'queue.pl' to run.pl (but be careful and run
# commands one by one: most recipes will exhaust the memory on your
# machine).  queue.pl works with GridEngine (qsub).  slurm.pl works
# with slurm.  Different queues are configured differently, with different
# queue names and different ways of specifying things like memory;
# to account for these differences you can create and edit the file
# conf/queue.conf to match your queue's configuration.  Search for
# conf/queue.conf in http://kaldi-asr.org/doc/queue.html for more information,
# or search for the string 'default_config' in utils/queue.pl or utils/slurm.pl.

export train_cmd="run.pl"
export decode_cmd="run.pl"
export mkgraph_cmd="run.pl"
```

### 1.3 Run path.sh

`path.sh` should work if you have access to the build `/misc/scratch01/reco/osterrfr/kaldi_hg_builds/build_2019-03-24_33_1ac8c922cbf6b2c34756d4b467cfa6067a6dba90`. Otherwise, change the first line of the file with the root of a cloned Kaldi repo: 

```bash
export KALDI_ROOT=`pwd`/../../..
```

### 1.4 Download datasets

```bash
# Download dev and test sets
# Saved in ./corpus
#
# name: Mini LibriSpeech ASR corpus
# summary: Subset of LibriSpeech corpus for purpose of regression testing
# category: speech
# license: CC BY 4.0
# file: dev-clean-2.tar.gz   development set, "clean" speech
# file: train-clean-5.tar.gz test set, "clean" speech
# file: md5sum.txt           md5 checksums of files

for part in dev-clean-2 train-clean-5; do
  local/download_and_untar.sh $data $data_url $part
done
```

## 2. Format transcripts for Kaldi

Kaldi requires various formats of the transcripts for acoustic model training. You’ll need the start and end times of each utterance, the speaker ID of each utterance, and a list of all words and phonemes present in the transcript.

### 2.1 Get language models

**Language models, corpus, vocabulary and lexicon** are taken from LibriSpeech.

* **librispeech-lm-corpus.tgz**: 14500 public domain books, used as training material for the LibriSpeech's LM
* **librispeech-lm-norm.txt.gz**: Normalized LM training text
* **librispeech-vocab.txt**: 200K word vocabulary for the LM
* **librispeech-lexicon.txt**: Pronunciations, some of which G2P auto-generated, for all words in the vocabulary
* **3-gram.arpa.gz**: 3-gram ARPA LM, not pruned

> lm_tglarge.arpa.gz

* **3-gram.pruned.1e-7.arpa.gz**: 3-gram ARPA LM, pruned with theshold 1e-7

> lm_tgsmall.arpa.gz

* **3-gram.pruned.3e-7.arpa.gz**: 3-gram ARPA LM, pruned with theshold 3e-7

> lm_tgsmall.arpa.gz

* **4-gram.arpa.gz**: 4-gram ARPA LM, usually used for rescoring
* **g2p-model-5**: Fifth order Sequitur G2P model

```bash
# Download language models
# Saved in ./corpus with symlink in data/local/lm
#
# name: LibriSpeech language models, vocabulary and G2P models
# summary: Language modelling resources, for use with the LibriSpeech ASR corpus
# category: text
# license: Public domain

local/download_lm.sh $lm_url $data data/local/lm
```

### 2.2 Prepare data

For dev and train sets from minilibrispeech, prepare the data. Output files are generated in `data/dev_clean_2` and `data/train_clean_5`. These files are:

* wav.scp

```
file_id path/file
1272-135031-0000 flac -c -d -s ./corpus//LibriSpeech/dev-clean-2/1272/135031/1272-135031-0000.flac |
```

* text

```
utt_id WORD1 WORD2 WORD3 WORD4 ...
1272-135031-0000 BECAUSE YOU WERE SLEEPING INSTEAD OF CONQUERING THE LOVELY ROSE PRINCESS HAS BECOME A FIDDLE WITHOUT A BOW WHILE POOR SHAGGY SITS THERE A COOING DOVE
```

* utt2spk

```
utt_id spkr
1272-135031-0000 1272-135031
1272-135031-0001 1272-135031
```

* spk2gender

```
spkr gender
1272-135031 m
1272-141231 m
```

* utt2dur

```
utt_id dur
1272-135031-0000 10.885
1272-135031-0001 11.13
```

* spk2utt

```
spkr utt_id1 utt_id2 utt_id3 ...
1272-135031 1272-135031-0000 1272-135031-0001 ...
```

```bash
  for part in dev-clean-2 train-clean-5; do
    # Use underscore-separated names in data directories.
    # Returns files in in data/dev_clean_2 and data/train_clean_5:
    #     wav.scp, text, utt2spk, spk2gender, utt2dur
    # Usage: $0 <src-dir> <dst-dir>
    # e.g.: $0 /export/a15/vpanayotov/data/LibriSpeech/dev-clean data/dev-clean
    local/data_prep.sh $data/LibriSpeech/$part data/$(echo $part | sed s/-/_/g)
  done
```

The next script prepares the dictionary and auto-generates the pronunciations for the words (lexicon) that are in our vocabulary but not in CMUdict. Files are saved in `data/local/dict_nosp`

* silence_phones.txt

```
SIL
SPN
```

* optional_silence.txt

```
SIL
```

* non_silence_phones.txt

```
AA AA0 AA1 AA2 
AE AE0 AE1 AE2 
AH AH0 AH1 AH2 
AO AO0 AO1 AO2 
...
```

* extra_questions.txt

```
SIL SPN 
AA AE AH AO AW AY B CH D DH EH ER EY F G HH IH IY JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH 
AA1 AE1 AH1 AO1 AW1 AY1 EH1 ER1 EY1 IH1 IY1 OW1 OY1 UH1 UW1 
AA0 AE0 AH0 AO0 AW0 AY0 EH0 ER0 EY0 IH0 IY0 OW0 OY0 UH0 UW0 
AA2 AE2 AH2 AO2 AW2 AY2 EH2 ER2 EY2 IH2 IY2 OW2 OY2 UH2 UW2 
```

* lexicon.txt

```
WORD W ER D
LEXICON L EH K S IH K AH N
!SIL SIL
<SPOKEN_NOISE> SPN
<UNK> SPN  
A  AH0  
A  EY1  
A''S	EY1 Z  
A'BODY	EY1 B AA2 D IY0  
A'COURT	EY1 K AO2 R T  
A'D	EY1 D
```

```bash
# Saved in data/local/dict_nosp.
#     silence_phones, optional_phones, nonsil_phones, extra_questions and lexicons
# "nosp" refers to the dictionary  before silence probabilities and pronunciation
# probabilities are added.
local/prepare_dict.sh --stage 3 --nj 30 --cmd "$train_cmd" \
  data/local/lm data/local/lm data/local/dict_nosp
```

### 2.3 Build lang directory

The next script prepares a directory such as `data/lang/`, in the standard format, given a source directory containing a dictionary `lexicon.txt` in a form like:

> word phone1 phone2 ... phoneN

per line (alternate prons would be separate lines), or a dictionary with probabilities called `lexiconp.txt` in a form:

> word pron-prob phone1 phone2 ... phoneN

(with 0.0 < pron-prob <= 1.0); note: if `lexiconp.txt` exists, we use it even if `lexicon.txt` exists.

Also files `silence_phones.txt`, `nonsilence_phones.tx`t, `optional_silence.txt` and `extra_questions.txt`. Here, `silence_phones.txt` and `nonsilence_phones.txt` are lists of silence and non-silence phones respectively (where silence includes various kinds of noise, laugh, cough, filled pauses etc., and nonsilence phones includes the "real" phones.) In each line of those files is a list of phones, and the phones on each line are assumed to correspond to the same "base phone", i.e. they will be different stress or tone variations of the same basic phone.

The file `optional_silence.txt` contains just a single phone (typically SIL) which is used for optional silence in the lexicon.

`extra_questions.txt` might be empty; typically will consist of lists of phones, all members of each list with the same stress or tone; and also possibly a list for the silence phones.  This will augment the automatically generated questions (note: the automatically generated ones will treat all the stress/tone versions of a phone the same, so will not "get to ask" about stress or tone).

This script adds word-position-dependent phones and constructs a host of other derived files, that go in data/lang/.

See http://kaldi-asr.org/doc/data_prep.html#data_prep_lang_creating for more info.

```bash
  # Usage: utils/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>
  # e.g.: utils/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang
  utils/prepare_lang.sh data/local/dict_nosp \
    "<UNK>" data/local/lang_tmp_nosp data/lang_nosp
```

The script creates `$tmpdir/phone_map.txt` and this has the format (on each line):

> <original phone> <version 1 of original phone> <version 2> ...

Where the versions depend on the position of the phone within a word. For instance, we'd have:

> AA AA_B AA_E AA_I AA_S

for (B)egin, (E)nd, (I)nternal and (S)ingleton and in the case of silence:

> SIL SIL SIL_B SIL_E SIL_I SIL_S

because SIL on its own is one of the variants; this is for when it doesn't occur inside a word but as an option in the lexicon.

The directory `data/lang_nosp` is populated with:
    
* phones/: Directory with language files
    
* L_disambig.fst: Finite State Transducer
    
* topo: FST topology

* oov.int: Unknown word token-to-int
  
```
3
```

* oov.txt: Unknown word token

```
<UNK>
```
    
* L.fst
    
* words.txt : Mapping of word-to-int
    
```
<eps> 0
!SIL 1
<SPOKEN_NOISE> 2
<UNK> 3
A 4
A''S 5
A'BODY 6
```
    
* phones.txt: Mapping of phone-to-int
    
```
<eps> 0
SIL 1
SIL_B 2
``` 
    
#### 2.4 Prepare Language Models
    
Reads and reformat the language models in `data/local/lm`.

Populates directory `data/lang_nosp_test_tgsmall`, `data/lang_nosp_test_tgmed` and `data/lang_nosp_test_tglarge` with the same kind of files as in `data/lang_nosp`.  Directories are for different 3-grams (pruned or not).
    
```bash
# Prepare the test time language model(G) transducers
# Usage: $0 <lm-dir>
# e.g.: $0 /export/a15/vpanayotov/data/lm
local/format_lms.sh --src-dir data/lang_nosp data/local/lm

# Create ConstArpaLm format language model for full 3-gram and 4-gram LMs
# This script reads in an Arpa format language model, and converts it into the
# ConstArpaLm format language model.
# Usage: 
#   $0 [options] <arpa-lm-path> <old-lang-dir> <new-lang-dir>
utils/build_const_arpa_lm.sh data/local/lm/lm_tglarge.arpa.gz \
  data/lang_nosp data/lang_nosp_test_tglarge
```

## 3. Extract acoustic features from the audio

Mel Frequency Cepstral Coefficients (MFCC) are the most commonly used features, but Perceptual Linear Prediction (PLP) features and other features are also an option. These features serve as the basis for the acoustic models.

### 3.1 Spread MFCCs to different machines

Spread the mfccs over various machines, as this data-set is quite large. This script creates storage directories on different file systems, and creates symbolic links to those directories.

```bash
utils/create_split_dir.pl /export/gpu-0{3,4,5}/egs/storage egs/storage
```

will mkdir -p all of those directories, and will create links

* `egs/storage/1` -> `/export/gpu-03/egs/storage`
* `egs/storage/2` -> `/export/gpu-03/egs/storage`
* ...

```bash
  # Usage: utils/create_split_dir.pl <actual_storage_dirs> <pseudo_storage_dir>
  # e.g.: utils/create_split_dir.pl /export/gpu-0{3,4,5}/egs/storage egs/storage
  if [[  $(hostname -f) ==  *.clsp.jhu.edu ]]; then
    mfcc=$(basename mfccdir) # in case was absolute pathname (unlikely), get basename.
    utils/create_split_dir.pl /export/b{07,14,16,17}/$USER/kaldi-data/egs/librispeech/s5/$mfcc/storage \
      $mfccdir/storage
  fi
```

### 3.2 Compute MFCCs and CMVN stats

**MFCC**: Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum.

**CMVN**: Cepstral mean and variance normalization (CMVN) is a computationally efficient normalization technique for robust speech recognition. The performance of CMVN is known to degrade for short utterances. This is due to insufficient data for parameter estimation and loss of discriminable information as all utterances are forced to have zero mean and unit variance.

CMVN minimizes distortion by noise contamination for robust feature extraction by linearly transforming the cepstral coefficients to have the same segmental statistics. Cepstral Normalization has been effective in the CMU Sphinx for maintaining a high level of recognition accuracy over a wide variety of acoustical environments.

```bash
  for part in dev_clean_2 train_clean_5; do
    # Usage: $0 [options] <data-dir> [<log-dir> [<mfcc-dir>] ];
    # e.g.: $0 data/train exp/make_mfcc/train mfcc
    # Note: <log-dir> defaults to <data-dir>/log, and <mfccdir> defaults to <data-dir>/data
    steps/make_mfcc.sh --cmd "$train_cmd" --nj 10 data/$part exp/make_mfcc/$part $mfccdir
    
    # Compute cepstral mean and variance statistics per speaker.
    # We do this in just one job; it's fast.
    # This script takes no options.
    # Usage: $0 [options] <data-dir> [<log-dir> [<cmvn-dir>] ];
    # e.g.: $0 data/train exp/make_mfcc/train mfcc
    steps/compute_cmvn_stats.sh data/$part exp/make_mfcc/$part $mfccdir
  done
```

will populate `data/dev_clean_2` and `data/train_clean_5` with `feats.scp` and `cmvn.scp` files. These files contains paths to the MFCC and CMVN features in `mfcc/raw_mfcc*` and `mfcc/cmvn*`.

```bash
# Get the shortest 500 utterances first because those are more likely
# to have accurate alignments.
utils/subset_data_dir.sh --shortest data/train_clean_5 500 data/train_500short
```

will populate `data/train_500short` with the same kind of files as in `data/train_clean_5` but for only 500 utterances instead of the complete train set (1519). Features computed for this small subset are used for training the monophone model.

## 4. Train monophone models

A monophone model is an acoustic model that does not include any contextual information about the preceding or following phone. It is used as a building block for the triphone models, which do make use of contextual information.

**Note**: from this point forward, we will be assuming a Gaussian Mixture Model/Hidden Markov Model (GMM/HMM) framework. This is in contrast to a deep neural network (DNN) system.

### 4.1 Training

`Delta+delta-delta`: training computes delta and double-delta features, or dynamic coefficients, to supplement the MFCC features. Delta and delta-delta features are numerical estimates of the first and second order derivatives of the signal (features). As such, the computation is usually performed on a larger window of feature vectors. While a window of two feature vectors would probably work, it would be a very crude approximation (similar to how a delta-difference is a very crude approximation of the derivative). Delta features are computed on the window of the original features; the delta-delta are then computed on the window of the delta-features.

`LDA-MLLT`: stands for Linear Discriminant Analysis – Maximum Likelihood Linear Transform. The Linear Discriminant Analysis takes the feature vectors and builds HMM states, but with a reduced feature space for all data. The Maximum Likelihood Linear Transform takes the reduced feature space from the LDA and derives a unique transformation for each speaker. MLLT is therefore a step towards speaker normalization, as it minimizes differences among speakers.

`LDA`: In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model.

`SAT`: stands for Speaker Adaptive Training. SAT also performs speaker and noise normalization by adapting to each specific speaker with a particular data transform. This results in more homogenous or standardized data, allowing the model to use its parameters on estimating variance due to the phoneme, as opposed to the speaker or recording environment.

```bash
# Flat start and monophone training, with delta-delta features.
# This script applies cepstral mean normalization (per speaker).
# Usage: steps/train_mono.sh [options] <data-dir> <lang-dir> <exp-dir>
#  e.g.: steps/train_mono.sh data/train.1k data/lang exp/mono
steps/train_mono.sh --boost-silence 1.25 --nj 5 --cmd "$train_cmd" \
  data/train_500short data/lang_nosp exp/mono
```

Results are saved in `exp/mono`. Monophone system is initialized, training graphs are compiled and data is aligned.

`exp/mono/log/analyze_alignments.log` contains stats. Example:

```
At utterance begin, SIL accounts for 100.0% of phone occurrences, 
with duration (median, mean, 95-percentile) is (29,32.3,55) frames.
At utterance end, SIL accounts for 99.8% of phone occurrences, 
with duration (median, mean, 95-percentile) is (25,28.7,52) frames.
Overall, nonsilence accounts for 94.5% of phone occurrences, 
with duration (median, mean, 95-percentile) is (7,7.9,17) frames.
Overall, SIL accounts for 5.4% of phone occurrences, 
with duration (median, mean, 95-percentile) is (30,36.7,94) frames.
Overall, AH0_I accounts for 3.9% of phone occurrences, 
with duration (median, mean, 95-percentile) is (3,3.8,7) frames.
```

Global results are printed:

```
6 warnings in exp/mono/log/init.log
1 warnings in exp/mono/log/update.*.log
318 warnings in exp/mono/log/align.*.*.log
exp/mono: nj=5 align prob=-99.26 over 1.15h [retry=0.4%, fail=0.0%] states=127 gauss=1000
steps/train_mono.sh: Done training monophone system in exp/mono
```

### 4.2 Decoding

`mkgraph.sh` creates a fully expanded decoding graph (HCLG) that represents all the language-model, pronunciation dictionary (lexicon), context-dependency, and HMM structure in our model.  The output is a Finite State Transducer (FST) that has word-ids on the output, and pdf-ids on the input (these are indexes that resolve to Gaussian Mixture Models). See `http://kaldi-asr.org/doc/graph_recipe_test.html`.

`decode.sh` works on CMN + (delta+delta-delta | LDA+MLLT) features; it works out what type of features you used (assuming it's one of these two). Uses Feature space Maximum Likelihood Linear Regression (fMLLR) transforms, which is a widely used technique for speaker adaptation in HMM-based speech recognition.

```bash
  (
    # "Usage: utils/mkgraph.sh [options] <lang-dir> <model-dir> <graphdir>"
    # "e.g.: utils/mkgraph.sh data/lang_test exp/tri1/ exp/tri1/graph"
    utils/mkgraph.sh data/lang_nosp_test_tgsmall \
      exp/mono exp/mono/graph_nosp_tgsmall
      
    for test in dev_clean_2; do
      # Usage: steps/decode.sh [options] <graph-dir> <data-dir> <decode-dir>"
      # ... where <decode-dir> is assumed to be a sub-directory of the directory"
      #  where the model is."
      # e.g.: steps/decode.sh exp/mono/graph_tgpr data/test_dev93 exp/mono/decode_dev93_tgpr"
      steps/decode.sh --nj 10 --cmd "$decode_cmd" exp/mono/graph_nosp_tgsmall \
        data/$test exp/mono/decode_nosp_tgsmall_$test
    done
  )&
```

will populate `exp/mono/decode_nosp_tgsmall_dev_clean_2` with lattices `lat.*.gz` and decoding logs in `exp/mono/decode_nosp_tgsmall_dev_clean_2/log`. Example of decoding log:

```
1272-135031-0000 BECAUSE YOU ARE SWEETNESS OF CONQUERING POLTROONS 
CRITICISMS TO FIT LOVABLE OLD PORTUGUESE ASSERTED COOING DOVE 
LOG (gmm-latgen-faster[5.5.286~1-b96ca]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) 
Log-like per frame for utterance 
1272-135031-0000 is -8.33485 over 1087 frames.
1272-135031-0001 HE HAS GONE GONE FOR GOOD DISCIPLINE CRUMPLED 
MANAGED TO SQUEEZE INTO HER ROOM BESIDE GRINDING AND HAD WITNESSED 
THE CROSSES WITH WHITE FINGERS 
LOG (gmm-latgen-faster[5.5.286~1-b96ca]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) 
Log-like per frame for utterance 
1272-135031-0001 is -8.51052 over 1111 frames.
1272-135031-0002 I HAVE REMAINED A PRISONER PLAY BECAUSE I WISHED 
TO BE ONE OF US INSISTED FOR THEN TURNED TO STONE CHANGES HE 
WHO USED TO BE INTRODUCED 
LOG (gmm-latgen-faster[5.5.286~1-b96ca]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) 
Log-like per frame for utterance 
1272-135031-0002 is -8.30051 over 1146 frames.
```

The directory contains WER results (`wer_*_*.*`) and lattices (`lat.*.gz`). In `scoring/` we have predicted transcripts (`*.*.*.tra`). In `scoring/log/` we have best paths description to obtain the transcriptions (`best_path.*.log`).

## 5. Align audio with the acoustic models

The parameters of the acoustic model are estimated in acoustic training steps; however, the process can be better optimized by cycling through training and alignment phases. This is also known as Viterbi training (related, but more computationally expensive procedures include the Forward-Backward algorithm and Expectation Maximization). By aligning the audio to the reference transcript with the most current acoustic model, additional training algorithms can then use this output to improve or refine the parameters of the model. Therefore, each training step will be followed by an alignment step where the audio and text can be realigned.

Alignement techniques: https://montreal-forced-aligner.readthedocs.io/en/latest/alignment_techniques.html

### 5.1 Alignement

The actual alignment algorithm will always be the same; the different scripts accept different types of acoustic model input. Speaker independent alignment, as it sounds, will exclude speaker-specific information in the alignment process.

`fMLLR`: stands for Feature Space Maximum Likelihood Linear Regression. After SAT training, the acoustic model is no longer trained on the original features, but on speaker-normalized features. For alignment, we essentially have to remove the speaker identity from the features by estimating the speaker identity (with the inverse of the fMLLR matrix), then removing it from the model (by multiplying the inverse matrix with the feature vector). These quasi-speaker-independent acoustic models can then be used in the alignment process.

Computes training alignments using a model with delta or LDA+MLLT features. If you supply the "--use-graphs true" option, it will use the training graphs from the source directory (where the model is).  In this case the number of jobs must match with the source directory. 

```bash 
# "usage: steps/align_si.sh <data-dir> <lang-dir> <src-dir> <align-dir>"
# "e.g.:  steps/align_si.sh data/train data/lang exp/tri1 exp/tri1_ali"
steps/align_si.sh --boost-silence 1.25 --nj 5 --cmd "$train_cmd" \
  data/train_clean_5 data/lang_nosp exp/mono exp/mono_ali_train_clean_5
```

will populate `exp/mono_ali_train_clean_5` with alignements `ali.*.gz` and alignements logs in `exp/mono_ali_train_clean_5/log`. Example:

`ali.5`: Where each int is a phoneme attached to a frame

```
6848-252323-0000 2 8 5 5 5 5 5 5 5 5 5 5 5 5 5 18 17 830 829 832 834 914 
913 916 918 1778 1780 1779 1779 1782 1781 1781 2156 2155 2158 2157 2157 
2157 2157 2160 2159 1094 1093 1093 1096 1095 1095 1095 1095 1095 1098 
...
```


`analyze_alignments.log`

```
At utterance begin, SIL accounts for 100.0% of phone occurrences, 
with duration (median, mean, 95-percentile) is (29,31.7,55) frames.
At utterance end, SIL accounts for 99.7% of phone occurrences, 
with duration (median, mean, 95-percentile) is (24,27.7,51) frames.
Overall, nonsilence accounts for 94.5% of phone occurrences, 
with duration (median, mean, 95-percentile) is (7,8.0,17) frames.
Overall, SIL accounts for 5.4% of phone occurrences, 
with duration (median, mean, 95-percentile) is (32,39.3,102) frames.
Overall, N_I accounts for 3.8% of phone occurrences, 
with duration (median, mean, 95-percentile) is (5,5.3,10) frames.
```

`align.*.log`

```
LOG (compile-train-graphs[5.5.286~1-b96ca]:main():compile-train-graphs.cc:147) 
compile-train-graphs: succeeded for 285 graphs, failed for 0
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:127) 
1737-146161-0015
LOG (apply-cmvn[5.5.286~1-b96ca]:main():apply-cmvn.cc:162) 
Applied cepstral mean normalization to 285 utterances, errors on 0
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:127) 
1737-146161-0016
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:127) 
1737-146161-0017
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:135) 
Overall log-likelihood per frame is -99.2303 over 350828 frames.
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:137) 
Retried 2 out of 285 utterances.
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:139) 
Done 285, errors on 0
```

### 6. Train triphone models

While monophone models simply represent the acoustic parameters of a single phoneme, we know that phonemes will vary considerably depending on their particular context. The triphone models represent a phoneme variant in the context of two other (left and right) phonemes.

At this point, we’ll also need to deal with the fact that not all triphone units are present (or will ever be present) in the dataset. There are (# of phonemes)<sup>3</sup> possible triphone models, but only a subset of those will actually occur in the data. Furthermore, the unit must also occur multiple times in the data to gather sufficient statistics for the data. A phonetic decision tree groups these triphones into a smaller amount of acoustically distinct units, thereby reducing the number of parameters and making the problem computationally feasible.

### 6.1 Train a first delta + delta-delta triphone system (tri1)

Train a first delta + delta-delta triphone system on all utterances

```bash
# Usage: steps/train_deltas.sh <num-leaves> <tot-gauss> <data-dir> <lang-dir> <alignment-dir> <exp-dir>
# e.g.: steps/train_deltas.sh 2000 10000 data/train_si84_half data/lang exp/mono_ali exp/tri1
steps/train_deltas.sh --boost-silence 1.25 --cmd "$train_cmd" 2000 10000 /
data/train_clean_5 data/lang_nosp exp/mono_ali_train_clean_5 exp/tri1
```

Results are saved in `exp/tri1`. Tree stats are accumulated, questions are gotten, tree is built, alignement from monophone model are converted using current tree, graphs are compiled from transcripts.

```
1 warnings in exp/tri1/log/update.*.log
13 warnings in exp/tri1/log/init_model.log
28 warnings in exp/tri1/log/align.*.*.log
1 warnings in exp/tri1/log/build_tree.log
exp/tri1: nj=5 align prob=-96.49 over 5.30h [retry=0.1%, fail=0.0%] 
states=1560 gauss=10022 tree-im
```

Here is a list of logs:

`acc_tree.*.log`

```
LOG (apply-cmvn[5.5.286~1-b96ca]:main():apply-cmvn.cc:162) 
Applied cepstral mean normalization to 285 utterances, errors on 0
LOG (acc-tree-stats[5.5.286~1-b96ca]:main():acc-tree-stats.cc:118) 
Accumulated stats for 285 files, 0 failed due to no alignment, 0 failed for other reasons.
LOG (acc-tree-stats[5.5.286~1-b96ca]:main():acc-tree-stats.cc:121) 
Number of separate stats (context-dependent states) is 34071
```

`analyze_alignements.log`

```
At utterance begin, SIL accounts for 100.0% of phone occurrences, 
with duration (median, mean, 95-percentile) is (28,30.9,54) frames.
At utterance end, SIL accounts for 99.7% of phone occurrences, 
with duration (median, mean, 95-percentile) is (23,25.5,49) frames.
Overall, nonsilence accounts for 94.4% of phone occurrences, 
with duration (median, mean, 95-percentile) is (7,8.1,17) frames.
Overall, SIL accounts for 5.5% of phone occurrences, 
with duration (median, mean, 95-percentile) is (29,36.1,98) frames.
Overall, N_I accounts for 3.8% of phone occurrences, 
with duration (median, mean, 95-percentile) is (5,5.2,10) frames.
```

`align.*.*.log`

```
LOG (apply-cmvn[5.5.286~1-b96ca]:main():apply-cmvn.cc:162) 
Applied cepstral mean normalization to 285 utterances, errors on 0
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:127) 
1737-146161-0015
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:127) 
1737-146161-0016
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:127) 
1737-146161-0017
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:135) 
Overall log-likelihood per frame is -98.5235 over 350828 frames.
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:137) 
Retried 1 out of 285 utterances.
LOG (gmm-align-compiled[5.5.286~1-b96ca]:main():gmm-align-compiled.cc:139) 
Done 285, errors on 0
```

`update.*.*.log`

```
LOG (gmm-sum-accs[5.5.286~1-b96ca]:main():gmm-sum-accs.cc:63) 
Summed 5 stats, total count 1.90919e+06, avg like/frame -100.436
LOG (gmm-sum-accs[5.5.286~1-b96ca]:main():gmm-sum-accs.cc:66) 
Total count of stats is 1.90919e+06
LOG (gmm-sum-accs[5.5.286~1-b96ca]:main():gmm-sum-accs.cc:67) Written stats to -
LOG (gmm-est[5.5.286~1-b96ca]:MleUpdate():transition-model.cc:528) 
TransitionModel::Update, objf change is 0.319806 per frame over 1.90919e+06 frames. 
LOG (gmm-est[5.5.286~1-b96ca]:MleUpdate():transition-model.cc:531) 
82 probabilities floored, 8886 out of 13128 transition-states skipped due 
to insuffient data (it is normal to have some skipped.)
LOG (gmm-est[5.5.286~1-b96ca]:main():gmm-est.cc:102) Transition model update: 
Overall 0.319806 log-like improvement per frame over 1.90919e+06 frames.
LOG (gmm-est[5.5.286~1-b96ca]:MleAmDiagGmmUpdate():mle-am-diag-gmm.cc:225) 
0 variance elements floored in 0 Gaussians, out of 2000
LOG (gmm-est[5.5.286~1-b96ca]:MleAmDiagGmmUpdate():mle-am-diag-gmm.cc:229) 
Removed 0 Gaussians due to counts < --min-gaussian-occupancy=10 
and --remove-low-count-gaussians=true
LOG (gmm-est[5.5.286~1-b96ca]:main():gmm-est.cc:113) GMM update: 
Overall 0.00141222 objective function improvement per frame over 1.90919e+06 frames
LOG (gmm-est[5.5.286~1-b96ca]:main():gmm-est.cc:116) GMM update: 
Overall avg like per frame = -100.436 over 1.90919e+06 frames.
LOG (gmm-est[5.5.286~1-b96ca]:SplitByCount():am-diag-gmm.cc:116) 
Split 1560 states with target = 2000, power = 0.25, perturb_factor = 0.01 
and min_count = 20, split #Gauss from 2000 to 2002
LOG (gmm-est[5.5.286~1-b96ca]:main():gmm-est.cc:146) Written model to exp/tri1/2.mdl
```

`acc.*.*.log`

```
LOG (gmm-acc-stats-ali[5.5.286~1-b96ca]:main():gmm-acc-stats-ali.cc:105) 
Processed 50 utterances; for utterance 1088-134315-0049 avg. 
like is -101.339 over 1463 frames.
LOG (gmm-acc-stats-ali[5.5.286~1-b96ca]:main():gmm-acc-stats-ali.cc:105) 
Processed 100 utterances; for utterance 118-47824-0018 avg. 
like is -105.318 over 1401 frames.
LOG (gmm-acc-stats-ali[5.5.286~1-b96ca]:main():gmm-acc-stats-ali.cc:105) 
Processed 150 utterances; for utterance 118-47824-0068 avg. 
like is -104.088 over 1343 frames.
LOG (gmm-acc-stats-ali[5.5.286~1-b96ca]:main():gmm-acc-stats-ali.cc:105) 
Processed 200 utterances; for utterance 163-122947-0031 avg. 
like is -96.8731 over 1455 frames.
LOG (gmm-acc-stats-ali[5.5.286~1-b96ca]:main():gmm-acc-stats-ali.cc:105) 
Processed 250 utterances; for utterance 163-122947-0081 avg. 
like is -94.4654 over 1318 frames.
LOG (apply-cmvn[5.5.286~1-b96ca]:main():apply-cmvn.cc:162) 
Applied cepstral mean normalization to 285 utterances, errors on 0
LOG (gmm-acc-stats-ali[5.5.286~1-b96ca]:main():gmm-acc-stats-ali.cc:112) 
Done 285 files, 0 with errors.
LOG (gmm-acc-stats-ali[5.5.286~1-b96ca]:main():gmm-acc-stats-ali.cc:115) 
Overall avg like per frame (Gaussian only) = -100.506 over 350828 frames.
LOG (gmm-acc-stats-ali[5.5.286~1-b96ca]:main():gmm-acc-stats-ali.cc:123) 
Written accs.
```

`init_model.log`

```
LOG (gmm-init-model[5.5.286~1-b96ca]:main():gmm-init-model.cc:271) 
Number of separate statistics is 81610
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 8: 58; corresponding phone list: 107 108 109 110 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 203: 56; corresponding phone list: 223 224 225 226 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 225: 71; corresponding phone list: 125 129 133 137 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 335: 77; corresponding phone list: 224 225 226 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 392: 95; corresponding phone list: 108 109 110 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 436: 98; corresponding phone list: 239 240 241 
242 243 244 245 246 247 248 249 250 251 252 253 254 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 631: 98; corresponding phone list: 183 184 185 
186 191 192 193 194 195 196 197 198 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 637: 26; corresponding phone list: 239 240 241 
242 243 244 245 246 247 248 249 250 251 252 253 254 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 697: 96; corresponding phone list: 283 284 285 286 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 741: 86; corresponding phone list: 183 184 185 
186 187 188 189 190 191 192 193 194 195 196 197 198 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 932: 51; corresponding phone list: 227 228 229 230 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 1100: 53; corresponding phone list: 43 44 45 46 
55 56 57 58 
WARNING (gmm-init-model[5.5.286~1-b96ca]:InitAmGmm():gmm-init-model.cc:83) 
Very small count for state 1452: 53; corresponding phone list: 43 45 46 51 
53 54 55 57 58 
LOG (gmm-init-model[5.5.286~1-b96ca]:main():gmm-init-model.cc:306) 
Wrote model.
```

`build_tree.log`

```
LOG (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:104) 
Number of separate statistics is 81610
LOG (build-tree[5.5.286~1-b96ca]:BuildTree():build-tree.cc:161) 
BuildTree: before building trees, map has 41 leaves.
LOG (build-tree[5.5.286~1-b96ca]:SplitDecisionTree():build-tree-utils.cc:577) 
DoDecisionTreeSplit: split 1959 times, #leaves now 2000
LOG (build-tree[5.5.286~1-b96ca]:BuildTree():build-tree.cc:187) 
Setting clustering threshold to smallest split 551.656
VLOG[1] (build-tree[5.5.286~1-b96ca]:BuildTree():build-tree.cc:196) 
After decision tree split, num-leaves = 2000, like-impr = 4.57321 
per frame over 1.90919e+06 frames.
VLOG[1] (build-tree[5.5.286~1-b96ca]:BuildTree():build-tree.cc:200) 
Including just phones that were split, improvement is 4.57321 
per frame over 1.90919e+06 frames.
LOG (build-tree[5.5.286~1-b96ca]:BuildTree():build-tree.cc:215) 
BuildTree: removed 437 leaves.
VLOG[1] (build-tree[5.5.286~1-b96ca]:ClusterEventMapToNClustersRestrictedByMap()
:build-tree-utils.cc:914) Number of non-empty clusters in map = 41
VLOG[1] (build-tree[5.5.286~1-b96ca]:ClusterEventMapToNClustersRestrictedByMap()
:build-tree-utils.cc:915) Number of non-empty clusters = 1563
LOG (build-tree[5.5.286~1-b96ca]:BuildTree():build-tree.cc:231) BuildTree: 
Rounded num leaves to multiple of 8 by removing 3 leaves.
VLOG[1] (build-tree[5.5.286~1-b96ca]:BuildTree():build-tree.cc:256) 
Objf change due to clustering -0.0875763 per frame.
VLOG[1] (build-tree[5.5.286~1-b96ca]:BuildTree():build-tree.cc:259) 
Normalizing over only split phones, this is: -0.0875763 per frame.
VLOG[1] (build-tree[5.5.286~1-b96ca]:BuildTree():build-tree.cc:262) 
Num-leaves is now 1560
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 8, low count 58
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 203, low count 56
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 225, low count 71
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 335, low count 77
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 392, low count 95
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 436, low count 98
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 631, low count 98
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 637, low count 26
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 697, low count 96
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 741, low count 86
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 932, low count 51
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 1100, low count 53
VLOG[1] (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:143) 
For pdf-id 1452, low count 53
WARNING (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:185) 
Saw no stats for following phones: 2 3 4 5 6 7 8 9 11 12 13 14 
16 18 26 27 28 29 30 32 34 36 38 40 42 43 44 45 46 56 58 59 60 
61 62 64 66 70 72 74 75 76 77 78 80 82 86 90 91 92 93 94 98 103 
106 110 114 118 122 123 124 125 126 128 130 132 136 138 139 140 
141 142 150 151 154 155 156 157 158 162 170 174 178 180 182 183 
184 185 186 190 192 194 196 198 199 200 201 202 206 214 218 222 
230 234 235 238 239 240 241 242 246 254 255 256 257 258 259 260 
262 266 267 270 274 278 282 286 290 294 295 296 297 298 299 300 
302 303 304 306 307 308 310 311 312 313 314 315 318 319 322 323 
326 330 332 334 336 338 339 342 343 346 
LOG (build-tree[5.5.286~1-b96ca]:main():build-tree.cc:189) 
Wrote tree
```


### 6.2 Decode first delta + delta-delta triphone model

```bash
# decode using the tri1 model
utils/mkgraph.sh data/lang_nosp_test_tgsmall exp/tri1 exp/tri1/graph_nosp_tgsmall
    
for test in dev_clean_2; do
  steps/decode.sh --nj 5 --cmd "$decode_cmd" exp/tri1/graph_nosp_tgsmall 
  data/$test exp/tri1/decode_nosp_tgsmall_$test
  
  # Do language model rescoring of lattices (remove old LM, add new LM)
  # Usage: steps/lmrescore.sh [options] <old-lang-dir> <new-lang-dir> <data-dir> 
  #                                     <input-decode-dir> <output-decode-dir>
  steps/lmrescore.sh --cmd "$decode_cmd" data/lang_nosp_test_{tgsmall,tgmed} 
  data/$test exp/tri1/decode_nosp_{tgsmall,tgmed}_$test
      
  # This script rescores lattices with the ConstArpaLm format language model.
  # Does language model rescoring of lattices (remove old LM, add new LM)
  # Usage: ... [options] <old-lang-dir> <new-lang-dir> 
  #                      <data-dir> <input-decode-dir> <output-decode-dir>
  steps/lmrescore_const_arpa.sh --cmd "$decode_cmd" data/lang_nosp_test_{tgsmall,tglarge} 
  data/$test exp/tri1/decode_nosp_{tgsmall,tglarge}_$test
done
```

will populate `exp/tri1/decode_nosp_tgsmall_dev_clean_2`, `exp/tri1/decode_nosp_tgmed_dev_clean_2`, `exp/tri1/decode_nosp_tglarge_dev_clean_2` with lattices `lat.*.gz` and decoding logs in `exp/mono/decode_nosp_tgsmall_dev_clean_2/log`.

They contain WER results (`wer_*_*.*`) and lattices (`lat.*.gz`). In `scoring/` we have predicted transcripts (`*.*.*.tra`). In `scoring/log/` we have best paths description to obtain the transcriptions (`best_path.*.log`).

### 6.3 Align audio

```bash
# Realignement
steps/align_si.sh --nj 5 --cmd "$train_cmd" data/train_clean_5 data/lang_nosp 
exp/tri1 exp/tri1_ali_train_clean_5
```

will populate `exp/tri1_ali_train_clean_5` with alignements `ali.*.gz` and alignements logs in `exp/tri1_ali_train_clean_5/log`.




## 7. Re-align audio with the acoustic models & re-train triphone models

Repeat steps 5 and 6 with additional triphone training algorithms for more refined models. These typically include delta+delta-delta training, LDA-MLLT, and SAT. The alignment algorithms include speaker independent alignments and FMLLR.

**Training Algorithms**: Delta+delta-delta training, LDA-MLLT and SAT.

**Alignment Algorithms**: fMLLR.

### 7.1 LDA+MLLT system. (tri2b)

LDA+MLLT refers to the way we transform the features after computing the MFCCs: we splice across several frames, reduce the dimension (to 40 by default) using Linear Discriminant Analysis), and then later estimate, over multiple iterations, a diagonalizing transform known as MLLT or STC.

See http://kaldi-asr.org/doc/transform.html for more explanation.

### 7.1.1 Train LDA+MLLT system.

```bash
  # "Usage: steps/train_lda_mllt.sh [options] <#leaves> <#gauss> <data> <lang> <alignments> <dir>"
  # " e.g.: steps/train_lda_mllt.sh 2500 15000 data/train_si84 data/lang exp/tri1_ali_si84 exp/tri2b"
  steps/train_lda_mllt.sh --cmd "$train_cmd" \
    --splice-opts "--left-context=3 --right-context=3" 2500 15000 \
    data/train_clean_5 data/lang_nosp exp/tri1_ali_train_clean_5 exp/tri2b
```

Results are saved in `exp/tri2b`. LDA and tree stats are accumulated, questions are gotten, tree is built, alignement from tri1 model are converted using current tree, graphs are compiled from transcripts.

```
steps/diagnostic/analyze_alignments.sh: see stats in exp/tri2b/log/analyze_alignments.log
8 warnings in exp/tri2b/log/update.*.log
25 warnings in exp/tri2b/log/align.*.*.log
1 warnings in exp/tri2b/log/build_tree.log
51 warnings in exp/tri2b/log/init_model.log
exp/tri2b: nj=5 align prob=-47.77 over 5.30h [retry=0.1%, fail=0.0%] 
states=2024 gauss=15022 tree-impr=4.81 lda-sum=21.79 mllt:impr,logdet=1.13,1.69
steps/train_lda_mllt.sh: Done training system with LDA+MLLT features in exp/tri2b```

#### 7.1.2 Decode using the LDA+MLLT model

```bash
  # decode using the LDA+MLLT model
  (
    utils/mkgraph.sh data/lang_nosp_test_tgsmall \
      exp/tri2b exp/tri2b/graph_nosp_tgsmall
      
    for test in dev_clean_2; do
      steps/decode.sh --nj 10 --cmd "$decode_cmd" exp/tri2b/graph_nosp_tgsmall \
        data/$test exp/tri2b/decode_nosp_tgsmall_$test
        
      steps/lmrescore.sh --cmd "$decode_cmd" data/lang_nosp_test_{tgsmall,tgmed} \
        data/$test exp/tri2b/decode_nosp_{tgsmall,tgmed}_$test
        
      steps/lmrescore_const_arpa.sh \
        --cmd "$decode_cmd" data/lang_nosp_test_{tgsmall,tglarge} \
        data/$test exp/tri2b/decode_nosp_{tgsmall,tglarge}_$test
    done
  )&
```

#### 7.1.3 Align audio

```bash
  # Realign utts using the tri2b model
  steps/align_si.sh  --nj 5 --cmd "$train_cmd" --use-graphs true \
    data/train_clean_5 data/lang_nosp exp/tri2b exp/tri2b_ali_train_clean_5
```

### 7.2 LDA+MLLT+SAT system. (tri3b)

This does Speaker Adapted Training (SAT), i.e. train on fMLLR-adapted features.  It can be done on top of either LDA+MLLT, or delta and delta-delta features.  If there are no transforms supplied in the alignment directory, it will estimate transforms itself before building the tree (and in any case, it estimates transforms a number of times during training).

#### 7.2.1 Train LDA+MLLT+SAT system.

```bash
  # "Usage: steps/train_sat.sh <#leaves> <#gauss> <data> <lang> <ali-dir> <exp-dir>"
  # " e.g.: steps/train_sat.sh 2500 15000 data/train_si84 data/lang exp/tri2b_ali_si84 exp/tri3b"
  steps/train_sat.sh --cmd "$train_cmd" 2500 15000 \
    data/train_clean_5 data/lang_nosp exp/tri2b_ali_train_clean_5 exp/tri3b
```

Results are saved in `exp/tri3b`. Obtaining initial fMLLR transforms, tree stats are accumulated, questions are gotten, tree is built, model si initialized and alignement from tri2b model are converted using current tree, graphs are compiled from transcripts.

```
steps/diagnostic/analyze_alignments.sh: see stats in exp/tri3b/log/analyze_alignments.log
3 warnings in exp/tri3b/log/update.*.log
18 warnings in exp/tri3b/log/init_model.log
24 warnings in exp/tri3b/log/align.*.*.log
1 warnings in exp/tri3b/log/build_tree.log
steps/train_sat.sh: Likelihood evolution:
-49.7344 -49.1791 -48.9736 -48.8073 -48.2381 -47.7561 -47.4154 -47.1582 -46.968 
-46.4597 -46.2195 -45.9896 -45.8512 -45.7332 -45.6174 -45.5092 -45.4093 -45.3156 
-45.227 -45.0607 -44.9338 -44.8556 -44.784 -44.7152 -44.6476 -44.5852 -44.5248 
-44.4642 -44.4051 -44.305 -44.2294 -44.2026 -44.1855 -44.1726
exp/tri3b: nj=5 align prob=-47.17 over 5.30h [retry=0.0%, fail=0.0%] 
states=2032 gauss=15017 fmllr-impr=2.57 over 4.25h tree-impr=7.05
steps/train_sat.sh: done training SAT system in exp/tri3b
```

#### 7.2.2 Decode using the LDA+MLLT+SAT model 

Decoding script that does fMLLR. This can be on top of delta+delta-delta, or LDA+MLLT features.

There are 3 models involved potentially in this script, and for a standard, speaker-independent system they will all be the same. 

* The "alignment model" is for the 1st-pass decoding and to get the Gaussian-level alignments for the "adaptation model" the first time we do fMLLR.  

* The "adaptation model" is used to estimate fMLLR transforms and to generate state-level lattices.  

* The lattices are then rescored with the "final model".

The following table explains where we get these 3 models from. ($srcdir is one level up from the decoding directory.)

| Model              | Default Source                                               |                   |
|--------------------|--------------------------------------------------------------|-------------------|
| "alignment model"  | `$srcdir/final.alimdl (or $srcdir/final.mdl if alimdl absent)` | --alignment-model |
| "adaptation model" | `$srcdir/final.mdl`                                            | --adapt-model     |
| "final model"      | `$srcdir/final.mdl`                                            | --final-model     |

```bash
  # decode using the tri3b model
  (
    utils/mkgraph.sh data/lang_nosp_test_tgsmall \
      exp/tri3b exp/tri3b/graph_nosp_tgsmall
      
    for test in dev_clean_2; do
      # "Usage: steps/decode_fmllr.sh [options] <graph-dir> <data-dir> <decode-dir>"
      # " e.g.: steps/decode_fmllr.sh exp/tri2b/graph_tgpr data/test_dev93 exp/tri2b/decode_dev93_tgpr"
      steps/decode_fmllr.sh --nj 10 --cmd "$decode_cmd" \
        exp/tri3b/graph_nosp_tgsmall data/$test \
        exp/tri3b/decode_nosp_tgsmall_$test
        
      steps/lmrescore.sh --cmd "$decode_cmd" data/lang_nosp_test_{tgsmall,tgmed} \
        data/$test exp/tri3b/decode_nosp_{tgsmall,tgmed}_$test
        
      steps/lmrescore_const_arpa.sh \
        --cmd "$decode_cmd" data/lang_nosp_test_{tgsmall,tglarge} \
        data/$test exp/tri3b/decode_nosp_{tgsmall,tglarge}_$test
    done
  )&
```

#### 7.2.3 Compute pronunciation and silence probabilities + Recreate Lang Directory

Now we compute the pronunciation and silence probabilities from training data, and re-create the lang directory. Pronunciation probabilities are saved in `exp/tri3b`.

`get_prons`: This script writes files `prons.*.gz` in the directory provided, which must contain alignments (`ali.*.gz`) or lattices (`lat.*.gz`). These files are as output by nbest-to-prons (see its usage message). As the usage message of nbest-to-prons says, its output has lines that can be interpreted as

```
`<utterance-id> <begin-frame> <num-frames> <word> <phone1> <phone2> ... <phoneN>`
1088-134315-0000 0 19 0 1  
1088-134315-0000 19 23 8537 35 340   
1088-134315-0000 42 12 198712 335 320   
1088-134315-0000 54 42 96204 231 248
```

and you could convert these into text form using a command like:

```bash
gunzip -c prons.*.gz | utils/sym2int.pl -f 4 words.txt | utils/sym2int.pl -f 5- phones.txt
```

`pron_counts{_nowb}.txt`: Number of times a word is pronunced.

```
10651 <eps> SIL 
2588 THE DH_B AH0_E 
1277 AND AE1_B N_I D_E 
1147 A AH0_S 
1124 OF AH0_B V_E 
```

The main steps of this script are:

1. Here we figure the count of silence before and after words (actually prons). Create a text like file, but instead of putting words, we write "word pron" pairs. We change the format of `prons.*.gz` from pron-per-line to utterance-per-line (with "word pron" pairs tab-separated), and add `<s> and </s>` at the begin and end of each sentence. The _B, _I, _S, _E markers are removed from phones.

`pron_perutt_nowb.txt`

```
1088-134315-0000	<s>	<eps> SIL	AS AE1 Z	YOU Y UW1	KNOW N OW1	<eps> SIL	
AND AE1 N D	AS EH1 Z	I AY1	HAVE HH AE1 V	GIVEN G IH1 V IH0 N	YOU Y UW1	
PROOF P R UW1 F	<eps> SIL	I AY1	HAVE HH AE1 V	THE DH AH0	GREATEST G R EY1 T AH0 S T	
ADMIRATION AE2 D M ER0 EY1 SH AH0 N	IN IH0 N	THE DH AH0	WORLD W ER1 L D	FOR F ER0	
ONE W AH1 N	WHOSE HH UW1 Z	WORK W ER1 K	FOR F ER0	HUMANITY Y UW0 M AE1 N IH0 T IY0	
HAS HH AE1 Z	WON W AH1 N	SUCH S AH1 CH	UNIVERSAL Y UW2 N AH0 V ER1 S AH0 L	
RECOGNITION R EH2 K IH0 G N IH1 SH AH0 N	<eps> SIL	I AY1	HOPE HH OW1 P	THAT DH AH0 T	
WE W IY1	SHALL SH AE1 L	BOTH B OW1 TH	FORGET F ER0 G EH1 T	THIS DH IH0 S	
UNHAPPY AH0 N HH AE1 P IY0	MORNING M AO1 R N IH0 NG	<eps> SIL	AND AE1 N D	THAT DH AH0 T	
YOU Y UW1	WILL W AH0 L	GIVE G IH1 V	ME M IY1	AN AH0 N	
OPPORTUNITY AA2 P ER0 T UW1 N AH0 T IY0	OF AH0 V	RENDERING R EH1 N D ER0 IH0 NG	
TO T IH0	YOU Y UW1	<eps> SIL	IN IH0 N	PERSON P ER1 S AH0 N	<eps> SIL	</s>
```

2. Collect bigram counts for words. To be more specific, we are actually collecting counts for "v ? w", where "?" represents silence or non-silence.

`pron_bigram_counts_nowb.txt`

```
1	HAD HH AE1 D	SLIPPED S L IH1 P T
1	AND AE1 N D	LOWER L OW1 ER0
1	BEEN B IH1 N	THE DH IY0
14	THEN DH EH1 N	HE HH IY1
```

3. Collect bigram counts for silence and words. the count file has 4 fields for counts, followed by the "word pron" pair. The sum of first two and last two counts will be equal (the total count of this word). All fields are separated by spaces:

> `<sil-before-count> <nonsil-before-count> <sil-after-count> <nonsil-after-count> <word> <phone1> <phone2 >...`

`sil_counts_nowb.txt`
```
1513 6 0 0 </s>
32 117 78 71 <UNK> SPN
0 0 1518 1 <s>
114 1033 18 1129 A AH0
24 71 1 94 A EY1
1 2 0 3 ABANDON AH0 B AE1 N D AH0 N
0 3 1 2 ABANDONED AH0 B AE1 N D AH0 N D
6 82 6 82 ABOUT AH0 B AW1 T
3 17 1 19 ACROSS AH0 K R AO1 S
```

`dict_dir_add_pronprobs.sh`: This script takes pronunciation counts, e.g. generated by aligning your training data and getting the prons using steps/get_prons.sh, and creates a modified dictionary directory with pronunciation probabilities. If the [input-sil-counts] parameter is provided, it will also include silprobs in the generated lexicon.

The thing that this script implements is described in the paper:
"PRONUNCIATION AND SILENCE PROBABILITY MODELING FOR ASR" by Guoguo Chen et al, see http://www.danielpovey.com/files/2015_interspeech_silprob.pdf

Create `data/local/dict/lexiconp_silprob.txt` and `data/local/dict/silprob.txt` if silence counts file exists. 

`lexiconp_silprob.txt`:

```
`word pron-prob P(s_r | w) F(s_l | w) F(n_l | w) pron`
!SIL 1 0.20 1.00 1.00 SIL
<SPOKEN_NOISE> 1 0.20 1.00 1.00 SPN
<UNK> 1 0.52 1.96 0.88 SPN
A 0.0836236 0.01 1.71 0.87 EY1
A 1 0.02 0.93 1.01 AH0
A''S 1 0.20 1.00 1.00 EY1 Z
A'BODY 1 0.20 1.00 1.00 EY1 B AA2 D IY0
A'COURT 1 0.20 1.00 1.00 EY1 K AO2 R T
FROWN 1 0.35 0.88 1.07 F R AW1 N
FROWN'D 1 0.20 1.00 1.00 F R AW1 N D
FROWNED 1 0.60 0.92 1.04 F R AW1 N D
FROWNIN 1 0.20 1.00 1.00 F R AW1 N IH0 N
FROWNING 1 0.13 1.20 0.80 F R AW1 N IH0 NG
```

where:  

* P(s_r | w) is the probability of silence to the right of the word
* F(s_l | w) is a factor which is greater than one if silence to the left of the word is more than averagely probable.
* F(n_l | w) is a factor which is greater than one if nonsilence to the left of the word is more than averagely probable.

`silprob.txt`:

```
<s> 0.99
</s>_s 2.50504315618901
</s>_n 0.00871250898477495
overall 0.20
```

`lexiconp.txt`: Probability of prons

```
Some low-probability prons include:
# sort -k2,2 -n data/local/dict/lexiconp.txt  | head -n 8
WAS 0.00887574 W AO1 Z
US 0.0144927 Y UW1 EH1 S
WAS 0.0221894 W AA1 Z
WHILE 0.025641 HH W AY1 L
LAST 0.0322581 L AO1 S T
GRAHAM 0.0333334 G R EY1 AH0 M
EVERY 0.0350877 EH1 V ER0 IY0
AGAIN 0.0377358 AH0 G EY1 N
```

```bash  
  # usage: $0 <data-dir> <lang-dir> <dir>
  # e.g.:  $0 data/train data/lang exp/tri3
  # or:  $0 data/train data/lang exp/tri3/decode_dev
  steps/get_prons.sh --cmd "$train_cmd" \
    data/train_clean_5 data/lang_nosp exp/tri3b
    
  # Usage: $0 [options] <input-dict-dir> <input-pron-counts> \\"
  #           [input-sil-counts] [input-bigram-counts] <output-dict-dir>"
  #  e.g.: $0 data/local/dict \\"
  #           exp/tri3/pron_counts_nowb.txt exp/tri3/sil_counts_nowb.txt \\"
  #           exp/tri3/pron_bigram_counts_nowb.txt data/local/dict_prons"
  #  e.g.: $0 data/local/dict \\"
  #           exp/tri3/pron_counts_nowb.txt data/local/dict_prons"
  utils/dict_dir_add_pronprobs.sh --max-normalize true \
    data/local/dict_nosp \
    exp/tri3b/pron_counts_nowb.txt exp/tri3b/sil_counts_nowb.txt \
    exp/tri3b/pron_bigram_counts_nowb.txt data/local/dict

  utils/prepare_lang.sh data/local/dict \
    "<UNK>" data/local/lang_tmp data/lang

  local/format_lms.sh --src-dir data/lang data/local/lm

  utils/build_const_arpa_lm.sh \
    data/local/lm/lm_tglarge.arpa.gz data/lang data/lang_test_tglarge
```

#### 7.2.4 Align audio

Computes training alignments; assumes features are (LDA+MLLT or delta+delta-delta) + fMLLR (probably with SAT models). It first computes an alignment with the final.alimdl (or the final.mdl if final.alimdl is not present), then does 2 iterations of fMLLR estimation.

If you supply the --use-graphs option, it will use the training graphs from the source directory (where the model is). In this case the number of jobs must match the source directory.

```bash
  # "usage: steps/align_fmllr.sh <data-dir> <lang-dir> <src-dir> <align-dir>"
  # "e.g.:  steps/align_fmllr.sh data/train data/lang exp/tri1 exp/tri1_ali"
  steps/align_fmllr.sh --nj 5 --cmd "$train_cmd" \
    data/train_clean_5 data/lang exp/tri3b exp/tri3b_ali_train_clean_5
```

#### 7.2.5 Decode using the LDA+MLLT+SAT model with silence and pronunciation probabilities

```bash
  # Test the tri3b system with the silprobs and pron-probs.

  # decode using the tri3b model
  utils/mkgraph.sh data/lang_test_tgsmall \
                   exp/tri3b exp/tri3b/graph_tgsmall
                   
  for test in dev_clean_2; do
    steps/decode_fmllr.sh --nj 10 --cmd "$decode_cmd" \
                          exp/tri3b/graph_tgsmall data/$test \
                          exp/tri3b/decode_tgsmall_$test
                          
    steps/lmrescore.sh --cmd "$decode_cmd" data/lang_test_{tgsmall,tgmed} \
                       data/$test exp/tri3b/decode_{tgsmall,tgmed}_$test
                       
    steps/lmrescore_const_arpa.sh \
      --cmd "$decode_cmd" data/lang_test_{tgsmall,tglarge} \
      data/$test exp/tri3b/decode_{tgsmall,tglarge}_$test
  done
```

## 8. DNN Accoustic Model

1h is as 1g but a re-tuned model based on resnet-style TDNN-F layers with bypass connections. Below, 1h2 and 1h3 are just reruns of 1h with different --affix options, to give some idea of the run-to-run variation.


```bash
local/chain/compare_wer.sh --online exp/chain/tdnn1g_sp exp/chain/tdnn1h_sp 
                           exp/chain/tdnn1h2_sp exp/chain/tdnn1h3_sp
```

Results:

| System                | tdnn1g_sp | tdnn1h_sp | tdnn1h2_sp | tdnn1h3_sp| 
|-|-|-|-|-|  
| WER dev_clean_2 (tgsmall)      | 13.50     | 12.09     | 12.23     | 12.19| 
| [online:]         | 13.52     | 12.11     | 12.25     | 12.14| 
| WER dev_clean_2 (tglarge)       | 9.79     |  8.59     |  8.64     |  8.73| 
| [online:]          | 9.79     |  8.76     |  8.65     |  8.78| 
| Final train prob        | -0.0460   | -0.0493   | -0.0490   | -0.0493| 
| Final valid prob        | -0.0892   | -0.0805   | -0.0803   | -0.0813| 
| Final train prob (xent)   | -1.1739   | -1.1730   | -1.1742   | -1.1749| 
| Final valid prob (xent)   | -1.4487   | -1.3872   | -1.3857   | -1.3913| 
| Num-params                 | 6234672  |  5207856  |  5207856  |  5207856| 

```bash
exp/chain/tdnn1g_sp: num-iters=25 nj=2..5 num-params=6.2M dim=40+100->2328 
                     combine=-0.056->-0.055 (over 3) 
                     xent:train/valid[15,24,final]=(-1.50,-1.23,-1.17/-1.73,-1.52,-1.45) 
                     logprob:train/valid[15,24,final]=(-0.063,-0.051,-0.046/-0.101,-0.094,-0.089)
exp/chain/tdnn1h_sp: num-iters=34 nj=2..5 num-params=5.2M dim=40+100->2328 
                     combine=-0.049->-0.046 (over 4) 
                     xent:train/valid[21,33,final]=(-1.50,-1.22,-1.17/-1.66,-1.44,-1.39) 
                     logprob:train/valid[21,33,final]=(-0.068,-0.055,-0.049/-0.097,-0.088,-0.080)
exp/chain/tdnn1h2_sp: num-iters=34 nj=2..5 num-params=5.2M dim=40+100->2328 
                      combine=-0.049->-0.046 (over 4) 
                      xent:train/valid[21,33,final]=(-1.50,-1.22,-1.17/-1.67,-1.43,-1.39) 
                      logprob:train/valid[21,33,final]=(-0.068,-0.055,-0.049/-0.096,-0.087,-0.080)
exp/chain/tdnn1h3_sp: num-iters=34 nj=2..5 num-params=5.2M dim=40+100->2328 
                      combine=-0.050->-0.046 (over 4) 
                      xent:train/valid[21,33,final]=(-1.51,-1.23,-1.17/-1.67,-1.45,-1.39) 
                      logprob:train/valid[21,33,final]=(-0.068,-0.055,-0.049/-0.097,-0.089,-0.081)
```

Obtained from:

```bash
local/chain/run_tdnn.sh --stage 0
```

### 8.0 Train a GMM system and generate alignments

This step was done in section 7. Before you can start training your DNN, you will need the following directories, all generated as part of normal GMM-HMM training in Kaldi:

1. a training data dir (as generated by a `prepare_data.sh` script in a `5/local` directory)
2. a language dir (which has information on your phones, decision tree, etc, probably generated by `prepare_lang.sh`)
3. an alignment dir (generated by something like `align_si.sh`).
4. a feature dir (for example MFCCs; made by the `make_mfcc.sh` script)

Here are the files needed:

```bash
## DEPENDENCIES FROM GMM-HMM SYSTEM ##

# DATA DIR FILES
$data_dir/feats.scp
$data_dir/splitJOBN                # where JOBN is the total number of JOBs (eg. split4)
                   /JOB            # one dir for each JOB, up to JOBN
                        /feats.scp


# LANGUAGE DIR FILES
$lang_dir/topo


# ALIGN DIR FILES
$ali_dir/ali.JOB.gz                     # for as many JOBs as you ran
$ali_dir/final.mdl
$ali_dir/tree
$ali_dir/num_jobs


# MFCC DIR FILES
$mfcc_dir/raw_mfcc_train.JOB.{ark,scp}  # for as many JOBs as you ran
```

### 8.1 Generate i-vectors

This script is called from local/nnet3/run_tdnn.sh and local/chain/run_tdnn.sh (and may eventually be called by more scripts).  It contains the common feature preparation and iVector-related parts of the script.  See those scripts for examples of usage.

```bash
local/nnet3/run_ivector_common.sh --stage $stage \
                                  --train-set $train_set \
                                  --gmm $gmm \
                                  --nnet3-affix "$nnet3_affix"
```

* Although the nnet will be trained by high resolution data, we still have to perturb the normal data to get the alignment_sp stands for speed-perturbed
    * Preparing directory for low-resolution speed-perturbed data (for alignment)
    * Making MFCC features for low-resolution speed-perturbed data
* Aligning with the perturbed low-resolution data
    * Speed perturbed data in `data/train_clean_5_sp`. Combined from `data/train_clean_5_sp0.9` and `data/train_clean_5_sp1.1`.
    * Old files are kept in `data/train_clean_5_sp/.backup`
    * The combined files have additional IDs (for utterances or speakers) with prefi
    xes `sp0.9-*` or `sp1.1-*`.
    * Prefixes are also in newly computed files like MFCC ir CMVN features.
* Create high-resolution MFCC features (with 40 cepstra instead of 13) (hires = high resolution)
    * Do volume-perturbation on the training data prior to extracting hires features; this helps make trained nnets more invariant to test data volume.
    * New folders are created: `train_clean_5_sp_hires` and `dev_clean_2_hires`
    * Same file as in `train_clean_5`, for example, are computed in these folders. 
* Computing a subset of data to train the diagonal UBM (Universal Background Model)
    * Subset is like 25% of all utterances.
    * Saved in `exp/nnet3/diag_ubm/train_clean_5_sp_hires_subset`
* Computing a PCA transform from the hi-res data.
    * Saved in `exp/nnet3/pca_transform` 

* Training the diagonal UBM (Use 512 Gaussians in the UBM).
    * Saved in `exp/nnet3/diag_ubm`
    ```
    steps/online/nnet2/train_diag_ubm.sh: initializing model from E-M in memory,
    steps/online/nnet2/train_diag_ubm.sh: starting from 256 Gaussians, reaching 512;
    steps/online/nnet2/train_diag_ubm.sh: for 20 iterations, using at most 700000 frames of data
    Getting Gaussian-selection info
    steps/online/nnet2/train_diag_ubm.sh: will train for 4 iterations, in parallel over
    steps/online/nnet2/train_diag_ubm.sh: 30 machines, parallelized with 'run.pl'
    steps/online/nnet2/train_diag_ubm.sh: Training pass 0
    ...
    ```

* Train the iVector extractor.  Use all of the speed-perturbed data since iVector extractors can be sensitive to the amount of data. The script defaults to an iVector dimension of 100.
    * Extractor is saved in `exp/nnet3/extractor`
    ```
    steps/online/nnet2/train_ivector_extractor.sh: doing Gaussian selection and posterior computation
    Accumulating stats (pass 0)
    Summing accs (pass 0)
    Updating model (pass 0)
    ...
    utils/data/modify_speaker_info.sh: copied data from data/train_clean_5_sp_hires to
    exp/nnet3/ivectors_train_clean_5_sp_hires/train_clean_5_sp_hires_max2, 
    number of speakers changed from 87 to 2304
    utils/validate_data_dir.sh: Successfully validated data-directory
    exp/nnet3/ivectors_train_clean_5_sp_hires/train_clean_5_sp_hires_max2
    ```
 

* We extract iVectors on the speed-perturbed training data after combining short segments, which will be what we train the system on.  With --utts-per-spk-max 2, the script pairs the utterances into twos, and treats each of these pairs as one speaker; this gives more diversity in iVectors.. Note that these are extracted 'online'.
    * Note, we don't encode the 'max2' in the name of the ivectordir even though that's the data we extract the ivectors from, as it's still going to be valid for the non-'max2' data, the utterance list is the same.
    * Having a larger number of speakers is helpful for generalization, and to handle per-utterance decoding well (iVector starts at zero).
    * i-Vectors are saved in `exp/nnet3/ivectors_train_clean_5_sp_hires` and `exp/nnet3/ivectors_dev_clean_2_hires`
   

 
### 8.2 Create lang directory with chain-type topology

Create a version of the `lang/` directory in `data/lang_chain` that has one state per phone in the topo file. [note, it really has two states.. the first one is only repeated once, the second one has zero or more repeats.]

Generate a topology file. This allows control of the number of states in the non-silence HMMs, and in the silence HMMs. This is a modified version of 'utils/gen_topo.pl' that generates a different type of topology, one that we believe should be useful in the 'chain' model.  

> Note: right now it doesn't have any real options, and it treats silence and nonsilence the same.  The intention is that you write different versions of this script, or add options, if you experiment with it.

```bash
cp -r data/lang $lang
silphonelist=$(cat $lang/phones/silence.csl) || exit 1;
nonsilphonelist=$(cat $lang/phones/nonsilence.csl) || exit 1;
# Use our special topology... note that later on may have to tune this
# topology.
# Usage: steps/nnet3/chain/gen_topo.py
# <colon-separated-nonsilence-phones> <colon-separated-silence-phones>
# e.g.:  steps/nnet3/chain/gen_topo.pl 4:5:6:7:8:9:10 1:2:3
steps/nnet3/chain/gen_topo.py $nonsilphonelist $silphonelist >$lang/topo
```    
**lang/topo**: We make the transition-probs 0.5 so they normalize, to keep the code happy. In fact, we always set the transition probability scale to 0.0 in the 'chain' code, so they are never used. 

* Note: the <ForwardPdfClass> will actually happen on the incoming arc because we always build the graph with "reorder=true".
    
### 8.3 Generate alignments

Get the alignments as lattices (gives the chain training more freedom). Use the same num-jobs as the alignments.

Version of `align_fmllr_lats.sh` that uses "basis fMLLR", so it is suitable for situations where there is very little data per speaker (e.g. when there is a one-to-one mapping between utterances and speakers).  Intended for use where the model was trained with basis-fMLLR (i.e.  when you trained the model with train_sat_basis.sh where you normally would have trained with train_sat.sh), or when it was trained with SAT but you ran get_fmllr_basis.sh on the source-model directory.
    
```bash
# usage: steps/align_fmllr_lats.sh <data-dir> <lang-dir> <src-dir> <align-dir>
# e.g.:  steps/align_fmllr_lats.sh data/train data/lang exp/tri1 exp/tri1_lats
steps/align_fmllr_lats.sh --nj 75 --cmd "$train_cmd" ${lores_train_data_dir} \
  data/lang $gmm_dir $lat_dir
rm $lat_dir/fsts.*.gz # save space
```
    
### 8.4 Tree Building    
     
Build a tree using our new topology.  We know we have alignments for the speed-perturbed data (local/nnet3/run_ivector_common.sh made them), so use those.  The num-leaves is always somewhat less than the num-leaves from the GMM baseline.

This script builds a tree for use in the 'chain' systems (although the script itself is pretty generic and doesn't use any 'chain' binaries).  This is just like the first stages of a standard system, like 'train_sat.sh', except it does 'convert-ali' to convert alignments to a monophone topology just created from the 'lang' directory (in case the topology is different from where you got the system's alignments from), and it stops after the tree-building and model-initialization stage, without re-estimating the Gaussians or training the transitions.

```bash
# Usage: $0 <#leaves> <data> <lang> <ali-dir> <exp-dir>
# e.g.: $0 --frame-subsampling-factor 3 
steps/nnet3/chain/build_tree.sh \
  --frame-subsampling-factor 3 \
  --context-opts "--context-width=2 --central-position=1" \
  --cmd "$train_cmd" 3500 ${lores_train_data_dir} \
  $lang $ali_dir $tree_dir
```

### 8.5 Creating Neural Net Configs

Please note that it is important to have input layer with the name=input as the layer immediately preceding the fixed-affine-layer to enable the use of short notation for the descriptor.

```bash
num_targets=$(tree-info $tree_dir/tree |grep num-pdfs|awk '{print $2}') 
learning_rate_factor=$(echo "print (0.5/$xent_regularize)" | python)
```
```bash
tdnn_opts="l2-regularize=0.03 dropout-proportion=0.0 dropout-per-dim-continuous=true"
tdnnf_opts="l2-regularize=0.03 dropout-proportion=0.0 bypass-scale=0.66"
linear_opts="l2-regularize=0.03 orthonormal-constraint=-1.0"
prefinal_opts="l2-regularize=0.03"
output_opts="l2-regularize=0.015"

mkdir -p $dir/configs
cat <<EOF > $dir/configs/network.xconfig
input dim=100 name=ivector
input dim=40 name=input

# please note that it is important to have input layer with the name=input
# as the layer immediately preceding the fixed-affine-layer to enable
# the use of short notation for the descriptor
fixed-affine-layer name=lda input=Append(-1,0,1,ReplaceIndex(ivector, t, 0)) \
affine-transform-file=$dir/configs/lda.mat

# the first splicing is moved before the lda layer, so no splicing here
relu-batchnorm-dropout-layer name=tdnn1 $tdnn_opts dim=768
tdnnf-layer name=tdnnf2 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=1
tdnnf-layer name=tdnnf3 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=1
tdnnf-layer name=tdnnf4 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=1
tdnnf-layer name=tdnnf5 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=0
tdnnf-layer name=tdnnf6 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf7 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf8 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf9 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf10 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf11 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf12 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
tdnnf-layer name=tdnnf13 $tdnnf_opts dim=768 bottleneck-dim=96 time-stride=3
linear-component name=prefinal-l dim=192 $linear_opts

## adding the layers for chain branch
prefinal-layer name=prefinal-chain input=prefinal-l $prefinal_opts small-dim=192 big-dim=768
output-layer name=output include-log-softmax=false dim=$num_targets $output_opts

# adding the layers for xent branch
prefinal-layer name=prefinal-xent input=prefinal-l $prefinal_opts small-dim=192 big-dim=768
output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor $output_opts
EOF
steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig --config-dir $dir/configs/
```

* Configuration saved in `exp/chain{}/tdnn{1h}_sp`. Depends on `nnet3_affix` and `affix`, which are {} and {1h} in this case.
* `exp/chain{}/tdnn{1h}_sp/configs` contains all configuration files.

### 8.6 Train Neural Net

```bash
steps/nnet3/chain/train.py --stage=$train_stage \
--cmd="$decode_cmd" \
--feat.online-ivector-dir=$train_ivector_dir \
--feat.cmvn-opts="--norm-means=false --norm-vars=false" \
--chain.xent-regularize $xent_regularize \
--chain.leaky-hmm-coefficient=0.1 \
--chain.l2-regularize=0.0 \
--chain.apply-deriv-weights=false \
--chain.lm-opts="--num-extra-lm-states=2000" \
--trainer.dropout-schedule $dropout_schedule \
--trainer.add-option="--optimization.memory-compression-level=2" \
--trainer.srand=$srand \
--trainer.max-param-change=2.0 \
--trainer.num-epochs=20 \
--trainer.frames-per-iter=3000000 \
--trainer.optimization.num-jobs-initial=2 \
--trainer.optimization.num-jobs-final=5 \
--trainer.optimization.initial-effective-lrate=0.002 \
--trainer.optimization.final-effective-lrate=0.0002 \
--trainer.num-chunk-per-minibatch=128,64 \
--egs.chunk-width=$chunk_width \
--egs.dir="$common_egs_dir" \
--egs.opts="--frames-overlap-per-eg 0" \
--cleanup.remove-egs=$remove_egs \
--use-gpu=true \
--reporting.email="$reporting_email" \
--feat-dir=$train_data_dir \
--tree-dir=$tree_dir \
--lat-dir=$lat_dir \
--dir=$dir  || exit 1;
```

### 8.7 Make Neural Net Graph

```bash
# Note: it's not important to give mkgraph.sh the lang directory with the
# matched topology (since it gets the topology file from the model).
utils/mkgraph.sh \
  --self-loop-scale 1.0 data/lang_test_tgsmall \
  $tree_dir $tree_dir/graph_tgsmall || exit 1;
```

### 8.8 Decoding

```bash
for data in $test_sets; do
  (
    nspk=$(wc -l <data/${data}_hires/spk2utt)
    steps/nnet3/decode.sh \
        --acwt 1.0 --post-decode-acwt 10.0 \
        --frames-per-chunk $frames_per_chunk \
        --nj $nspk --cmd "$decode_cmd"  --num-threads 4 \
        --online-ivector-dir exp/nnet3${nnet3_affix}/ivectors_${data}_hires \
        $tree_dir/graph_tgsmall data/${data}_hires ${dir}/decode_tgsmall_${data} || exit 1
    steps/lmrescore_const_arpa.sh --cmd "$decode_cmd" \
      data/lang_test_{tgsmall,tglarge} \
     data/${data}_hires ${dir}/decode_{tgsmall,tglarge}_${data} || exit 1
  ) || touch $dir/.error &
done
```

### 8.9 Online Decoding

```bash
for data in $test_sets; do
  (
    nspk=$(wc -l <data/${data}_hires/spk2utt)
    # note: we just give it "data/${data}" as it only uses the wav.scp, the
    # feature type does not matter.
    steps/online/nnet3/decode.sh \
      --acwt 1.0 --post-decode-acwt 10.0 \
      --nj $nspk --cmd "$decode_cmd" \
      $tree_dir/graph_tgsmall data/${data} ${dir}_online/decode_tgsmall_${data} || exit 1
    steps/lmrescore_const_arpa.sh --cmd "$decode_cmd" \
      data/lang_test_{tgsmall,tglarge} \
     data/${data}_hires ${dir}_online/decode_{tgsmall,tglarge}_${data} || exit 1
  ) || touch $dir/.error &
done
```

<br>

---

## Training GMM-HMM Acoustic Models

https://eleanorchodroff.com/tutorial/kaldi/training-acoustic-models.html

### Create files for `data/train`

1) Run append_transcripts.py to obtain `text` utterance-to-utterance transcripts file.

```bash
cd mycorpus/data/train
python append_transcripts.py ../local/LibriSpeech_train/train-clean-5/
```

2) Run the following command to reduce the lexicon to only the words present in the corpus and obtain `words.txt`

```bash
cd mycorpus/data/train
cut -d ' ' -f 2- text | tr ' ' '\n' | sort -u > words.txt
```

3) Run the following command to downsize the lexicon to only the words in the corpus tp obtain `/local/lang/lexicon.txt`.

```bash
cd mycorpus/data/train
python filter_dict.py ../lang/lexicon.txt ../local/lang/lexicon.txt words.txt
```

4) Run the following command to convert flac format to wav format.

```bash
cd mycorpus/data/train
python get_wavs.py ../local/LibriSpeech_train/train-clean-5/ ../local/LibriSpeech_train/train-clean-5-wav/
```

5) Run the following command to create `segments` file.

```bash
cd mycorpus
python data/train/create_segments.py data/local/LibriSpeech_train/train-clean-5-wav/
```

6) Run the following command to create `wav.scp` file.

```bash
cd mycorpus
python data/train/create_wav_file.py data/local/LibriSpeech_train/train-clean-5-wav/
```

7) Run the following command to create `utt2spk` file.

```bash
cd mycorpus/data/train
cat segments | cut -f 1 -d ' ' | \
perl -ane 'chomp; @F = split "-", $_; print $_ . " " . @F[0] . "\n";' > utt2spk
```

8) Run the following command to create `spk2utt` file. It will strip `segments` and `wav.scp` from all their files.

```bash
cd mycorpus
utils/fix_data_dir.sh data/train/
```
### Create files for `data/local/lang`

```bash
cd ../local/lang
```

1) Create `nonsilence_phones.txt`.

```bash
# this should be interpreted as one line of code
cut -d ' ' -f 2- lexicon.txt |  \  
tr ' ' '\n' | \  
sort -u > nonsilence_phones.txt
```

2) Create `silence_phones.txt`.

```bash
printf 'SIL\noov\n' > silence_phones.txt
```

3) Create `optional_silence.txt`.

```bash
printf 'SIL\n' > optional_silence.txt
```

### Create files for `data/lang`

1) Populate the directory

```bash
cd mycorpus
utils/prepare_lang.sh data/local/lang '<oov>' data/local/ data/lang
```

### Parallelization wrapper

1)
```bash
cd mycorpus  
vim cmd.sh 

# Insert the following text in cmd.sh
train_cmd="run.pl"
decode_cmd="run.pl"

chmod 750 cmd.sh
```

2) 
```bash
# Create mfcc.conf by opening it in a text editor like vim
cd mycorpus/conf
vim mfcc.conf

# Insert the following text in mfcc.conf                
--use-energy=false  
--sample-frequency=16000

chmod 750 mfcc.conf
```
3)
```bash
cd mycorpus
steps/make_mfcc.sh --cmd run.pl --nj 16 data/train/ exp/make_mfcc/data/train mfcc
steps/compute_cmvn_stats.sh data/train/ exp/make_mfcc/data/train/ mfcc
```

### Monophone training and alignment

1) Take subset of data for monophone training.
```bash
cd mycorpus
utils/subset_data_dir.sh --first data/train 1000 data/train_1k
```

2) Train monophones.
```bash
steps/train_mono.sh --boost-silence 1.25 --nj 10 --cmd run.pl data/train_1k data/lang exp/mono_1k
```

2) Align monophones.
```bash
steps/align_si.sh --boost-silence 1.25 --nj 16 --cmd run.pl data/train data/lang exp/mono_1k exp/mono_ali || exit 1;
```

### Triphone training and alignment

1) Train delta-based triphones
```bash
steps/train_deltas.sh --boost-silence 1.25 --cmd run.pl 2000 10000 data/train data/lang exp/mono_ali exp/tri1 || exit 1;
```

2) Align delta-based triphones.
```bash
steps/align_si.sh --nj 24 --cmd run.pl data/train data/lang exp/tri1 exp/tri1_ali || exit 1;
```

3) Train delta + delta-delta triphones.
```bash
steps/train_deltas.sh --cmd run.pl 2500 15000 data/train data/lang exp/tri1_ali exp/tri2a || exit 1;
```

4) Align delta + delta-delta triphones.
```bash
steps/align_si.sh  --nj 24 --cmd run.pl --use-graphs true data/train data/lang exp/tri2a exp/tri2a_ali  || exit 1;
```

5) Train LDA-MLLT triphones.
```bash
steps/train_lda_mllt.sh --cmd run.pl 3500 20000 data/train data/lang exp/tri2a_ali exp/tri3a || exit 1;
```

6) Align LDA-MLLT triphones with FMLLR.
```bash
steps/align_fmllr.sh --nj 28 --cmd run.pl data/train data/lang exp/tri3a exp/tri3a_ali || exit 1;
```

7) Align LDA-MLLT triphones with FMLLR.
```bash
steps/train_sat.sh  --cmd run.pl 4200 40000 data/train data/lang exp/tri3a_ali exp/tri4a || exit 1;
```

8) Align SAT triphones with FMLLR.
```bash
steps/align_fmllr.sh  --cmd run.pl data/train data/lang exp/tri4a exp/tri4a_ali || exit 1;
```

<br>

---
## Training DNN Acoustic Models

https://jrmeyer.github.io/asr/2016/12/15/DNN-AM-Kaldi.html

### First Things First: train a GMM system and generate alignments

1) Using all the files generated previously.

* `data_dir`: mycorpus/data/train where splitJOBN = split28
* `lang_dir`: mycorpus/data/lang
* `ali_dir`: mycorpus/exp/tri4a_ali
* `mfcc_dir`: mycropus/mfcc

```bash
## DEPENDENCIES FROM GMM-HMM SYSTEM ##

# DATA DIR FILES
$data_dir/feats.scp
$data_dir/splitJOBN                # where JOBN is the total number of JOBs (eg. split4)
                   /JOB            # one dir for each JOB, up to JOBN
                        /feats.scp


# LANGUAGE DIR FILES
$lang_dir/topo


# ALIGN DIR FILES
$ali_dir/ali.JOB.gz                     # for as many JOBs as you ran
$ali_dir/final.mdl
$ali_dir/tree
$ali_dir/num_jobs


# MFCC DIR FILES
$mfcc_dir/raw_mfcc_train.JOB.{ark,scp}  # for as many JOBs as you ran
```

### The Main Run Script: `run_nnet2_simple.sh`

1) Create the script following the instructions and using the correct paths.

