# 1.2: Examining the *required* files

Our version of the `kaldi` pipeline will depend on the files and directory structures explained in this notebook.

**Note**: If you would like to run this pipeline with your **own data**, you **must** have all of the following before proceeding.

In our case, all of the things we need should now be in the directory, `raw_data`.  Here you will see the original `.tar.gz` downloads along with:

 - the `lexicon` file: `librispeech-lexicon.txt`
 - the `transcripts` file: `librispeech-transcripts.txt`
 - the different `language models`: `3-gram.arpa.gz`, `3-gram.pruned.*.arpa.gz`, `4-gram.arpa.gz`

In [20]:
ls raw_data

[0m[01;31m3-gram.arpa.gz[0m              [01;31mdev-other.tar.gz[0m             [01;36mlm_tglarge.arpa.gz[0m
[01;31m3-gram.pruned.1e-7.arpa.gz[0m  librispeech-lexicon.txt      [01;36mlm_tgmed.arpa.gz[0m
[01;31m3-gram.pruned.3e-7.arpa.gz[0m  [01;31mlibrispeech-lm-norm.txt.gz[0m   [01;36mlm_tgsmall.arpa.gz[0m
[01;31m4-gram.arpa.gz[0m              librispeech-transcripts.txt  [01;31mtest-clean.tar.gz[0m
[01;34mLibriSpeech[0m                 librispeech-vocab.txt        [01;31mtest-other.tar.gz[0m
[01;31mdev-clean.tar.gz[0m            [01;36mlm_fglarge.arpa.gz[0m           [01;31mtrain-clean-100.tar.gz[0m


Additional files for the different subsets of the dataset are in `raw_data/LibriSpeech`, including an `audio` file and a `data` file for each subset.

In [22]:
ls raw_data/LibriSpeech

BOOKS.TXT     SPEAKERS.TXT     [0m[01;34mdev-other_audio[0m   [01;34mtest-other[0m
CHAPTERS.TXT  [01;34mdev-clean_audio[0m  [01;34mdev-other_data[0m    [01;34mtest-other_data[0m
LICENSE.TXT   [01;34mdev-clean_data[0m   [01;34mtest-clean_audio[0m  [01;34mtrain-clean-100[0m
README.TXT    [01;34mdev-other[0m        [01;34mtest-clean_data[0m   [01;34mtrain-clean-100_data[0m


## audio files

All audio files for a particualr subset (*e.g.* `train`, `dev`, `test`, etc.) must be in a flat directory structure (*i.e.* without any sub-directories).  

In [10]:
ls raw_data/LibriSpeech/test-clean_audio | head    # look at first 10 files in the directory to exhibit flat structure

1089-134686-0000.wav
1089-134686-0001.wav
1089-134686-0002.wav
1089-134686-0003.wav
1089-134686-0004.wav
1089-134686-0005.wav
1089-134686-0006.wav
1089-134686-0007.wav
1089-134686-0008.wav
1089-134686-0009.wav
ls: write error: Broken pipe


And the format of the audio is important as well.  `kaldi` expects the audio to be encoded with `16-bit signed little endian` (more information about this is [here](https://wiki.multimedia.cx/index.php/PCM)).  

The sample rate of that audio is a hyperparameter that becomes important in a later step.  Most common is `16 kHz` for recorded audio and `8 kHz` for recorded phone calls.  In our case, we downsampled the `librispeech` data to also be `8 kHz`.

In [13]:
file raw_data/LibriSpeech/test-clean_audio/1089-134686-0000.wav

raw_data/LibriSpeech/test-clean_audio/1089-134686-0000.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz


The command below will generate counts for each subset.

In [27]:
for part in train-clean-100_audio dev-clean_audio dev-other_audio test-clean_audio test-other_audio; do
    count=$(ls raw_data/LibriSpeech/${part} | wc -l)
    echo "There are ${count} utterances in the ${part} subset"
done

There are 17553 utterances in the train-clean-100_audio subset
There are 2703 utterances in the dev-clean_audio subset
There are 2864 utterances in the dev-other_audio subset
There are 2620 utterances in the test-clean_audio subset
There are 2939 utterances in the test-other_audio subset


## segments file 
#### (**UNUSED** for `librispeech`)

The `librispeech` data is already segmented into small audio files (~2-10 seconds long).  `kaldi` *can* handle data that unsegmented, but it requires an additional `segments` file with the following format:

```
[utterance-id] [audio-basename] [utterance-start] [utterance-stop]
[utterance-id] [audio-basename] [utterance-start] [utterance-stop]
[utterance-id] [audio-basename] [utterance-start] [utterance-stop]
```

This allows `kaldi` to process **each segment** as a separate audio file.

## transcript file

All transcripts for all audio subsets should be in one text file with the following format:
```
[utterance-id] [transcript text]
[utterance-id] [transcript text]
[utterance-id] [transcript text]
```

The `utterance-id` is used to identify the particular utterance.  In the case of **segmented** audio like the `librispeech` dataset, this will **also** be the audio basename (*i.e.* without `.wav`).

In [28]:
head raw_data/librispeech-transcripts.txt

1272-128104-0000 MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
1272-128104-0001 NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER
1272-128104-0002 HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND
1272-128104-0003 HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA
1272-128104-0004 LINNELL'S PICTURES ARE A SORT OF UP GUARDS AND AT EM PAINTINGS AND MASON'S EXQUISITE IDYLLS ARE AS NATIONAL AS A JINGO POEM MISTER BIRKET FOSTER'S LANDSCAPES SMILE AT ONE MUCH IN THE SAME WAY THAT MISTER CARKER USED TO FLASH HIS TEETH AND MISTER JOHN COLLIER GIVES HIS SITTER A CHEERFUL SLAP ON THE BACK BEFORE HE SAYS LIKE A SHAMPOOER IN A TURKISH BATH NEXT MAN
1272-128104-0005 IT IS OBVIOUSLY UNNECESSARY FOR US TO POINT OUT HOW LUMINOUS THESE CRITICI

## phones file

This file contains a list of all the phones used to make up the words in our `lexicon`.

It takes the following format:

```
[phone_1]
[phone_2]
[phone_3
```

In [36]:
head raw_data/librispeech-phones.txt
tail raw_data/librispeech-phones.txt

AA0
AA1
AA2
AE0
AE1
AE2
AH0
AH1
AH2
AO0
UH1
UH2
UW0
UW1
UW2
V
W
Y
Z
ZH


You'll notice that some phones have a digit at their end.  This allows for us to distinguish different stresses or tones of a phone.

## lexicon

The `lexicon` is a file containing all the words in our vocabulary **and** their pronunciations.  

**Note:** Only words that appear in this `lexicon` will be words that our `ASR` system predicts.  In other words, if the word is **not** in this `lexicon`, then our system will **never** be able to predict it.

It takes the following format:

```
[short_word]    [phone_1] [phone_2] [phone_3]
[longer_word]   [phone_1] [phone_2] [phone_3] [phone_4] [phone_5]
[another_word]  [phone_1] [phone_2] [phone_3]
```

**Note:** The first `whitespace` is a `tab`, the remaining are `space`.
**Note:** While it is convenient for humans, `kaldi` does **not** require that this file be in alphabetical order.

In [32]:
head -n5 raw_data/librispeech-lexicon.txt
tail -n5 raw_data/librispeech-lexicon.txt

A  AH0
A  EY1
A''S	EY1 Z
A'BODY	EY1 B AA2 D IY0
A'COURT	EY1 K AO2 R T
ZYNOOL'S	Z IH1 N UW1 L Z
ZYOBOR	Z AY1 OW0 B AO0 R
ZZ	Z
ZZZ	Z Z
ZZZZ	Z AH0 Z


## language model