# 1.2: Examining the *required* files

Our version of the `kaldi` pipeline will depend on the files and directory structures explained in this notebook.

**Note**: If you would like to run this pipeline with your **own data**, you **must** have all of the following before proceeding.

In our case, all of the things we need should now be in the directory, `raw_data`.  Here you will see the original `.tar.gz` downloads along with:

 - the `lexicon` file: `librispeech-lexicon.txt`
 - the `transcripts` file: `librispeech-transcripts.txt`
 - the different `language models`: `3-gram.arpa.gz`, `3-gram.pruned.*.arpa.gz`, `4-gram.arpa.gz`

In [None]:
ls raw_data

Additional files for the different subsets of the dataset are in `raw_data/LibriSpeech`, including an `audio` file and a `data` file for each subset.

In [None]:
ls raw_data/LibriSpeech

## audio files

All audio files for a particualr subset (*e.g.* `train`, `dev`, `test`, etc.) must be in a flat directory structure (*i.e.* without any sub-directories).  

In [None]:
ls raw_data/LibriSpeech/test-clean_audio | head    # look at first 10 files in the directory to exhibit flat structure

And the format of the audio is important as well.  `kaldi` expects the audio to be encoded with `16-bit signed little endian` (more information about this is [here](https://wiki.multimedia.cx/index.php/PCM)).  

The sample rate of that audio is a hyperparameter that becomes important in a later step.  Most common is `16 kHz` for recorded audio and `8 kHz` for recorded phone calls.  In our case, we downsampled the `librispeech` data to also be `8 kHz`.

In [None]:
file raw_data/LibriSpeech/test-clean_audio/1089-134686-0000.wav

The command below will generate counts for each subset.

In [None]:
for part in train-clean-100_audio dev-clean_audio dev-other_audio test-clean_audio test-other_audio; do
    count=$(ls raw_data/LibriSpeech/${part} | wc -l)
    echo "There are ${count} utterances in the ${part} subset"
done

## segments file 
#### (**UNUSED** for `librispeech`)

The `librispeech` data is already segmented into small audio files (~2-10 seconds long).  `kaldi` *can* handle data that unsegmented, but it requires an additional `segments` file with the following format:

```
[utterance-id] [audio-basename] [utterance-start] [utterance-stop]
[utterance-id] [audio-basename] [utterance-start] [utterance-stop]
[utterance-id] [audio-basename] [utterance-start] [utterance-stop]
```

This allows `kaldi` to process **each segment** as a separate audio file.

## transcript file

All transcripts for all audio subsets should be in one text file with the following format:
```
[utterance-id] [transcript text]
[utterance-id] [transcript text]
[utterance-id] [transcript text]
```

The `utterance-id` is used to identify the particular utterance.  In the case of **segmented** audio like the `librispeech` dataset, this will **also** be the audio basename (*i.e.* without `.wav`).

In [None]:
head raw_data/librispeech-transcripts.txt

## phones file

This file contains a list of all the phones used to make up the words in our `lexicon`.

It takes the following format:

```
[phone_1]
[phone_2]
[phone_3]
```

In [None]:
head raw_data/librispeech-phones.txt
tail raw_data/librispeech-phones.txt

You'll notice that some phones have a digit at their end.  This allows for us to distinguish different stresses or tones of a phone.

We also need to identify `silence phones`.  These are phones that will represent sounds of things that do **not** correspond to spoken words.  This could be laughter, coughing, passing cars, whatever.  In our case, we will keep it simple and have one `phone` (`SIL`) to represent everything non-spoken.  But you can be as granular with this as you'd like **as long as** your `transcripts` file accurately transcribes them all!

**Note:** This `silence phone` will be a hyperparameter to the first step in our pipeline!

In [None]:
cat raw_data/librispeech-phones.txt | grep -A2 -B2 SIL

## lexicon

The `lexicon` is a file containing all the words in our vocabulary **and** their pronunciations.  

**Note:** Only words that appear in this `lexicon` will be words that our `ASR` system predicts.  In other words, if the word is **not** in this `lexicon`, then our system will **never** be able to predict it.

It takes the following format:

```
[short_word]    [phone_1] [phone_2] [phone_3]
[longer_word]   [phone_1] [phone_2] [phone_3] [phone_4] [phone_5]
[another_word]  [phone_1] [phone_2] [phone_3]
```

**Note:** The first `whitespace` is a `tab`, the remaining are `space`.
**Note:** While it is convenient for humans, `kaldi` does **not** require that this file be in alphabetical order.

In [None]:
head -n5 raw_data/librispeech-lexicon.txt
tail -n5 raw_data/librispeech-lexicon.txt

You will also notice multiple entries for the same word are allowed, provided they have different pronunciations.

In [None]:
cat raw_data/librispeech-lexicon.txt | grep "INDIRECTLY"

We also have an entry for unknown words (`<unk>`).  `kaldi` requires this "placeholder" for any words that it can't decode.  It is made up of the single `nonsilence phone`, `SIL`.

In [None]:
cat raw_data/librispeech-lexicon.txt | grep "<unk>"

## language model

The `language model` must be in the `ARPA` format (more details on this format in week 2).  

In [None]:
head raw_data/3-gram.pruned.3e-7.arpa
echo "..."
grep -A4 -E '\\2-grams' raw_data/3-gram.pruned.3e-7.arpa
echo "..."
grep -A4 -E '\\3-grams' raw_data/3-gram.pruned.3e-7.arpa
echo "..."
tail raw_data/3-gram.pruned.3e-7.arpa

**Note:** The `language model` **can** remain `compress`ed, but it requires `piping` `gzip -d` to later steps.  We have already uncompressed the `language model`s for this dataset.

We have 4 `language model`s of different size and complexity built for the `librispeech` data.

In [None]:
ls raw_data | grep "arpa"