# 1.1: Downloading and preparing `librispeech` files

We will be using a free dataset consisting of readings from books available at `project gutenberg`.

The language models and lexicons are explained [here](http://www.openslr.org/12/).

Note: You do **not** need to download them yourself.  The scripts below will automatically download the necessary files.

## Downloading the audio

In [None]:
# location to download raw audio
data=${KALDI_INSTRUCTIONAL_PATH}/raw_data
mkdir $data

# base url for downloads
data_url=www.openslr.org/resources/12
lm_url=www.openslr.org/resources/11

# source files with path information
. ${KALDI_INSTRUCTIONAL_PATH}/path.sh

The audio files are explained [here](http://www.openslr.org/12/).  

There are two sets of audio: `clean` and `other`.  `clean` is a subset of the audio that is very clearly articulated and "easier" to run through `ASR`.  `other` is a subset of data that is much more difficult to run through `ASR`.  There are also three different sized training sets: `100 hrs`, `360 hrs`, and `500 hrs`.  

We will do all of our training on `train-clean-100`, and will test on *both* `test-clean` and `test-other`.

The command below will download the following audio subsets into the directory `INSTRUCTIONAL/raw_data`:
 - `train-clean-100`
 - `dev-clean`
 - `dev-other`
 - `test-clean`
 - `test-other`

**Note:** This step could take quite a while (perhaps even > 1 hr) to complete (depending on your internet connection speed).

In [None]:
for part in dev-clean test-clean dev-other test-other train-clean-100; do
    ${KALDI_INSTRUCTIONAL_PATH}/local/download_and_untar.sh ${data} ${data_url} ${part}
done

## Converting the audio

We will use `ffmpeg` to `downsample` and `convert` the `librispeech` audio files from `16kHz flac` to `8kHz wav` (with `16-bit signed little endian encoding`).  

We will also consolidate all the `train`, `dev`, and `test` audio subsets into respective, flat directories:
 - `train_clean_audio`
 - `dev_clean_audio`
 - `dev_other_audio`
 - `test_clean_audio`
 - `test_other_audio`
 
**Note:** This step could take up to `1 hr` to complete.

In [None]:
for part in dev-clean test-clean dev-other test-other train-clean-100; do
    ${KALDI_INSTRUCTIONAL_PATH}/utils/data/convert_audio_directory.sh \
        -i ${KALDI_INSTRUCTIONAL_PATH}/raw_data/LibriSpeech/${part} \
        -o ${KALDI_INSTRUCTIONAL_PATH}/raw_data/LibriSpeech/${part}_audio \
        -s 8000
done

Then another quick pass to clean up the filenames.

In [None]:
for part in dev-clean_audio test-clean_audio dev-other_audio test-other_audio train-clean-100_audio; do
    ${KALDI_INSTRUCTIONAL_PATH}/utils/data/strip_duplicate_filetype.sh \
        ${KALDI_INSTRUCTIONAL_PATH}/raw_data/LibriSpeech/${part}
done

## Downloading `language model`s and `lexicon`

Some of the other files needed, `language model`s and `lexicon`, have already been created.

`librispeech-lm-norm.txt.gz` is a compressed file of the `text` used to build the language models. <br>
`librispeech-lexicon.txt` is a file that contains all the words in the `ASR` vocabulary and their pronunciations. <br>
`3-gram.arpa.gz` is a compressed `3-gram` `language model`. <br>
`3-gram.pruned.1e-7.arpa.gz` and `3gram.pruned.3e-7.arpa.gz` are `prune`d versions of `3-gram.arpa.gz`. <br>
`4-gram.arpa.gz` is a `4-gram` `language model`. <br>

The command below will download these files into the directory `INSTRUCTIONAL/raw_data`.

**Note:** This step could take up to `1 hr` to complete.

In [None]:
${KALDI_INSTRUCTIONAL_PATH}/local/download_lm.sh ${lm_url} ${data}

For simplicity of later steps, we will uncompress the language models.

In [None]:
for lm in 3-gram.arpa.gz 3-gram.pruned.1e-7.arpa.gz 3-gram.pruned.3e-7.arpa.gz 4-gram.arpa.gz; do
    gzip -df raw_data/${lm}
    echo "uncompressed ${lm}"
done

We are also going to remove some `symbolic links` created in a previous step.

In [None]:
rm raw_data/lm_*.gz

In [None]:
ls raw_data

## Fixing bugs in `lexicon`

There are two identical entries for `SPIRITS` in the `lexicon`.

In [None]:
cat raw_data/librispeech-lexicon.txt | grep SPIRITS

And so we are going to replace the second entry with an alternative pronunciation (more about alternative pronunciations later):

```
SPIRITS  S P IH1 R IH1 T S
```

In [None]:
cat raw_data/librispeech-lexicon.txt | \
    perl -pe 's{S P IH1 R IH0 T S}{++$n == 4 ? "S P IH1 R IH1 T S" : $&}ge' \
    > raw_data/librispeech-lexicon.txt.corrected
    
mv raw_data/librispeech-lexicon.txt.corrected raw_data/librispeech-lexicon.txt

In [None]:
cat raw_data/librispeech-lexicon.txt | grep SPIRITS

`kaldi` also expects that we have an entry for `<unk>` words to this `lexicon` (more about this later):

```
<unk>    SIL
```

In [None]:
printf "<unk>\tSIL\n" >> raw_data/librispeech-lexicon.txt

In [None]:
tail -n3 raw_data/librispeech-lexicon.txt

## Building a transcript file

The following script will generate a bunch of files, most of which we will ignore for now...

In [None]:
for part in dev-clean test-clean dev-other test-other train-clean-100; do
    data_files=${data}/LibriSpeech/${part}_data
    mkdir -p ${data_files}
    ${KALDI_INSTRUCTIONAL_PATH}/local/data_prep.sh \
        ${data}/LibriSpeech/${part} \
        ${data_files}
    rm -r ${data}/LibriSpeech/${part}    # we no longer need the original audio, so we can delete it
done

...but it will allow us to easily make a single file containing all the transcripts.

In [None]:
for part in dev-clean test-clean dev-other test-other train-clean-100; do
    data_files=${data}/LibriSpeech/${part}_data
    cat ${data_files}/text
done > ${data}/librispeech-transcripts.txt

In [None]:
head -n5 raw_data/librispeech-transcripts.txt

## Building a phones file

The following script will build a list of all the phones used in our lexicon.

In [None]:
python local/build_phones_list.py \
    ${data}/librispeech-lexicon.txt \
    ${data}/librispeech-phones.txt

In [None]:
head -n5 raw_data/librispeech-phones.txt