# 1.2: Preparing the *required* files

The scripts below will prepare the required files.  The next notebook (`1.3: Examining the *required* files`) will look at their format, file structures, and the roles they play in our `ASR` pipeline.

The following command will build the following required files for **each** subset:
 - `wav.scp`
 - `spk2utt`
 - `utt2spk`
 - `text`

In [None]:
# location of downloaded audio data
data=${KALDI_INSTRUCTIONAL_PATH}/raw_data

for part in dev-clean test-clean dev-other test-other train-clean-100; do
      # use underscore-separated names in data directories.
      ${KALDI_INSTRUCTIONAL_PATH}/local/data_prep.sh \
          $data/LibriSpeech/$part \
          ${KALDI_INSTRUCTIONAL_PATH}/data/$(echo $part | sed s/-/_/g)
done

We will use `ffmpeg` to `downsample` and `convert` the `librispeech` audio files from `16kHz flac` to `8kHz wav` (with `16-bit signed little endian encoding`).  

We will also consolidate all the `train`, `dev`, and `test` audio subsets into respective, flat directories:
 - `train_clean_audio`
 - `dev_clean_audio`
 - `dev_other_audio`
 - `test_clean_audio`
 - `test_other_audio`
 
**Note:** This step could take up to `1 hr` to complete.

In [None]:
for part in dev-clean test-clean dev-other test-other train-clean-100; do
    ${KALDI_INSTRUCTIONAL_PATH}/utils/data/convert_audio_directory.sh \
        -i ${KALDI_INSTRUCTIONAL_PATH}/raw_data/LibriSpeech/${part} \
        -o ${KALDI_INSTRUCTIONAL_PATH}/raw_data/LibriSpeech/${part}_audio \
        -s 8000 \
        -r
done

Then another quick pass to clean up the filenames.

In [None]:
for part in dev-clean_audio test-clean_audio dev-other_audio test-other_audio train-clean-100_audio; do
    ${KALDI_INSTRUCTIONAL_PATH}/utils/data/strip_duplicate_filetype.sh \
        ${KALDI_INSTRUCTIONAL_PATH}/raw_data/Librispeech/${part}
done

echo  "process is completed"

But since we've restructured the audio, we will need to rebuild the `wav.scp` files for each subset.