# 1.2: Examining the *required* files

Our version of the `kaldi` pipeline will depend on the files and directory structures explained in this notebook.

**Note**: If you would like to run this pipeline with your **own data**, you **must** have all of the following before proceeding.

## audio files

All audio files for a particualr subset (*e.g.* `train`, `dev`, `test`) must be in a flat directory structure (*i.e.* without any sub-directories).  

In [10]:
ls raw_data/LibriSpeech/test-clean_audio | head    # look at first 10 files in the directory to exhibit flat structure

1089-134686-0000.wav
1089-134686-0001.wav
1089-134686-0002.wav
1089-134686-0003.wav
1089-134686-0004.wav
1089-134686-0005.wav
1089-134686-0006.wav
1089-134686-0007.wav
1089-134686-0008.wav
1089-134686-0009.wav
ls: write error: Broken pipe


And the format of the audio is important as well.  `kaldi` expects the audio to be encoded with `16-bit signed little endian` (more information about this is [here](https://wiki.multimedia.cx/index.php/PCM)).  

The sample rate of that audio is a hyperparameter that will become important in a later step.  Most common, however, is `16 kHz` for recorded audio and `8 kHz` for recorded phone calls.  In our case, the `librispeech` will be `16 kHz`.

In [13]:
file raw_data/LibriSpeech/test-clean_audio/1089-134686-0000.wav

raw_data/LibriSpeech/test-clean_audio/1089-134686-0000.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz


## segments file 
#### (**UNUSED** for `librispeech`)

The `librispeech` data is already segmented into small audio files (~2-10 seconds long).  `kaldi` *can* handle data that unsegmented, but it requires an additional `segments` file with the following format:

```
[utterance-id] [audio-basename] [utterance-start] [utterance-stop]
[utterance-id] [audio-basename] [utterance-start] [utterance-stop]
[utterance-id] [audio-basename] [utterance-start] [utterance-stop]
```

This allows `kaldi` to process **each segment** as a separate audio file.

## transcript file

All transcripts for all audio subsets should be in one text file with the following format:
```
[utterance-id] [transcript text]
[utterance-id] [transcript text]
[utterance-id] [transcript text]
```

The `utterance-id` is used to identify the particular utterance.  In the case of **segmented** audio like the `librispeech` dataset, this will **also** be the audio basename (*i.e.* without `.wav`).

In [15]:
head ${KALDI_INSTRUCTIONAL_PATH}/raw_data/transcripts.txt

1272-128104-0000 MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
1272-128104-0001 NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER
1272-128104-0002 HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND
1272-128104-0003 HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA
1272-128104-0004 LINNELL'S PICTURES ARE A SORT OF UP GUARDS AND AT EM PAINTINGS AND MASON'S EXQUISITE IDYLLS ARE AS NATIONAL AS A JINGO POEM MISTER BIRKET FOSTER'S LANDSCAPES SMILE AT ONE MUCH IN THE SAME WAY THAT MISTER CARKER USED TO FLASH HIS TEETH AND MISTER JOHN COLLIER GIVES HIS SITTER A CHEERFUL SLAP ON THE BACK BEFORE HE SAYS LIKE A SHAMPOOER IN A TURKISH BATH NEXT MAN
1272-128104-0005 IT IS OBVIOUSLY UNNECESSARY FOR US TO POINT OUT HOW LUMINOUS THESE CRITICI

## phones file

## lexicon

## language model