# 3.2: Inspecting the `data` directory

`run_prepare_data.sh` will generate a new directory,`data`, that contains many of the files required for the `ASR` pipeline.  We will inspect its contents below.

**Note:** The official `kaldi` documentation has a more detailed explanation of these files [here](http://kaldi-asr.org/doc/data_prep.html).  Just beware that **some** files explained there are not relevant to our pipeline.

In [None]:
ls data

## `data/{train|test}_dir`

These directories contain four files for each of the subsets, `train` and `test` (assuming you set up a configuration with both when you ran `run_prepare_data.sh`, which we did).

In [5]:
ls data/train_dir

spk2utt  text  utt2spk  wav.scp


### `wav.scp`

### `text`

### `spk2utt`

### `utt2spk`

## `data/local`

This directory is an intermediate (essentially, `temp`) directory used for housing files as they are manipulated and/or built for later use.  All the **important** files will appear in another subdirectory of `data`, so we won't spend too much time on the items here.

### `waves.{train|test}`

These files are simply a list of the audio files that belong to the `train` and/or `test` subsets.

In [None]:
head -n5 data/local/waves.train
echo ...
head -n5 data/local/waves.test

### `lm_tg.arpa`

This is a modified version of the `language model` that you supplied as an argument to `run_prepare_data.sh` with any `n-gram` containing `<UNK>` removed.  This will ensure that our model will **never** predict `<UNK>` when decoding.

In [None]:
diff data/local/lm_tg.arpa raw_data/3-gram.pruned.3e-7.arpa | head -n10

### `data/local/dict`

This directory contains files pertaining to the `lexicon`.

In [None]:
ls data/local/dict

#### `lexicon.txt`

This is just a local copy of the `lexicon` you supplied to `run_prepare_data.sh`

In [13]:
diff data/local/dict/lexicon.txt raw_data/librispeech-lexicon.txt

**Note:** `kaldi` will do this alot: it will **copy** files to a `local` location and then call those files from that **local** location (as opposed to their original locations.  This is inefficient from a disk space perspective in my opinion, but it would be a tremendous amount of work to parameterize all the scripts to take a location.  And disk space is cheap...So we will suffer it.

#### `lexiconp.txt`

This is a form of the `lexicon` with an additional value for each word: the probability of that pronunciation.

In [None]:
cat data/local/dict/lexiconp.txt | grep INDIRECTLY

This allows for you to provide a `lexicon` that not only provides alternative pronunciations, but weights them according to their likelihood.  In our case, however, we don't have data to support the setting of those values, so all pronunciations are equally weighted at `1.0` (yes, it probably should be `.5` and `.5`, but `kaldi` is OK with `1.0` all the different pronunciations of a given word).

#### `{non}silence_phones.txt` and `optional_silence.txt`

`silence_phones.txt` and `nonsilence_phones.txt` simply separate the `phones` we supplied to `run_data_prepare.sh` into those that refer to `silence` and those that don't.  In our case, the only `silence` phone is `SIL`.

In [None]:
cat data/local/dict/silence_phones.txt

`optional_silence.txt` contains the value for a `phone` we will use to identify the `silence` between words.  The official `kaldi` documentation linked above doesn't go into much detail as to why this is used.

In [None]:
cat data/local/dict/optional_silence.txt

### `data/local/lang`

This directory is truly only a `temp` `directory` used in the building of `data/lang`.  So we will skip investigating this directory, and move on to `data/lang`. 

## `data/lang`

This directory will contain all the files needed to utilize the `language model` (how `kaldi` accesses the `ARPA`-format `language model` during decoding will be discussed later).

### `phones.txt`

### `words.txt`

### `oov.txt`

### `oov.int`

### `arpa_oov.txt`

### `topo`

### `L.fst` and `L_disambig.fst`

### `data/lang/phones`

This directory...