# 4.2: Inspecting the `mfcc` dir

`run_feature_extraction.sh` will generate two new directories: `exp` and `mfcc`.  We will inspect their contents below.

In [None]:
ls exp

In [None]:
ls mfcc

## `exp/make_mfcc`

This directory contains the `log` files pertaining to the making of the `mfcc`s.

**Note:** `exp` stands for `exp`eriment, and so the next steps of our `ASR` pipeline will also be outputted into this `directory`.

In [None]:
ls exp/make_mfcc

In [None]:
ls exp/make_mfcc/train_dir

**Note:** Whenever possible, we will take advantage of parallelization within `kaldi`.  The resulting outputs of parallelized steps will be of the form `file_name.JOB.log` where `JOB` will be integers from `1` to the number of parallelized threads.  

In [None]:
cat exp/make_mfcc/train_dir/make_mfcc_train_dir.1.log

You will notice that these `log`s don't contain much useful information *other than* which audio files were processed in which parallel thread.

## `mfcc`

This directory contains the actual extracted features for both the training and testing subsets.  

In [None]:
ls mfcc

### `mfcc/kaldi_config_args.json` 

This is a copy of the arguments used in `kaldi_config.json` when running `run_feature_extraction.sh`.  In this case, the most important thing to pay attention to is which `mfcc_config` was used.

In [None]:
cat mfcc/kaldi_config_args.json

### `.ark` and `.scp` files

`kaldi` has two of its own file types: `ark` and `scp`.  

In most cases, there will be a different `ark` or `scp` file for each `thread` of parallelization used.  Above (in `kaldi_config_args.json`), you can see the `num_processors` value that was used.  There should be this many `ark` and `scp` files, each labeled with the `integer` corresponding to the thread used.

In [None]:
ls mfcc | grep raw_mfcc_train_dir.*.scp

Much more detail about these file types can be found [here](http://kaldi-asr.org/doc/io.html), but for now we'll simplify it this way:

an `ark` file (short for `archive`) is a `binary` file containing `C++ objects`, often for more than one audio sample, utterance, etc.  An `scp` file acts as a mapping of items to their "location" in the `kaldi` `archive`s.

In [None]:
head mfcc/raw_mfcc_train_dir.1.scp

The actual features are contained in the `ark` files.  In the `scp` files you can see a list of utterances and (1) which `ark` file they are in, along with (2) which "line" of that `ark` file (the `:\d+` portion of the line) represents that utterance.

Often there are `C++` functions that will allow us to inspect `ark` files in more detail.  In the case of `MFCC`s, we'll actually use some third-party `python` code to explore the `mfcc` features in the next notebook.

In [None]:
ls mfcc  | grep cmvn

You'll notice some files called `cmvn_*`.  `cmvn` stands for `cepstral mean and variance normalization`, and it is simply a process by which we attempt to normalize all of the samples.  These files contain the `values` required to normalize our data.  We won't spend much time with these as the normalization step is done "automatically" in later steps of our ASR pipeline.