# 3.0: Understanding the `config` file, the `run_*` scripts, & support scripts

The pipeline is set up in **five** steps:

 - `run_prepare_data`
 - `run_feature_extraction`
 - `run_train_phones`
 - `run_compile_graph`
 - `run_test`

and the arguments to each of those steps are handled by a `config` file called `kaldi_config.json`.

In the first step, the `config` file will supply files to the script, and in later steps most of the arguments passed are hyperparameters.

## `kaldi_config.json`

In [None]:
cat kaldi_config.json | grep -A15 run_prepare_data
echo ...
cat kaldi_config.json | grep -A15 run_train_phones

**Note:** `path`s must **always** be `absolute` (*e.g.* the **full** path)

### `integer` v. `string` values

This `json` is parsed by the `python` module (https://docs.python.org/2/library/json.html), and so `string` values must be in `""` while `integer` values should **not** be in `""`.

### `boolean` values

`boolean` values need to be in the form expected by `shell` which are **lowercased** forms of `true` and `false`.

**Note:** There are **no** `""` used.

### `null` values

If an argument is not needed for a particular configuration, `null` can be used.

**Note:** There are **no** `""` used.

### `non-vanilla hyperparameters`

Most scripts have at least one option called `non-vanilla hyperparameters`.  

`kaldi` has its own internal argument-parsing system where any variable defined in a `shell` script can be set from the command line with `--[variable_name] [value]`.  Parameterizing every single hyperparameter for each pipeline step would make for an unmanageable `config` file, so I opted **not** to include some.  But **any** variable **can still be set** through this `config` `key`:

```
"non_vanilla_train_deltas_hyperparameters": {
    "flag": "-s",
    "value": "--num_iters 5 --beam 20"
...
```

## `run_*.sh`

The beginning of each `run_*` script does the following:
    - summarizes the purpose of the script
    - briefly explains the arguments it takes
    - identifies the outputs of the script (usually in the form of new directories and files)

In [None]:
head -n30 run_prepare_data.sh

Each `run_*.sh` script takes one argument: a `kaldi` `config` file.  The first thing done in the script is to set the necessary arguments from the appropriate section of the `config` file. 

## `kaldi` script structure

The `run_*.sh` scripts will all wrap individual `shell` scripts that can be found in the other directories of `egs/INSTRUCTIONAL`. Those directories are explained below.

 - `utils`: utility scripts
 - `local`: scripts particular to the corpus (*e.g.* `eg`) being used
   - **Note:** in the `INSTRUCTIONAL` `eg`, this distinction between `utils` and `local` is a muddy one since the `INSTRUCTIONAL` `eg` is designed to be corpus/data-agnostic
 - `steps`: scripts focused particularly on steps of the `ASR` pipeline

## `path.sh` and `cmd.sh`

You will also see two `shell` scripts called `path.sh` and `cmd.sh`

`path.sh` is a simple script that contains the path to the `kaldi` `src` directory (where the `C++` code lives).  When this file is `source`d (`. ./path.sh`) at the beginning of a `shell` script, it allows us to just call the name of the `C++` script we want to use **without** having to worry about `absolute` paths.

`kaldi` allows for different types of parallelization (see more [here](http://kaldi-asr.org/doc/queue.html)).  In our case we will use the simplest form (for one machine) called `run.pl`.  `cmd.sh` houses the default arguments for parallelization.  And in our case it's very simple:

In [None]:
cat cmd.sh

`10G` should be a safe setting for `memory`, but you can always change the default value in this script to fit your needs.