**Note**: This `notebook` is built for a `python` kernel, and so the default setting for each cell is `python`.  But you can use `%%bash` at the beginning of any cell to utilize `shell`.

In [None]:
import utils.feature_viz.feature_viz as fv
import plotly.graph_objs as go
import plotly.offline as py
py.init_notebook_mode(connected=True)
import numpy as np

# 4.3: Examining the `MFCC`s

Our `INSTRUCTIONAL` directory has a copy of [this third-party repository](https://github.com/vesis84/kaldi-io-for-python) in `utils/feature_viz`.  `utils/feature_viz/kaldi_io.py` has methods that allow us to read from (and write to...though we won't be using that) `.ark` files.

In [None]:
%%bash
ls utils/feature_viz/kaldi_io_for_python/

And there is a `python` module called `feature_viz.py` that has some methods wrapping `kaldi_io.py` that we'll use below to examine the `MFCC` features we extracted.

## Choosing an audio sample

All of our `mfcc` features are located in the `mfcc` directory.

In [None]:
%%bash
ls mfcc | grep mfcc

You'll notice a number of files equal to the parallelization that we utilized during feature extraction, and for each `int`, there will be an `.ark` and a `.scp` file.  

**Note:** If there is a *particular* audio sample that you'd like to inspect, you'll have to find which `.ark` file it resides in.

For this example, we'll look at an example with some interesting sounds in it.

In [None]:
%%bash
head raw_data/librispeech-transcripts.txt 

First let's find which `.ark` it is in.

In [None]:
%%bash 
for f in `ls mfcc/raw_mfcc_*.scp`; do
    match=$(cat $f | grep "1272-128104-0000")
    if [[ -z $match ]]; then
        echo "not found in $f"
    else
        echo "found in $f: $match"
    fi
done

## Reading in features

In [None]:
SAMPLE = "1272-128104-0000"
ARK_SAMPLE = "raw_mfcc_train_dir.1.scp"

`read_in_features()` will read in the features for *all* of the utterances found in our `.ark`

In [None]:
feats_in = fv.read_in_features("mfcc/{}".format(ARK_SAMPLE))
list(feats_in.keys())[:10]

If we look more closely at our sample, the `value` is a `numpy array` representing the features for each frame in the utterance

In [None]:
feats_in[SAMPLE]

## Understanding the feature shape

The shape of this `array` is `(num_frames x num_features)`

In [None]:
feats_in[SAMPLE].shape

There are two functions in `feature_viz.py` that can also easily extract this information:

In [None]:
fv.get_num_frames(feats_in[SAMPLE]), fv.get_num_features(feats_in[SAMPLE])

If we look back at the `MFCC` configuration file we used to extract these features, we'll see where this shape comes from.

In [None]:
%%bash
cat conf/mfcc_defaults.conf

`--num-ceps` dictates the number of columns our features will have.

And the number of frames depends on `--frame-length` and `--frame-shift` (along with the actual length of the audio of course).  In this case, each `frame` represents `25 ms` of audio, and the "next frame" is shifted `10 ms` to the right, and then consists of the next `25 ms`.  So our functions are overlapped.

Since "1272-128104-0003" is a significantly longer transcript, we can assume that it will have more features.

In [None]:
cat raw_data/librispeech-transcripts.txt | grep -E "1272-128104-000[03]"

In [None]:
fv.get_num_frames(feats_in["1272-128104-0003"])

But, it will have the same number of `features`!

In [None]:
fv.get_num_features(feats_in["1272-128104-0003"])

## Viewing the features

`feature_viz.py` has a method `plot_frames()` that will plot the `MFCC`s for `n` consecutive frames of an audio sample

**Note:** In order to view the plot directly in this notebook, you need to run `py.iplot([output])` where `output` is the returned value from `plot_frames()`.

In [None]:
# the one *required* argument is a numpy array of features for any number of frames
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:61]   # [x:y] will return the x^th frame
    )
)

Here you can see the first frame of features for our audio sample.  If you prefer, you can view these as a `bar` graph with an added argument `mode=bar` to the function.  This *may* be more intuitive as the `vector` is `discrete` (*e.g.* there is **no value** for `x=3.5`).

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:61],   # [x:y] will return the x^th frame
        mode='bar'
    )
)

`plot_frames()` can also plot multiple **consecutive** frames on the same graph.  Below are five consecutive frames of audio.

**Note:** `plotly` allows you to click "on" and "off" any particular line by clicking on them in the legend.

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:68]
    )
)

You can, again, view these as a `bar` graph.

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:68],
        mode='bar'
    )
)

In this case, it *may* be easier to compare a *particular coefficient* (*e.g.* `x` value) across frames.  For example, you can see here that `x=1` is non-negative for all frames in this sequence.

## Including `phone` information

If provided with the information, `plot_frames()` can also label each `frame` by its *predicted* phone.  It is a *predicted* phone because the **alignment*** of frames to a transcript is an integral part of the ASR pipeline, but its accuracy is dependent on the quality of the pipeline.

In our case, we have **not yet** done the steps in the pipeline that generate these `alignment` files (typically called `ali.*.gz`).   But for the sake of these visualizations, the alignments we need are included in `resource_files/feature_viz/all_ali`.  This is an `un-compressed` (`gzip -d`), concatenated (of all of the parallelized outputs) file that contains alignmenet information for all of the audio in our training subset.

In [None]:
%%bash
ls resource_files/feature_viz

We can use a `C++` function called `ali-to-phones` to inspect these alignments. 

**Note:** Normally we could `source` `path.sh` (`. path.sh`) in order to avoid providing full paths to these C++ functions.  But because we are in a `python` `kernel` and only using `bash` in individual cells, that doesn't work.  So we have to provide full paths.

In [None]:
%%bash
${KALDI_PATH}/src/bin/ali-to-phones

In order to use this function, we also need an acoustic model (again, something we haven't built yet in our pipeline).  A model is provided in `resource_files/feature_viz/model_for_alignments.mdl`

In [None]:
%%bash
# using ark,t:- we are telling the function to output to STDOUT
${KALDI_PATH}/src/bin/ali-to-phones \
    resource_files/feature_viz/model_for_alignments.mdl \
    ark:resource_files/feature_viz/all_ali \
    ark,t:- | grep "1272-128104-0000"

This generates a sequence of indexes representing the sequence of phones for the given utterance.  If we want to see the actual phones these indexes refer to, we need to `pipe` this output to `int2sym.pl`

In [None]:
%%bash
${KALDI_INSTRUCTIONAL_PATH}/utils/int2sym.pl

You'll notice that this method requires a `symtab` (symbol table), which we have already built in `data/lang/phones.txt`

In [None]:
%%bash
head data/lang/phones.txt

In [None]:
cat raw_data/librispeech-transcripts.txt | grep -E "1272-128104-0000"

In [None]:
%%bash
${KALDI_PATH}/src/bin/ali-to-phones \
    resource_files/feature_viz/model_for_alignments.mdl \
    ark:resource_files/feature_viz/all_ali \
    ark,t:- |\
    ${KALDI_INSTRUCTIONAL_PATH}/utils/int2sym.pl -f 2- data/lang/phones.txt |\
    grep "1272-128104-0000"

But you'll notice that this is waaaaay less sounds than the 584 that we know exist for this audio sample.  In this case, the function collapsed a consecutive sequence of the same phone for easy "viewing".  This shouldn't be surprising that a particular sound might last for longer than one frame (which is `25 ms` in our case).

We can add the `--per-frame` argument to get a `phone` for each `frame`.

In [None]:
%%bash
${KALDI_PATH}/src/bin/ali-to-phones \
    --per-frame=true \
    resource_files/feature_viz/model_for_alignments.mdl \
    ark:resource_files/feature_viz/all_ali \
    ark,t:- |\
    ${KALDI_INSTRUCTIONAL_PATH}/utils/int2sym.pl -f 2- data/lang/phones.txt |\
    grep "1272-128104-0000"

And so now we have a full command we can run that will generate this output for each audio sample in our training set.  We will run it below and save it to `resource_files/feature_viz/all_ali_phoned`.

In [None]:
%%bash
${KALDI_PATH}/src/bin/ali-to-phones \
    --per-frame=true \
    resource_files/feature_viz/model_for_alignments.mdl \
    ark:resource_files/feature_viz/all_ali \
    ark,t:- |\
    ${KALDI_INSTRUCTIONAL_PATH}/utils/int2sym.pl -f 2- data/lang/phones.txt > resource_files/feature_viz/all_ali_phoned

In [None]:
%%bash
cat resource_files/feature_viz/all_ali_phoned | grep "1272-128104-0000"

`feature_viz.py` also has a method for reading in these alignments into a `<dict>`.  And the `value` will be a `<list>` of phones equal to the number of frames.

In [None]:
ali_in = fv.read_in_alignments("resource_files/feature_viz/all_ali_phoned")
ali_in[SAMPLE][60:68]

In [None]:
len(ali_in[SAMPLE]) == fv.get_num_frames(feats_in[SAMPLE])

Now we can provide the relevant `<list>` of phones to `plot_frames()` to label each frame with its *predicted* phone.

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:68],
        phones=ali_in[SAMPLE][60:68]
    )
)

We can now see that these 8 frames *most likely* represent *three* different phones.  

Again, clicking on individual lines in the legend will turn that line "on"/"off" from the plot.

## Including "average" `mfcc` information

`feature_viz.py` also has a function that will calculate the *average* `MFCC` vector for each `phone` in our training subset.

First we need to build a `<dict>` that groups all of the `MFCC` vectors for each phone together.

In [None]:
grouped_dict = fv.get_grouped_phones_dict(
                    feats_dict=feats_in, 
                    ali_dict=ali_in
)
list(grouped_dict.keys())[:5]

Each `value` is a `<list>` of all the examples of that phone in our subset.  In this case, there are 5955 *predicteed* `F_B` phones in our subset.

In [None]:
len(grouped_dict["F_B"])

We can then run `get_average_mfccs()` to generate the `mean` `MFCC` vector for each phone.  Below is the `mean` `MFCC` vector for all of the `F_B`s seen in our subset.

In [None]:
ave_dict = fv.get_average_mfccs(
    phones_dict=grouped_dict
)
ave_dict["F_B"]

`plot_frames()` can also show you the average vector of each frame that appears in your plot.

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:68],
        phones=ali_in[SAMPLE][60:68],
        average_mfccs_dict=ave_dict
    )
)

You can now compare the particular `MFCC` vectors for a given phone to its average vector.  This *may* be more useful when plotted over the `bar graph` version.

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:68],
        phones=ali_in[SAMPLE][60:68],
        average_mfccs_dict=ave_dict,
        mode='bar'
    )
)

## Case Study of `IH`

Let's look more closely at a particular `phone`.  There are *five* instances of `IH` in our chosen sample.

In [None]:
cat raw_data/librispeech-transcripts.txt | grep -E "1272-128104-0000"

Let's find the frames that correspond to them.

**Note:** The phone is `IH1` which indicates the "second" pronunciation of `IH` in our phones set.  There is also a `IH0` and a `IH2`

In [None]:
cat data/lang/phones.txt | grep IH

In [None]:
enumerated_phones = list(enumerate(ali_in[SAMPLE]))

list(filter(
    lambda x: "IH" in x[1],
    enumerated_phones)
    )

And we can see that those *five* instances correspond to the following frames:
    - 59-62
    - 96-98
    - 125-131
    - 243-245
    - 470-473

In [None]:
ex_1 = {
    "start": 59,
    "stop": 63    # +1 to be inclusive
}

ex_2 = {
    "start": 96,
    "stop": 99    # +1 to be inclusive
}

ex_3 = {
    "start": 125,
    "stop": 132    # +1 to be inclusive
}

ex_4 = {
    "start": 243,
    "stop": 246    # +1 to be inclusive
}

ex_5 = {
    "start": 470,
    "stop": 474    # +1 to be inclusive
}

In [None]:
ali_in[SAMPLE][ex_1["start"]:ex_1["stop"]]

Let's plot each sequence of phones.

### `ex_1`: `IH` in "MISTER"

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_1["start"]:ex_1["stop"]],
        phones=ali_in[SAMPLE][ex_1["start"]:ex_1["stop"]],
        average_mfccs_dict=ave_dict
    )
)

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_1["start"]:ex_1["stop"]],
        phones=ali_in[SAMPLE][ex_1["start"]:ex_1["stop"]],
        average_mfccs_dict=ave_dict,
        mode='bar'
    )
)

### `ex_2`: `IH` in "QUILTER"

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_2["start"]:ex_2["stop"]],
        phones=ali_in[SAMPLE][ex_2["start"]:ex_2["stop"]],
        average_mfccs_dict=ave_dict
    )
)

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_2["start"]:ex_2["stop"]],
        phones=ali_in[SAMPLE][ex_2["start"]:ex_2["stop"]],
        average_mfccs_dict=ave_dict,
        mode='bar'
    )
)

### `ex_3`: `IH` in "IS"

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_3["start"]:ex_3["stop"]],
        phones=ali_in[SAMPLE][ex_3["start"]:ex_3["stop"]],
        average_mfccs_dict=ave_dict
    )
)

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_3["start"]:ex_3["stop"]],
        phones=ali_in[SAMPLE][ex_3["start"]:ex_3["stop"]],
        average_mfccs_dict=ave_dict,
        mode='bar'
    )
)

### `ex_4`: `IH` in "MIDDLE"

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_4["start"]:ex_4["stop"]],
        phones=ali_in[SAMPLE][ex_4["start"]:ex_4["stop"]],
        average_mfccs_dict=ave_dict
    )
)

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_4["start"]:ex_4["stop"]],
        phones=ali_in[SAMPLE][ex_4["start"]:ex_4["stop"]],
        average_mfccs_dict=ave_dict,
        mode='bar'
    )
)

### `ex_5`: `IH` in "HIS"

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_5["start"]:ex_5["stop"]],
        phones=ali_in[SAMPLE][ex_5["start"]:ex_5["stop"]],
        average_mfccs_dict=ave_dict
    )
)

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_4["start"]:ex_4["stop"]],
        phones=ali_in[SAMPLE][ex_4["start"]:ex_4["stop"]],
        average_mfccs_dict=ave_dict,
        mode='bar'
    )
)

### `ex_3` and `ex_5`

Since `ex_3` and `ex_5` are "IS" and "HIS", respectively, let's plot them together.

In [None]:
ex_3_5_frames = np.vstack(
    (
        feats_in[SAMPLE][ex_3["start"]:ex_3["stop"]],
        feats_in[SAMPLE][ex_5["start"]:ex_5["stop"]]
    )
)
ex_3_5_frames.shape

In [None]:
ex_3_5_phones = ali_in[SAMPLE][ex_3["start"]:ex_3["stop"]] + ali_in[SAMPLE][ex_5["start"]:ex_5["stop"]]
ex_3_5_phones

In [None]:
py.iplot(
    fv.plot_frames(
        frames=ex_3_5_frames,
        phones=ex_3_5_phones,
        average_mfccs_dict=ave_dict,
        mode='bar'
    )
)

It's hard to see any similarity in the individual frames of each `IH`, but we *can* see that the average phone for `IH_B` (when `IH` comes at the **beginning** of a word) and the average phone for `IH_I` (when `IH` comes in the **middle** of a word) are quite similar, which shouldn't be too surprising.

But then the question remains why these individual frames of a particular `IH` differ so much.

### phones in context

Perhaps it's because each of these sequence of `IH` phones comes in a different "context" (in terms of which phone came **before** and **after** them).  And this has a direct effect on the features.

#### `ex_1` ("MIS..")

Let's re-plot `ex_1` ("MISTER"), but this time with a few frames before and after added for context.

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_1["start"]-2:ex_1["stop"]+2],
        phones=ali_in[SAMPLE][ex_1["start"]-2:ex_1["stop"]+2],
        average_mfccs_dict=ave_dict
    )
)

Compare frame `3/8` with `4/8` (the first two frames of `IH`).  They are very similar.

Now compare `3/8` to `6/8` (the *first* and *last* frames of `IH`).  Very different.

Now compare `2/8` to `3/8` (the *last* `M` frame with the first `IH` frame).  And compare `7/9` with `8/9` (the *last* `IH` frame and the *first* `S` frame).

#### `ex_2` ("...WIL...")

Compare this sequence (`ex_1`) with `ex_2` ("QUILTER")

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_2["start"]-2:ex_2["stop"]+2],
        phones=ali_in[SAMPLE][ex_2["start"]-2:ex_2["stop"]+2],
        average_mfccs_dict=ave_dict
    )
)

Notice how much "tighter" all these frames line up with each other.  In other words, the impact of the `W` and the `L` on the `IH` frames is much less noticeable.

#### `ex_2` ("...R IS")

This example shows the **last two** frames of the **previous** word, "QUILTER" along with the majority of the frames associated with "IS".

In [None]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_3["start"]-2:ex_3["stop"]+2],
        phones=ali_in[SAMPLE][ex_3["start"]-2:ex_3["stop"]+2]
    )
)

Starting with `frame 1/11` as the *only* visible line, add each next frame one at a time.  Notice how at first, the `IH1_B` frames (`3/11` and `4/11`) are almost *exactly* the same as the `ER0_E` frames.  But over time, they begin to noticeable shift -- one could argue -- *towards* the `Z_E`.