**Note**: This `notebook` is built for a `python` kernel, and so the default setting for each cell is `python`.  But you can use `%%bash` at the beginning of any cell to utilize `shell`.

In [1]:
import utils.feature_viz.feature_viz as fv
import plotly.graph_objs as go
import plotly.offline as py
py.init_notebook_mode(connected=True)
import numpy as np

# 4.3: Examining the `MFCC`s

Our `INSTRUCTIONAL` directory has a copy of [this third-party repository](https://github.com/vesis84/kaldi-io-for-python) in `utils/feature_viz`.  `utils/feature_viz/kaldi_io.py` has methods that allow us to read from (and write to...though we won't be using that) `.ark` files.

In [2]:
%%bash
ls utils/feature_viz/kaldi_io_for_python/

__init__.pyc
kaldi_io.pyc


And there is a `python` module called `feature_viz.py` that has some methods wrapping `kaldi_io.py` that we'll use below to examine the `MFCC` features we extracted.

## Choosing an audio sample

All of our `mfcc` features are located in the `mfcc` directory.

In [3]:
%%bash
ls mfcc | grep mfcc

raw_mfcc_test_dir.1.ark
raw_mfcc_test_dir.1.scp
raw_mfcc_test_dir.2.ark
raw_mfcc_test_dir.2.scp
raw_mfcc_test_dir.3.ark
raw_mfcc_test_dir.3.scp
raw_mfcc_test_dir.4.ark
raw_mfcc_test_dir.4.scp
raw_mfcc_train_dir.1.ark
raw_mfcc_train_dir.1.scp
raw_mfcc_train_dir.2.ark
raw_mfcc_train_dir.2.scp
raw_mfcc_train_dir.3.ark
raw_mfcc_train_dir.3.scp
raw_mfcc_train_dir.4.ark
raw_mfcc_train_dir.4.scp


You'll notice a number of files equal to the parallelization that we utilized during feature extraction, and for each `int`, there will be an `.ark` and a `.scp` file.  

**Note:** If there is a *particular* audio sample that you'd like to inspect, you'll have to find which `.ark` file it resides in.

For this example, we'll look at an example with some interesting sounds in it.

In [4]:
%%bash
head raw_data/librispeech-transcripts.txt 

1272-128104-0000 MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
1272-128104-0001 NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER
1272-128104-0002 HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND
1272-128104-0003 HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA
1272-128104-0004 LINNELL'S PICTURES ARE A SORT OF UP GUARDS AND AT EM PAINTINGS AND MASON'S EXQUISITE IDYLLS ARE AS NATIONAL AS A JINGO POEM MISTER BIRKET FOSTER'S LANDSCAPES SMILE AT ONE MUCH IN THE SAME WAY THAT MISTER CARKER USED TO FLASH HIS TEETH AND MISTER JOHN COLLIER GIVES HIS SITTER A CHEERFUL SLAP ON THE BACK BEFORE HE SAYS LIKE A SHAMPOOER IN A TURKISH BATH NEXT MAN
1272-128104-0005 IT IS OBVIOUSLY UNNECESSARY FOR US TO POINT OUT HOW LUMINOUS THESE CRITICI

First let's find which `.ark` it is in.

In [5]:
%%bash 
for f in `ls mfcc/raw_mfcc_*.scp`; do
    match=$(cat $f | grep "1272-128104-0000")
    if [[ -z $match ]]; then
        echo "not found in $f"
    else
        echo "found in $f: $match"
    fi
done

not found in mfcc/raw_mfcc_test_dir.1.scp
not found in mfcc/raw_mfcc_test_dir.2.scp
not found in mfcc/raw_mfcc_test_dir.3.scp
not found in mfcc/raw_mfcc_test_dir.4.scp
found in mfcc/raw_mfcc_train_dir.1.scp: 1272-128104-0000 /home//kaldi/egs/INSTRUCTIONAL/mfcc/raw_mfcc_train_dir.1.ark:17
not found in mfcc/raw_mfcc_train_dir.2.scp
not found in mfcc/raw_mfcc_train_dir.3.scp
not found in mfcc/raw_mfcc_train_dir.4.scp


## Reading in features

In [6]:
SAMPLE = "1272-128104-0000"
ARK_SAMPLE = "raw_mfcc_train_dir.1.scp"

`read_in_features()` will read in the features for *all* of the utterances found in our `.ark`

In [7]:
feats_in = fv.read_in_features("mfcc/{}".format(ARK_SAMPLE))
list(feats_in.keys())[:10]

['2035-147961-0031',
 '2035-147961-0030',
 '2035-147961-0033',
 '2035-147961-0032',
 '2035-147961-0035',
 '2035-147961-0034',
 '2035-147961-0037',
 '2035-147960-0014',
 '2035-147961-0039',
 '2035-147961-0038']

If we look more closely at our sample, the `value` is a `numpy array` representing the features for each frame in the utterance

In [8]:
feats_in[SAMPLE]

array([[ 12.14908981, -21.18790054, -20.31894112, ...,  16.23352814,
          0.05224136, -16.93466187],
       [ 12.14908981, -21.75504494, -22.66297722, ...,  22.01498985,
         -9.08587933,  -1.47463059],
       [ 12.23464489, -19.48646164, -17.97490501, ...,  22.78585243,
         -0.42246622,  -2.26634622],
       ..., 
       [ 11.72131634, -24.59077454,  -8.06653976, ...,  11.60835743,
         -1.01585066,  -2.7061882 ],
       [ 11.63576221, -24.59077454,  -8.06653976, ...,  21.24412918,
         -9.67926407,  -7.3891058 ],
       [ 11.89242554, -22.32219124,  -6.91603661, ...,  16.23352814,
         -7.06837225,  -5.78508282]], dtype=float32)

## Understanding the feature shape

The shape of this `array` is `(num_frames x num_features)`

In [9]:
feats_in[SAMPLE].shape

(584, 13)

There are two functions in `feature_viz.py` that can also easily extract this information:

In [10]:
fv.get_num_frames(feats_in[SAMPLE]), fv.get_num_features(feats_in[SAMPLE])

(584, 13)

If we look back at the `MFCC` configuration file we used to extract these features, we'll see where this shape comes from.

In [11]:
%%bash
cat conf/mfcc_defaults.conf

--frame-length=25               # frame length in milliseconds
--frame-shift=10                # frame shift in milliseconds
--num-ceps=13                   # number of cepstra in computation (incl. C0)
--num-mel-bins=23               # number of triangular mel-frequency bins
--use-energy=true               # use energy (not C0) in computation
--low-freq=20                   # low cutoff frequency for mel bins
--high-freq=0                   # high cutoff frequency for mel bins
--window-type=povey             # choose "hamming", "hanning", "rectangular"
--snip-edges=true               # only output frames that fit in file
                                # number of frames depends on frame-length
                                        # if false, depends on frame-shift
--dither=1                      # random 1bit of noise added
                                        # ensures no log(0) calculations
--sample-frequency=8000         # sample rate of audio



`--num-ceps` dictates the number of columns our features will have.

And the number of frames depends on `--frame-length` and `--frame-shift` (along with the actual length of the audio of course).  In this case, each `frame` represents `25 ms` of audio, and the "next frame" is shifted `10 ms` to the right, and then consists of the next `25 ms`.  So our functions are overlapped.

Since "1272-128104-0003" is a significantly longer transcript, we can assume that it will have more features.

In [12]:
cat raw_data/librispeech-transcripts.txt | grep -E "1272-128104-000[03]"

1272-128104-0000 MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
1272-128104-0003 HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA


In [13]:
fv.get_num_frames(feats_in["1272-128104-0003"])

988

But, it will have the same number of `features`!

In [14]:
fv.get_num_features(feats_in["1272-128104-0003"])

13

## Viewing the features

`feature_viz.py` has a method `plot_frames()` that will plot the `MFCC`s for `n` consecutive frames of an audio sample

**Note:** In order to view the plot directly in this notebook, you need run `py.iplot([output])` where `output` is the returned value from `plot_frames()`.

In [15]:
# the one *required* argument is a numpy array of features for any number of frames
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:61]   # [x:y] will return the x^th frame
    )
)

Here you can see the first frame of features for our audio sample.

`plot_frames()` can also plot multiple **consecutive** frames on the same graph.  Below are five consecutive frames of audio.

**Note:** `plotly` allows you to click "on" and "off" any particular line by clicking on them in the legend.

In [16]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:68]
    )
)

## Including `phone` information

If provided with the information, `plot_frames()` can also label each `frame` by its *predicted* phone.  It is a *predicted* phone because the **alignment*** of frames to a transcript is an integral part of the ASR pipeline, but its accuracy is dependent on the quality of the pipeline.

In our case, we have **not yet** done the steps in the pipeline that generate these `alignment` files (typically called `ali.*.gz`).   But for the sake of these visualizations, the alignments we need are included in `resource_files/feature_viz/all_ali`.  This is an `un-compressed` (`gzip -d`), concatenated (of all of the parallelized outputs) file that contains alignmenet information for all of the audio in our training subset.

In [17]:
%%bash
ls resource_files/feature_viz

all_ali
all_ali_phoned
model_for_alignments.mdl


We can use a `C++` function called `ali-to-phones` to inspect these alignments. 

**Note:** Normally we could `source` `path.sh` (`. path.sh`) in order to avoid providing full paths to these C++ functions.  But because we are in a `python` `kernel` and only using `bash` in individual cells, that doesn't work.  So we have to provide full paths.

In [24]:
%%bash
${KALDI_PATH}/src/bin/ali-to-phones

/home//kaldi/src/bin/ali-to-phones 

Convert model-level alignments to phone-sequences (in integer, not text, form)
Usage:  ali-to-phones  [options] <model> <alignments-rspecifier> <phone-transcript-wspecifier|ctm-wxfilename>
e.g.: 
 ali-to-phones 1.mdl ark:1.ali ark:-
or:
 ali-to-phones --ctm-output 1.mdl ark:1.ali 1.ctm
See also: show-alignments lattice-align-phones

Options:
  --ctm-output                : If true, output the alignments in ctm format (the confidences will be set to 1) (bool, default = false)
  --frame-shift               : frame shift used to control the times of the ctm output (float, default = 0.01)
  --per-frame                 : If true, write out the frame-level phone alignment (else phone sequence) (bool, default = false)
  --write-lengths             : If true, write the #frames for each phone (different format) (bool, default = false)

Standard options:
  --config                    : Configuration file to read (this option may be repeated) (string, default 

In order to use this function, we also need an acoustic model (again, something we haven't built yet in our pipeline).  A model is provided in `resource_files/feature_viz/model_for_alignments.mdl`

In [25]:
%%bash
# using ark,t:- we are telling the function to output to STDOUT
${KALDI_PATH}/src/bin/ali-to-phones \
    resource_files/feature_viz/model_for_alignments.mdl \
    ark:resource_files/feature_viz/all_ali \
    ark,t:- | grep "1272-128104-0000"

1272-128104-0000 1 178 148 224 232 107 170 268 148 176 232 107 146 275 90 155 30 216 12 224 32 175 30 263 1 90 31 178 148 88 32 175 170 176 24 224 32 275 1 30 184 87 266 159 109 134 176 24 87 230 31 266 100 176 172 32 179 138 148 275 134 48 224 216 32 175 1 


/home//kaldi/src/bin/ali-to-phones resource_files/feature_viz/model_for_alignments.mdl ark:resource_files/feature_viz/all_ali ark,t:- 
LOG (ali-to-phones[5.2.191~1-48be1]:main():ali-to-phones.cc:134) Done 2703 utterances.


This generates a sequence of indexes representing the sequence of phones for the given utterance.  If we want to see the actual phones these indexes refer to, we need to `pipe` this output to `int2sym.pl`

In [28]:
%%bash
${KALDI_INSTRUCTIONAL_PATH}/utils/int2sym.pl

Usage: sym2int.pl [options] symtab [input] > output
options: [-f (<field>|<field_start>-<field-end>)]
e.g.: -f 2, or -f 3-4


You'll notice that this method requires a `symtab` (symbol table), which we have already built in `data/lang/phones.txt`

In [23]:
%%bash
head data/lang/phones.txt

<eps> 0
SIL 1
SIL_B 2
SIL_E 3
SIL_I 4
SIL_S 5
AA0_B 6
AA0_E 7
AA0_I 8
AA0_S 9


In [30]:
%%bash
${KALDI_PATH}/src/bin/ali-to-phones \
    resource_files/feature_viz/model_for_alignments.mdl \
    ark:resource_files/feature_viz/all_ali \
    ark,t:- |\
    ${KALDI_INSTRUCTIONAL_PATH}/utils/int2sym.pl -f 2- data/lang/phones.txt |\
    grep "1272-128104-0000"

1272-128104-0000 SIL M_B IH1_I S_I T_I ER0_E K_B W_I IH1_I L_I T_I ER0_E IH1_B Z_E DH_B IY0_E AH0_B P_I AA1_I S_I AH0_I L_E AH0_B V_E SIL DH_B AH0_E M_B IH1_I D_I AH0_I L_E K_B L_I AE1_I S_I AH0_I Z_E SIL AH0_B N_I D_E W_B IY1_E ER0_S G_B L_I AE1_I D_E T_B AH0_E W_B EH1_I L_I K_I AH0_I M_E HH_B IH1_I Z_E G_B AO1_I S_I P_I AH0_I L_E SIL 


/home//kaldi/src/bin/ali-to-phones resource_files/feature_viz/model_for_alignments.mdl ark:resource_files/feature_viz/all_ali ark,t:- 
LOG (ali-to-phones[5.2.191~1-48be1]:main():ali-to-phones.cc:134) Done 2703 utterances.


But you'll notice that this is waaaaay less sounds than the 584 that we know exist for this audio sample.  In this case, the function collapsed a consecutive sequence of the same phone for easy "viewing".  This shouldn't be surprising that a particular sound might last for longer than one frame (which is `25 ms` in our case).

We can add the `--per-frame` argument to get a `phone` for each `frame`.

In [31]:
%%bash
${KALDI_PATH}/src/bin/ali-to-phones \
    --per-frame=true \
    resource_files/feature_viz/model_for_alignments.mdl \
    ark:resource_files/feature_viz/all_ali \
    ark,t:- |\
    ${KALDI_INSTRUCTIONAL_PATH}/utils/int2sym.pl -f 2- data/lang/phones.txt |\
    grep "1272-128104-0000"

1272-128104-0000 SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL M_B M_B M_B M_B M_B M_B M_B M_B IH1_I IH1_I IH1_I IH1_I IH1_I S_I S_I S_I T_I T_I T_I T_I T_I T_I ER0_E ER0_E ER0_E ER0_E ER0_E K_B K_B K_B K_B K_B K_B K_B K_B K_B K_B K_B K_B K_B W_I W_I W_I W_I W_I IH1_I IH1_I IH1_I L_I L_I L_I L_I T_I T_I T_I T_I T_I ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E IH1_B IH1_B IH1_B IH1_B IH1_B IH1_B IH1_B IH1_B IH1_B IH1_B Z_E Z_E Z_E Z_E Z_E DH_B DH_B DH_B DH_B DH_B DH_B DH_B IY0_E IY0_E IY0_E IY0_E IY0_E AH0_B AH0_B AH0_B AH0_B AH0_B AH0_B AH0_B AH0_B AH0_B P_I P_I P_I P_I P_I P_I P_I P_I P_I P_I P_I P_I P_I AA1_I AA1_I AA1_I AA1_I AA1_I AA1_I AA1_I S_I S_I S_I S_I S_I S_I S_I S_I S_I S_I S_I S_I AH0_I AH0_I AH0_I AH0_I AH0_I AH0_I L_E L_E L_E L_E L_E L_E L_E L_E L_E L_E L_E L_

/home//kaldi/src/bin/ali-to-phones --per-frame=true resource_files/feature_viz/model_for_alignments.mdl ark:resource_files/feature_viz/all_ali ark,t:- 
LOG (ali-to-phones[5.2.191~1-48be1]:main():ali-to-phones.cc:134) Done 2703 utterances.


And so now we have a full command we can run that will generate this output for each audio sample in our training set.  We will run it below and save it to `resource_files/feature_viz/all_ali_phoned`.

In [32]:
%%bash
${KALDI_PATH}/src/bin/ali-to-phones \
    --per-frame=true \
    resource_files/feature_viz/model_for_alignments.mdl \
    ark:resource_files/feature_viz/all_ali \
    ark,t:- |\
    ${KALDI_INSTRUCTIONAL_PATH}/utils/int2sym.pl -f 2- data/lang/phones.txt > resource_files/feature_viz/all_ali_phoned

/home//kaldi/src/bin/ali-to-phones --per-frame=true resource_files/feature_viz/model_for_alignments.mdl ark:resource_files/feature_viz/all_ali ark,t:- 
LOG (ali-to-phones[5.2.191~1-48be1]:main():ali-to-phones.cc:134) Done 2703 utterances.


In [33]:
%%bash
cat resource_files/feature_viz/all_ali_phoned | grep "1272-128104-0000"

1272-128104-0000 SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL SIL M_B M_B M_B M_B M_B M_B M_B M_B IH1_I IH1_I IH1_I IH1_I IH1_I S_I S_I S_I T_I T_I T_I T_I T_I T_I ER0_E ER0_E ER0_E ER0_E ER0_E K_B K_B K_B K_B K_B K_B K_B K_B K_B K_B K_B K_B K_B W_I W_I W_I W_I W_I IH1_I IH1_I IH1_I L_I L_I L_I L_I T_I T_I T_I T_I T_I ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E ER0_E IH1_B IH1_B IH1_B IH1_B IH1_B IH1_B IH1_B IH1_B IH1_B IH1_B Z_E Z_E Z_E Z_E Z_E DH_B DH_B DH_B DH_B DH_B DH_B DH_B IY0_E IY0_E IY0_E IY0_E IY0_E AH0_B AH0_B AH0_B AH0_B AH0_B AH0_B AH0_B AH0_B AH0_B P_I P_I P_I P_I P_I P_I P_I P_I P_I P_I P_I P_I P_I AA1_I AA1_I AA1_I AA1_I AA1_I AA1_I AA1_I S_I S_I S_I S_I S_I S_I S_I S_I S_I S_I S_I S_I AH0_I AH0_I AH0_I AH0_I AH0_I AH0_I L_E L_E L_E L_E L_E L_E L_E L_E L_E L_E L_E L_

`feature_viz.py` also has a method for reading in these alignments into a `<dict>`.  And the `value` will be a `<list>` of phones equal to the number of frames.

In [35]:
ali_in = fv.read_in_alignments("resource_files/feature_viz/all_ali_phoned")
ali_in[SAMPLE][60:68]

['IH1_I', 'IH1_I', 'IH1_I', 'IH1_I', 'S_I', 'S_I', 'S_I', 'T_I']

In [36]:
len(ali_in[SAMPLE]) == fv.get_num_frames(feats_in[SAMPLE])

True

Now we can provide the relevant `<list>` of phones to `plot_frames()` to label each frame with its *predicted* phone.

In [37]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:68],
        phones=ali_in[SAMPLE][60:68]
    )
)

We can now see that these 8 frames *most likely* represent *three* different phones.  

Again, clicking on individual lines in the legend will turn that line "on"/"off" from the plot.

## Including "average" `mfcc` information

`feature_viz.py` also has a function that will calculate the *average* `MFCC` vector for each `phone` in our training subset.

First we need to build a `<dict>` that groups all of the `MFCC` vectors for each phone together.

In [38]:
grouped_dict = fv.get_grouped_phones_dict(
                    feats_dict=feats_in, 
                    ali_dict=ali_in
)
list(grouped_dict.keys())[:10]

['IY2_I',
 'AA2_B',
 'EH1_I',
 'EH1_B',
 'SIL',
 'L_B',
 'IY1_I',
 'AY0_I',
 'F_B',
 'AY0_B']

Each `value` is a `<list>` of all the examples of that phone in our subset.  In this case, there are 5955 *predicteed* `F_B` phones in our subset.

In [39]:
len(grouped_dict["F_B"])

5955

We can then run `get_average_mfccs()` to generate the `mean` `MFCC` vector for each phone.  Below is the `mean` `MFCC` vector for all of the `F_B`s seen in our subset.

In [40]:
ave_dict = fv.get_average_mfccs(
    phones_dict=grouped_dict
)
ave_dict["F_B"]

array([ 15.63681507, -21.85861588,  -3.27805853,  -1.40559375,
       -10.15818024,  -8.24401665,  -3.1355145 ,   1.58451748,
        -3.61500621,  -0.10631811,  -4.7382741 ,  -2.49622965,  -3.75922012], dtype=float32)

`plot_frames()` can also show you the average vector of each frame that appears in your plot.

In [41]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][60:68],
        phones=ali_in[SAMPLE][60:68],
        average_mfccs_dict=ave_dict
    )
)

You can now compare the particular `MFCC` vectors for a given phone to its average vector.

## Case Study of `IH`

Let's look more closely at a particular `phone`.  There are *five* instances of `IH` in our chosen sample.

In [42]:
cat raw_data/librispeech-transcripts.txt | grep -E "1272-128104-0000"

1272-128104-0000 MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL


Let's find the frames that correspond to them.

**Note:** The phone is `IH1` which indicates the "second" pronunciation of `IH` in our phones set.  There is also a `IH0` and a `IH2`

In [43]:
cat data/lang/phones.txt | grep IH

IH0_B 142
IH0_E 143
IH0_I 144
IH0_S 145
IH1_B 146
IH1_E 147
IH1_I 148
IH1_S 149
IH2_B 150
IH2_E 151
IH2_I 152
IH2_S 153


In [44]:
enumerated_phones = list(enumerate(ali_in[SAMPLE]))

list(filter(
    lambda x: "IH" in x[1],
    enumerated_phones)
    )

[(59, 'IH1_I'),
 (60, 'IH1_I'),
 (61, 'IH1_I'),
 (62, 'IH1_I'),
 (63, 'IH1_I'),
 (96, 'IH1_I'),
 (97, 'IH1_I'),
 (98, 'IH1_I'),
 (125, 'IH1_B'),
 (126, 'IH1_B'),
 (127, 'IH1_B'),
 (128, 'IH1_B'),
 (129, 'IH1_B'),
 (130, 'IH1_B'),
 (131, 'IH1_B'),
 (132, 'IH1_B'),
 (133, 'IH1_B'),
 (134, 'IH1_B'),
 (242, 'IH1_I'),
 (243, 'IH1_I'),
 (244, 'IH1_I'),
 (469, 'IH1_I'),
 (470, 'IH1_I'),
 (471, 'IH1_I'),
 (472, 'IH1_I'),
 (473, 'IH1_I'),
 (474, 'IH1_I'),
 (475, 'IH1_I'),
 (476, 'IH1_I')]

And we can see that those *five* instances correspond to the following frames:
    - 59-63
    - 96-98
    - 125-134
    - 242-244
    - 469-476

In [45]:
ex_1 = {
    "start": 59,
    "stop": 64    # +1 to be inclusive
}

ex_2 = {
    "start": 96,
    "stop": 99    # +1 to be inclusive
}

ex_3 = {
    "start": 125,
    "stop": 135    # +1 to be inclusive
}

ex_4 = {
    "start": 242,
    "stop": 245    # +1 to be inclusive
}

ex_5 = {
    "start": 469,
    "stop": 477    # +1 to be inclusive
}

In [46]:
ali_in[SAMPLE][ex_1["start"]:ex_1["stop"]]

['IH1_I', 'IH1_I', 'IH1_I', 'IH1_I', 'IH1_I']

Let's plot each sequence of phones.

### `ex_1`: "MISTER"

In [47]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_1["start"]:ex_1["stop"]],
        phones=ali_in[SAMPLE][ex_1["start"]:ex_1["stop"]],
        average_mfccs_dict=ave_dict
    )
)

### `ex_2`: "QUILTER"

In [48]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_2["start"]:ex_2["stop"]],
        phones=ali_in[SAMPLE][ex_2["start"]:ex_2["stop"]],
        average_mfccs_dict=ave_dict
    )
)

### `ex_3`: "IS"

In [49]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_3["start"]:ex_3["stop"]],
        phones=ali_in[SAMPLE][ex_3["start"]:ex_3["stop"]],
        average_mfccs_dict=ave_dict
    )
)

### `ex_4`: "MIDDLE"

In [50]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_4["start"]:ex_4["stop"]],
        phones=ali_in[SAMPLE][ex_4["start"]:ex_4["stop"]],
        average_mfccs_dict=ave_dict
    )
)

### `ex_5`: "HIS"

In [51]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_5["start"]:ex_5["stop"]],
        phones=ali_in[SAMPLE][ex_5["start"]:ex_5["stop"]],
        average_mfccs_dict=ave_dict
    )
)

### `ex_3` and `ex_5`

Since `ex_3` and `ex_5` are "IS" and "HIS", respectively, let's plot them together.

In [52]:
ex_3_5_frames = np.vstack(
    (
        feats_in[SAMPLE][ex_3["start"]:ex_3["stop"]],
        feats_in[SAMPLE][ex_5["start"]:ex_5["stop"]]
    )
)
ex_3_5_frames.shape

(18, 13)

In [53]:
ex_3_5_phones = ali_in[SAMPLE][ex_3["start"]:ex_3["stop"]] + ali_in[SAMPLE][ex_5["start"]:ex_5["stop"]]
ex_3_5_phones

['IH1_B',
 'IH1_B',
 'IH1_B',
 'IH1_B',
 'IH1_B',
 'IH1_B',
 'IH1_B',
 'IH1_B',
 'IH1_B',
 'IH1_B',
 'IH1_I',
 'IH1_I',
 'IH1_I',
 'IH1_I',
 'IH1_I',
 'IH1_I',
 'IH1_I',
 'IH1_I']

In [54]:
py.iplot(
    fv.plot_frames(
        frames=ex_3_5_frames,
        phones=ex_3_5_phones,
        average_mfccs_dict=ave_dict
    )
)

It's hard to see any similarity in the individual frames of each `IH`, but we *can* see that the average phone for `IH_B` (when `IH` comes at the **beginning** of a word) and the average phone for `IH_I` (when `IH` comes in the **middle** of a word) are quite similar.

But then the question remains why these individual frames of a particular `IH` differ so much.

### phones in context

Each of these sequence of `IH` phones comes in a different "context" (in terms of which phone came before and after them).  And this has a direct effect on the features.

#### `ex_1` ("MIS..")

Let's re-plot `ex_1` ("MISTER"), but this time with a few frames before and after added for context.

In [55]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_1["start"]-2:ex_1["stop"]+2],
        phones=ali_in[SAMPLE][ex_1["start"]-2:ex_1["stop"]+2],
        average_mfccs_dict=ave_dict
    )
)

Compare frame `3/9` with `4/9` (the first two frames of `IH`).  They are very similar.

Now compare `3/9` to `7/9` (the *first* and *last* frames of `IH`).  Very different.

Now compare `2/9` to `3/9` (the *last* `M` frame with the first `IH` frame).  And compare `7/9` with `8/9` (the *last* `IH` frame and the *first* `S` frame).

#### `ex_2` ("...WIL...")

Compare this sequence (`ex_1`) with `ex_2` ("QUILTER")

In [56]:
py.iplot(
    fv.plot_frames(
        frames=feats_in[SAMPLE][ex_2["start"]-2:ex_2["stop"]+2],
        phones=ali_in[SAMPLE][ex_2["start"]-2:ex_2["stop"]+2],
        average_mfccs_dict=ave_dict
    )
)

Notice how much "tighter" all these frames line up with each other.