# Model training - full set

In the previous notebook we've experimented with different model architectures, hyperparameters and forms of preprocessing/augmentation. In this one we will test the most promising solutions on the entire set.

First we will have to tackle the fact that our main set is **currently unbalanced**, in that we have:

   1. A lot more examples in the _unknown_ category than other categories for train, cv & test sets.
   2. A lot fewer examples in the _silence_ category than other categories for train, cv & test sets.

In the _sample_ set we had a balanced mix. There's also another challenge that comes into play as we move from sample to main set - some of the examples in the full set are mislabelled. 

Once we balance, preprocess and persist our final data set we will move on to tuning our models to that data. We will then rewrite our most promising architecture in TensorFlow.

In [1]:
# first make sure we're in the parent dictory of our data/sample folders.
!pwd

/home/paperspace/tensorflow_speech_recognition


## Import
We'll need a couple of additional libraries so let's import them.

In [2]:
# filter out warnings
import warnings
warnings.filterwarnings('ignore') 

In [3]:
import bcolz
import glob
import librosa
import matplotlib.pyplot as plt
import numpy as np
import os
import pickle
import random
import tensorflow
import time

# utils
from importlib import reload
import utils; reload(utils)

# keras as tensorflow backend
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, BatchNormalization, Dropout, Convolution1D, Conv1D, Conv2D, Input
from tensorflow.python.keras.layers import MaxPooling1D, MaxPooling2D, Flatten, SimpleRNN, GRU, ConvLSTM2D
from tensorflow.python.keras.layers import LSTM, Activation, GlobalMaxPool1D
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.models import Model, load_model

# F1 and accuracy score metric
from sklearn.metrics import f1_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier

## Prepare data
First we have to prepare our full dataset: 

1. Deal with mislabelled data
2. Balance the number of examples per category
3. Preprocess and persist

In [4]:
path_to_main = "data/main"

### Mislabelled examples
Kaggle is great in that its challenges resembles real life problems. In my experience the work of a data scientist & machine learning engineer often has more to do with cleaning your data and making sure that the pipeline for getting more clean data is reliable. 

Below is a list of all the 39 mislabelled examples from our cross-validation set.

In [5]:
mislabelled_cv_paths = ["cv/down/bdee441c_nohash_3.wav",
                        "cv/go/1bc45db9_nohash_0.wav",
                        "cv/go/1bc45db9_nohash_1.wav",
                        "cv/go/7fd25f7c_nohash_4.wav",
                        "cv/go/a6d586b7_nohash_2.wav",
                        "cv/go/d9462202_nohash_2.wav",
                        "cv/go/dbb40d24_nohash_0.wav",
                        "cv/go/dbb40d24_nohash_1.wav",
                        "cv/go/dbb40d24_nohash_2.wav",
                        "cv/go/dbb40d24_nohash_3.wav",
                        "cv/go/dbb40d24_nohash_4.wav",
                        "cv/go/dbb40d24_nohash_5.wav",
                        "cv/left/c842b5e4_nohash_0.wav",
                        "cv/left/dbb40d24_nohash_1.wav",
                        "cv/left/dbb40d24_nohash_2.wav",
                        "cv/left/dbb40d24_nohash_3.wav",
                        "cv/left/dbb40d24_nohash_4.wav",
                        "cv/left/dbb40d24_nohash_5.wav",
                        "cv/no/7c1d8533_nohash_3.wav",
                        "cv/no/dbb40d24_nohash_4.wav",
                        "cv/off/5fadb538_nohash_0.wav",
                        "cv/off/5fadb538_nohash_1.wav",
                        "cv/off/5fadb538_nohash_2.wav",
                        "cv/off/5fadb538_nohash_3.wav",
                        "cv/off/5fadb538_nohash_4.wav",
                        "cv/on/7c1d8533_nohash_2.wav",
                        "cv/on/7c1d8533_nohash_3.wav",
                        "cv/on/7fd25f7c_nohash_3.wav",
                        "cv/on/099d52ad_nohash_3.wav",
                        "cv/on/794cdfc5_nohash_0.wav",
                        "cv/on/a6d586b7_nohash_4.wav",
                        "cv/on/d197e3ae_nohash_2.wav",
                        "cv/right/9d32f10a_nohash_0.wav",
                        "cv/right/264f471d_nohash_4.wav",
                        "cv/right/439c84f4_nohash_0.wav",
                        "cv/right/a6d586b7_nohash_1.wav",
                        "cv/stop/7fd25f7c_nohash_1.wav",
                        "cv/stop/264f471d_nohash_1.wav",
                        "cv/stop/d9462202_nohash_0.wav"]

I wanted to find all mislabelled examples in one of our subsets to be able to gauge the scale of the problem. Let's see how many examples we have in our CV set in general.

In [5]:
# we'll need a list of all category folder names
categories_to_predict = ["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", "silence", "unknown"]

In [7]:
# grab all .wav paths for CV
path_to_cv = os.path.join(path_to_main, "cv")
cv_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_cv, category)
    category_files = utils.grab_wavs(path_to_category)
    cv_wavs.extend(category_files)

In [8]:
# scale of the problem
print("{:.2f}% of all CV examples are mislabelled".format(39 *100/len(cv_wavs)))

0.57% of all CV examples are mislabelled


Good, only less than 1% of our samples are mislabelled. That's important to remember if we were trying to estimate perfect human performance on the dataset that we were provided, but not a deal breaker for now.

Let's listen to some of the mislabelled examples.

In [9]:
# here we have a person saying "one" instead of "on"
utils.display_audio(os.path.join(path_to_main, "cv", "on", "7c1d8533_nohash_3.wav"))

In [10]:
# here instead of "on" we get background noise
utils.display_audio(os.path.join(path_to_main, "cv", "on", "a6d586b7_nohash_4.wav"))

In [11]:
# here a person doesn't manage to finish the word "right" before the wav cuts them off
utils.display_audio(os.path.join(path_to_main, "cv", "right", "439c84f4_nohash_0.wav"))

In [12]:
# here the word is unintelligible
utils.display_audio(os.path.join(path_to_main, "cv", "stop", "dbb40d24_nohash_0.wav"))

The majority of the mislabelled examples are silences where the person wasn't able to finish the utterance in time. Knowing this we could expect our models to incorrectly predict the silence category. 

Having tracked this in the CV set allows us to potentially remove all the mislabelled examples from the CV set. Let's keep both the cleaned and uncleaned versions for now. The below code is Linux specific, for a Windows-compatible code switch the separators in the mislabelled_cv_paths list.

In [13]:
# show a wav from entire CV set
cv_wavs[0]

'data/main/cv/yes/c4cfbe43_nohash_1.wav'

We have to turn our paths from the mislabelled_cv_paths list to match the above, and then remove them.

In [14]:
mislabelled_cv_paths = ["data/main/" + p for p in mislabelled_cv_paths]
mislabelled_cv_paths[0]

'data/main/cv/down/bdee441c_nohash_3.wav'

In [15]:
# keep only correctly labelled wavs
cv_wavs_cleaned = []
for wav_path in cv_wavs:
    if wav_path in mislabelled_cv_paths:
        pass
    else:
        cv_wavs_cleaned.append(wav_path)

In [16]:
len(cv_wavs)

6850

In [17]:
len(cv_wavs_cleaned)

6811

### Balance the dataset
Let's see how many examples we currently have, per subset & per category.

In [45]:
subsets = ["train", "test", "cv"]

In [48]:
# let's use unix commands to see the imbalance
for subset in subsets:
    print(subset)
    path_to_subset = os.path.join(path_to_main, subset)
    for category in categories_to_predict:
        print(category, end="\t")
        path_to_category = os.path.join(path_to_subset, category)
        !ls $path_to_category | wc
    print()

train
yes	   1860    1860   40921
no	   1853    1853   40766
up	   1843    1843   40546
down	   1842    1842   40524
left	   1839    1839   40458
right	   1852    1852   40744
on	   1864    1864   41008
off	   1839    1839   40458
stop	   1885    1885   41470
go	   1861    1861   40942
silence	    294     294    6035
unknown	  32550   32550  881635

test
yes	    256     256    5632
no	    252     252    5544
up	    272     272    5984
down	    253     253    5566
left	    267     267    5874
right	    259     259    5698
on	    246     246    5412
off	    262     262    5764
stop	    249     249    5478
go	    251     251    5522
silence	     52      52    1063
unknown	   4268    4268  115594

cv
yes	    261     261    5742
no	    270     270    5940
up	    260     260    5720
down	    264     264    5808
left	    247     247    5434
right	    256     256    5632
on	    257     257    5654
off	    256     256    5632
stop	    246     246    5412
go	    260     260    5720
silence	     

The simplest solution is to only use a random, limited number of the *unknown* examples and use our data augmentation techniques to create more *silence* samples. Let's listen to some of the silences first, to know whether we're creating different, but valid examples. 

Remember that the silence category is also supposed to cover background noises, it is by no means "quiet".

In [56]:
# let's grab our silences
utils.display_audio(os.path.join(path_to_cv, "silence", "dude_miaowing_5.wav"))

In [57]:
utils.display_audio(os.path.join(path_to_cv, "silence", "doing_the_dishes_73.wav"))

In [58]:
utils.display_audio(os.path.join(path_to_cv, "silence", "white_noise_30.wav"))

There doesn't seem to be that much diversity in our background noise samples. This is a potential challenge. Our techniques of adding white noise aren't very sophisticated and might result in very uniform examples. The models might in turn learn to latch on to those simple, repeating characteristics, which do not really convey the idea of a "silence" - as distinguished from words being spoken.

Let's grab the silence examples from the main cv set and apply the addition of white-noise, stretching and shifting (for increased randomness) to balance the category.

In [61]:
cv_silences = [wav for wav in cv_wavs if "silence" in wav]
print(cv_silences[:5])

# confirm number
print(len(cv_silences))

['data/main/cv/silence/dude_miaowing_46.wav', 'data/main/cv/silence/dude_miaowing_39.wav', 'data/main/cv/silence/white_noise_55.wav', 'data/main/cv/silence/dude_miaowing_5.wav', 'data/main/cv/silence/white_noise_9.wav']
52


In [98]:
# define how many we need to add (for cv & train we need approximately 200, for train 1500)
to_add = 200

In [129]:
# add them to a separate directory, for cleanliness
!mkdir tmp

In [100]:
random.seed(12345678)

for i in range(to_add):
    print(i, end=" ")
    # grab a random silence
    wav_file = random.choice(cv_silences)
    
    # prepare outputs
    output_white_noise_file = os.path.join("tmp", "{}_whitenoise.wav".format(i))
    output_shift_file = os.path.join("tmp", "{}_shift.wav".format(i))
    output_stretch_file =  os.path.join("tmp", "{}_final.wav".format(i))

    # random white noise
    # within reasonable bounds and a constant seed
    white_noise_factor = random.uniform(1, 100)
    utils.augment_with_white_noise(wav_file, output_white_noise_file, white_noise_factor)

    # shifting (default factor)
    utils.augment_with_shift(output_white_noise_file, output_shift_file)

    # stretching
    stretch_factor = random.uniform(0.1, 3)
    utils.augment_with_stretch(output_shift_file, output_stretch_file, stretch_factor)
    
    # remove the previous 2 files
    !rm $output_shift_file
    !rm $output_white_noise_file

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 

In [101]:
!ls tmp | wc

    200     200    2690


Let's listen to the new silences we've created, to check if they're acceptably varied.

In [110]:
utils.display_audio(os.path.join("tmp", "1_final.wav"))

In [111]:
utils.display_audio(os.path.join("tmp", "2_final.wav"))

In [112]:
utils.display_audio(os.path.join("tmp", "3_final.wav"))

Compared to our source they seem varied enough. Let's move them to back to the appropriate folder.

In [117]:
tmp_path = os.path.join("tmp", "*.wav")
target_path = path_to_cv_silence
!mv $tmp_path $path_to_cv_silence

Let's check if we got the desired effect.

In [119]:
!ls $path_to_cv_silence | wc

    252     252    3759


#### Repeat for test set

In [123]:
path_to_main

'data/main'

In [124]:
test_silences = utils.grab_wavs(os.path.join(path_to_main, "test", "silence"))

['data/main/test/silence/dude_miaowing_9.wav',
 'data/main/test/silence/dude_miaowing_33.wav',
 'data/main/test/silence/white_noise_50.wav',
 'data/main/test/silence/white_noise_59.wav',
 'data/main/test/silence/running_tap_18.wav',
 'data/main/test/silence/white_noise_47.wav',
 'data/main/test/silence/white_noise_43.wav',
 'data/main/test/silence/exercise_bike_8.wav',
 'data/main/test/silence/pink_noise_2.wav',
 'data/main/test/silence/dude_miaowing_42.wav']

In [137]:
def balance_silences(paths_to_silences, to_be_added, subset_name, show_progress_in_tens=False):
    """
    Take a subset of the main dataset and add the appropriate amount of new silence examples.
    """
    
    for i in range(to_be_added):
        
        # sanity progress checker
        if show_progress_in_tens:
            if i % 10 == 0:
                print(i, end=" ")
        else:
            print(i, end=" ")
        
        # grab a random silence
        wav_file = random.choice(paths_to_silences)

        # prepare outputs
        output_white_noise_file = os.path.join("tmp", "{}_whitenoise.wav".format(i))
        output_shift_file = os.path.join("tmp", "{}_shift.wav".format(i))
        output_stretch_file =  os.path.join("tmp", "{}_final.wav".format(i))

        # random white noise
        # within reasonable bounds and a constant seed
        white_noise_factor = random.uniform(1, 100)
        utils.augment_with_white_noise(wav_file, output_white_noise_file, white_noise_factor)

        # shifting (default factor)
        utils.augment_with_shift(output_white_noise_file, output_shift_file)

        # stretching
        stretch_factor = random.uniform(0.1, 3)
        utils.augment_with_stretch(output_shift_file, output_stretch_file, stretch_factor)

        # remove the previous 2 files
        !rm $output_shift_file
        !rm $output_white_noise_file
        
    # move the created files to appropriate folder
    tmp_path = os.path.join("tmp", "*.wav")
    target_path = os.path.join(path_to_main, subset_name, "silence")
    !mv $tmp_path $target_path

In [132]:
# call the function on the test set
balance_silences(test_silences, to_be_added=200, subset_name="test")

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 

In [134]:
# check
target_path = os.path.join(path_to_main, "test", "silence")
!ls $target_path | wc

    252     252    3753


#### Repeat for train set

In [136]:
train_silences = utils.grab_wavs(os.path.join(path_to_main, "train", "silence"))
len(train_silences)

294

In [138]:
# call the function on the train set
balance_silences(train_silences, to_be_added=1500, subset_name="train", show_progress_in_tens=True)

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 710 720 730 740 750 760 770 780 790 800 810 820 830 840 850 860 870 880 890 900 910 920 930 940 950 960 970 980 990 1000 1010 1020 1030 1040 1050 1060 1070 1080 1090 1100 1110 1120 1130 1140 1150 1160 1170 1180 1190 1200 1210 1220 1230 1240 1250 1260 1270 1280 1290 1300 1310 1320 1330 1340 1350 1360 1370 1380 1390 1400 1410 1420 1430 1440 1450 1460 1470 1480 1490 

In [139]:
# check
target_path = os.path.join(path_to_main, "train", "silence")
!ls $target_path | wc

   1794    1794   27425


Finally let's confirm with a print out of all our example counts, per subset and category.

In [140]:
# let's use unix commands to see the imbalance
for subset in subsets:
    print(subset)
    path_to_subset = os.path.join(path_to_main, subset)
    for category in categories_to_predict:
        print(category, end="\t")
        path_to_category = os.path.join(path_to_subset, category)
        !ls $path_to_category | wc
    print()

train
yes	   1860    1860   40921
no	   1853    1853   40766
up	   1843    1843   40546
down	   1842    1842   40524
left	   1839    1839   40458
right	   1852    1852   40744
on	   1864    1864   41008
off	   1839    1839   40458
stop	   1885    1885   41470
go	   1861    1861   40942
silence	   1794    1794   27425
unknown	  32550   32550  881635

test
yes	    256     256    5632
no	    252     252    5544
up	    272     272    5984
down	    253     253    5566
left	    267     267    5874
right	    259     259    5698
on	    246     246    5412
off	    262     262    5764
stop	    249     249    5478
go	    251     251    5522
silence	    252     252    3753
unknown	   4268    4268  115594

cv
yes	    261     261    5742
no	    270     270    5940
up	    260     260    5720
down	    264     264    5808
left	    247     247    5434
right	    256     256    5632
on	    257     257    5654
off	    256     256    5632
stop	    246     246    5412
go	    260     260    5720
silence	    2

#### Clean up
Remove the tmp directory.

In [141]:
# remove the temporary directory and all its contents
!rm -r tmp

Great, now our silences are balanced. We don't have to remove the over-represented *unknown* example files - instead we'll just remove as many as we need from a list of paths.

#### Create list of paths to .wav files
Per subset, per category. We have to repeat this for our CV set (and remove the mislabelled examples) because we've added the new silences.

In [13]:
# first grab the cv set
path_to_cv = os.path.join(path_to_main, "cv")
main_cv_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_cv, category)
    category_files = utils.grab_wavs(path_to_category)
    
    # we use extend instead of append to add all elements from the iterable
    main_cv_wavs.extend(category_files)
    
print("How many CV samples?: ", len(main_cv_wavs))

How many CV samples?:  7050


In [189]:
# remove the mislabelled examples again
main_cv_wavs_cleaned = []
for wav_path in main_cv_wavs:
    if wav_path in mislabelled_cv_paths:
        pass
    else:
        main_cv_wavs_cleaned.append(wav_path)
        
main_cv_wavs = main_cv_wavs_cleaned
print("How many CV samples after removing the mislabelled ones?: ", len(main_cv_wavs))

How many CV samples after removing the mislabelled ones?:  7011


In [190]:
main_cv_balanced = []

# and finally remove the over-represented unknown samples randomly
random.seed(1234567)

# shuffle works in-place (this randomizes the order of unknown samples)
random.shuffle(main_cv_wavs)

# we need to remove 3960 examples for CV, 4000 Test, and 30 700 for Train
to_remove = 3960

# find an unkown example
for example in main_cv_wavs:
    
    # remove the example if it's unknown and we still need to remove some
    if to_remove > 0 and "unknown" in example:
        to_remove -= 1
    
    # add to balanced
    else :
        main_cv_balanced.append(example)

main_cv_wavs = main_cv_balanced
print("How many CV samples after removing the over-represented unknown?: ", len(main_cv_wavs))

How many CV samples after removing the over-represented unknown?:  3051


In [191]:
# confirm
remaining_unknowns_count = 0
for e in main_cv_wavs:
    if "unknown" in e:
        print(e)
        remaining_unknowns_count += 1
remaining_unknowns_count

data/main/cv/unknown/f2dd248e_nohash_0_one.wav
data/main/cv/unknown/c6ee87a7_nohash_0_bed.wav
data/main/cv/unknown/56eb74ae_nohash_3_six.wav
data/main/cv/unknown/bdee441c_nohash_1_four.wav
data/main/cv/unknown/c4e1f6e0_nohash_0_bed.wav
data/main/cv/unknown/b1426003_nohash_0_four.wav
data/main/cv/unknown/6071a214_nohash_0_five.wav
data/main/cv/unknown/ae927455_nohash_0_eight.wav
data/main/cv/unknown/3ca784ec_nohash_0_four.wav
data/main/cv/unknown/d874a786_nohash_1_seven.wav
data/main/cv/unknown/105e72bb_nohash_0_dog.wav
data/main/cv/unknown/2643992f_nohash_0_two.wav
data/main/cv/unknown/50f55535_nohash_0_cat.wav
data/main/cv/unknown/90804775_nohash_0_four.wav
data/main/cv/unknown/57cb3575_nohash_0_sheila.wav
data/main/cv/unknown/7fd25f7c_nohash_2_seven.wav
data/main/cv/unknown/a8cf01bc_nohash_2_nine.wav
data/main/cv/unknown/bfbd0e6b_nohash_2_one.wav
data/main/cv/unknown/9cde5de8_nohash_1_wow.wav
data/main/cv/unknown/22aa3665_nohash_0_wow.wav
data/main/cv/unknown/ae927455_nohash_1_four.w

261

Repeat for the **test** subset.

In [192]:
# first grab the test set
path_to_test = os.path.join(path_to_main, "test")
main_test_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_test, category)
    category_files = utils.grab_wavs(path_to_category)
    
    # we use extend instead of append to add all elements from the iterable
    main_test_wavs.extend(category_files)
    
print("How many test samples?: ", len(main_test_wavs))

How many test samples?:  7087


In [193]:
main_test_balanced = []

random.seed(1234567)
random.shuffle(main_test_wavs)

# we need to remove 3960 examples for CV, 4000 Test, and 30 700 for Train
to_remove = 4000

# find an unkown example
for example in main_test_wavs:
    
    # remove the example if it's unknown and we still need to remove some
    if to_remove > 0 and "unknown" in example:
        to_remove -= 1
    
    # add to balanced
    else :
        main_test_balanced.append(example)

main_test_wavs = main_test_balanced
print("How many test samples after removing the over-represented unknown?: ", len(main_test_wavs))

How many test samples after removing the over-represented unknown?:  3087


Repeat for the **train** subset.

In [194]:
# first grab the test set
path_to_train = os.path.join(path_to_main, "train")
main_train_wavs = []

for category in categories_to_predict:
    path_to_category = os.path.join(path_to_train, category)
    category_files = utils.grab_wavs(path_to_category)
    
    # we use extend instead of append to add all elements from the iterable
    main_train_wavs.extend(category_files)
    
print("How many train samples?: ", len(main_train_wavs))

How many train samples?:  52882


In [195]:
main_train_balanced = []

random.seed(1234567)
random.shuffle(main_train_wavs)

# we need to remove 3960 examples for CV, 4000 Test, and 30 700 for Train
to_remove = 30700

# find an unkown example
for example in main_train_wavs:
    
    # remove the example if it's unknown and we still need to remove some
    if to_remove > 0 and "unknown" in example:
        to_remove -= 1
    
    # add to balanced
    else :
        main_train_balanced.append(example)

main_train_wavs = main_train_balanced
print("How many train samples after removing the over-represented unknown?: ", len(main_train_wavs))

How many train samples after removing the over-represented unknown?:  22182


#### Persist the select paths to subset examples

In [211]:
with open(os.path.join(path_to_main, "main_cv_paths"), "wb") as f:
    pickle.dump(main_cv_wavs, f)

In [212]:
with open(os.path.join(path_to_main,"main_test_paths"), "wb") as f:
    pickle.dump(main_test_wavs, f)

In [213]:
with open(os.path.join(path_to_main, "main_train_paths"), "wb") as f:
    pickle.dump(main_train_wavs, f)

In [217]:
# sanity check
main_train_wavs[:5]

['data/main/train/up/cc71bada_nohash_0.wav',
 'data/main/train/silence/dude_miaowing_44.wav',
 'data/main/train/off/c1d39ce8_nohash_8.wav',
 'data/main/train/down/151bfb79_nohash_0.wav',
 'data/main/train/go/c2aeb59d_nohash_0.wav']

Reload if needed.

In [14]:
with open(os.path.join(path_to_main, "main_cv_paths"), "rb") as f:
    main_cv = pickle.load(f)

In [15]:
with open(os.path.join(path_to_main, "main_test_paths"), "rb") as f:
    main_test = pickle.load(f)

In [16]:
with open(os.path.join(path_to_main, "main_train_paths"), "rb") as f:
    main_train = pickle.load(f)

In [17]:
main_train[:5]

['data/main/train/up/cc71bada_nohash_0.wav',
 'data/main/train/silence/dude_miaowing_44.wav',
 'data/main/train/off/c1d39ce8_nohash_8.wav',
 'data/main/train/down/151bfb79_nohash_0.wav',
 'data/main/train/go/c2aeb59d_nohash_0.wav']

### Separate into X and y and persist
Both in the raw wav format and the preprocessed form that was proven effective during sample experiments.

Let's start with the **y** for our CV set.

In [14]:
print("CV target (y)")

# figure out the dimensions
rows = len(main_cv)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# get the y
cv_y = utils.get_y(main_cv, 13, categories_to_predict)
print("Received shape: {}".format(cv_y.shape))

CV target (y)
Target dimensions: (3051, 12)
Received shape: (3051, 12)


Repeat for Test and Train sets.

In [15]:
print("Test target (y)")

# figure out the dimensions
rows = len(main_test)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# get the y
test_y = utils.get_y(main_test, 15, categories_to_predict)
print("Received shape: {}".format(test_y.shape))

Test target (y)
Target dimensions: (3087, 12)
Received shape: (3087, 12)


In [16]:
print("Train target (y)")

# figure out the dimensions
rows = len(main_train)
columns = len(categories_to_predict)
dimensions = (rows, columns)
print("Target dimensions: {}".format(dimensions))

# get the y
train_y = utils.get_y(main_train, 16, categories_to_predict)
print("Received shape: {}".format(train_y.shape))

Train target (y)
Target dimensions: (22182, 12)
Received shape: (22182, 12)


And now the **X**.

In [None]:
# get the desired number of columns (n)
n = len(utils.get_wav_info(main_train[0])[1])
n

In [19]:
%%time
cv_X = utils.get_X(main_cv, n)
print("CV: ",cv_X.shape)

CV:  (3051, 16000)
CPU times: user 1min 52s, sys: 2min 42s, total: 4min 35s
Wall time: 4min 40s


In [20]:
%%time
test_X = utils.get_X(main_test, n)
print("Test: ",test_X.shape)

Test:  (3087, 16000)
CPU times: user 1min 53s, sys: 2min 43s, total: 4min 37s
Wall time: 4min 42s


The Train set contains over 20K examples, which is a size that can slow down some the processing we want to expose the data to - so let's spread it into smaller chunks, each of a similar size to the CV & Test sets.

In [17]:
# find a good splitting point
split_point = len(main_train) // 7
print(split_point)

3168


In [26]:
# split the Train set
main_train_subsets = []
for i in range(7):
    
    # grab a slice
    part_of_train = main_train[split_point * i:split_point * (i + 1)]
    
    # confirm we grabbed the right length
    print(len(part_of_train))
          
    # append
    main_train_subsets.append(part_of_train)

# sanity check
main_train_subsets[0][0]

3168
3168
3168
3168
3168
3168
3168


'data/main/train/up/cc71bada_nohash_0.wav'

In [29]:
# get the X for each part of the Train subsets
train_Xs = []
for i, paths in enumerate(main_train_subsets):
    
    # extract
    train_X_subset = utils.get_X(paths, n)
    print("Train subset {}: {}".format(i+1, train_X_subset.shape))
    
    # append
    train_Xs.append(train_X_subset)

Train subset 1: (3168, 16000)
Train subset 2: (3168, 16000)
Train subset 3: (3168, 16000)
Train subset 4: (3168, 16000)
Train subset 5: (3168, 16000)
Train subset 6: (3168, 16000)
Train subset 7: (3168, 16000)


Let's confirm through a quick sanity check that everything fits together.

In [35]:
# our first path should be an "up", which is the 3rd category to predict
print(main_train[0])
print(train_y[0])
print(train_Xs[0][0])

data/main/train/up/cc71bada_nohash_0.wav
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ -44. -119. -156. ...  160.  138.  135.]


Persist in a separate directory.

In [5]:
# define the bcolz array saving functions
def bcolz_save(fname, arr): c=bcolz.carray(arr, rootdir=fname, mode='w'); c.flush()
def bcolz_load(fname): return bcolz.open(fname)[:]

In [6]:
!pwd

/home/paperspace/tensorflow_speech_recognition


In [7]:
path_to_main_preprocessed = os.path.join(path_to_main, "preprocessed")
path_to_main_preprocessed

'data/main/preprocessed'

In [25]:
# create the directory if it's not there already
# !mkdir $path_to_main_preprocessed

#### Persist the y

In [26]:
# save the y
bcolz_save(path_to_main_preprocessed + os.path.sep + "train_y" + ".bc", train_y)
bcolz_save(path_to_main_preprocessed + os.path.sep + "cv_y" + ".bc", cv_y)
bcolz_save(path_to_main_preprocessed + os.path.sep + "test_y" + ".bc", test_y)

#### Persist the X

In [None]:
# raw data
bcolz_save(path_to_main_preprocessed + os.path.sep + "cv_X" + ".bc", cv_X)
bcolz_save(path_to_main_preprocessed + os.path.sep + "test_X" + ".bc", test_X)

In [39]:
# in chunks for the Train set
for i, train_subset in enumerate(train_Xs):
    bcolz_save(path_to_main_preprocessed + os.path.sep + "train_X" + str(i + 1) +".bc", train_subset)

In [41]:
!ls $path_to_main_preprocessed

cv_X.bc  test_X.bc  train_X1.bc  train_X3.bc  train_X5.bc  train_X7.bc
cv_y.bc  test_y.bc  train_X2.bc  train_X4.bc  train_X6.bc  train_y.bc


### Preprocess & persist
Preprocess the raw .wav files in ways that have proven useful during experiments on the sample (2D MFCCs and Tempograms).

#### MFCCs 2D

In [21]:
%%time
# CV and Test set first
cv_X_mfccs_2D = utils.get_X_mfccs(main_cv, shape=(100, 32), mean=False)
test_X_mfccs_2D = utils.get_X_mfccs(main_test, shape=(100, 32), mean=False)

print("CV mfccs: ", cv_X_mfccs_2D.shape)
print("Test mfccs: ",test_X_mfccs_2D.shape)

CV mfccs:  (3051, 100, 32)
Test mfccs:  (3087, 100, 32)
CPU times: user 4min, sys: 21min 55s, total: 25min 56s
Wall time: 4min 54s


In [46]:
# get the X MFCCs 2D for each part of the Train subsets
train_Xs_mfccs_2D = []
for i, paths in enumerate(main_train_subsets):
    
    # extract
    train_X_mfccs_2D_subset = utils.get_X_mfccs(paths, shape=(100, 32), mean=False)
    print("Train MFCCs 2D subset {}: {}".format(i+1, train_X_mfccs_2D_subset.shape))
    
    # append
    train_Xs_mfccs_2D.append(train_X_mfccs_2D_subset)

Train MFCCs 2D subset 1: (3168, 100, 32)
Train MFCCs 2D subset 2: (3168, 100, 32)
Train MFCCs 2D subset 3: (3168, 100, 32)
Train MFCCs 2D subset 4: (3168, 100, 32)
Train MFCCs 2D subset 5: (3168, 100, 32)
Train MFCCs 2D subset 6: (3168, 100, 32)
Train MFCCs 2D subset 7: (3168, 100, 32)


In [52]:
# persist the 2D MFCCs (Train)
for i, train_X_mfccs_2D_subset in enumerate(train_Xs_mfccs_2D):
    bcolz_save(path_to_main_preprocessed + os.path.sep + "train_X_MFCCs_2D_" + str(i + 1) +".bc", train_X_mfccs_2D_subset)

In [24]:
# persist the 2D MFCCs (Test & CV)
bcolz_save(path_to_main_preprocessed + os.path.sep + "cv_X_MFCCs_2d" + ".bc", cv_X_mfccs_2D)
bcolz_save(path_to_main_preprocessed + os.path.sep + "test_X_MFCCs_2d" + ".bc", test_X_mfccs_2D)

In [53]:
!ls $path_to_main_preprocessed

cv_X.bc      train_X2.bc  train_X7.bc		 train_X_MFCCs_2D_5.bc
cv_y.bc      train_X3.bc  train_X_MFCCs_2D_1.bc  train_X_MFCCs_2D_6.bc
test_X.bc    train_X4.bc  train_X_MFCCs_2D_2.bc  train_X_MFCCs_2D_7.bc
test_y.bc    train_X5.bc  train_X_MFCCs_2D_3.bc  train_y.bc
train_X1.bc  train_X6.bc  train_X_MFCCs_2D_4.bc


#### Tempograms

In [22]:
%%time
# CV first
cv_X_tempogram = utils.get_X_tempogram(main_cv)
print("CV tempogram: ", cv_X_tempogram.shape)

CV tempogram:  (3051, 384, 32)
CPU times: user 8min 2s, sys: 32min, total: 40min 2s
Wall time: 8min 30s


In [23]:
%%time
# Test set
test_X_tempogram = utils.get_X_tempogram(main_test)
print("Test tempogram: ",test_X_tempogram.shape)

Test tempogram:  (3087, 384, 32)
CPU times: user 8min 12s, sys: 33min 5s, total: 41min 17s
Wall time: 8min 39s


In [58]:
# get the X MFCCs 2D for each part of the Train subsets
train_Xs_tempogram = []
for i, paths in enumerate(main_train_subsets):
    
    # extract
    train_X_tempogram_subset = utils.get_X_tempogram(paths)
    print("Train tempogram subset {}: {}".format(i+1, train_X_tempogram_subset.shape))
    
    # append
    train_Xs_tempogram.append(train_X_tempogram_subset)

Train tempogram subset 1: (3168, 384, 32)
Train tempogram subset 2: (3168, 384, 32)
Train tempogram subset 3: (3168, 384, 32)
Train tempogram subset 4: (3168, 384, 32)
Train tempogram subset 5: (3168, 384, 32)
Train tempogram subset 6: (3168, 384, 32)
Train tempogram subset 7: (3168, 384, 32)


In [59]:
# persist the tempogram (Train)
for i, train_X_tempogram_subset in enumerate(train_Xs_tempogram):
    bcolz_save(path_to_main_preprocessed + os.path.sep + "train_X_tempogram_" + str(i + 1) +".bc", train_X_tempogram_subset)

In [None]:
# persist the tempogram (Test & CV)
bcolz_save(path_to_main_preprocessed + os.path.sep + "cv_X_tempogram" + ".bc", cv_X_tempogram)
bcolz_save(path_to_main_preprocessed + os.path.sep + "test_X_tempogram" + ".bc", test_X_tempogram)

In [28]:
!ls $path_to_main_preprocessed

cv_X.bc		     train_X3.bc	    train_X_MFCCs_2D_6.bc
cv_X_MFCCs_2d.bc     train_X4.bc	    train_X_MFCCs_2D_7.bc
cv_X_tempogram.bc    train_X5.bc	    train_X_tempogram_1.bc
cv_y.bc		     train_X6.bc	    train_X_tempogram_2.bc
test_X.bc	     train_X7.bc	    train_X_tempogram_3.bc
test_X_MFCCs_2d.bc   train_X_MFCCs_2D_1.bc  train_X_tempogram_4.bc
test_X_tempogram.bc  train_X_MFCCs_2D_2.bc  train_X_tempogram_5.bc
test_y.bc	     train_X_MFCCs_2D_3.bc  train_X_tempogram_6.bc
train_X1.bc	     train_X_MFCCs_2D_4.bc  train_X_tempogram_7.bc
train_X2.bc	     train_X_MFCCs_2D_5.bc  train_y.bc


### Reload
If necessary, you can use this snippet to reload all of the previously preprocessed and persisted data sets.

In [8]:
# reload the y
train_y = bcolz_load(path_to_main_preprocessed + os.path.sep + "train_y" + ".bc")
cv_y = bcolz_load(path_to_main_preprocessed + os.path.sep + "cv_y" + ".bc")
test_y = bcolz_load(path_to_main_preprocessed + os.path.sep + "test_y" + ".bc")

In [9]:
# reload the Test & CV X
# raw
cv_X = bcolz_load(path_to_main_preprocessed + os.path.sep + "cv_X" + ".bc")
test_X = bcolz_load(path_to_main_preprocessed + os.path.sep + "test_X" + ".bc")

# tempogram
cv_X_tempogram = bcolz_load(path_to_main_preprocessed + os.path.sep + "cv_X_tempogram" + ".bc")
test_X_tempogram = bcolz_load(path_to_main_preprocessed + os.path.sep + "test_X_tempogram" + ".bc")

# 2D MFCCs
cv_X_mfccs_2D = bcolz_load(path_to_main_preprocessed + os.path.sep + "cv_X_MFCCs_2d" + ".bc")
test_X_mfccs_2D = bcolz_load(path_to_main_preprocessed + os.path.sep + "test_X_MFCCs_2d" + ".bc")

In [10]:
# reload the Train X
# raw
train_Xs = []
for i in range(7):
    train_subset = bcolz_load(path_to_main_preprocessed + os.path.sep + "train_X" + str(i + 1) +".bc")
    train_Xs.append(train_subset)
    
# MFCCs 2D
train_Xs_MFCCs_2D = []
for i in range(7):
    train_subset = bcolz_load(path_to_main_preprocessed + os.path.sep + "train_X_MFCCs_2D_" + str(i + 1) +".bc")
    train_Xs_MFCCs_2D.append(train_subset)

# Tempogram
train_Xs_tempogram = []
for i in range(7):
    train_subset = bcolz_load(path_to_main_preprocessed + os.path.sep + "train_X_tempogram_" + str(i + 1) +".bc")
    train_Xs_tempogram.append(train_subset)

#### Split the Train y
Since we've split our Train X, it will be easier to split our Train y too, when we're passing it to our models.

In [11]:
# Train X subsets have 3168 examples each (7 total), exactly
train_ys = []
subset_size = 3168
for i in range(7):
    train_y_subset = train_y[subset_size * i : subset_size * (i + 1)]
    train_ys.append(train_y_subset)

In [12]:
train_ys[0][0]

array([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

#### Expand Tempograms to 2D
One of our best models (based on experiments on the sample) used 2D Tempograms, so let's expand our data into them.

In [13]:
train_Xs_tempogram_2D = [np.expand_dims(train_X_tempogram, axis=3) for train_X_tempogram in train_Xs_tempogram]
cv_X_tempogram_2D = np.expand_dims(cv_X_tempogram, axis=3)
test_X_tempogram_2D = np.expand_dims(test_X_tempogram, axis=3)

test_X_tempogram_2D.shape

(3087, 384, 32, 1)

## Train Models
We have the following promising model architectures to test on the entire set:

1. 1D Convolutional Tempogram Model
2. 2D Convolutional Tempogram Model
3. 1D Convolutional Community Model (deep learning on raw data)

We will want to persist the models that exceed a certain performance threshold. Let's set that threshold and create a directory for storing models.

In [14]:
# performance threshold
accuracy_threshold = 0.6

In [15]:
!pwd

/home/paperspace/tensorflow_speech_recognition


In [16]:
# create the directory for storing final models
# !mkdir models

In [17]:
path_to_models = os.path.join(os.path.expanduser("~"), "tensorflow_speech_recognition", "models")
path_to_models

'/home/paperspace/tensorflow_speech_recognition/models'

In [18]:
num_categories = 12

#### 1D Convolutional Tempogram Model
Let's try this model's architecture in the exact same form as the version we trained on sample, to approximately 0.4 accuracy.

In [20]:
current_model_path = os.path.join(path_to_models, "1D_CNN_TEMPOGRAM")

cnn3 = Sequential([
        Convolution1D(input_shape=(test_X_tempogram.shape[1], test_X_tempogram.shape[2]), 
                      kernel_size=32, filters=128, padding="same", activation="relu"),
        Dropout(0.11),
        MaxPooling1D(),
        Convolution1D(kernel_size=12, filters=128, padding="same", activation="relu"),
        Dropout(0.13),
        MaxPooling1D(),
        Flatten(),
        Dense(2000, activation="relu"),
        Dropout(.7),
        Dense(num_categories, activation="softmax")
    ])

cnn3.compile(Adam(lr=0.0001),loss="categorical_crossentropy", metrics=["accuracy"])

In [34]:
# start the timer
start = time.time()

In [30]:
# keep track of epoch
cur_epoch_nr = 1

# fit iteratively
for i, train_X_tempogram in enumerate(train_Xs_tempogram):
    
    # pretty printing
    print(i + 1, "/", len(train_Xs_tempogram), end=" | ")
    
    result = cnn3.fit(train_X_tempogram, train_ys[i], batch_size=32, epochs=1, verbose=0,
             validation_data=(cv_X_tempogram, cv_y))
    
    # pretty printing
    duration = time.time() - start
    start = time.time()
    print("Took {:.2f} seconds".format(duration), end=" | ")
    
    # results
    cv_acc = "{:.4f}".format(result.history["val_acc"][0]).replace(".","")
    train_acc = "{:.4f}".format(result.history["acc"][0]).replace(".","")
    print("Train acc: {} | CV acc: {}".format(train_acc, cv_acc))
    
    # saving
    if result.history["val_acc"][0] >= accuracy_threshold:
        cnn3.save_weights(current_model_path + "_" + str(cur_epoch_nr) + "_" + str(i + 1) + "_" + "TR" + train_acc + "_" + "CV" + cv_acc + ".h5")

1 / 7 | Took 41.53 seconds | Train acc: 03741 | CV acc: 03743
2 / 7 | Took 41.36 seconds | Train acc: 03845 | CV acc: 03648
3 / 7 | Took 41.15 seconds | Train acc: 03889 | CV acc: 03707
4 / 7 | Took 41.22 seconds | Train acc: 03662 | CV acc: 03812
5 / 7 | Took 41.37 seconds | Train acc: 03860 | CV acc: 03795
6 / 7 | Took 41.08 seconds | Train acc: 03854 | CV acc: 03900
7 / 7 | Took 41.39 seconds | Train acc: 05297 | CV acc: 01203


Train for a couple more epochs, adjusting the learning rate.

In [19]:
# predefine the learning rates
# iteratively
lrs = [0.0003, 0.001, 0.0003,0.0001,0.00003,0.00003,0.03,
       0.01,0.003,0.001,0.0003,0.0003,0.00001,0.00001,
       0.00001,0.03,0.03,0.03,0.003,0.003,0.0003,0.00001,
       0.00001,0.00001,0.000001,0.000001,0.000001,
       0.000001,0.0000001,0.00000001,0.3,1.0,1.0,0.1,0.03,0.01,
       0.003,0.001,0.001,0.0003,0.0001,
       0.00003,0.00003,0.00001,0.00001,0.1,0.01,0.001,0.0001,
       0.00001,0.0003, 0.001, 0.0003,0.0001,0.00003,0.00003,0.03,
       0.01,0.003,0.001,0.0003,0.0003,0.00001,0.00001,
       0.00001,0.03,0.03,0.03,0.003,0.003,0.0003,0.00001,
       0.00001,0.00001,0.000001,0.000001,0.000001,0.3,1.0,
       1.0,0.1,0.03,0.01,0.003,0.001,0.001,0.0003,0.0001,
       0.00003,0.00003,0.00001,0.00001,0.1,0.01,0.001,0.0001,
       0.000001]

In [38]:
# iteratively for each learning rate
for lr in lrs:
    # adjust learning rate and epoch
    cnn3.optimizer.lr = lr
    cur_epoch_nr = cur_epoch_nr + 1
    
    # fit iteratively
    for i, train_X_tempogram in enumerate(train_Xs_tempogram):

        # pretty printing
        print(i + 1, "/", len(train_Xs_tempogram), end=" | ")
        print("Actual epoch: {}".format(cur_epoch_nr), end=" | ")
        print("Current lr: {}".format(lr), end=" | ")

        result = cnn3.fit(train_X_tempogram, train_ys[i], batch_size=32, epochs=1, verbose=0, 
                 validation_data=(cv_X_tempogram, cv_y))

        # pretty printing
        duration = time.time() - start
        start = time.time()
        print("Took {:.2f} seconds".format(duration), end=" | ")

        # results
        cv_acc = "{:.4f}".format(result.history["val_acc"][0]).replace(".","")
        train_acc = "{:.4f}".format(result.history["acc"][0]).replace(".","")
        print("Train acc: {} | CV acc: {}".format(train_acc, cv_acc))

        # saving
        if result.history["val_acc"][0] >= accuracy_threshold:
            cnn3.save_weights(current_model_path + "_" + str(cur_epoch_nr) + "_" + str(i + 1) + "_" + "TR" + train_acc + "_" + "CV" + cv_acc + ".h5")


1 / 7 | Actual epoch: 3 | Current lr: 0.0003 | Took 48.65 seconds | Train acc: 02061 | CV acc: 02635
2 / 7 | Actual epoch: 3 | Current lr: 0.0003 | Took 40.76 seconds | Train acc: 03226 | CV acc: 03425
3 / 7 | Actual epoch: 3 | Current lr: 0.0003 | Took 40.88 seconds | Train acc: 03674 | CV acc: 03661
4 / 7 | Actual epoch: 3 | Current lr: 0.0003 | Took 41.19 seconds | Train acc: 03699 | CV acc: 03612
5 / 7 | Actual epoch: 3 | Current lr: 0.0003 | Took 43.42 seconds | Train acc: 03759 | CV acc: 03756
6 / 7 | Actual epoch: 3 | Current lr: 0.0003 | Took 43.88 seconds | Train acc: 03958 | CV acc: 03753
7 / 7 | Actual epoch: 3 | Current lr: 0.0003 | Took 43.59 seconds | Train acc: 05571 | CV acc: 01295
1 / 7 | Actual epoch: 4 | Current lr: 0.001 | Took 42.41 seconds | Train acc: 03324 | CV acc: 03802
2 / 7 | Actual epoch: 4 | Current lr: 0.001 | Took 43.42 seconds | Train acc: 04034 | CV acc: 03832
3 / 7 | Actual epoch: 4 | Current lr: 0.001 | Took 41.63 seconds | Train acc: 04100 | CV acc

After 22 epochs we get to a CV accuracy of about 0.45, a little bit better than the performance on our sample set, but below our expectations. Our model doesn't seem to be overfitting too much either - the performance on the main train set is close to 0.49. This could suggest that whilst the general approach of using 1D convolutional layers on preprocessed tempograms is valid, **our architecture may not be complex enough to capture all the important differences.**

Let's save the current model and reload it for further training.

In [39]:
# save the model
cnn3.save_weights(current_model_path + "_" + str(cur_epoch_nr) + "_" + str(i + 1) + "_" + "TR" + train_acc + "_" + "CV" + cv_acc + ".h5")

In [52]:
# reload th current model, if needed
# cnn3.load_weights(os.path.join(path_to_models, "1D_CNN_TEMPOGRAM_3_1_TR02705_CV02950.h5"))

In [40]:
# show latest results
best_training_accuracy = max(result.history["acc"])
best_validation_accuracy = max(result.history["val_acc"])
print("Best 1D CNN Tempogram model scores\nTrain acc: {:.4f}\nCV acc: {:.4f}".format(best_training_accuracy, best_validation_accuracy))

Best 1D CNN Tempogram model scores
Train acc: 0.4956
CV acc: 0.4448


Our next logical step was to add more layers to our initial 1D Convolutional Tempogram Model and see if it enhances its performance. It didn't seem to increase the model's ability to fit the training set significantly (above 0.5 accuracy). This can be in part due to the fact that tempogram transformations (whilst trying to be less susceptible to differences in pitch) do represent a loss of certain features of the original examples.

Let's train a 2D convolutional model and see if we can get a better performance from the tempograms.

#### 2D Convolutional Tempogram Model

In [22]:
current_model_path = os.path.join(path_to_models, "2D_CNN_TEMPOGRAM")

cnn5 = Sequential([
        Conv2D(input_shape=(test_X_tempogram_2D.shape[1], test_X_tempogram_2D.shape[2], 1), 
                      kernel_size=32, filters=128, padding="same", activation="relu"),
        Dropout(0.11),
        MaxPooling2D(),
        Conv2D(kernel_size=12, filters=128, padding="same", activation="relu"),
        Dropout(0.13),
        MaxPooling2D(),
        Flatten(),
        Dense(2000, activation="relu"),
        Dropout(.7),
        Dense(num_categories, activation="softmax")
    ])

cnn5.compile(Adam(lr=0.0001),loss="categorical_crossentropy", metrics=["accuracy"])

In [26]:
# start the timer
start = time.time()

In [24]:
# keep track of epoch
cur_epoch_nr = 1

# fit iteratively
for i, train_X_tempogram_2D in enumerate(train_Xs_tempogram_2D):
    
    # pretty printing
    print(i + 1, "/", len(train_Xs_tempogram_2D))
    
    result = cnn5.fit(train_X_tempogram_2D, train_ys[i], batch_size=32, epochs=1, 
             validation_data=(cv_X_tempogram_2D, cv_y))
    
    # pretty printing
    duration = time.time() - start
    print("Took {:.2f} seconds".format(duration))
    print()
    
    # results
    cv_acc = "{:.4f}".format(result.history["val_acc"][0]).replace(".","")
    train_acc = "{:.4f}".format(result.history["acc"][0]).replace(".","")
    
    # saving
    if result.history["val_acc"][0] >= accuracy_threshold:
        cnn5.save_weights(current_model_path + "_" + str(cur_epoch_nr) + "_" + str(i + 1) + "_" + "TR" + train_acc + "_" + "CV" + cv_acc + ".h5")

1 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 2735.55 seconds

2 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 5174.50 seconds

3 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 7841.48 seconds

4 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 10531.50 seconds

5 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 13271.69 seconds

6 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 16032.57 seconds

7 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 18803.15 seconds



Our validation accuracy doesn't seem to exceed the levels we've reached on the sample set. We might be able to experiment with different learning rates and get a better result, but given the fact that the 2D Tempogram model is much more expensive to train (time-wise), let's move on to the 1D convolutional model on raw data.

#### 1D Convolutional Community Model

In [20]:
# we need to expand the dimensions for 1D convolutions
commmunity_train_Xs = [np.expand_dims(train_X, axis=2) for train_X in train_Xs]
commmunity_train_Xs[0].shape

(3168, 16000, 1)

In [21]:
# same for CV
community_cv_X = np.expand_dims(cv_X, axis=2)
community_cv_X.shape

(3051, 16000, 1)

In [22]:
# Functional model
current_model_path = os.path.join(path_to_models, "1D_CNN_RAW")

# input layer & batch normalization
inputs = Input(shape = (16000,1))
x_1d = BatchNormalization(name = 'batchnormal_1d_in')(inputs)

# iteratively create 9 blocks of 2 convolutional layers with batchnorm and max-pooling
for i in range(9):
    
    name = 'step'+str(i)
    
    # first 1D convolutional block
    x_1d = Conv1D(8*(2 ** i), (3),padding = 'same', name = 'conv'+name+'_1')(x_1d)
    x_1d = BatchNormalization(name = 'batch'+name+'_1')(x_1d)
    x_1d = Activation('relu')(x_1d)
    
    # second 1D convolutional block
    x_1d = Conv1D(8*(2 ** i), (3),padding = 'same', name = 'conv'+name+'_2')(x_1d)
    x_1d = BatchNormalization(name = 'batch'+name+'_2')(x_1d)
    x_1d = Activation('relu')(x_1d)
    
    # max pooling
    x_1d = MaxPooling1D((2), padding='same')(x_1d)

# final convolution and dense layer
x_1d = Conv1D(1024, (1),name='last1024')(x_1d)
x_1d = GlobalMaxPool1D()(x_1d)
x_1d = Dense(1024, activation = 'relu', name= 'dense1024_onlygmax')(x_1d)
x_1d = Dropout(0.2)(x_1d)

# soft-maxed prediction layer
predictions = Dense(num_categories, activation = 'softmax',name='cls_1d')(x_1d)


community_model = Model(inputs=inputs, outputs=predictions)
community_model.compile(Adam(lr=0.0001),loss="categorical_crossentropy", metrics=["accuracy"])

Train for 1 epoch.

In [27]:
# keep track of epoch
cur_epoch_nr = 1

# fit iteratively
for i, community_train_X in enumerate(commmunity_train_Xs):
    
    # pretty printing
    print(i + 1, "/", len(commmunity_train_Xs))
    
    result = community_model.fit(community_train_X, train_ys[i], batch_size=32, epochs=1, 
             validation_data=(community_cv_X, cv_y))
    
    # pretty printing
    duration = time.time() - start
    print("Took {:.2f} seconds".format(duration))
    print()
    
    # results
    cv_acc = "{:.4f}".format(result.history["val_acc"][0]).replace(".","")
    train_acc = "{:.4f}".format(result.history["acc"][0]).replace(".","")
    
    # saving
    if result.history["val_acc"][0] >= accuracy_threshold:
        community_model.save_weights(current_model_path + "_" + str(cur_epoch_nr) + "_" + str(i + 1) + "_" + "TR" + train_acc + "_" + "CV" + cv_acc + ".h5")

1 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 1112.49 seconds

2 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 2121.38 seconds

3 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 3180.02 seconds

4 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 4276.91 seconds

5 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 5312.86 seconds

6 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 6361.21 seconds

7 / 7
Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 7365.40 seconds



In [29]:
# save the model
community_model.save_weights(current_model_path + "_" + str(cur_epoch_nr) + "_" + str(i + 1) + "_" + "TR" + train_acc + "_" + "CV" + cv_acc + ".h5")

In [23]:
# reload th current model, if needed (and restart the timer)
# community_model.load_weights(os.path.join(path_to_models, "1D_CNN_RAW_1_7_TR07399_CV05326.h5"))
# start = time.time()

Train for multiple epochs, adjusting the learning rate.

In [31]:
# iteratively for each learning rate
for lr in lrs:
    # adjust learning rate and epoch
    community_model.optimizer.lr = lr
    cur_epoch_nr = cur_epoch_nr + 1
    
    # fit iteratively
    for i, community_train_X in enumerate(commmunity_train_Xs):

        # pretty printing
        print(i + 1, "/", len(commmunity_train_Xs), end=" | ")
        print("Actual epoch: {}".format(cur_epoch_nr), end=" | ")
        print("Current lr: {}".format(lr), end=" | ")

        result = community_model.fit(community_train_X, train_ys[i], batch_size=32, epochs=1, verbose=1, 
                 validation_data=(community_cv_X, cv_y))

        # pretty printing
        duration = time.time() - start
        start = time.time()
        print("Took {:.2f} seconds".format(duration), end=" | ")

        # results
        cv_acc = "{:.4f}".format(result.history["val_acc"][0]).replace(".","")
        train_acc = "{:.4f}".format(result.history["acc"][0]).replace(".","")
        print("Train acc: {} | CV acc: {}".format(train_acc, cv_acc))

        # saving
        if result.history["val_acc"][0] >= accuracy_threshold:
            community_model.save_weights(current_model_path + "_" + str(cur_epoch_nr) + "_" + str(i + 1) + "_" + "TR" + train_acc + "_" + "CV" + cv_acc + ".h5")

1 / 7 | Actual epoch: 2 | Current lr: 0.0003 | Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 953.21 seconds | Train acc: 07674 | CV acc: 06785
2 / 7 | Actual epoch: 2 | Current lr: 0.0003 | Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 951.80 seconds | Train acc: 07847 | CV acc: 06667
3 / 7 | Actual epoch: 2 | Current lr: 0.0003 | Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 945.77 seconds | Train acc: 08103 | CV acc: 06247
4 / 7 | Actual epoch: 2 | Current lr: 0.0003 | Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 945.37 seconds | Train acc: 08336 | CV acc: 07257
5 / 7 | Actual epoch: 2 | Current lr: 0.0003 | Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 946.25 seconds | Train acc: 08355 | CV acc: 07607
6 / 7 | Actual epoch: 2 | Current lr: 0.0003 | Train on 3168 samples, validate on 3051 samples
Epoch 1/1
Took 948.96 seconds | Train acc: 08573 | CV acc: 08001
7 / 7 | Actual epoch: 2 | Current lr: 0.

We can see that after just 2 epochs our model has reached a **validation accuracy of more than 0.8**, with relatively little overfitting (0.85). We can conclude that this is the architecture we want to rewrite in **TensorFlow**, train further and monitor via **TensorBoard**.