## Training Your Custom Dataset Keyword Spotting Model 
## Using 4-6-8-CustomDatasetKWSModel-rev4-part1.ipynb

Note to fellow TinyML students:  The original form of this Colab notebook was developed by a number of team members at Google and Harvard University.  For more information about a series of courses on TinyML, check out <a href="https://www.edx.org/professional-certificate/harvardx-applied-tiny-machine-learning-tinyml-for-scale">
Professional Certificate in
Applied Tiny Machine Learning (TinyML) for Scale</a>.

Sections of the original notebook need Tensorflow version 1.15.  Since TF 1.15 requires Python 3.6.x or 3.7.x to run and these are no longer supported as part of the Google Colab Development environment, the notebook would no longer run.

To fix this problem, I have dividied the problem into two parts.  This first notebook is to be run within Google Colab and retain as much as possible of the original notebook.  The second notebook will be run locally on the User's PC.  This requires that the user has installed Anaconda within Fedora 37.

More information is available at my Github repository <a href="https://github.com/john-mangiaracina/TinyML-CustomKeywordSpotting">
TinyML-CustomKeywordSpotting</a>.

While perhaps not the most elegant of solutions, I have tested this and found it to run with no issues.  I hope this helps you with your studies in TinyML and I hope you find it as rewarding to use as I have developing.    

Good Luck, 

John


## Setup


In [1]:
!wget https://github.com/tensorflow/tensorflow/archive/v2.4.1.zip
!unzip v2.4.1.zip &> 0
!mv tensorflow-2.4.1/ tensorflow/

--2023-06-11 03:37:33--  https://github.com/tensorflow/tensorflow/archive/v2.4.1.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/tensorflow/tensorflow/zip/refs/tags/v2.4.1 [following]
--2023-06-11 03:37:33--  https://codeload.github.com/tensorflow/tensorflow/zip/refs/tags/v2.4.1
Resolving codeload.github.com (codeload.github.com)... 140.82.121.9
Connecting to codeload.github.com (codeload.github.com)|140.82.121.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘v2.4.1.zip’

v2.4.1.zip              [    <=>             ]  66.13M  8.12MB/s    in 7.8s    

2023-06-11 03:37:41 (8.46 MB/s) - ‘v2.4.1.zip’ saved [69346072]



####  Let's check our version of Python used within Colab
####  As of notebook publication on June 07, 2023, this is 3.10.11

In [2]:
#!pip uninstall tensorflow -y
#!pip install tensorflow-gpu==1.15

import sys

print("Python version")
print(sys.version)
#print(tf.__version__)

Python version
3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0]


Now let's check the TensorFlow version
As of June 07, 2023, this is 2.12.0.  For this notebook, that is fine.

In [3]:
import tensorflow as tf
print(tf.__version__)

2.12.0


Let's now import some packages and define our path and tend to some setup issues

In [4]:
sys.path.append("/content/tensorflow/tensorflow/examples/speech_commands/")
import input_data
import models
import numpy as np
import glob
import os
import re
import shutil
from google.colab import files
!pip install ffmpeg-python &> 0
#  We no longer need xxd in part 1 of the lab, but we will need it in part2
#!apt-get update && apt-get -qq install xxd

### Import your Custom Dataset
First we are going to download Pete's dataset to use as a base set of "other words" and "background noise" that you can build ontop of for your dataset. We have found that doing this will make your model work a lot better, especially if you are training it with a small amount of custom data! We STRONGLY suggest you follow this approach as it will make a large impact on your results!

**Note: this *will* take a couple of minutes to run!**

In [5]:
!wget https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz
DATASET_DIR =  'dataset/'
!mkdir dataset
!tar -xf speech_commands_v0.02.tar.gz -C 'dataset'
!rm -r -f speech_commands_v0.02.tar.gz

--2023-06-11 03:37:55--  https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.79.128, 108.177.119.128, 108.177.126.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.79.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2428923189 (2.3G) [application/gzip]
Saving to: ‘speech_commands_v0.02.tar.gz’


2023-06-11 03:39:10 (31.0 MB/s) - ‘speech_commands_v0.02.tar.gz’ saved [2428923189/2428923189]



Now you'll need to upload your all of your custom audio files that you recorded using the Open Speech Recording tool (aka the ```*.ogg``` files). **Note: you can select multiple files and upload them all at once!** 

If you are having trouble uploading files because your internet bandwidth is too slow feel free to skip this step and you can instead pick from the words in Pete's dataset, just like -- for those of you that took Course 2 -- you did in the KWS assignment.

Pete's dataset includes the following words:

Options for target words are (PICK FROM THIS LIST FOR BEST RESULTS): "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", “backward”, “forward”, “follow”, “learn”,

Additional words that will be used to help train the "unknown" label are: "bed", "bird", "cat", "dog", "happy", "house", "marvin", "sheila", "tree", "wow"



In [6]:
uploaded = files.upload()

Saving bread_1686435652788.ogg to bread_1686435652788.ogg
Saving pizza_1686435652783.ogg to pizza_1686435652783.ogg
Saving bread_1686435652780.ogg to bread_1686435652780.ogg
Saving bread_1686435652771.ogg to bread_1686435652771.ogg
Saving bread_1686435652766.ogg to bread_1686435652766.ogg
Saving bread_1686435652763.ogg to bread_1686435652763.ogg
Saving coke_1686435652759.ogg to coke_1686435652759.ogg
Saving bread_1686435652755.ogg to bread_1686435652755.ogg
Saving bread_1686435652751.ogg to bread_1686435652751.ogg
Saving coke_1686435652748.ogg to coke_1686435652748.ogg
Saving pizza_1686435541125.ogg to pizza_1686435541125.ogg
Saving coke_1686435541122.ogg to coke_1686435541122.ogg
Saving bread_1686435541120.ogg to bread_1686435541120.ogg
Saving coke_1686435541116.ogg to coke_1686435541116.ogg
Saving pizza_1686435541110.ogg to pizza_1686435541110.ogg
Saving bread_1686435541107.ogg to bread_1686435541107.ogg
Saving bread_1686435541104.ogg to bread_1686435541104.ogg
Saving pizza_168643554

Then we can convert them into correctly trimmed WAV files and then store them in the appropriate folders in the ```DATASET_DIR```.
We will use Pete's extract_loudest_section tool which you can find more documentation about here: https://github.com/petewarden/extract_loudest_section

In [7]:
# convert the ogg files to wavs
!mkdir wavs
!find *.ogg -print0 | xargs -0 basename -s .ogg | xargs -I {} ffmpeg -i {}.ogg -ar 16000 wavs/{}.wav
!rm -r -f *.ogg

# then use pete's tool to only extract 1 second clips from them for use with the KWS pipeline
!mkdir trimmed_wavs
!git clone https://github.com/petewarden/extract_loudest_section.git
!make -C extract_loudest_section/
!/tmp/extract_loudest_section/gen/bin/extract_loudest_section 'wavs/*.wav' trimmed_wavs/
!rm -r -f /wavs

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
    Metadata:
      encoder         : Lavc58.54.100 pcm_s16le
size=      45kB time=00:00:01.43 bitrate= 256.8kbits/s speed= 307x    
video:0kB audio:45kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.169271%
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable

In [8]:
# Store them in the appropriate folders
data_index = {}
os.chdir('trimmed_wavs')
search_path = os.path.join('*.wav')
for wav_path in glob.glob(search_path):
    original_wav_path = wav_path
    parts = wav_path.split('_')
    if len(parts) > 2:
        wav_path = parts[0] + '_' + ''.join(parts[1:])
    matches = re.search('([^/_]+)_([^/_]+)\.wav', wav_path)
    if not matches:
        raise Exception('File name not in a recognized form:"%s"' % wav_path)
    word = matches.group(1).lower()
    instance = matches.group(2).lower()
    if not word in data_index:
      data_index[word] = {}
    if instance in data_index[word]:
        raise Exception('Audio instance already seen:"%s"' % wav_path)
    data_index[word][instance] = original_wav_path

output_dir = os.path.join('..', 'dataset')
try:
    os.mkdir(output_dir)
except:
    pass
for word in data_index:
  word_dir = os.path.join(output_dir, word)
  try:
      os.mkdir(word_dir)
      print('Created dir: ' + word_dir)
  except:
      print('Storing in existing dir: ' + word_dir)
  for instance in data_index[word]:
    wav_path = data_index[word][instance]
    output_path = os.path.join(word_dir, instance + '.wav')
    shutil.copyfile(wav_path, output_path)
os.chdir('..')
!rm -r -f trimmed_wavs

Created dir: ../dataset/coke
Created dir: ../dataset/bread
Created dir: ../dataset/pizza


Due to the revisions to the lab, this next step is no longer optional.  Please zip up your processed custom dataset.  Download it from the Files area by clicking on the three dots next to the zip file. Note: this command will take at least 3 minutes to run as the combination of your data and Pete's dataset will be relatively large!

We STRONGLY suggest you use Pete's dataset for any actual training you do!

Finally if you have room in your Google Drive and would like to load and store your dataset from there in the future you can mount your Google Drive in Colab [as described in this blog post](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166).

In [9]:
!zip -r myKWSDataset.zip dataset

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  adding: dataset/sheila/5b26c81b_nohash_0.wav (deflated 17%)
  adding: dataset/sheila/c0cb43d6_nohash_0.wav (deflated 28%)
  adding: dataset/sheila/8353fea1_nohash_0.wav (deflated 56%)
  adding: dataset/sheila/ca48dc76_nohash_1.wav (deflated 20%)
  adding: dataset/sheila/4fce7686_nohash_0.wav (deflated 32%)
  adding: dataset/sheila/893b3c11_nohash_0.wav (deflated 30%)
  adding: dataset/sheila/ece1a95a_nohash_1.wav (deflated 26%)
  adding: dataset/sheila/6c0f6493_nohash_0.wav (deflated 47%)
  adding: dataset/sheila/8c7f81df_nohash_0.wav (deflated 19%)
  adding: dataset/sheila/1b42b551_nohash_0.wav (deflated 17%)
  adding: dataset/sheila/ed3c2d05_nohash_0.wav (deflated 29%)
  adding: dataset/sheila/9beccfc8_nohash_5.wav (deflated 35%)
  adding: dataset/sheila/74b73f88_nohash_1.wav (deflated 30%)
  adding: dataset/sheila/483e2a6f_nohash_0.wav (deflated 25%)
  adding: dataset/sheila/cce7416f_nohash_0.wav (deflated 19%)
  add

After the zip is performed, you may see a file with a series of letters, but no file labeled myKWSDataset.zip.  Right click in the sidebar and give it a quick refresh and the zipped file should appear.

Occasionally, downloads from Colab are very slow and the system will timeout before a 2G+ files can be successfully downloaded.  In those cases, try the cell below to form multiple separate zip files.  You can then right click to download all files in parallel!  (This technique saved me a lot of frustration!)

Well done!  Once the file is successfully downloaded to your PC, part1 of the lab is complete.  Proceed to 4-6-8-CustomDatasetKWSModel-rev4-part2.ipynb.

For more information, check out my Github repository <a href="https://github.com/john-mangiaracina/TinyML-CustomKeywordSpotting">
TinyML-CustomKeywordSpotting</a>.

In [11]:
#  forms multiple zip files for || download
#  Use if single file download is to slow, times out

#!zip -s 500M new.zip myKWSDataset.zip

  adding: myKWSDataset.zip (stored 0%)


In [None]:
#  Optional

#  I experimented with this code so Colab wouldn't prematurely time me out

import time

print("I'm here")

for num in range(1, 200):
  time.sleep(60);
  print("Colab, I'm still here", num)

I'm here
Colab, I'm still here 1
Colab, I'm still here 2
Colab, I'm still here 3
Colab, I'm still here 4
Colab, I'm still here 5
Colab, I'm still here 6
Colab, I'm still here 7
Colab, I'm still here 8
Colab, I'm still here 9
Colab, I'm still here 10
Colab, I'm still here 11
Colab, I'm still here 12
Colab, I'm still here 13
Colab, I'm still here 14
Colab, I'm still here 15
Colab, I'm still here 16
Colab, I'm still here 17
Colab, I'm still here 18
Colab, I'm still here 19
Colab, I'm still here 20
Colab, I'm still here 21
Colab, I'm still here 22
Colab, I'm still here 23
Colab, I'm still here 24
Colab, I'm still here 25
Colab, I'm still here 26
Colab, I'm still here 27
Colab, I'm still here 28
Colab, I'm still here 29
Colab, I'm still here 30
Colab, I'm still here 31
Colab, I'm still here 32
Colab, I'm still here 33
Colab, I'm still here 34
Colab, I'm still here 35
Colab, I'm still here 36
Colab, I'm still here 37
Colab, I'm still here 38
Colab, I'm still here 39
Colab, I'm still here 40
