## Training Your Custom Dataset Keyword Spotting Model 
## Using 4-6-8-CustomDatasetKWSModel-rev4-part1.ipynb

Note to fellow TinyML students:  The original form of this Colab notebook was developed by a number of team members at Google and Harvard University.  For more information about a series of courses on TinyML, check out <a href="https://www.edx.org/professional-certificate/harvardx-applied-tiny-machine-learning-tinyml-for-scale">
Professional Certificate in
Applied Tiny Machine Learning (TinyML) for Scale</a>.

Sections of the original notebook need Tensorflow version 1.15.  Since TF 1.15 requires Python 3.6.x or 3.7.x to run and these are no longer supported as part of the Google Colab Development environment, the notebook would no longer run.

To fix this problem, I have dividied the problem into two parts.  This first notebook is to be run within Google Colab and retain as much as possible of the original notebook.  The second notebook will be run locally on the User's PC.  This requires that the user has installed Anaconda within Fedora 37.

More information is available at my Github repository <a href="https://github.com/john-mangiaracina/TinyML-CustomKeywordSpotting">
TinyML-CustomKeywordSpotting</a>.

While perhaps not the most elegant of solutions, I have tested this and found it to run with no issues.  I hope this helps you with your studies in TinyML and I hope you find it as rewarding to use as I have developing.    

Good Luck, 

John


## Setup


In [None]:
!wget https://github.com/tensorflow/tensorflow/archive/v2.4.1.zip
!unzip v2.4.1.zip &> 0
!mv tensorflow-2.4.1/ tensorflow/

####  Let's check our version of Python used within Colab
####  As of notebook publication on June 07, 2023, this is 3.10.11

In [None]:
#!pip uninstall tensorflow -y
#!pip install tensorflow-gpu==1.15

import sys

print("Python version")
print(sys.version)
#print(tf.__version__)

Now let's check the TensorFlow version
As of June 07, 2023, this is 2.12.0.  For this notebook, that is fine.

In [None]:
import tensorflow as tf
print(tf.__version__)

Let's now import some packages and define our path and tend to some setup issues

In [None]:
sys.path.append("/content/tensorflow/tensorflow/examples/speech_commands/")
import input_data
import models
import numpy as np
import glob
import os
import re
import shutil
from google.colab import files
!pip install ffmpeg-python &> 0
#  We no longer need xxd in part 1 of the lab, but we will need it in part2
#!apt-get update && apt-get -qq install xxd

### Import your Custom Dataset
First we are going to download Pete's dataset to use as a base set of "other words" and "background noise" that you can build ontop of for your dataset. We have found that doing this will make your model work a lot better, especially if you are training it with a small amount of custom data! We STRONGLY suggest you follow this approach as it will make a large impact on your results!

**Note: this *will* take a couple of minutes to run!**

In [None]:
!wget https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz
DATASET_DIR =  'dataset/'
!mkdir dataset
!tar -xf speech_commands_v0.02.tar.gz -C 'dataset'
!rm -r -f speech_commands_v0.02.tar.gz

Now you'll need to upload your all of your custom audio files that you recorded using the Open Speech Recording tool (aka the ```*.ogg``` files). **Note: you can select multiple files and upload them all at once!** 

If you are having trouble uploading files because your internet bandwidth is too slow feel free to skip this step and you can instead pick from the words in Pete's dataset, just like -- for those of you that took Course 2 -- you did in the KWS assignment.

Pete's dataset includes the following words:

Options for target words are (PICK FROM THIS LIST FOR BEST RESULTS): "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", “backward”, “forward”, “follow”, “learn”,

Additional words that will be used to help train the "unknown" label are: "bed", "bird", "cat", "dog", "happy", "house", "marvin", "sheila", "tree", "wow"



In [None]:
uploaded = files.upload()

Then we can convert them into correctly trimmed WAV files and then store them in the appropriate folders in the ```DATASET_DIR```.
We will use Pete's extract_loudest_section tool which you can find more documentation about here: https://github.com/petewarden/extract_loudest_section

In [None]:
# convert the ogg files to wavs
!mkdir wavs
!find *.ogg -print0 | xargs -0 basename -s .ogg | xargs -I {} ffmpeg -i {}.ogg -ar 16000 wavs/{}.wav
!rm -r -f *.ogg

# then use pete's tool to only extract 1 second clips from them for use with the KWS pipeline
!mkdir trimmed_wavs
!git clone https://github.com/petewarden/extract_loudest_section.git
!make -C extract_loudest_section/
!/tmp/extract_loudest_section/gen/bin/extract_loudest_section 'wavs/*.wav' trimmed_wavs/
!rm -r -f /wavs

In [None]:
# Store them in the appropriate folders
data_index = {}
os.chdir('trimmed_wavs')
search_path = os.path.join('*.wav')
for wav_path in glob.glob(search_path):
    original_wav_path = wav_path
    parts = wav_path.split('_')
    if len(parts) > 2:
        wav_path = parts[0] + '_' + ''.join(parts[1:])
    matches = re.search('([^/_]+)_([^/_]+)\.wav', wav_path)
    if not matches:
        raise Exception('File name not in a recognized form:"%s"' % wav_path)
    word = matches.group(1).lower()
    instance = matches.group(2).lower()
    if not word in data_index:
      data_index[word] = {}
    if instance in data_index[word]:
        raise Exception('Audio instance already seen:"%s"' % wav_path)
    data_index[word][instance] = original_wav_path

output_dir = os.path.join('..', 'dataset')
try:
    os.mkdir(output_dir)
except:
    pass
for word in data_index:
  word_dir = os.path.join(output_dir, word)
  try:
      os.mkdir(word_dir)
      print('Created dir: ' + word_dir)
  except:
      print('Storing in existing dir: ' + word_dir)
  for instance in data_index[word]:
    wav_path = data_index[word][instance]
    output_path = os.path.join(word_dir, instance + '.wav')
    shutil.copyfile(wav_path, output_path)
os.chdir('..')
!rm -r -f trimmed_wavs

Due to the revisions to the lab, this next step is no longer optional.  Please zip up your processed custom dataset.  Download it from the Files area by clicking on the three dots next to the zip file. Note: this command will take at least 3 minutes to run as the combination of your data and Pete's dataset will be relatively large!

We STRONGLY suggest you use Pete's dataset for any actual training you do!

Finally if you have room in your Google Drive and would like to load and store your dataset from there in the future you can mount your Google Drive in Colab [as described in this blog post](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166).

In [None]:
!zip -r myKWSDataset.zip dataset

After the zip is performed, you may see a file with a series of letters, but no file labeled myKWSDataset.zip.  Right click in the sidebar and give it a quick refresh and the zipped file should appear.

Occasionally, downloads from Colab are very slow and the system will timeout before a 2G+ files can be successfully downloaded.  In those cases, try the cell below to form multiple separate zip files.  You can then right click to download all files in parallel!  (This technique saved me a lot of frustration!)

Well done!  Once the file is successfully downloaded to your PC, part1 of the lab is complete.  Proceed to 4-6-8-CustomDatasetKWSModel-rev4-part2.ipynb.

For more information, check out my Github repository <a href="https://github.com/john-mangiaracina/TinyML-CustomKeywordSpotting">
TinyML-CustomKeywordSpotting</a>.

In [None]:
#  forms multiple zip files for || download
#  Use if single file download is to slow, times out

#!zip -s 500M new.zip myKWSDataset.zip

In [None]:
#  Optional

#  I experimented with this code so Colab wouldn't prematurely time me out

import time

print("I'm here")

for num in range(1, 200):
  time.sleep(60);
  print("Colab, I'm still here", num)