#Train an Audio Classifier for the Azure Sphere

Notebook authored by Jeremy Webb

This notebook will walkthrough how to setup and train an audio classifier model that will run on the Microsoft Azure Sphere using the Embedded Learning Library.

**To train an audio classifier, read and run each cell below by clicking on the play button on the left side.**

## License

MIT License

Copyright (c) 2019 Jeremy Webb

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

# Setup and Install Prerequisites

To train a machine learning model to run on the Azure Sphere, the Embedded Learning Library (ELL) will be used. ELL is specifically developed by Microsoft for running machine learning models on resource constrained systems.



Start by cloning the library. Note that v3.0.2 is specified because that version was tested with the following code.

In [0]:
ell_dir = "/content/ELL"
ell_scripts_dir = ell_dir + "/tools/utilities/pythonlibs/audio/training/"
!git clone --branch v3.0.2 https://github.com/Microsoft/ELL.git {ell_dir}

Install ELL prerequisites.

In [0]:
!sh -c 'echo deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-8 main >> /etc/apt/sources.list'
!sh -c 'echo deb-src http://apt.llvm.org/bionic/ llvm-toolchain-bionic-8 main >> /etc/apt/sources.list'
!wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add -
!apt-get -y update
!apt-get install -y gcc-8 g++-8 cmake libedit-dev zlibc zlib1g zlib1g-dev make
!apt-get install -y libopenblas-dev doxygen
!apt-get install python-pyaudio python3-pyaudio
!apt-get install llvm-8 -y
!pip install onnx

Install SWIG, which is used to generate Python to C++ interfaces for ELL.

In [0]:
!curl -O --location http://prdownloads.sourceforge.net/swig/swig-4.0.0.tar.gz
!tar zxvf swig-4.0.0.tar.gz
%cd swig-4.0.0
!./configure --without-pcre && make && make install
%cd /content

Build ELL and generate the Python libraries. This takes a while to run.

In [0]:
ell_build_dir = ell_dir + "/build"
!mkdir {ell_build_dir}
%cd {ell_build_dir}
!cmake ..
!make
!make _ELL_python
%cd /content/

# Setup the Data

Setup paths and directories for the training data.

In [0]:
from pathlib import Path

base_dir = Path("/content/security/")
train_data_dir = base_dir / "audio/"
train_dir = base_dir / "models/"
drive_dir = Path("/content/drive/My Drive/Colab Notebooks/")
drive_data_dir = drive_dir / "Data" / "azuresphere"
!mkdir -p "{train_data_dir}"
!mkdir -p "{train_dir}"

Mount your Google Drive to load or store training data. This step is not necessary if you are not planning on storing training data in Google Drive.

In [0]:
from pathlib import Path
import zipfile
from google.colab import drive

audio_data_file = 'security_sounds.zip'
audio_data_path = drive_data_dir / audio_data_file
train_data_found = False

drive.mount('/content/drive')
if not drive_dir.exists():
  print("Note: Create 'Colab Notebooks' directory in drive root to proceed.")
elif not drive_data_dir.exists():
  print("Note: Create 'azuresphere' directory inside 'Colab Notebooks'" \
    " to proceed.")
if not audio_data_path.exists():
  print("Warning: Upload compressed data missing."
    " Download the data in the next step.")
else:
  with zipfile.ZipFile(audio_data_path, 'r') as zip_ref:
    zip_ref.extractall(train_data_dir)
    train_data_found = True

## Download Training Data

Training data will be built from a variety of samples collected from Freesound.org. Freesound hosts a large collection of recorded sounds in many different kinds of categories.

You will need to [register for an account](https://freesound.org/home/register/) and then [request an API access key](https://freesound.org/apiv2/apply). Copy and paste the access key into the "api_key" variable below.

If you wish to use different data than the defaults, edit the "labels_to_download" dictionary, and add a key as the training category paired with a FreesoundQuery as the value. A FreesoundQuery takes in a string of words to use for the query, and a space separated string of tags to use to filter the query. The tags can be empty. A good way to figure out what query to use is to go to the [Freesound website](https://freesound.org) and use the search box to try various queries. You can examine the results to see if the returned samples fit your intended category. For best results, you should ensure that there are at least 400 samples for each category, but more will result in a more accurate model.

Note that machine learning models are only as good as the data they are trained from. Ideally, you would ensure that each sample has only the noises you are looking to train and no other noises or extra silence. However, because almost anyone can upload audio samples to Freesound, there is no guarantee that mislabeled or poor samples are not mixed in with the data collected by the script below. To get top-quality results, the data should be cleaned and checked for accuracy. This obviously takes a lot of effort, but the good news is that decent results can be obtained without doing this.

The script below will process each FreesoundQuery and download the resulting samples, saving them in a directory specified by the dictionary key name. It is not necessary to run the cell below if the training data has been loaded from Google Drive in the previous step.

In [0]:
# download dataset from FreeSound since it wasn't found in previous step
if not train_data_found:
  import math
  from pathlib import Path
  import subprocess
  import json
  import requests

  api_key = "XXXXXXXXXXXXXXXXXXXXXXXXXX"
  max_download = 600  # max number of samples per category to download

  class FreesoundQuery:
    def __init__(self, query='', tags='', segment=False):
      self.query = query
      self.tags = tags
      self.segment = segment

  labels_to_download = {
    'window_break': [FreesoundQuery('shatter', 'glass'),
                      FreesoundQuery('window breaking', 'glass window')],
    #'door_break': [FreesoundQuery('wood break', 'wood Wood')],
    'gunshot': [FreesoundQuery('gunshot')],
    'background_noise': [FreesoundQuery('silence', segment=True), 
                        FreesoundQuery('background-noise indoor', segment=True)
                          ]
  }


  audio_codec = 'pcm_s16le'
  audio_container = '.wav'

  page_size = 150
  search_url = 'http://freesound.org/apiv2/search/text/?' \
              '&query={query}&token={api_token}&tag={tag}' \
              '&page={page}&page_size={page_size}' \
              '&filter=license:("Attribution" OR "Creative Commons 0")' \
              '&filter=duration:[* TO 30]' \
              '&fields=id,name,previews,username'
  info_url = 'http://freesound.org/apiv2/sounds/{sound_id}/?&token={api_token}'
  download_url = '{url}?&token={api_token}'

  # get list of all audio clips that match the query
  def get_audio_list(query, tags, max_files):
    max_pages = max_files / page_size
    page_count = 1
    url_list = []
    tag = ' '.join(tags)
    url = search_url.format(query=query, api_token=api_key,
                            tag=tag, page=1, page_size=page_size)
    response = requests.get(url)
    response.raise_for_status()
    if(response.status_code >= 300 or len(response.json()) == 0):
      # when the page is invalid, a 300 redirect is returned
      # we don't want to do anymore processing after this
      print(f"Error encountered processing query: {query}")
      return None
    data = response.json()
    url_list.extend(data['results'])
    while data['next'] and page_count < max_pages:
      page_count += 1
      url = data['next'] + f'&token={api_key}'
      response = requests.get(url)
      response.raise_for_status()
      if(response.status_code >= 300 or len(response.json()) == 0):
        # when the page is invalid, a 300 redirect is returned
        # we don't want to do anymore processing after this
        print(f"Error encountered processing query: {query}")
        return None
      data = response.json()
      url_list.extend(data['results'])
    return url_list

  def download_audio(audio_data, label, store_dir, segment=False):
    store_dir.mkdir(parents=True, exist_ok=True)
    label = label.replace(' ', '_')
    sample_name = ""
    if segment:
      sample_name = "%d"
    audio_filename = str(audio_data['id']) + "_" + label \
    + sample_name + audio_container
    audio_filepath = store_dir / audio_filename

    audio_url = download_url.format(
        url=audio_data['previews']['preview-hq-ogg'],
        api_token=api_key
        )
    if segment:
      if Path(str(audio_filepath) % 0).exists():
        # file already downloaded
        return True
      audio_dl_cmd = ['ffmpeg', '-n',
          #'-ss', str(sound_start),  # The beginning of the trim window
          '-i', audio_url,          # audio URL
          #'-t', str(duration),      # total duration of the segment
          #'-vn',                    # suppress the video stream
          '-ac', '1',               # set the number of channels
          '-sample_fmt', 's16',     # bit depth
          '-acodec', audio_codec,   # output encoding
          '-ar', '16000',           # audio sample rate
          '-threads', '1',
          '-f', 'segment',
          '-segment_time', '5',     # split the audio into 5 second chunks
          str(audio_filepath)
          ]
    else:
      if audio_filepath.exists():
        # no need to download the file again
        return True
      audio_dl_cmd = ['ffmpeg', '-n',
          '-i', audio_url,          # audio URL
          '-ac', '1',               # set the number of channels
          '-sample_fmt', 's16',     # bit depth
          '-acodec', audio_codec,   # output encoding
          '-ar', '16000',           # audio sample rate
          '-threads', '1',
          str(audio_filepath)
          ]

    attribution = f"Downloading sample \"{audio_data['name']}\" " \
    f"provided by {audio_data['username']} from freesound.org"
    print(attribution)
    result = subprocess.run(audio_dl_cmd)
    return result.returncode == 0

  def download_dataset(labels_dict):
    for label, queries in labels_dict.items():
      max_files = max_download / len(queries)
      for query in queries:
        data_list = get_audio_list(query.query, query.tags, max_files)
        for data in data_list:
          result = download_audio(
              data, label, Path(train_data_dir / label), query.segment
              )
          if not result:
            print(f"Error downloading: {data['name']}")

  download_dataset(labels_to_download)
  print("All done!")

## Store Training Data

It's recommended to compress and store the training data in Google Drive if you plan on using this document again in the future. This will allow you to skip downloading the data again.

In [0]:
import shutil
import tempfile
from google.colab import files

# delete zip file in drive if it already exists
if audio_data_path.exists():
  audio_data_path.unlink()
# compress train data into a zip file and store in drive
with tempfile.TemporaryDirectory() as tempdir:
  zipf = tempdir + "/" + audio_data_path.stem
  shutil.make_archive(zipf, 'zip', train_data_dir)
  shutil.move(zipf + ".zip", drive_data_dir)

## Prepare Data for Training

Select the training, validation, and testing data by randomly choosing 64 samples for testing and validation, and then using the rest for training. Note that simple random selection is used so it is possible that some samples on each list are duplicated, but this is unlikely to have a large effect on training or testing, as long as the amount of data is large enough.

The cell below generates the validation and testing data lists.

In [0]:
import os
from pathlib import Path
import random

test_list = []
val_list = []
num_to_select = 64
for (root,dirs,files) in os.walk(train_data_dir):
  category_dir = Path(root).relative_to(train_data_dir)
  if len(files) > 0 and str(category_dir) != ".":
    for i in range(num_to_select):
      # this method may introduce duplicates, but its unlikely
      # and impact won't be that bad
      choice = random.choice(files)
      test_list.append(str(category_dir) + "/" + choice)
      choice = random.choice(files)
      val_list.append(str(category_dir) + "/" + choice)
with open(train_data_dir / 'testing_list.txt', 'w') as filehandle:
    for listitem in test_list:
        filehandle.write('%s\n' % listitem)
with open(train_data_dir / 'validation_list.txt', 'w') as filehandle:
    for listitem in val_list:
        filehandle.write('%s\n' % listitem)

Generate the training data list and limit each category to num_files_to_use. This variable is used to ensure the training dataset is balanced with the same amount of files per category. If you have more samples in each category, increase this setting.

In [0]:
import shutil

num_files_to_use = 350 #@param {type: "number"}

%cd {train_dir}
!python {ell_scripts_dir}make_training_list.py \
          --wav_files {train_data_dir} \
          --max_files_per_directory {num_files_to_use}
%cd /content/
shutil.copy(train_data_dir / "categories.txt", train_dir)

# Prepare Audio Classifier Model

In this section, the audio classifier model will be setup for training. First you'll create and compile a featurizer, and then apply that featurizer to the training and testing data. Once the data has been preprocessed, the model is ready to be trained.

## Create Featurizer

Create and compile the featurizer by running the cells below.

The featurizer takes in a number of audio datapoints, in this case 512, and essentially creates a fingerprint for that set of datapoints. This fingerprint can be learned and classified by a machine learning model. It can also make the model run faster because it reduces the dimensionality of the input data for the model. For this classifier, the Mel Frequency Cepstrum Coefficients (MFCC) will be used. You can read about [MFCCs on Wikipedia](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum).

In [0]:
%cd {train_dir}
!python {ell_scripts_dir}make_featurizer.py --sample_rate 16000 \
          --window_size 512 \
          --input_buffer_size 512 \
          --filterbank_type mel \
          --filterbank_size 80 \
          --filterbank_nfft 512 --nfft 512 --log \
          --auto_scale
!python {ell_dir}/tools/wrap/wrap.py --model_file featurizer.ell \
  --outdir compiled_featurizer --module_name mfcc
!{ell_dir}/build/bin/print -imap featurizer.ell
%cd /content/

In [0]:
%cd {train_dir}
!mkdir compiled_featurizer/build
!cd compiled_featurizer/build && cmake .. && make
%cd /content/

## Create Features from Data

Using the featurizer created above, transform all the training, validation, and testing data into a set of features that can be fed into the machine learning model.

In [0]:
%cd {train_dir}
!python {ell_scripts_dir}make_dataset.py \
          --list_file {train_data_dir / "training_list.txt"} \
          --featurizer compiled_featurizer/mfcc \
          --window_size 40 --shift 40
!python {ell_scripts_dir}make_dataset.py \
          --list_file {train_data_dir / "validation_list.txt"} \
          --featurizer compiled_featurizer/mfcc \
          --window_size 40 --shift 40
!python {ell_scripts_dir}make_dataset.py \
          --list_file {train_data_dir / "testing_list.txt"} \
          --featurizer compiled_featurizer/mfcc \
          --window_size 40 --shift 40

# Train the Model

Create and train the machine learning model on the featurized data. The number of neurons in the hidden layers can be adjusted by changing the hidden_units number. If you have enough data, more units will allow the model to more accurately learn to classify the categories, but will result in a larger model. If the model is too large, it will not be able to run on the Azure Sphere. A larger model is also easier to overfit.

The number of training iterations can be adjusted by setting the epochs value. A higher number of epochs will teach the classifier to more accurately model the training data, but too high will cause the model to overfit.

In [0]:
!python {ell_scripts_dir}train_classifier.py \
          --architecture GRU --use_gpu \
          --dataset {train_data_dir} \
          --categories {train_dir / "categories.txt"} \
          --outdir {train_dir} \
          --filename classifier \
          --hidden_units 110 \
          --epochs 30 \
          --normalize

# Prepare Model for Running on Azure Sphere

To adjust the audio classifier so it runs on the Azure Sphere, it will first be converted to the ELL format. The ELL model can then be tested against the testing data to compare the accuracy of the ELL model against the original. Finally, it will be compiled for running on the Azure Sphere.

## Convert ONNX Model to ELL Format

Convert the ONNX model generated in the previous step to an ELL model.

In [0]:
!python {ell_dir}/tools/importers/onnx/onnx_import.py classifier.onnx

## Test ELL Model

Compile ELL Model for testing.

In [0]:
!python {ell_dir}/tools/wrap/wrap.py --model_file classifier.ell --outdir SafeSound --module_name model
!mkdir SafeSound/build
!cd SafeSound/build && cmake .. && make

Run the tests for ELL model below. You can compare the test accuracy to the training and testing accuracy shown in [Train the Model](#scrollTo=fIAvIouZqOCQ).

In [0]:
!python {ell_scripts_dir}test_ell_model.py \
 --classifier {train_dir / "SafeSound" / "model"} \
 --featurizer {train_dir / "compiled_featurizer" / "mfcc"} \
 --sample_rate 16000 --list_file {train_data_dir / "testing_list.txt"} \
 --categories {train_data_dir / "categories.txt"} --reset --auto_scale

## Compile ELL File for Use on Azure Sphere

In [0]:
!{ell_dir}/build/bin/compile -imap featurizer.ell -cfn Filter -cmn mfcc \
 --bitcode -od . --fuseLinearOps true --header --blas false --optimize true \
 --target custom --numBits 32 --cpu cortex-a7 --triple armv7--linux-gnueabihf --features +vfp4,+d16
!/usr/lib/llvm-8/bin/opt featurizer.bc -o featurizer.opt.bc -O3
!/usr/lib/llvm-8/bin/llc featurizer.opt.bc -o featurizer.o -O3 -filetype=obj \
 -mtriple=armv7--linux-gnueabihf -mcpu=cortex-a7 -relocation-model=pic -float-abi=hard -mattr=+vfp4,+d16
!{ell_dir}/build/bin/compile -imap classifier.ell -cfn Predict -cmn model \
 --bitcode -od . --fuseLinearOps true --header --blas false --optimize true \
 --target custom --numBits 32 --cpu cortex-a7 --triple armv7--linux-gnueabihf --features +vfp4,+d16
!/usr/lib/llvm-8/bin/opt classifier.bc -o classifier.opt.bc -O3
!/usr/lib/llvm-8/bin/llc classifier.opt.bc -o classifier.o -O3 -filetype=obj \
 -mtriple=armv7--linux-gnueabihf -mcpu=cortex-a7 -relocation-model=pic -float-abi=hard -mattr=+vfp4,+d16

## Download Classifier and Test Audio

Download the ELL classifier, featurizer, and associated header files below. Unzip the download and place the files in your Azure Sphere project.

If you wish, you can also specify and download an audio sample to test on the Azure Sphere. A header file containing the raw audio data will be generated. Note that the audio sample will be truncated to about one second to reduce the file size so it can be run on the Sphere.

In [0]:
#@title Download Compiled Featurizer and Classifier
import tempfile
import time
import zipfile
from google.colab import files

zip_name = "audio_classifier.zip" #@param {type: "string"}

with tempfile.TemporaryDirectory() as tempdir:
  with zipfile.ZipFile(zip_name, "w", zipfile.ZIP_DEFLATED) as zipf:
      zipf.write(train_dir / "classifier.h", "classifier.h")
      zipf.write(train_dir / "classifier.o", "classifier.o")
      zipf.write(train_dir / "featurizer.h", "featurizer.h")
      zipf.write(train_dir / "featurizer.o", "featurizer.o")
  time.sleep(1)
  files.download(zip_name)

In [0]:
#@title Download Test Audio Sample {display-mode: "form"}
import tempfile
import time
from scipy.io import wavfile
from google.colab import files

category = "window_break" #@param ['background_noise', 'door_break', 'gunshot', 'window_break']
audio_filename = "452667_window_break.wav" #@param {type: "string"}
audio_file = train_data_dir / category / audio_filename
frame_size = 512
max_rows = 35
output_filename = category + ".h"
output_audio_filename = category + ".wav"

fs, data = wavfile.read(audio_file)
with tempfile.TemporaryDirectory() as tempdir:
  output_file = Path(tempdir) / output_filename
  output_audio = Path(tempdir) / output_audio_filename
  with open(output_file, mode='w') as f:
    f.write("short sample_wav_data[][AUDIO_FRAME_SIZE] = {\n")
    f.write("{")
    for i, point in enumerate(data):
      if i >= max_rows * frame_size:
        break
      if i % frame_size == 0 and i != 0:
        f.write(f"}},\n{{{point:.1f}, ")
      else:
        f.write(f"{point:.1f}, ")
    extraPaddingLength = frame_size - ((i - 1) % frame_size) - 1
    for _ in range(extraPaddingLength):
      f.write("0.0, ")
    f.write("},\n};")
    wavfile.write(output_audio, fs, data[:i])

  time.sleep(0.5)
  files.download(output_file)
  files.download(output_audio)