# Major assignment (BA885)

In this assignment you will solve a deep learning problem from scratch!

Your company accepts payments over the phone for its services and is now in the process of automating this task. As the sole data scientist on your team, your task is to build and train a voice recognition algorithm to record the customer's credit card number. You will do this in two stages:



1.   Stage 1 (Due April 15): Gather/crowdsource your own data and build a prototype algorithm that recognizes digits.

*   Collect audio samples with the help of your team (classmates). The recordings should have the following specifications: Duration: 1s, sample rate= 16 kHz, file name and format(See example below): {label} _ hash(count_name).wav. Try to collect at least a 100 samples (10 samples per digit).

*   Build a simple model that takes the waveform -- a 16000 component vector -- and classifies it as a digit in [0,...,9].

You can use the "speech commands" dataset for the time being until your own dataset is complete but the final model should only be trained on the samples collected by the class.

You can use this notebook as a template for this assignment. Please share your thoughts on the following questions in your submission.

* What is the acceptable error rate for the model?

* What is the maximum accuracy you can reach with your limited (~3000) number of samples?

* Can you use data generation and transfer learning to improve the performance of your model?




2.   Stage 2 (Due May 6): TBA


## Sample file names:



Use the following format when annotating the files:

{label} _ hash(count_name).wav

Ex: 4_297d828cd9bbfab3fd6a0ad5442e232b.wav

where label is the digit (0,...,9), count is the sample number and name is your name. Hash stands for a hash function, we will use the md5 hash function from hashlib.


In [3]:
import hashlib

In [20]:
hashlib.md5(b'012_Nima_Doroud').hexdigest()

'297d828cd9bbfab3fd6a0ad5442e232b'

## Speech commands dataset

In [None]:
!pip install tensorflow_io



In [None]:
import os

from IPython import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import glob
import shutil
import random
from collections import Counter

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_io as tfio
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import preprocessing
from sklearn.utils import shuffle

In [None]:
_ = tf.keras.utils.get_file('speech_commands.tar.gz',
                            'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz',
                            cache_dir='./',
                            cache_subdir='datasets',
                            extract=True)

In [None]:
data = []
categories = []
for folder, labels, samples in os.walk('./datasets/'):
    if folder[11:]:
        categories.append(folder[11:])
    for sample in samples:
        if sample[-3:] == 'wav':
            data.append([folder+'/'+sample, folder[11:]])

In [None]:
data = np.array(data)

In [None]:
nums_list = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine']

In [None]:
num_samples = Counter(data[:,1])
for num in nums_list:
    print(num+' {}'.format(num_samples[num]))

zero 4052
one 3890
two 3880
three 3727
four 3728
five 4052
six 3860
seven 3998
eight 3787
nine 3934


## Some useful functions

In [None]:
def load_audio(filepath):
    # Import audio file as a tensor representing the waveform.
    audio = tfio.audio.AudioIOTensor(filepath)
    audio_rate = int(audio.rate)
    # Assert that the sample rate is 16000kHz.
    assert audio_rate == 16000
    # Return the waveform as a 1 dimensional float32 tensor.
    return tf.cast(tf.squeeze(audio.to_tensor(), axis=[-1]), tf.float32) / 32767.0

def get_spectrogram(waveform):
    # Zero-padding for an audio waveform with less than 16,000 samples.
    input_len = 16000
    waveform = waveform[:input_len]
    zero_padding = tf.zeros(
        [16000] - tf.shape(waveform),
        dtype=tf.float32)
    # Cast the waveform tensors' dtype to float32.
    waveform = tf.cast(waveform, dtype=tf.float32)
    # Concatenate the waveform with `zero_padding`, which ensures all audio
    # clips are of the same length.
    equal_length = tf.concat([waveform, zero_padding], 0)
    # Convert the waveform to a spectrogram via a STFT.
    spectrogram = tf.signal.stft(
        equal_length, frame_length=255, frame_step=128)
    # Obtain the magnitude of the STFT.
    spectrogram = tf.abs(spectrogram)
    # Add a `channels` dimension, so that the spectrogram can be used
    # as image-like input data with convolution layers (which expect
    # shape (`batch_size`, `height`, `width`, `channels`).
    spectrogram = spectrogram[..., tf.newaxis]
    return spectrogram

# Build your model