# <color style="color:red">**!! Disclaimer !!**</color>

##### Copyright 2020 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

### <color style="color:green">**Contributions and License**</color>
This notebook is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
We thank the following contributors for their contributions to this notebook:
* Viviane Potocnik <vivianep@iis.ee.ethz.ch> (ETH Zurich)
* TensorFlow team [https://www.tensorflow.org/](https://www.tensorflow.org/)
* Leonard Lochte-Holtgreven <lleonard@ethz.ch> (ETH Zurich)

# Simple audio recognition: Recognizing keywords

This tutorial demonstrates how to preprocess audio files in the WAV format and build and train a basic [automatic speech recognition](https://en.wikipedia.org/wiki/Speech_recognition) (ASR) model for recognizing ten different words. You will use a portion of the [Speech Commands dataset](https://www.tensorflow.org/datasets/catalog/speech_commands) ([Warden, 2018](https://arxiv.org/abs/1804.03209)), which contains short (one-second or less) audio clips of commands, such as "down", "go", "left", "no", "right", "stop", "up" and "yes".

Real-world speech and audio recognition [systems](https://ai.googleblog.com/search/label/Speech%20Recognition) are complex. But, like in the previous weeks, this tutorial should give you a basic understanding of the techniques involved.

## Setup

Import necessary modules and dependencies. You'll be using `tf.keras.utils.audio_dataset_from_directory` (introduced in TensorFlow 2.10), which helps generate audio classification datasets from directories of `.wav` files. You'll also need [seaborn](https://seaborn.pydata.org) for visualization in this tutorial.

In [None]:
import os
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import models
from IPython import display

# Set the seed value for experiment reproducibility.
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)

## Import the mini Speech Commands dataset

To save time with data loading, you will be working with a smaller version of the Speech Commands dataset. The [original dataset](https://www.tensorflow.org/datasets/catalog/speech_commands) consists of over 105,000 audio files in the [WAV (Waveform) audio file format](https://www.aelius.com/njh/wavemetatools/doc/riffmci.pdf) of people saying 35 different words. This data was collected by Google and released under a CC BY license.

Download and extract the `mini_speech_commands.zip` file containing the smaller Speech Commands datasets with `tf.keras.utils.get_file`:

In [None]:
DATASET_PATH = 'data/mini_speech_commands_extracted/mini_speech_commands'

data_dir = pathlib.Path(DATASET_PATH)
if not data_dir.exists():
  tf.keras.utils.get_file(
      'mini_speech_commands.zip',
      origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip",
      extract=True,
      cache_dir='.', cache_subdir='data')

The dataset's audio clips are stored in eight folders corresponding to each speech command: `no`, `yes`, `down`, `go`, `left`, `up`, `right`, and `stop`:

In [None]:
commands = np.array(tf.io.gfile.listdir(str(data_dir)))
commands = commands[(commands != 'README.md') & (commands != '.DS_Store')]
print('Commands:', commands)

Divided into directories this way, you can easily load the data using `keras.utils.audio_dataset_from_directory`. 

The audio clips are 1 second or less at 16kHz. The `output_sequence_length=16000` pads the short ones to exactly 1 second (and would trim longer ones) so that they can be easily batched.

In [None]:
train_ds, val_ds = tf.keras.utils.audio_dataset_from_directory(
    directory=data_dir,
    batch_size=64,
    validation_split=0.2,
    seed=0,
    output_sequence_length=16000,
    subset='both')

label_names = np.array(train_ds.class_names)
print()
print("label names:", label_names)

By default, `keras.utils.audio_dataset_from_directory` includes a `num_channels` dimension. This dataset only contains single channel audio, so use the `tf.squeeze` function to drop the extra axis:

In [None]:
def squeeze(audio, labels):
  audio = tf.squeeze(audio, axis=-1)
  return audio, labels

train_ds = train_ds.map(squeeze, tf.data.AUTOTUNE)
val_ds = val_ds.map(squeeze, tf.data.AUTOTUNE)

The `utils.audio_dataset_from_directory` function only returns up to two splits. It's a good idea to keep a test set separate from your validation set.
Ideally you'd keep it in a separate directory, but in this case you can use `Dataset.shard` to split the validation set into two halves. Note that iterating over **any** shard will load **all** the data, and only keep its fraction. 

In [None]:
test_ds = val_ds.shard(num_shards=2, index=0)
val_ds = val_ds.shard(num_shards=2, index=1)

Let's print the shape of one item in the `train_ds`! You should see that it consists of a batch with 64 audio files, each containing a sequence of 16000 samples.

In [None]:
for example_audio, example_labels in train_ds.take(1):  
  print(example_audio.shape)
  print(example_labels.shape)

Let's plot a few audio waveforms:

In [None]:
plt.figure(figsize=(16, 10))
rows = 3
cols = 3
n = rows * cols
for i in range(n):
  plt.subplot(rows, cols, i+1)
  audio_signal = example_audio[i]
  plt.plot(audio_signal)
  plt.title(label_names[example_labels[i]])
  plt.yticks(np.arange(-1.2, 1.2, 0.2))
  plt.ylim([-1.1, 1.1])

## Convert Waveforms to MFCCs

The waveforms in the dataset are represented in the time domain. Next, you'll transform the waveforms from the time-domain signals into feature representations by computing the Mel-frequency cepstral coefficients (MFCCs). MFCCs capture the spectral characteristics of the audio signal and are commonly used as input for machine learning models.

A Fourier transform (`tf.signal.fft`) converts a signal to its component frequencies, losing all time information. In comparison, the short-time Fourier transform (`tf.signal.stft`) splits the signal into windows of time and runs a Fourier transform on each window, preserving some time information. This results in a 2D tensor that can be further processed to compute MFCCs.

### Create a Utility Function for Converting Waveforms to MFCCs:

- **Length Consistency**: Ensure that the waveforms are of consistent length. This can be done by zero-padding audio clips that are shorter than one second (using `tf.zeros`). This ensures that all input samples have the same duration, which is crucial for training a neural network.

- **STFT Parameters**: When calling `tf.signal.stft`, choose the `frame_length` and `frame_step` parameters to effectively capture the desired time-frequency characteristics. A common choice is to set these values so that the resulting representation provides a good balance between time and frequency resolution.

- **Magnitude Calculation**: After computing the STFT, derive the magnitude of the complex numbers using `tf.abs`. The magnitude captures the intensity of each frequency component without the phase information, which is not necessary for MFCC extraction.

In [None]:
def get_mfcc(waveform, sample_rate=16000, num_mel_bins=20):
    # Convert the waveform to a spectrogram via a STFT.
    stft = tf.signal.stft(waveform, frame_length=1024, frame_step=512)
    
    # Obtain the magnitude of the STFT.
    magnitude_spectrogram = tf.abs(stft)
    
    # Convert to a log mel spectrogram.
    mel_filterbank = tf.signal.linear_to_mel_weight_matrix(
        num_mel_bins, magnitude_spectrogram.shape[-1], sample_rate)
    
    mel_spectrogram = tf.matmul(magnitude_spectrogram, mel_filterbank)
    
    # Convert to log scale.
    log_mel_spectrogram = tf.math.log(mel_spectrogram + 1e-6)
    
    # Compute MFCCs from the log mel spectrogram.
    mfccs = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrogram)
    
    # Add a channels dimension for compatibility with CNNs.
    mfccs = mfccs[..., tf.newaxis]
    
    return mfccs

Next, start exploring the data. Print the shapes of one example's tensorized waveform and the corresponding spectrogram, and play the original audio:

In [None]:
for i in range(3):
  label = label_names[example_labels[i]]
  waveform = example_audio[i]
  #spectrogram = get_spectrogram(waveform)
  mfcc = get_mfcc(waveform)

  print('Label:', label)
  print('Waveform shape:', waveform.shape)
  print('MFCC shape:', mfcc.shape)
  print('Audio playback')
  display.display(display.Audio(waveform, rate=16000))

Now, define a function for displaying a spectrogram:

In [None]:
def plot_mfcc(mfcc, ax):
    if len(mfcc.shape) > 2:
        assert len(mfcc.shape) == 3
        mfcc = np.squeeze(mfcc, axis=-1)  # Remove the channel dimension
    
    # Transpose the MFCCs for plotting (time on x-axis, MFCC coefficients on y-axis).
    mfcc = mfcc.T
    
    # Plotting the MFCCs as discrete pixels
    ax.imshow(mfcc, aspect='auto', origin='lower', interpolation='nearest', cmap='viridis')  # Use 'nearest' for discrete effect
    ax.set_ylabel('MFCC Coefficients')
    ax.set_xlabel('Time Frames')
    ax.set_title('MFCCs')
    plt.colorbar(ax.imshow(mfcc, aspect='auto', origin='lower', interpolation='nearest', cmap='viridis'), ax=ax)


Plot the example's waveform over time and the corresponding spectrogram (frequencies over time):

In [None]:
import matplotlib.pyplot as plt

# Assuming `waveform` is your audio signal and `mfcc` is your computed MFCCs
fig, axes = plt.subplots(2, figsize=(12, 8))

# Plot the waveform
timescale = np.arange(waveform.shape[0])
axes[0].plot(timescale, waveform.numpy())
axes[0].set_title('Waveform')
axes[0].set_xlim([0, 16000])  # Adjust according to your sample rate

# Plot the MFCCs using the modified plot_mfcc function
plot_mfcc(mfcc, axes[1])  # Use the updated plot function for MFCCs
axes[1].set_title('MFCCs')

plt.suptitle(label.title())
plt.show()

Now, create MFCC datasets from the audio datasets:

In [None]:
def make_mfcc_ds(ds):
  return ds.map(
      map_func=lambda audio,label: (get_mfcc(audio), label),
      num_parallel_calls=tf.data.AUTOTUNE)

In [None]:
train_mfcc_ds = make_mfcc_ds(train_ds)
val_mfcc_ds = make_mfcc_ds(val_ds)
test_mfcc_ds = make_mfcc_ds(test_ds)

Examine the MFCCs for different examples of the dataset:

In [None]:
for example_mfcc, example_mfcc_labels in train_mfcc_ds.take(1):
  break

In [None]:
rows = 3
cols = 3
n = rows*cols
fig, axes = plt.subplots(rows, cols, figsize=(16, 9))

for i in range(n):
    r = i // cols
    c = i % cols
    ax = axes[r][c]
    plot_mfcc(example_mfcc[i].numpy(), ax)
    ax.set_title(label_names[example_mfcc_labels[i].numpy()])

plt.show()

## Build and train the model

Add `Dataset.cache` and `Dataset.prefetch` operations to reduce read latency while training the model:

In [None]:
train_mfcc_ds = train_mfcc_ds.cache().shuffle(10000).prefetch(tf.data.AUTOTUNE)
val_mfcc_ds = val_mfcc_ds.cache().prefetch(tf.data.AUTOTUNE)
test_mfcc_ds = test_mfcc_ds.cache().prefetch(tf.data.AUTOTUNE)

For the model, you'll use a simple convolutional neural network (CNN), since you have transformed the audio files into MFCC images.

Your `tf.keras.Sequential` model will use the following Keras preprocessing layers:

- `tf.keras.layers.Resizing`: to downsample the input to enable the model to train faster.
- `tf.keras.layers.Normalization`: to normalize each pixel in the image based on its mean and standard deviation.

For the `Normalization` layer, its `adapt` method would first need to be called on the training data in order to compute aggregate statistics (that is, the mean and the standard deviation).

In [None]:
input_shape = example_mfcc.shape[1:]
print('Input shape:', input_shape)
num_labels = len(label_names)

# Instantiate the `tf.keras.layers.Normalization` layer.
norm_layer = layers.Normalization()
# Fit the state of the layer to the spectrograms
# with `Normalization.adapt`.
norm_layer.adapt(data=train_mfcc_ds.map(map_func=lambda mfcc, label: mfcc))

model = models.Sequential([
    layers.Input(shape=input_shape),
    norm_layer,
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(64, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Dropout(0.25),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_labels),
])

model.summary()

Configure the Keras model with the Adam optimizer and the cross-entropy loss:

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'],
)

Train the model over 10 epochs for demonstration purposes:

In [None]:
EPOCHS = 10
history = model.fit(
    train_mfcc_ds,
    validation_data=val_mfcc_ds,
    epochs=EPOCHS,
    callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),
)

Let's plot the training and validation loss curves to check how your model has improved during training:

In [None]:
metrics = history.history
plt.figure(figsize=(16,6))
plt.subplot(1,2,1)
plt.plot(history.epoch, metrics['loss'], metrics['val_loss'])
plt.legend(['loss', 'val_loss'])
plt.ylim([0, max(plt.ylim())])
plt.xlabel('Epoch')
plt.ylabel('Loss [CrossEntropy]')

plt.subplot(1,2,2)
plt.plot(history.epoch, 100*np.array(metrics['accuracy']), 100*np.array(metrics['val_accuracy']))
plt.legend(['accuracy', 'val_accuracy'])
plt.ylim([0, 100])
plt.xlabel('Epoch')
plt.ylabel('Accuracy [%]')

## Evaluate the model performance

Run the model on the test set and check the model's performance:

In [None]:
model.evaluate(test_mfcc_ds, return_dict=True)

### Display a confusion matrix

Use a [confusion matrix](https://developers.google.com/machine-learning/glossary#confusion-matrix) to check how well the model did classifying each of the commands in the test set:


In [None]:
y_pred = model.predict(test_mfcc_ds)

In [None]:
y_pred = tf.argmax(y_pred, axis=1)

In [None]:
y_true = tf.concat(list(test_mfcc_ds.map(lambda s,lab: lab)), axis=0)

In [None]:
confusion_mtx = tf.math.confusion_matrix(y_true, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(confusion_mtx,
            xticklabels=label_names,
            yticklabels=label_names,
            annot=True,
            fmt='g',
            cmap='Blues')
plt.xlabel('Prediction')
plt.ylabel('Label')
plt.show()

## Run inference on an audio file

Finally, verify the model's prediction output using an input audio file of someone saying "no". How well does your model perform?

In [None]:
x = data_dir/'left/0b09edd3_nohash_0.wav'
x = tf.io.read_file(str(x))
x, sample_rate = tf.audio.decode_wav(x, desired_channels=1, desired_samples=16000,)
x = tf.squeeze(x, axis=-1)
waveform = x
x = get_mfcc(x)
x = x[tf.newaxis,...]

prediction = model(x)
x_labels = ['down', 'go', 'left', 'no', 'right', 'stop', 'up', 'yes']
plt.bar(x_labels, tf.nn.softmax(prediction[0]))
plt.title('Left')
plt.show()

display.display(display.Audio(waveform, rate=16000))

As the output suggests, your model should have recognized the audio command as "left".

## TFLite Conversion

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

tflite_model = converter.convert()

with open('model.tflite', 'wb') as f:
  f.write(tflite_model)

Let's load the `model.tflite` file to check if everything is working properly:

In [None]:
# Load the TFLite model and allocate tensors
interpreter = tf.lite.Interpreter(model_path="model.tflite")
interpreter.allocate_tensors()

# Get input and output tensors
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

In [None]:
# Create a new validation set with batch size 1 so we can use it easily on the TFLite model
val_ds_batch1 = tf.keras.utils.audio_dataset_from_directory(
    directory=data_dir,
    batch_size=1,
    validation_split=0.2,
    seed=0,
    output_sequence_length=16000,
    subset='validation')

val_ds_batch1 = val_ds_batch1.map(squeeze, tf.data.AUTOTUNE)
val_mfcc_ds_batch1 = make_mfcc_ds(val_ds_batch1)

# Save one for later
for example_audio, example_labels in val_mfcc_ds_batch1.take(1):  
    print(example_audio.shape)
    print(example_labels.shape)
    np.save('sample.npy', example_audio.numpy())
    np.save('label.npy', example_labels.numpy())

In [None]:
def compute_accuracy(dataset):
    correct_predictions = 0
    total_samples = 0

    for features, labels in dataset:
        interpreter.set_tensor(input_details[0]['index'], features)
        interpreter.invoke()

        output_data = interpreter.get_tensor(output_details[0]['index'])
        predictions = np.argmax(output_data, axis=1)

        correct_predictions += np.sum(predictions == labels.numpy())
        total_samples += labels.shape[0]

    accuracy = correct_predictions / total_samples
    return accuracy

accuracy = compute_accuracy(val_mfcc_ds_batch1)
print(f'Accuracy: {accuracy * 100:.2f}%')

Finally, we can check the size of our TFLite file in Mb:

In [None]:

model_size_bytes = os.path.getsize('model.tflite')
model_size_mb = model_size_bytes / (1024 * 1024)

print(f"TFLite file size: {model_size_mb:.2f} MB")

## Further Resources

This tutorial demonstrated how to carry out simple audio classification/automatic speech recognition using a convolutional neural network with TensorFlow and Python. To learn more, consider the following resources:

- The [Sound classification with YAMNet](https://www.tensorflow.org/hub/tutorials/yamnet) tutorial shows how to use transfer learning for audio classification.
- The notebooks from [Kaggle's TensorFlow speech recognition challenge](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/overview).
- The 
[TensorFlow.js - Audio recognition using transfer learning codelab](https://codelabs.developers.google.com/codelabs/tensorflowjs-audio-codelab/index.html#0) teaches how to build your own interactive web app for audio classification.
- [A tutorial on deep learning for music information retrieval](https://arxiv.org/abs/1709.04396) (Choi et al., 2017) on arXiv.
- TensorFlow also has additional support for [audio data preparation and augmentation](https://www.tensorflow.org/io/tutorials/audio) to help with your own audio-based projects.
- Consider using the [librosa](https://librosa.org/) library for music and audio analysis.