# Computational Analysis of Sound and Music

# ESA 1 - Sound Event Detection - 1

Dr.-Ing. Jakob Abeßer, jakob.abesser@idmt.fraunhofer.de

**Last update:** 21.05.2024

TODO
  - implement generator (to be used in the next notebook for data augmentation with audiomentations library)
  - implement CNN (FSD50k) and CRNN basic

**Outline**

In this notebook, we revise the M1 notebook and use a small dataset of **animal sounds** extracted from the **ESC50 dataset**.
We will study how to 
- use the **audiomentations** Python library for **data augmentation** and how 
- to implement a custom **generator** for our training to apply the data augmentation **during training**.

In [1]:
!pip install wget



In [41]:
import numpy as np
import sklearn as skl
import os
import matplotlib
import librosa
import matplotlib.pyplot as pl
import platform
import IPython.display as ipd
import wget
import zipfile
import glob

import audiomentations

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import tensorflow as tf
"""
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, Conv2D, BatchNormalization, \
   MaxPooling2D, Activation, GlobalAveragePooling2D, GlobalMaxPooling2D, Input
from tensorflow.keras.utils import to_categorical
"""

'\nfrom tensorflow.keras.models import Model\nfrom tensorflow.keras.layers import Dense, Dropout, Conv2D, BatchNormalization,    MaxPooling2D, Activation, GlobalAveragePooling2D, GlobalMaxPooling2D, Input\nfrom tensorflow.keras.utils import to_categorical\n'

## Dataset

Let's download the small animal sound dataset used before ...

In [3]:
if not os.path.isfile('animal_sounds.zip'):
    print('Please wait a couple of seconds ...')
    wget.download('https://github.com/machinelistening/machinelistening.github.io/blob/master/animal_sounds.zip?raw=true', 
                      out='animal_sounds.zip', bar=None)
    print('animal_sounds.zip downloaded successfully ...')
else:
    print('Files already exist!')
    
if not os.path.isdir('animal_sounds'):
    print("Let's unzip the file ... ")
    assert os.path.isfile('animal_sounds.zip')
    with zipfile.ZipFile('animal_sounds.zip', 'r') as f:
        # Entpacke alle Inhalte in das angegebene Verzeichnis
        f.extractall('.')
    assert os.path.isdir('animal_sounds')
    print("All done :)")


Files already exist!


**Reminder**

The dataset used here is a manual selection of 5 examples for 5 animal classes from the https://github.com/karolpiczak/ESC-50 dataset.

As the first step, let's get a list of sound classes (animal types) and for each class, a list of audio files.

In [4]:
# list the subdirectories (which provide us the animal classes)
dir_dataset = 'animal_sounds'
sub_directories = glob.glob(os.path.join(dir_dataset, '*'))

n_sub = len(sub_directories)
# let's collect the files in each subdirectory
# the folder name is the class name
fn_wav_list = []
class_label = []
file_num_in_class = []

for i in range(n_sub):
    current_class_label = os.path.basename(sub_directories[i])
    current_fn_wav_list = sorted(glob.glob(os.path.join(sub_directories[i], '*.wav')))
    for k, fn_wav in enumerate(current_fn_wav_list):
        fn_wav_list.append(fn_wav)
        class_label.append(current_class_label)
        file_num_in_class.append(k)

n_files = len(class_label)
print('Here is our list of audio files, sorted by sound classes:')
for i in range(n_files):
    print(class_label[i], '-', fn_wav_list[i])
    
# this vector includes a "counter" for each file within its class, we use it later ...
file_num_in_class = np.array(file_num_in_class)
print(f"Here is a within-class file counter: {file_num_in_class}")

Here is our list of audio files, sorted by sound classes:
cat - animal_sounds\cat\1-34094-B-5.wav
cat - animal_sounds\cat\1-47819-A-5.wav
cat - animal_sounds\cat\1-56380-A-5.wav
cat - animal_sounds\cat\1-79113-A-5.wav
cat - animal_sounds\cat\2-110010-A-5.wav
cow - animal_sounds\cow\3-124376-A-3.wav
cow - animal_sounds\cow\3-126358-A-3.wav
cow - animal_sounds\cow\3-152039-A-3.wav
cow - animal_sounds\cow\3-160993-A-3.wav
cow - animal_sounds\cow\4-174860-A-3.wav
dog - animal_sounds\dog\2-114280-A-0.wav
dog - animal_sounds\dog\2-117271-A-0.wav
dog - animal_sounds\dog\2-118072-A-0.wav
dog - animal_sounds\dog\2-122104-A-0.wav
dog - animal_sounds\dog\3-136288-A-0.wav
frog - animal_sounds\frog\1-18757-A-4.wav
frog - animal_sounds\frog\1-31836-A-4.wav
frog - animal_sounds\frog\2-32515-A-4.wav
frog - animal_sounds\frog\2-52085-A-4.wav
frog - animal_sounds\frog\3-70962-A-4.wav
insect - animal_sounds\insect\1-17585-A-7.wav
insect - animal_sounds\insect\1-19501-A-7.wav
insect - animal_sounds\insect

Let's listen to one example per class...

In [5]:
for i in range(5):
    idx = 5*i  # always take the first one per class
    x, fs = librosa.load(fn_wav_list[idx])
    print(class_label[idx])
    ipd.display(ipd.Audio(data=x, rate=fs))

cat


cow


dog


frog


insect


We need to get a class ID for each file (which is a number that represents its class)

In [6]:
unique_classes = sorted(list(set(class_label)))
print("All unique class labels (sorted alphabetically)", unique_classes)

class_id = np.array([unique_classes.index(_) for _ in class_label])
print("Class IDs of all files", class_id)

All unique class labels (sorted alphabetically) ['cat', 'cow', 'dog', 'frog', 'insect']
Class IDs of all files [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4]


## Feature Extraction

Let's use a function to compute the **Mel spectrogram** with logarithmic magnitude scaling:

In [7]:
def compute_melspec(fn_wav, n_bins=128):
    """ Compute Mel spectrogram with logarithmic magnitude scaling 
    Args:
        fn_wav (str): WAV file name
        n_bins (int): Number of Mel frequency bins
    Returns:
        mel_spec (2d np.ndarray): Mel spectrogram (n_bins x n_frames)
    """
    x, fs = librosa.load(fn_wav, mono=True, sr=44100)
    S = librosa.feature.melspectrogram(y=x, sr=fs, n_mels=n_bins, fmax=fs/2)
    S_dB = librosa.power_to_db(S, ref=np.max)
    return S_dB

Batch feature extraction over all WAV files

In [16]:
feat = []
for fn_wav in fn_wav_list:
    feat.append(compute_melspec(fn_wav))
feat = np.array(feat)

print(feat.shape)

(25, 128, 431)


Remember the shape required for a CNN model:

$n_\mathrm{patches} \times n_\mathrm{freqbins} \times n_\mathrm{frames} \times n_\mathrm{channels}$

We only consider the magnitude channel, hence $n_\mathrm{channels}=1$.

In [17]:
feat = np.expand_dims(feat, axis=-1)
print(f"Final shape: {feat.shape}")

Final shape: (25, 128, 431, 1)


## Train-Test-Split

We use the ```file_num_in_class``` variable from before to separate our dataset into a **training set** and a **test set**. We will use the first three files in each class as training set and the last two as test set. We use boolean operations to get two masks.

In [18]:
print("Remember how it looks like:", file_num_in_class)  # starts at 0 for the first file in each class, etc...

Remember how it looks like: [0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4]


In [19]:
is_train = np.where(file_num_in_class <= 2)[0]
is_test = np.where(file_num_in_class >= 3)[0]

print("Indices of the training set items:", is_train)
print("Indices of the test set items:", is_test)


Indices of the training set items: [ 0  1  2  5  6  7 10 11 12 15 16 17 20 21 22]
Indices of the test set items: [ 3  4  8  9 13 14 18 19 23 24]


Now that we have splitted our dataset, we can generate the feature matrix and target vectors for the training and test set.

In [96]:
X_train = feat[is_train, :]
X_test = feat[is_test, :]

y_train = class_id[is_train]
y_test = class_id[is_test]

# one-hot-encoding
y_train = tf.keras.utils.to_categorical(y_train, num_classes=5)
y_test = tf.keras.utils.to_categorical(y_test, num_classes=5)

# Data standardization
X_train -= np.mean(X_train)
X_train /= np.std(X_train)
X_test -= np.mean(X_test)
X_test /= np.std(X_test)

print("Let's look at the dimensions")
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Let's look at the dimensions
(15, 128, 431, 1)
(15, 5)
(10, 128, 431, 1)
(10, 5)


## Neural Network Architecture

We first use a CNN model for sound event classification, which has around **216k parameters**. 

This time, we use the **VGG-like** model from

 [1] Fonseca, E., Member, S., Favory, X., Pons, J., Font, F., & Serra, X.
        (2020). FSD50K: an Open Dataset of Human-labeled Sound Events. ArXiv Preprint ArXiv:2010.00475. (https://arxiv.org/abs/2010.00475) 

### GlobalMaxPooling / GlobalMeanPooling

*Source: https://8f430952.rocketcdn.me/wp-content/uploads/2021/08/image-278.png*

- global pooling operations summarize entire feature maps into scalar values
- in the example shown below, 4 feature maps are summarized and the result is an array with 4 values

In [97]:
ipd.Image(url='https://8f430952.rocketcdn.me/wp-content/uploads/2021/08/image-278.png')

In [116]:
def creage_vgg_like_model(input_shape, num_output_dim):
    
    inp = tf.keras.layers.Input(shape=input_shape)

    x = None
    for i in range(3):
        if i == 0:
            x = inp
        x = tf.keras.layers.Conv2D(32, kernel_size=(3, 3), padding='same')(x)
        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.Activation(activation="relu")(x)

    x = tf.keras.layers.MaxPooling2D((2, 2))(x)

    for i in range(2):
        x = tf.keras.layers.Conv2D(64, kernel_size=(3, 3), padding='same')(x)
        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.Activation(activation="relu")(x)

    x = tf.keras.layers.MaxPooling2D((2, 2))(x)
    x = tf.keras.layers.Conv2D(128, kernel_size=(3, 3), padding='same')(x)
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Activation(activation="relu")(x)

    x = tf.keras.layers.concatenate([tf.keras.layers.GlobalAveragePooling2D()(x),
                                     tf.keras.layers.GlobalMaxPooling2D()(x)])

    x = tf.keras.layers.Dense(256, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.5)(x)
    out = tf.keras.layers.Dense(num_output_dim, activation="softmax")(x)

    model = tf.keras.models.Model(inputs=inp, outputs=out)
    
    model.compile(loss='categorical_crossentropy', 
                  optimizer='adam',
                  metrics=['accuracy'])
 
    return model

# Example usage
input_shape = X_train.shape[1:] 
model = creage_vgg_like_model(input_shape, 5)
model.summary()


Model: "functional_49"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_29 (InputLayer)           [(None, 128, 431, 1) 0                                            
__________________________________________________________________________________________________
conv2d_162 (Conv2D)             (None, 128, 431, 32) 320         input_29[0][0]                   
__________________________________________________________________________________________________
batch_normalization_162 (BatchN (None, 128, 431, 32) 128         conv2d_162[0][0]                 
__________________________________________________________________________________________________
activation_162 (Activation)     (None, 128, 431, 32) 0           batch_normalization_162[0][0]    
______________________________________________________________________________________

## Model training & evaluation

In [117]:
hist_1 = model.fit(X_train, y_train, batch_size=2, epochs=30, verbose=2)

Epoch 1/30
8/8 - 2s - loss: 6.2786 - accuracy: 0.1333
Epoch 2/30
8/8 - 2s - loss: 4.2867 - accuracy: 0.1333
Epoch 3/30
8/8 - 2s - loss: 3.5269 - accuracy: 0.2667
Epoch 4/30
8/8 - 2s - loss: 3.0351 - accuracy: 0.1333
Epoch 5/30
8/8 - 2s - loss: 1.7955 - accuracy: 0.3333
Epoch 6/30
8/8 - 2s - loss: 1.7136 - accuracy: 0.4667
Epoch 7/30
8/8 - 2s - loss: 1.8884 - accuracy: 0.4667
Epoch 8/30
8/8 - 2s - loss: 1.7682 - accuracy: 0.4000
Epoch 9/30
8/8 - 1s - loss: 1.9372 - accuracy: 0.2000
Epoch 10/30
8/8 - 2s - loss: 0.8733 - accuracy: 0.6667
Epoch 11/30
8/8 - 2s - loss: 1.5154 - accuracy: 0.6000
Epoch 12/30
8/8 - 2s - loss: 1.2330 - accuracy: 0.5333
Epoch 13/30
8/8 - 2s - loss: 1.2422 - accuracy: 0.5333
Epoch 14/30
8/8 - 2s - loss: 1.5392 - accuracy: 0.4667
Epoch 15/30
8/8 - 2s - loss: 1.3364 - accuracy: 0.4667
Epoch 16/30
8/8 - 2s - loss: 1.1005 - accuracy: 0.7333
Epoch 17/30
8/8 - 2s - loss: 1.2325 - accuracy: 0.6000
Epoch 18/30
8/8 - 2s - loss: 1.0880 - accuracy: 0.6000
Epoch 19/30
8/8 - 2

In [118]:
y_test_pred = model.predict(X_test)
class_id_test = np.argmax(y_test, axis=1)
class_id_test_pred = np.argmax(y_test_pred, axis=1)
acc = accuracy_score(class_id_test, class_id_test_pred)
print(f"Accuracy = {acc}")

Accuracy = 0.4


## Data Augmentation

Since our dataset is very small, we want to use **data augmentation** to create more training data instances, which are variations of existing instances. The intuition is that by having more diverse training examples, the model better generalizes to unseen (test) data.

### Audiomentations

The **audiomentations** Python library provides and easy-to-use wrapper for many data augmentation techniques such as
- adding background noise
- adding reverberation (impulse response)
- distortion
- spectral masking and so on...

Individual processing algorithms can be imported as classes and combind using the **Compose** wrapper

In [119]:
from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift, Shift

augment = Compose([
    AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.03, p=0.5),
    TimeStretch(min_rate=0.8, max_rate=1.2, p=0.5),
    PitchShift(min_semitones=-2, max_semitones=2, p=0.5)
])



Let's take an audio file and listen to some (randomly) augmented versions of it...

In [120]:
# load audio 
idx = 15
x_orig, fs = librosa.load(fn_wav_list[idx])

for i in range(5):
    print(i)
    if i == 0:
        print('Original')
        x = x_orig
    else:
        print('Augmented version')
        x = augment(samples=x_orig, sample_rate=fs)
    
    ipd.display(ipd.Audio(data=x, rate=fs))


0
Original


1
Augmented version


2
Augmented version


3
Augmented version


4
Augmented version


Now we want to apply the random data augmentation during training, we need to do the following steps:
- import all audio files and store the original samples as rows in a matrix
- implement a **generator** which applies data augmentation in each training epoch and then computes the mel spectrogram

In [121]:
# load all audio files
all_samples = []
for fn_wav in fn_wav_list:
    x, fs = librosa.load(fn_wav, mono=True)
    all_samples.append(x)
all_samples = np.vstack(all_samples)
all_samples = all_samples[is_train, :]
print(all_samples.shape)

(15, 110250)


## Generator

In [125]:
class DataGenerator(tf.keras.utils.Sequence):

    def __init__(self, all_samples, targets):
        self.all_samples = all_samples
        self.targets = targets
        self.n_files = self.all_samples.shape[0]
        self.n_samples = self.all_samples.shape[1]
        # array of file indexes that we can shuffle after each training epoch to use files in random order
        self.indexes = np.arange(self.n_files)
        self.fs = 44100
        self.augment = Compose([AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.001, p=0.5),
                                TimeStretch(min_rate=0.95, max_rate=1.05, p=0.5),
                                PitchShift(min_semitones=-1, max_semitones=1, p=0.5)])
        
    def __len__(self):
        """ Returns the number of training examples """
        return self.n_files

    def __getitem__(self, index):
        # get current audio samples
        curr_samples = self.all_samples[self.indexes[index], :]
        # apply data augmentation
        curr_samples_aug = augment(samples=curr_samples, sample_rate=44100)
        # compute Mel spectrogram
        spec = librosa.feature.melspectrogram(y=curr_samples, sr=self.fs, n_mels=128, fmax=self.fs/2)
        spec = librosa.power_to_db(spec, ref=np.max)
        spec -= np.mean(spec)
        spec /= np.std(spec)
        # define feature tensor: 1 patch, 1 channel
        feat = np.zeros((1, spec.shape[0], spec.shape[1], 1))
        feat[0, :, :, 0] = spec
        target = self.targets[self.indexes[index]]
        target = np.expand_dims(target, axis=0)

        return feat, target

    def on_epoch_end(self):
        # shuffle training file indeces
        np.random.shuffle(self.indexes)
        
        
generator = DataGenerator(all_samples, y_train)

In [126]:
model2 = creage_vgg_like_model(input_shape, 5)
model2.fit(generator, epochs=30, verbose=2)

Epoch 1/30
15/15 - 1s - loss: 5.8861 - accuracy: 0.1333
Epoch 2/30
15/15 - 1s - loss: 3.8250 - accuracy: 0.4000
Epoch 3/30
15/15 - 1s - loss: 4.4330 - accuracy: 0.0667
Epoch 4/30
15/15 - 1s - loss: 2.2675 - accuracy: 0.3333
Epoch 5/30
15/15 - 1s - loss: 1.6570 - accuracy: 0.2667
Epoch 6/30
15/15 - 1s - loss: 1.6239 - accuracy: 0.4000
Epoch 7/30
15/15 - 1s - loss: 1.2878 - accuracy: 0.6000
Epoch 8/30
15/15 - 1s - loss: 1.2747 - accuracy: 0.4000
Epoch 9/30
15/15 - 1s - loss: 1.4007 - accuracy: 0.5333
Epoch 10/30
15/15 - 1s - loss: 0.8086 - accuracy: 0.6000
Epoch 11/30
15/15 - 1s - loss: 1.2113 - accuracy: 0.5333
Epoch 12/30
15/15 - 1s - loss: 0.7203 - accuracy: 0.8000
Epoch 13/30
15/15 - 1s - loss: 0.7739 - accuracy: 0.8000
Epoch 14/30
15/15 - 1s - loss: 0.6464 - accuracy: 0.7333
Epoch 15/30
15/15 - 1s - loss: 0.7783 - accuracy: 0.6000
Epoch 16/30
15/15 - 1s - loss: 0.3203 - accuracy: 0.9333
Epoch 17/30
15/15 - 1s - loss: 0.3706 - accuracy: 0.8667
Epoch 18/30
15/15 - 1s - loss: 0.1277 - 

<tensorflow.python.keras.callbacks.History at 0x1f730240108>