# Neural Networks course project: #
# Google AudioSet sound classification with Deep Learning #
### Sapienza University of Rome ###

### by Ivan Senilov (1787618) ###

## 1. Introduction ##

This work represents a practical part of the Neural Networks course taught at Sapienza University.

The goal of this coursework is to:

1. Explore the [Google AudioSet](https://research.google.com/audioset/index.html) [1]
2. Build the classification model based on Neural Network(s)
3. Validate the model
4. Evaluate and discuss the results

## 2. Audio classification problem ##

### 2.1 Audio features ###

Feature extraction is signal processing task of computing the numerical representation from the signal that can be used to characterize the audio segment [3].

Most of the audio features fall into three categories [4]:

1. **Energy-based**. For example, 4Hz modulation energy used for speech/music classification [5].
2. **Spectrum-based**. Examples of the category are roll-off of the spectrum, spectral flux, Mel Frequency Cepstral Coefficents (MFCC) [5] and linear spectrum pair, band periodicity [6].
3. **Perceptual based**, like pitch (estimated to discriminate songs and speech over music [7]).

The most developed areas of machine learning for audio classification include speech and music recognition where  MFCCs are widely used as features. MFCCs were introduced in 1980 [8] and showed better performance in terms of recognition of spoken words. However, when it comes to other types of sound recognition, the selection of feature extraction method becomes less obvious, even though MFCCs are used in, for example, application to environmental sound classification [9].

<img src="figs/feat_extr.png">

Extraction pipelines for MPEG-7 (left) and MFCC (right) features (reprinted from [4])

Typical approach of feature extraction process is to split the audio signal into small chunks of several $ms$ (exact size is domain-dependent) and feed it into computational function of one of many frameworks for extraction of features from audio (see Figure 1 and 2 for examples of extraction pipelines for MPEG-7 [10] and MFCC features). The most popular frameworks include YAAFE [11] and openSMILE [12], which allows to extract following feature types:

1. Amplitude Modulation [13]. Analyzed frequency ranges are: Tremolo (4 - 8 Hz) and Grain (10 - 40 Hz). For each of these ranges, it computes:
    * Frequency of maximum energy in range
    * Difference of the energy of this frequency and the mean energy over all frequencies
    * Difference of the energy of this frequency and the mean energy in range
    * Product of the two first values.
2. Autocorrelation coefficients $\mathit{ac}$ on each frame.
    $$ac(k)=\sum_{i=0}^{N-k-1}x(i)x(i+k),$$    
    where (here and below) $k$ is frame length in samples, $N$ is length of whole signal in samples and $x(i)$ is signal function.    
3. Onset detection using a complex domain spectral flux method [14].    
4. Energy $\mathit{en}$ as root mean square of an audio frame.
    $$en=\sqrt{\dfrac{\sum_{i=0}^{N-1}x(i)^2}{N}}$$    
5. Envelope of an oscillating signal (smooth curve outlining its extremes).    
6. Shape statistics (centroid, spread, skewness and kurtosis) of each frame’s Temporal Shape, Amplitude Envelope and Magnitude Spectrum.    
7. Linear Predictor Coefficients (LPC) of a signal frame [15].    
8. Line Spectral Frequency (LSF) coefficients of a signal frame [16].    
9. Loudness coefficients [17].    
10. Mel-frequencies cepstrum coefficients and Mel-frequencies spectrum.    
11. Magnitude spectrum.    
12. Octave band signal intensity (OBSI) using a triangular octave filter bank and  OBSI ratio between consecutive octave.
13. Sharpness and Spread of Loudness coefficients [18].    
14. Spectral crest factor per log-spaced band of 1/4 octave.    
15. Spectral decrease, spectral flatness.    
16. Spectral Flux.    
17. Spectral roll-off (frequency so that 99% of the energy is contained below) [5].
18. Spectral Slope (computed by linear regression of the spectral amplitude) [18].    
19. Spectral Variation (normalized correlation of spectrum between consecutive frames) [18].    
20. Zero-crossing rate (ZCR) for frame [5].

Even though there are plenty of features available for audio, we focus on MFCCs as they proved to be the most effective representation (See Figure 3 for example heatmap of feature matrix). We also consider raw audio signal to see how automatically learned features (by special Neural Network architecture) perform in comparison with hand-crafted ones.

<img src="figs/spectr.png">

<p style="text-align: center; font-weight:bold"> Fig.3. Example of heatmap of feature matrix </p>

### 2.2 Model selection ###

As this work is a part of Neural Networks course, we will not consider more traditional Machine Learning algorithms. Other reason for this is that Neural Networks proved to be much more effective in real world pattern recognition problems, particularly in audio classification like shown by [2] and many other researches.

Basically, in the next section we would like to check and find out performance of following combinations of feature extraction techniques and network architectures:

1. **MFCC features + Bidirectional Long Short Term Memory (BLSTM) network**, where we use traditional (for audio applications) features with relatively modern Recurrent Neural Network [19,20].
2. **Raw audio + Convolutional Neural Network (CNN) and BLSTM network**. CNNs are traditionally used in image recognition [21] where 2Dkernels learn features. We use 1D kernels for learning patterns in raw audio signal instead, keeping the original idea of CNN.

## 3. Implementation ##

In this section, we will:

1. Load ontology and correspondence between video's ids and class labels.
2. Download the videos from Youtube to respective directories
3. Load labeled dataset into Numpy arrays
4. Build classifier based on Neural Network architecture
5. Evaluate it using cross-validation

First of all, import data manipulation libs.

In [1]:
import pandas as pd
import numpy as np

In order to simplify the task (in terms of computational complexity as well) we limit number of categories to 3. Read the ontology to Pandas dataframe and select only *id*, *name* and *child_id* fields.

In [2]:
classes_of_interest = ("Vehicle", "Channel, environment and background", "Natural sounds")

with open("ontology.json", "r") as f:
    contents = f.read()

ontology = pd.read_json(contents)

ontology = ontology[["id", "name", "child_ids"]].set_index("id")

Then we extract *id*s for each (main and child) category (class) of interest recursively, building a dictionary where key is class name and value is a list of *id*s.

In [4]:
def extract_ids(cls, data):  # recursively add all child classes of input class to the list
    out = [] 
    for id, row in data.iterrows():
        if row["name"] == cls:
            out.append(id)
            if len(row["child_ids"]) > 0:
                for child in row["child_ids"]:
                    out.append(extract_ids(data["name"][child], data))
                
    return flattern(out)


def flattern(A):  # list flattening helper function
    rt = []
    for i in A:
        if isinstance(i,list): rt.extend(flattern(i))
        else: rt.append(i)
    return rt

In [5]:
classes = {i:[] for i in classes_of_interest}
for cls in classes_of_interest:
    classes[cls] = extract_ids(cls, ontology)

Here we read tables with correspondance between YouTube ID, segment in the video and ids from ontology.

In [6]:
train_data = pd.read_csv("balanced_train_segments.csv", skiprows=2, sep=", ", engine="python", index_col="# YTID")
train_data["positive_labels"] = train_data["positive_labels"].apply(lambda x: x.replace('\"','').split(","))

test_data = pd.read_csv("eval_segments.csv", skiprows=2, sep=", ", engine="python", index_col="# YTID")
test_data["positive_labels"] = test_data["positive_labels"].apply(lambda x: x.replace('\"','').split(","))

train_data.head()

Unnamed: 0_level_0,start_seconds,end_seconds,positive_labels
# YTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
--PJHxphWEs,30.0,40.0,"[/m/09x0r, /t/dd00088]"
--ZhevVpy1s,50.0,60.0,[/m/012xff]
--aE2O5G5WE,0.0,10.0,"[/m/03fwl, /m/04rlf, /m/09x0r]"
--aO5cdqSAg,30.0,40.0,"[/t/dd00003, /t/dd00005]"
--aaILOrkII,200.0,210.0,"[/m/032s66, /m/073cg4]"


Finally, we download segments of videos from YouTube to respective folders of classes.

In [None]:
import youtube_dl   # Python library for downloading from YouTube
import os


'''
Helper function for cropping audio with ffmpeg
'''
def crop(start, length, filename):
    command = "ffmpeg -y -i " + filename + \
    " -ss  " + str(start) + " -t " + str(length) + \
    " -ac 1 -acodec copy " + filename.split(".")[0] + "_.wav"
    os.system(command)

to_download = {i:[] for i in classes_of_interest}
    
for cls in classes_of_interest:
    for id in classes[cls]:
        for row in train_data.itertuples(): 
            if row[3][0] == id:
                to_download[cls].append(row[0])
                

'''
options for youtube-dl
'''
options = {
    'format': 'bestaudio/best',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'wav'   # wav format for lossless features extraction
    }],
    'extractaudio' : True,
    'ignoreerrors' : True,
    'audioformat' : "wav",
    'noplaylist' : True,    # only download single clip, not playlist
}
                
for cls in to_download:
    # setting path for download removing commas and spaces in order to avoid fylesisitem access problems
    options['outtmpl'] = os.path.join(cls.replace(",","").replace(" ",""), '%(id)s.%(ext)s')
    for file in to_download[cls]:
        with youtube_dl.YoutubeDL(options) as ydl:
            ydl.download([file])   # downloading full audio clip (due to limitations of youtube-dl)
            filename = os.path.join(cls.replace(",","").replace(" ",""), file + ".wav")
            crop(train_data.loc[file]["start_seconds"],
                 10, filename)     # cropping of the clip in accordance with dataset csv file 
            try:
                os.remove(filename)   # removing original (non-cropped) clip
            except OSError:
                pass


Further, we extract features with help of SciPy (raw audio) and librosa (for MFCC extraction) libraries.

In [None]:
import os
from scipy.io import wavfile
from librosa import feature

# Removing commas and spaces from classes names for avoiding problems woth dir names 
classes_dirs = [i.replace(",","").replace(" ","") for i in classes_of_interest]

feat = []
feat_raw = []
labels = []
class_num = 0
window = 32768                          # size of window for each sample
for directory in classes_dirs:
    count = 0
    for file in os.listdir(directory):
        filename = os.path.join(directory, file)
        if os.path.isfile(filename) and count < 1000:    # limiting number of files to read
            count += 1
            rate, frames = wavfile.read(filename)
            if len(frames.shape) > 1:    # if stereo, take only one channel
                frames = frames[:,0]
            for i in range(0, len(frames) - (window+1), int(window/2)):
                pxx = feature.mfcc(y=frames[i:i + window], sr=rate, n_mfcc=20)
                feat.append(pxx)
                feat_raw.append(frames[i:i + window])
                labels.append(class_num)
    class_num += 1                          # each successive class is represented by incremented integer
data = np.stack(feat)
data_raw = np.stack(feat_raw)

from keras.utils import to_categorical
labels = to_categorical(labels)             # convert class number to 'one-hot' vector

Now we are ready to train our networks with help of keras deep learning library which provides simple interface for building the nets. But before, we define custom keras callback for evaluating the model after each epoch.

In [5]:
from keras.callbacks import Callback#,TensorBoard, ModelCheckpoint
from datetime import datetime

class TestCallback(Callback):
    def __init__(self, test_data, net_type):
        self.test_data = test_data
        self.net_type = net_type
        self.dt = datetime.now().strftime("%d-%m-%Y.%H-%M")
        
    def on_epoch_end(self, epoch, logs={}):
        x, y = self.test_data
        loss, acc = self.model.evaluate(x, y, verbose=0)
        if not os.path.isdir("logs"):
            os.mkdir("logs")
        log_filename = os.path.join("logs", "log.") + self.net_type + "." + self.dt + ".csv"
        with open(log_filename, "a") as log:
            # net type, epoch no, test loss, test acc, train loss, train acc
            log.write("{},{},{},{},{}\n".format(epoch, loss, acc, logs["loss"], logs["acc"]))

Firstly, we try 3-layers BLSTM network adopted from [2] as it has proved its effectiveness in similar problem. During the experiments it was found out that ReLU activation function in RNN layers results in model stop learning which is most probably consequence of vanishing gradients.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, accuracy_score
from keras.models import Sequential
from keras.layers import LSTM, Conv2D, Dense, MaxPooling2D, GlobalAveragePooling2D
from keras.layers.wrappers import Bidirectional


X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

print("\nTraining model...")
# architecture of the network is adopted from https://arxiv.org/pdf/1511.07035.pdf
model = Sequential()
model.add(Bidirectional(LSTM(216, return_sequences=True, activation="tanh", dropout=0.5),
                        input_shape=(X_train.shape[1:])))
model.add(Bidirectional(LSTM(216, return_sequences=True, activation="tanh", dropout=0.4)))
model.add(Bidirectional(LSTM(216, activation="tanh", dropout=0.3)))
model.add(Dense(3, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

test_callback = TestCallback((X_test, y_test), "MFCC-BLSTM-drop")

hist = model.fit(X_train, y_train,
                 callbacks= [test_callback],
                 # validation_data=(X_test, y_test),
                 epochs=100,
                 batch_size=128,
                 verbose=1)

print("\nEvaluating...")
y_pred = to_categorical(model.predict_classes(X_test, verbose=1))
print(y_pred, y_test)
acc = accuracy_score(y_test, y_pred)
print("\nAccuracy:", acc)
rec = recall_score(y_test, y_pred, average="macro")
print("Recall:", rec)

The architecture of the network is depicted on the Figure 4 

<img src="figs/blstm.png">
<p style="text-align: center; font-weight:bold"> Fig.4. Architecture of the BLSTM network trained on MFCC features</p>


The above code already includes regularization (dropout) but first experiment was performed without it. Let's compare the performance of the model with and without dropout (Figures 5 and 6 respectively).

<img src="figs/mfcc-blstm.png">
<p style="text-align: center; font-weight:bold"> Fig.5. Accuracy graphs of BLSTM network on MFCC features (w/o dropout)</p>

The clear evidence of overfitting is on the graph where training accuracy reaches 98% but test set accuracy is only 73% at its maximum point.

<img src="figs/mfcc-blstm-drop.png">
<p style="text-align: center; font-weight:bold"> Fig.6. Accuracy graphs of BLSTM network on MFCC features (w/dropout) </p>

Apparently, dropout increases test set (unseen by the model) accuracy so we will add it to all consequent models regardless of their architecture.

Next, we add convolutional layers in the beginning of the network and test on the same features.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, accuracy_score
from keras.models import Sequential
from keras.layers import LSTM, Conv1D, Dense
from keras.layers.wrappers import Bidirectional
from keras.layers.wrappers import TimeDistributed
from keras.layers import MaxPooling1D, Dropout, GlobalAveragePooling1D


data_raw_1 = data_raw[:,::10]
X_train, X_test, y_train, y_test = train_test_split(data_raw_1, labels, test_size=0.2, random_state=42)
X_train = np.expand_dims(X_train, axis=1)
X_train = X_train.reshape((X_train.shape[0], 1, int(X_train.shape[2])))
X_train = np.expand_dims(X_train, axis=3)
X_test = np.expand_dims(X_test, axis=1)
X_test = X_test.reshape((X_test.shape[0], 1, int(X_test.shape[2])))
X_test = np.expand_dims(X_test, axis=3)


print("\nTraining model...")
print("\nShape of the dataset:", X_train.shape)

model = Sequential()
model.add(TimeDistributed(Conv1D(filters=64, kernel_size=64, strides=1, padding="same", activation='tanh'),
                          input_shape=(1, 8192, 1)))
model.add(TimeDistributed(MaxPooling1D(2)))
model.add(TimeDistributed(Dropout(0.4)))
model.add(TimeDistributed(Conv1D(64, 64, padding="same", activation='tanh')))
model.add(TimeDistributed(MaxPooling1D(2)))
model.add(TimeDistributed(Dropout(0.4)))
model.add(TimeDistributed(Conv1D(128, 64, padding="same", activation='tanh')))
model.add(TimeDistributed(MaxPooling1D(2)))
model.add(TimeDistributed(Dropout(0.4)))
model.add(TimeDistributed(Conv1D(128, 64, padding="same", activation='tanh')))
model.add(TimeDistributed(MaxPooling1D(2)))
# model.add(TimeDistributed(Dropout(0.4)))
model.add(TimeDistributed(Conv1D(256, 64, padding="same", activation='tanh')))
model.add(TimeDistributed(MaxPooling1D(2)))
# model.add(TimeDistributed(Dropout(0.4)))
model.add(TimeDistributed(Conv1D(256, 32, padding="same", activation='tanh')))
model.add(TimeDistributed(GlobalAveragePooling1D()))
model.add(Bidirectional(LSTM(216, return_sequences=True, activation="tanh", dropout=0.3)))
model.add(Bidirectional(LSTM(216, return_sequences=True, activation="tanh", dropout=0.2)))
model.add(Bidirectional(LSTM(216, activation="tanh")))
model.add(Dense(3, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

test_callback = TestCallback((X_test, y_test), "MFCC-CNN-BLSTM-drop")

hist = model.fit(X_train, y_train,
                 callbacks=[test_callback],
                 # validation_data=(X_1, y_1),
                 epochs=100,
                 batch_size=128,
                 verbose=1)


print("\nEvaluating...")
y_pred = to_categorical(model.predict_classes(X_test, verbose=1))
print(y_pred, y_test)
acc = accuracy_score(y_test, y_pred)
print("\nAccuracy:", acc)
rec = recall_score(y_test, y_pred, average="macro")
print("Recall:", rec)

The architecture of the network is depicted on the Figure 7

<img src="figs/cnn-blstm.png">
<p style="text-align: center; font-weight:bold"> Fig.7. Architecture of the CNN-BLSTM network trained on raw audio</p>

This architecture is much more complicated that initial one and training one epoch on 1,000 snippets **takes about 1,400s on Core i7 + GTX 1080Ti** with TensorFlow-GPU.

Now let's see at the performance of the model on Figure 8:

<img src="figs/raw-cnn-blstm-drop.png">
<p style="text-align: center; font-weight:bold"> Fig.8. Accuracy graphs of CNN-BLSTM network on raw audio (w/dropout)</p>

Here, in opposite to previous case, we see underfitting which is probably caused by not optimal selection of window length or hyperparameters of the network (too strong regularization). Unfortunately, I couldn't test other parameters due to limited access to GPU powered workstation.

## 4. Discussion, conclusion and further steps ##

In this work, I explored the problem of audio classification on the example of Google's AudioSet. Firstly, the existing approaches to features extraction and model building were reviewed. Then the hypothesis of application of CNN to raw audio was checked by implementation of the proposed architectures. We saw that 3-layer BLSTM network on MFCC features yields good results even though it is prone to overfitting. On the other hand, adding CNN layers in the beginning of the network complicates the network significantly but doesn't increase performance (however it doesn't overfit much).

Next steps could include extensive testing the different hyperparameters for both implemented networks (varying number and type of layers, number of neurons/filters, size of the filters, regularization, etc.) but it was not done due to limited computational resources and time. The same reason is responsible for absence of cross validation as it would increase training time by k times in case of k-fold cross validation.

## 5. References ##

1. Gemmeke, Jort F., et al. "Audio Set: An ontology and human-labeled dataset for audio events." *IEEE ICASSP*. 2017.
2. Abdić, Irman, et al. "Detecting road surface wetness from audio: A deep learning approach." *Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE*, 2016.
3. Prasad, Bhanu, and SR Mahadeva Prasanna, eds. Speech, audio, image and biomedical signal processing using neural networks. Vol. 83. Springer, 2007.
4. Xiong, Ziyou, et al. "Comparing MFCC and MPEG-7 audio features for feature extraction, maximum likelihood HMM and entropic prior HMM for sports audio classification." Multimedia and Expo, 2003. ICME'03. Proceedings. 2003 International Conference on. Vol. 3. IEEE, 2003.
5. Scheirer, Eric, and Malcolm Slaney. "Construction and evaluation of a robust multifeature speech/music discriminator." Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on. Vol. 2. IEEE, 1997.
6. Lu, Lie, Stan Z. Li, and Hong-Jiang Zhang. "Content-based audio segmentation using support vector machines." Proceedings of the IEEE International Conference on Multimedia and Expo (ICME 2001). 2001.
7. Zhang, Tong, and C-C. Jay Kuo. "Content-based classification and retrieval of audio." Advanced Signal Processing Algorithms, Architectures, and Implementations VIII. Vol. 3461. International Society for Optics and Photonics, 1998.
8. Davis, Steven, and Paul Mermelstein. "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences." IEEE transactions on acoustics, speech, and signal processing 28.4 (1980): 357-366.
9. Chu, Selina, Shrikanth Narayanan, and C-C. Jay Kuo. "Environmental sound recognition with time–frequency audio features." IEEE Transactions on Audio, Speech, and Language Processing 17.6 (2009): 1142-1158.
10. Casey, Michael. "MPEG-7 sound-recognition tools." IEEE Transactions on circuits and Systems for video Technology 11.6 (2001): 737-747.
11. Mathieu, Benoit, et al. "YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software." ISMIR. 2010.
12. Eyben, Florian, Martin Wöllmer, and Björn Schuller. "Opensmile: the munich versatile and fast open-source audio feature extractor." Proceedings of the 18th ACM international conference on Multimedia. ACM, 2010.
13. Eronen, Antti. "Automatic musical instrument recognition." Mémoire de DEA, Tempere University of Technology (2001): 178.
14. Duxbury, Chris, et al. "Complex domain onset detection for musical signals." Proc. Digital Audio Effects Workshop (DAFx). Vol. 1. London: Queen Mary University, 2003.
15. Makhoul, John. "Linear prediction: A tutorial review." Proceedings of the IEEE 63.4 (1975): 561-580.
16. Bäckström, Tom, and Carlo Magi. "Properties of line spectrum pair polynomials—A review." Signal processing 86.11 (2006): 3286-3298.
17. Moore, Brian CJ, Brian R. Glasberg, and Thomas Baer. "A model for the prediction of thresholds, loudness, and partial loudness." Journal of the Audio Engineering Society 45.4 (1997): 224-240.
18. Peeters, Geoffroy. "A large set of audio features for sound description (similarity and classification) in the CUIDADO project." (2004).
19. Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.
20. Eyben, Florian, et al. "Universal onset detection with bidirectional long-short term memory neural networks." Proc. 11th Intern. Soc. for Music Information Retrieval Conference, ISMIR, Utrecht, The Netherlands. 2010.
21. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.