# Neural Networks course project: #
# Google AudioSet sound classification with Deep Learning #
### Sapienza University of Rome ###

### by Ivan Senilov (1787618) ###

## 1. Introduction ##

This work represents a practical part of the Neural Networks course taught at Sapienza University.

The goal of this coursework is to:

1. Explore the [Google AudioSet](https://research.google.com/audioset/index.html)
2. Build the classification model based on Neural Network(s)
3. Validate the model
4. Evaluate and discuss the results

## 2. Audio classification problem ##

### 2.1 Audio features ###

Feature extraction is signal processing task of computing the numerical representation from the signal that can be used to characterize the audio segment \cite{prasad2007speech}.

Most of the audio features fall into three categories \cite{1221332}:

1. **Energy-based**. For example, 4Hz modulation energy used for speech/music classification \cite{scheirer1997construction}.
2. **Spectrum-based**. Examples of the category are roll-off of the spectrum, spectral flux, Mel Frequency Cepstral Coefficents (MFCC) \cite{scheirer1997construction} and linear spectrum pair, band periodicity \cite{lu2001content}.
3. **Perceptual based**, like pitch (estimated to discriminate songs and speech over music \cite{zhang1998content}).

The most developed areas of machine learning for audio classification include speech \cite{shrawankar2013techniques,lee2003optimizing,ghahremani2014pitch} and music \cite{logan2000mel,wang2006shazam,eronen2001comparison} recognition where  MFCCs are widely used as features. MFCCs were introduced in 1980 \cite{davis1980comparison} and showed better performance in terms of recognition of spoken words. However, when it comes to other types of sound recognition, the selection of feature extraction method becomes less obvious, even though MFCCs are used in, for example, application to environmental sound classification \cite{chu2009environmental,cotton2011spectral}.

<img src="figs/feat_extr.png">

Extraction pipelines for MPEG-7 (left) and MFCC (right) features (reprinted from \cite{1221332})

Typical approach of feature extraction process is to split the audio signal into small chunks of several $ms$ (exact size is domain-dependent) and feed it into computational function of one of many frameworks for extraction of features from audio (see Figure \ref{fig:feature_extraction} for examples of extraction pipelines for MPEG-7 \cite{casey2001mpeg} and MFCC features). The most popular frameworks include YAAFE \cite{mathieu2010yaafe} and openSMILE \cite{eyben2010opensmile,eyben2013recent}, which allows to extract following feature types:

1. Amplitude Modulation  \cite{eronen2001automatic}. Analyzed frequency ranges are: Tremolo (4 - 8 Hz) and Grain (10 - 40 Hz). For each of these ranges, it computes:
    * Frequency of maximum energy in range
    * Difference of the energy of this frequency and the mean energy over all frequencies
    * Difference of the energy of this frequency and the mean energy in range
    * Product of the two first values.
2. Autocorrelation coefficients $\mathit{ac}$ on each frame.
    $$ac(k)=\sum_{i=0}^{N-k-1}x(i)x(i+k),$$    
    where (here and below) $k$ is frame length in samples, $N$ is length of whole signal in samples and $x(i)$ is signal function.    
3. Onset detection using a complex domain spectral flux method. \cite{duxbury2003complex}.    
4. Energy $\mathit{en}$ as root mean square of an audio frame.
    $$en=\sqrt{\dfrac{\sum_{i=0}^{N-1}x(i)^2}{N}}$$    
5. Envelope of an oscillating signal (smooth curve outlining its extremes).    
6. Shape statistics (centroid, spread, skewness and kurtosis) of each frame’s Temporal Shape, Amplitude Envelope and Magnitude Spectrum.    
7. Linear Predictor Coefficients (LPC) of a signal frame \cite{makhoul1975linear}.    
8. Line Spectral Frequency (LSF) coefficients of a signal frame \cite{backstrom2006properties,schussler1976stability}.    
9. Loudness coefficients \cite{moore1997model}.    
10. Mel-frequencies cepstrum coefficients and Mel-frequencies spectrum.    
11. Magnitude spectrum.    
12. Octave band signal intensity (OBSI) using a triangular octave filter bank and  OBSI ratio between consecutive octave.
13. Sharpness and Spread of Loudness coefficients \cite{peeters2004large}.    
14. Spectral crest factor per log-spaced band of 1/4 octave.    
15. Spectral decrease, spectral flatness.    
16. Spectral Flux.    
17. Spectral roll-off (frequency so that 99\% of the energy is contained below) \cite{scheirer1997construction}.
18. Spectral Slope (computed by linear regression of the spectral amplitude) \cite{peeters2004large}.    
19. Spectral Variation (normalized correlation of spectrum between consecutive frames) \cite{peeters2004large}.    
20. Zero-crossing rate (ZCR) for frame \cite{scheirer1997construction}.

## 3. Implementation ##

In this section, we will:

1. Load ontology and correspondence between video's ids and class labels.
2. Download the videos from Youtube to respective directories
3. Load labeled dataset into Numpy arrays
4. Build classifier based on Neural Network architecture
5. Evaluate it using cross-validation

First of all, import data manipulation libs.

In [1]:
import pandas as pd
import numpy as np

In order to simplify the task (in terms of computational complexity as well) we limit number of categories to 3. Read the ontology to Pandas dataframe and select only *id*, *name* and *child_id* fields.

In [2]:
classes_of_interest = ("Vehicle", "Channel, environment and background", "Natural sounds")

with open("ontology.json", "r") as f:
    contents = f.read()

ontology = pd.read_json(contents)

ontology = ontology[["id", "name", "child_ids"]].set_index("id")

Then we extract *id*s for each (main and child) category (class) of interest recursively, building a dictionary where key is class name and value is a list of *id*s.

In [3]:
def extract_ids(cls, data):  # recursively add all child classes of input class to the list
    out = [] 
    for id, row in data.iterrows():
        if row["name"] == cls:
            out.append(id)
            if len(row["child_ids"]) > 0:
                for child in row["child_ids"]:
                    out.append(extract_ids(data["name"][child], data))
                
    return flattern(out)


def flattern(A):  # list flattening helper function
    rt = []
    for i in A:
        if isinstance(i,list): rt.extend(flattern(i))
        else: rt.append(i)
    return rt

In [4]:
classes = {i:[] for i in classes_of_interest}
for cls in classes_of_interest:
    classes[cls] = extract_ids(cls, ontology)

Here we read tables with correspondance between YouTube ID, segment in the video and ids from ontology.

In [5]:
train_data = pd.read_csv("balanced_train_segments.csv", skiprows=2, sep=", ", engine="python", index_col="# YTID")
train_data["positive_labels"] = train_data["positive_labels"].apply(lambda x: x.replace('\"','').split(","))

test_data = pd.read_csv("eval_segments.csv", skiprows=2, sep=", ", engine="python", index_col="# YTID")
test_data["positive_labels"] = test_data["positive_labels"].apply(lambda x: x.replace('\"','').split(","))

train_data.head()

Unnamed: 0_level_0,start_seconds,end_seconds,positive_labels
# YTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
--PJHxphWEs,30.0,40.0,"[/m/09x0r, /t/dd00088]"
--ZhevVpy1s,50.0,60.0,[/m/012xff]
--aE2O5G5WE,0.0,10.0,"[/m/03fwl, /m/04rlf, /m/09x0r]"
--aO5cdqSAg,30.0,40.0,"[/t/dd00003, /t/dd00005]"
--aaILOrkII,200.0,210.0,"[/m/032s66, /m/073cg4]"


Finally, we download segments of videos from YouTube to respective folders of classes.

In [6]:
import youtube_dl   # Python library for downloading from YouTube
import os


'''
Helper function for cropping audio with ffmpeg
'''
def crop(start, length, filename):
    command = "ffmpeg -y -i " + filename + \
    " -ss  " + str(start) + " -t " + str(length) + \
    " -ac 1 -acodec copy " + filename.split(".")[0] + "_.wav"
    os.system(command)

to_download = {i:[] for i in classes_of_interest}
    
for cls in classes_of_interest:
    for id in classes[cls]:
        for row in train_data.itertuples(): 
            if row[3][0] == id:
                to_download[cls].append(row[0])
                

'''
options for youtube-dl
'''
options = {
    'format': 'bestaudio/best',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'wav'   # wav format for lossless features extraction
    }],
    'extractaudio' : True,
    'ignoreerrors' : True,
    'audioformat' : "wav",
    'noplaylist' : True,    # only download single clip, not playlist
}
                
for cls in to_download:
    # setting path for download removing commas and spaces in order to avoid fylesisitem access problems
    options['outtmpl'] = os.path.join(cls.replace(",","").replace(" ",""), '%(id)s.%(ext)s')
    for file in to_download[cls]:
        with youtube_dl.YoutubeDL(options) as ydl:
            ydl.download([file])   # downloading full audio clip (due to limitations of youtube-dl)
            filename = os.path.join(cls.replace(",","").replace(" ",""), file + ".wav")
            crop(train_data.loc[file]["start_seconds"],
                 10, filename)     # cropping of the clip in accordance with dataset csv file 
            try:
                os.remove(filename)   # removing original (non-cropped) clip
            except OSError:
                pass


KeyboardInterrupt: 

Further, we extract features with help of SciPy and librosa (for MFCC extraction) libraries.

In [9]:
import os
from scipy.io import wavfile
from librosa import feature

# Removing commas and spaces from classes names for avoiding problems woth dir names 
classes_dirs = [i.replace(",","").replace(" ","") for i in classes_of_interest]

feat = []
labels = []
class_num = 0
window = 32768                          # size of window for each sample
for directory in classes_dirs:
    count = 0
    for file in os.listdir(directory):
        filename = os.path.join(directory, file)
        if os.path.isfile(filename) and count < 1000:    # limiting number of files to read
            count += 1
            rate, frames = wavfile.read(filename)
            if len(frames.shape) > 1:    # if stereo, take only one channel
                frames = frames[:,0]
            for i in range(0, len(frames)-window, int(window/2)):
                pxx = feature.mfcc(y=frames[i:i + window - 1], sr=rate, n_mfcc=20)
                feat.append(pxx)
                labels.append(class_num)
    class_num += 1                          # each successive class is represented by incremented integer
data = np.stack(feat)

from keras.utils import to_categorical
labels = to_categorical(labels)             # convert class number to 'one-hot' vector

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score, accuracy_score
from keras.models import Sequential
from keras.layers import LSTM, Conv2D, Dense, MaxPooling2D, GlobalAveragePooling2D
from keras.layers.wrappers import Bidirectional
# from keras.callbacks import TensorBoard, ModelCheckpoint, Callback
from keras import optimizers, regularizers

from datetime import datetime



X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], X_train.shape[2], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], X_test.shape[2], 1))

print("\nTraining model...")

model = Sequential()
model.add(Conv2D(filters=64, kernel_size=5, strides=1, padding="same", activation='relu', input_shape=X_train.shape[1:]))
model.add(MaxPooling2D(2))
model.add(Conv2D(filters=128, kernel_size=5))
model.add(GlobalAveragePooling2D())
model.add(Dense(2, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

hist = model.fit(X_train, y_train,
                 # callbacks=[tbCallback, mcCallback, testCallback0, testCallback1, testCallback2, testCallback3],
                 # validation_data=(X_1, y_1),
                 epochs=100,
                 batch_size=128,
                 verbose=1)


print("\nEvaluating...")
y_pred = to_categorical(model.predict_classes(X_test, verbose=1))
print(y_pred, y_test)
acc = accuracy_score(y_test, y_pred)
print("\nAccuracy:", acc)
rec = recall_score(y_test, y_pred, average="macro")
print("Recall:", rec)


Training model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_11 (Conv2D)           (None, 20, 64, 64)        1664      
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 10, 32, 64)        0         
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 6, 28, 128)        204928    
_________________________________________________________________
global_average_pooling2d_3 ( (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 258       
Total params: 206,850
Trainable params: 206,850
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100

KeyboardInterrupt: 

During the experiments we found out that ReLU activation function in RNN layers results in model stop learning which is most propably consequence of vanishing gradients.

In [22]:
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], X_train.shape[2]))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], X_test.shape[2]))

print("\nTraining model...")
# architecture of the network is adopted from https://arxiv.org/pdf/1511.07035.pdf
model = Sequential()
model.add(Bidirectional(LSTM(216, return_sequences=True, activation="tanh",),
                        input_shape=(X_train.shape[1:])))
model.add(Bidirectional(LSTM(216, return_sequences=True, activation="tanh")))
model.add(Bidirectional(LSTM(216, activation="tanh")))
model.add(Dense(2, activation='sigmoid'))
model.summary()
model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

hist = model.fit(X_train, y_train,
                 # callbacks=[tbCallback, mcCallback, testCallback0, testCallback1, testCallback2, testCallback3],
                 validation_data=(X_test, y_test),
                 epochs=100,
                 batch_size=128,
                 verbose=1)


print("\nEvaluating...")
y_pred = to_categorical(model.predict_classes(X_test, verbose=1))
print(y_pred, y_test)
acc = accuracy_score(y_test, y_pred)
print("\nAccuracy:", acc)
rec = recall_score(y_test, y_pred, average="macro")
print("Recall:", rec)


Training model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_19 (Bidirectio (None, 20, 432)           485568    
_________________________________________________________________
bidirectional_20 (Bidirectio (None, 20, 432)           1121472   
_________________________________________________________________
bidirectional_21 (Bidirectio (None, 432)               1121472   
_________________________________________________________________
dense_11 (Dense)             (None, 2)                 866       
Total params: 2,729,378
Trainable params: 2,729,378
Non-trainable params: 0
_________________________________________________________________
Train on 22125 samples, validate on 5532 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/1

KeyboardInterrupt: 

## x. References ##

1. Gemmeke, Jort F., et al. "Audio Set: An ontology and human-labeled dataset for audio events." *IEEE ICASSP*. 2017.
2. Abdić, Irman, et al. "Detecting road surface wetness from audio: A deep learning approach." *Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE*, 2016.