# Neural Networks course project: #
# Google AudioSet sound classification with Deep Learning #
### Sapienza University of Rome ###

### by Ivan Senilov (1787618) ###

## 1. Introduction ##

This work represents a practical part of the Neural Networks course taught at Sapienza University.

The goal of this coursework is to:

1. Explore the [Google AudioSet](https://research.google.com/audioset/index.html)
2. Build the classification model based on Neural Network(s)
3. Validate the model
4. Evaluate and discuss the results

## 2. Audio classification problem ##

### 2.1 Audio features ###

Feature extraction is signal processing task of computing the numerical representation from the signal that can be used to characterize the audio segment \cite{prasad2007speech}.

Most of the audio features fall into three categories \cite{1221332}:

1. **Energy-based**. For example, 4Hz modulation energy used for speech/music classification \cite{scheirer1997construction}.
2. **Spectrum-based**. Examples of the category are roll-off of the spectrum, spectral flux, Mel Frequency Cepstral Coefficents (MFCC) \cite{scheirer1997construction} and linear spectrum pair, band periodicity \cite{lu2001content}.
3. **Perceptual based**, like pitch (estimated to discriminate songs and speech over music \cite{zhang1998content}).

The most developed areas of machine learning for audio classification include speech \cite{shrawankar2013techniques,lee2003optimizing,ghahremani2014pitch} and music \cite{logan2000mel,wang2006shazam,eronen2001comparison} recognition where  MFCCs are widely used as features. MFCCs were introduced in 1980 \cite{davis1980comparison} and showed better performance in terms of recognition of spoken words. However, when it comes to other types of sound recognition, the selection of feature extraction method becomes less obvious, even though MFCCs are used in, for example, application to environmental sound classification \cite{chu2009environmental,cotton2011spectral}.

<img src="figs/feat_extr.png">

Extraction pipelines for MPEG-7 (left) and MFCC (right) features (reprinted from \cite{1221332})

Typical approach of feature extraction process is to split the audio signal into small chunks of several $ms$ (exact size is domain-dependent) and feed it into computational function of one of many frameworks for extraction of features from audio (see Figure \ref{fig:feature_extraction} for examples of extraction pipelines for MPEG-7 \cite{casey2001mpeg} and MFCC features). The most popular frameworks include YAAFE \cite{mathieu2010yaafe} and openSMILE \cite{eyben2010opensmile,eyben2013recent}, which allows to extract following feature types:
\begin{itemize}
	\item Amplitude Modulation  \cite{eronen2001automatic}. Analyzed frequency ranges are: Tremolo (4 - 8 Hz) and Grain (10 - 40 Hz). For each of these ranges, it computes:
	\begin{itemize}
		\item Frequency of maximum energy in range
		\item Difference of the energy of this frequency and the mean energy over all frequencies
		\item Difference of the energy of this frequency and the mean energy in range
		\item Product of the two first values.
	\end{itemize}
	\item Autocorrelation coefficients $\mathit{ac}$ on each frame.
    $$ac(k)=\sum_{i=0}^{N-k-1}x(i)x(i+k),$$    
    where (here and below) $k$ is frame length in samples, $N$ is length of whole signal in samples and $x(i)$ is signal function.    
    \item Onset detection using a complex domain spectral flux method. \cite{duxbury2003complex}.    
    \item Energy $\mathit{en}$ as root mean square of an audio frame.
    $$en=\sqrt{\dfrac{\sum_{i=0}^{N-1}x(i)^2}{N}}$$    
    \item Envelope of an oscillating signal (smooth curve outlining its extremes).    
    \item Shape statistics (centroid, spread, skewness and kurtosis) of each frame’s Temporal Shape, Amplitude Envelope and Magnitude Spectrum.    
    \item Linear Predictor Coefficients (LPC) of a signal frame \cite{makhoul1975linear}.    
    \item Line Spectral Frequency (LSF) coefficients of a signal frame \cite{backstrom2006properties,schussler1976stability}.    
    \item Loudness coefficients \cite{moore1997model}.    
    \item Mel-frequencies cepstrum coefficients and Mel-frequencies spectrum.    
    \item Magnitude spectrum.    
    \item Octave band signal intensity (OBSI) using a triangular octave filter bank and  OBSI ratio between consecutive octave.    
    \item Sharpness and Spread of Loudness coefficients \cite{peeters2004large}.    
    \item Spectral crest factor per log-spaced band of 1/4 octave.    
    \item Spectral decrease, spectral flatness.    
    \item Spectral Flux.    
    \item Spectral roll-off (frequency so that 99\% of the energy is contained below) \cite{scheirer1997construction}.
    \item Spectral Slope (computed by linear regression of the spectral amplitude) \cite{peeters2004large}.    
    \item Spectral Variation (normalized correlation of spectrum between consecutive frames) \cite{peeters2004large}.    
    \item Zero-crossing rate (ZCR) for frame \cite{scheirer1997construction}.
\end{itemize}

In [1]:
import pandas as pd
import numpy as np

In [2]:
classes_of_interest = ("Vehicle", "Channel, environment and background", "Natural sounds")

with open("ontology.json", "r") as f:
    contents = f.read()

ontology = pd.read_json(contents)

ontology = ontology[["id", "name", "child_ids"]].set_index("id")

In [3]:
def extract_ids(cls, data):  # recursively add all child classes of input class to the list
    out = [] 
    for id, row in data.iterrows():
        if row["name"] == cls:
            out.append(id)
            if len(row["child_ids"]) > 0:
                for child in row["child_ids"]:
                    out.append(extract_ids(data["name"][child], data))
                
    return flattern(out)


def flattern(A):  # list flattening helper function
    rt = []
    for i in A:
        if isinstance(i,list): rt.extend(flattern(i))
        else: rt.append(i)
    return rt

In [8]:
classes = {i:[] for i in classes_of_interest}
for cls in classes_of_interest:
    classes[cls] = extract_ids(cls, ontology)

In [51]:
train_data = pd.read_csv("balanced_train_segments.csv", skiprows=2, sep=", ", engine="python", index_col="# YTID")
train_data["positive_labels"] = train_data["positive_labels"].apply(lambda x: x.replace('\"','').split(","))

test_data = pd.read_csv("eval_segments.csv", skiprows=2, sep=", ", engine="python", index_col="# YTID")
test_data["positive_labels"] = test_data["positive_labels"].apply(lambda x: x.replace('\"','').split(","))

In [52]:
train_data

Unnamed: 0_level_0,start_seconds,end_seconds,positive_labels
# YTID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
--PJHxphWEs,30.0,40.0,"[/m/09x0r, /t/dd00088]"
--ZhevVpy1s,50.0,60.0,[/m/012xff]
--aE2O5G5WE,0.0,10.0,"[/m/03fwl, /m/04rlf, /m/09x0r]"
--aO5cdqSAg,30.0,40.0,"[/t/dd00003, /t/dd00005]"
--aaILOrkII,200.0,210.0,"[/m/032s66, /m/073cg4]"
--cB2ZVjpnA,30.0,40.0,[/m/01y3hg]
--ekDLDTUXA,30.0,40.0,"[/m/015lz1, /m/07pws3f]"
-0DLPzsiXXE,30.0,40.0,"[/m/04rlf, /m/07qwdck]"
-0DdlOuIFUI,50.0,60.0,"[/m/0130jx, /m/02jz0l, /m/0838f]"
-0FHUc78Gqo,30.0,40.0,"[/m/02w4v, /m/04rlf]"


In [56]:
import youtube_dl
import os

def crop(start, end, filename):
    command = "ffmpeg -i " + filename + " -ss  " + start + " -to " + end + " -ac 1 " + filename
    os.system(command)

to_download = {i:[] for i in classes_of_interest}
    
for cls in classes_of_interest:
    for id in classes[cls]:
        for row in train_data.itertuples(): 
            if row[3][0] == id:
                to_download[cls].append(row[0])
                

options = {
    'format': 'bestaudio/best',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'wav'
    }],
    'extractaudio' : True,
    'ignoreerrors' : True,
    'audioformat' : "wav",    # name the file the ID of the video
    'noplaylist' : True,    # only download single clip, not playlist
}
'''                
for cls in to_download:
    options['outtmpl'] = os.path.join(cls, '%(id)s.%(ext)s')
    with youtube_dl.YoutubeDL(options) as ydl:
        ydl.download(to_download[cls])
'''

from os import walk

f = []
for cls in to_download:
    for (_, _, filenames) in walk(os.path.join(cls)):
        f.extend(filenames)

