# Urban Sound Classification, Part 1

출처: 
- http://aqibsaeed.github.io/2016-09-03-urban-sound-classification-part-1/
- https://github.com/aqibsaeed/Urban-Sound-Classification

- [Logistic Regression in Tensorflow with SMOTE](http://aqibsaeed.github.io/2016-08-10-logistic-regression-tf/)

In this blog post, we will learn techniques to classify urban sounds into categories using machine learning. Earlier blog posts covered classification problems where data can be easily expressed in vector form. For example, in the textual dataset, each word in the corpus becomes feature and tf-idf score becomes its value. Likewise, in anomaly detection dataset we saw two features “throughput” and “latency” that fed into a classifier. But when it comes to sound, feature extraction is not quite straightforward. Today, we will first see what features can be extracted from sound data and how easy it is to extract such features in Python using open source library called Librosa.

> 이 블로그 포스트에서는 기계 학습을 사용하여 도시 사운드를 카테고리로 분류하는 기술을 배우게됩니다. 이전 블로그 게시물은 데이터를 벡터 형태로 쉽게 표현할 수있는 분류 문제를 다루었습니다. 예를 들어, 텍스트 데이터 세트에서 코퍼스의 각 단어는 지형지 물이되며 tf-idf 점수는 그 값이됩니다. 마찬가지로, 비정상 탐지 데이터 세트에서 우리는 분류 자에게 공급되는 "처리량"과 "대기 시간"이라는 두 가지 기능을 보았습니다. 그러나 소리에 관해서, 특징 추출은 아주 간단하지 않습니다. 오늘 우리는 사운드 데이터에서 추출 할 수있는 기능과 Librosa라는 오픈 소스 라이브러리를 사용하여 파이썬에서 이러한 기능을 추출하는 것이 얼마나 쉬운 지 먼저 살펴볼 것입니다.

To get started with this tutorial, please make sure you have following tools installed:

- Tensorflow
- Librosa
- Numpy
- Matplotlib

## Dataset

We need a labelled dataset that we can feed into machine learning algorithm. Fortunately, some researchers published urban sound dataset. It contains 8,732 labelled sound clips (4 seconds each) from ten classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music. The dataset by default is divided into 10-folds. To get the dataset please visit the following link and if you want to use this dataset in your research kindly don’t forget to acknowledge. In this dataset, the sound files are in .wav format but if you have files in another format such as .mp3, then it’s good to convert them into .wav format. It’s because .mp3 is lossy music compression technique, check this link for more information. To keep things simple, we will use sound files from only first three folds, namely fold1, fold2 and fold3.

Let’s read some sound files and visualise to understand how different each sound clip is from other. Matplotlib’s specgram method performs all the required calculation and plotting of the spectrum. Likewise, Librosa provide handy method for wave and log power spectrogram plotting. By looking at the plots shown in Figure 1, 2 and 3, we can see apparent differences between sound clips of different classes.

In [1]:
import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from matplotlib.pyplot import specgram

%matplotlib inline

## Feature Extraction

To extract the useful features from sound data, we will use Librosa library. It provides several methods to extract different features from the sound clips. We are goint to use below mentioned methods to extract various featrues:

- melspectrogram: Compute a Mel-scaled power spectrogram
- mfcc: Mel-frequency cepstral coefficients
- chrorma-stft: Compute a chromagram from a waveform or power spectrogram
- spectral_contrast: Compute spectral contrast, using method defined in [Music type classification by spectral contrast feature](http://ieeexplore.ieee.org/document/1035731/)
- tonnetz: Computes the tonal centroid features (tonnetz), following the mothod of [Detecting harmonic change in musical audio](https://dl.acm.org/citation.cfm?id=1178727)

To make the process of feature extractin from sound clips easy, two helper methods are defined. 

First `parse_audio_files` which takes parent directory name, subdirectories within parent directory and file extension (default is .wav) as input. It then iterates over all the files within subdirectories and call second helper function `extract_feature`. It takes files path as input, read the file by calling `librosa.load` mehtod, extract and return featrues discssed above.

THese two methods are all that is required to convert raw sound clips into informative features (along with a class label for each sound clip) that we can directly feed into ousr classifier.

Rememeber, the class label of each sound clip is in the file name. For example, if the file anme is 108041-9-0-4.wav the the class label will be 9. Doing string split by `-` and taking the second item of the array will give us the class label.

In [3]:
def extract_feature(file_name):
    X, sample_rate = librosa.load(file_name)
    stft = np.abs(librosa.stft(X))
    mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
    chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
    mel = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
    contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
    tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X),
    sr=sample_rate).T,axis=0)
    return mfccs,chroma,mel,contrast,tonnetz

def parse_audio_files(parent_dir,sub_dirs,file_ext="*.wav"):
    features, labels = np.empty((0,193)), np.empty(0)
    for label, sub_dir in enumerate(sub_dirs):
        for fn in glob.glob(os.path.join(parent_dir, sub_dir, file_ext)):
            try:
              mfccs, chroma, mel, contrast,tonnetz = extract_feature(fn)
            except Exception as e:
              print("Error encountered while parsing file: ", fn)
              continue
            ext_features = np.hstack([mfccs,chroma,mel,contrast,tonnetz])
            features = np.vstack([features,ext_features])
            labels = np.append(labels, fn.split('/')[2].split('-')[1])
    return np.array(features), np.array(labels, dtype = np.int)

def one_hot_encode(labels):
    n_labels = len(labels)
    n_unique_labels = len(np.unique(labels))
    one_hot_encode = np.zeros((n_labels,n_unique_labels))
    one_hot_encode[np.arange(n_labels), labels] = 1
    return one_hot_encode

In [4]:
parent_dir = 'Sound-Data'
tr_sub_dirs = ["fold1","fold2"]
ts_sub_dirs = ["fold3"]
tr_features, tr_labels = parse_audio_files(parent_dir,tr_sub_dirs)
ts_features, ts_labels = parse_audio_files(parent_dir,ts_sub_dirs)

tr_labels = one_hot_encode(tr_labels)
ts_labels = one_hot_encode(ts_labels)

## Classification using Multilayer Neural Network

> Note: If you want to use cikit-learn or any other library for training classifier, feel free to use that. The goal of this tutorial is to provide an implementation of the neural network in Tensorflow for classification tasks.

Now we have our training and testing set ready, let's implement two layers neural network in Tensorflow to classify each sound clip into a different category.

The code provided below defines configuration parameters required by nerual network model. Such as training epochs, a number of neurones in each hidden layer and learning rate.

In [5]:
training_epochs = 50
n_dim = tr_features.shape[1]
n_classes = 10
n_hidden_units_one = 280 
n_hidden_units_two = 300
sd = 1 / np.sqrt(n_dim)
learning_rate = 0.01

Now define placeholders for features and class labels, which tensor flow will fill with the data at runtime. Furthermore, define weights and biases for hidden and output layers of the network. For non-linearity, we use the isgmoid function in the first hidden layer and tanh in the second hidden layer. The output layer has softmax as non-linearity as we are dealing with multiclass classification problem.

In [6]:
X = tf.placeholder(tf.float32,[None,n_dim])
Y = tf.placeholder(tf.float32,[None,n_classes])

W_1 = tf.Variable(tf.random_normal([n_dim,n_hidden_units_one], mean = 0, stddev=sd))
b_1 = tf.Variable(tf.random_normal([n_hidden_units_one], mean = 0, stddev=sd))
h_1 = tf.nn.tanh(tf.matmul(X,W_1) + b_1)

W_2 = tf.Variable(tf.random_normal([n_hidden_units_one,n_hidden_units_two], 
mean = 0, stddev=sd))
b_2 = tf.Variable(tf.random_normal([n_hidden_units_two], mean = 0, stddev=sd))
h_2 = tf.nn.sigmoid(tf.matmul(h_1,W_2) + b_2)

W = tf.Variable(tf.random_normal([n_hidden_units_two,n_classes], mean = 0, stddev=sd))
b = tf.Variable(tf.random_normal([n_classes], mean = 0, stddev=sd))
y_ = tf.nn.softmax(tf.matmul(h_2,W) + b)

init = tf.initialize_all_variables()

Instructions for updating:
Use `tf.global_variables_initializer` instead.


The cross-entropy cost funciton will be minimised using gradient descent optimizer, the code provided below initalize cost function and optimizer. Also, define and initalize variables for accuracy calculation of the prediction by model.

In [7]:
cost_function = -tf.reduce_sum(Y * tf.log(y_))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)

correct_prediction = tf.equal(tf.argmax(y_,1), tf.argmax(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

We hav eall the requred pieces in place. Now let's train neural network model, visualize whether cost is decreasing with each epoch and make prediction on the test set, using following code:

In [9]:
cost_history = np.empty(shape=[1],dtype=float)
y_true, y_pred = None, None
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(training_epochs):            
        _,cost = sess.run([optimizer,cost_function],feed_dict={X:tr_features,Y:tr_labels})
        cost_history = np.append(cost_history,cost)
    
    y_pred = sess.run(tf.argmax(y_,1),feed_dict={X: ts_features})
    y_true = sess.run(tf.argmax(ts_labels,1))
    print("Test accuracy: ",round(session.run(accuracy, 
    	feed_dict={X: ts_features,Y: ts_labels}),3))

fig = plt.figure(figsize=(10,8))
plt.plot(cost_history)
plt.axis([0,training_epochs,0,np.max(cost_history)])
plt.show()

p,r,f,s = precision_recall_fscore_support(y_true, y_pred, average="micro")
print("F-Score:", round(f,3))

ValueError: Cannot feed value of shape (0, 0) for Tensor 'Placeholder_1:0', which has shape '(?, 10)'

In this tutorial, we saw how to extract features from a sound dataset and train a two layer neural network model in Tensorflow to categories sounds. I would encourage you to check the documentation of Librosa and experiment with different neural network configurations i.e. by changing number of neurons, number of hidden layers and introducing dropout etc.