# How to classify a sound?
> ''

- author: <a href=http://kozistr.tech/about>Hyeongchan Kim</a>
- categories: [python, data science]
- image: images/yomex-owo-Bm7Vm8T4BQs-unsplash.jpg
- permalink: /audio/
- hide: true

## Introduction

Tons of various sounds exist in our life. Starting from music that many of us usually listen to, lots of human voices, to the cry of endangered species like a bird. In some cases, it is crucial that classifying the source of the sounds and is already widely used for various purposes. Such as, there's a classifier for the genre of music in many music recommendation systems. Even ornithologists try to use a classifier to categorize which sounds of birds because it is hard to detect the birdcalls from the fields or noisy environments. Therefore, audio classifiers play a significant role in our lives.

Recently, by the advance of the technologies, computation power is exponentially increased, so even doing heavy computation is possible in a millisecond like deep learning (DL). And it brings promising performance gain over the domains. It enables us to detect the categories more precisely than in the past. However, there's still an issue utilizing deep learning fast and accurately.

In this post, We will use the birdcalls as an example and show how to deal with the data fast and detect the rare kind of birds using deep learning.

~~Recent works have shown that using convolutional neural networks (CNNs) is powerful in various types of machine learning tasks like classification tasks across domains. CNN enables networks to catch informative features by fusing spatial and channel-wise information within local receptive fields at each layer. For audio signals, lots of related works take a spectrogram as input using CNN that allows extracting features both frequency and time wisely via its kernel, and these approaches show promising results.
Unlike sound classification tasks, sound event detection (SED) is a task to detect the events in recorded audio, and it's crucial to find the various length of single or multiple events at a segment-level. Although CNN is a powerful method to extract its feature maps, it can't capture long time dependency in audio due to the limited sizes of the receptive field. To resolve this problem, we utilize both a gated recurrent unit (GRU) and a self-attention mechanism to handle long time dependency effectively. In this article, we'll explain the CNN architecture for detecting sound events by adopting the bi-GRU and self-attention module with the implementation and show performances compared with different methods.~~

## Processing on GPU

In the previous post that `Tony` wrote, he explains well why and how we should process the audio data into *Spectrogram* to give to a deep learning model as an input. Continuing from the post, the speed of data processing is one of the keys to utilizing a deep learning model. Although the increment of computation power, the computation cost of audio processing is still expensive on a Central Processing Unit ([CPU](https://en.wikipedia.org/wiki/Central_processing_unit)). However, if we choose a better computation resource to process the data like a Graphic Processing Unit ([GPU](https://en.wikipedia.org/wiki/Graphics_processing_unit)), it can boost the speed of about ten to one hundred times faster! In this post, we will show how to process *Spectrogram* fast by utilizing a library called [`torchlibrosa`](https://github.com/qiuqiangkong/torchlibrosa) that enables us to process `Spectrogram` on GPU.

`torchlibrosa` is a `Python` library that has some audio processing functions implemented in [`PyTorch`](https://pytorch.org/) that can utilize `GPU` resources. By implementing the `Spectrogram` algorithm in `PyTorch` which is a deep learning framework, it can run via a GPU device. Here's an example of extracting *Spectrogram* features from the fiven raw audio using `torchlibrosa`.

### Build Spectrogram processor

In [1]:
import torch
import torch.nn as nn
from torchlibrosa.stft import Spectrogram

class AudioProcessing(nn.Module):
    def __init__(self, window_size: int = 1024, hop_size: int = 320):
        super().__init__()
        self.spectrogram_extractor = Spectrogram(
            hop_length=hop_size, 
            win_length=window_size
        )

    def forward(self, x):
        return self.spectrogram_extractor(x)

# building processor
processor = AudioProcessing().cuda()

### Load audio data

We can lightly load audio data via `librosa` library, which is one of the popular `Python` audio processing libraries.

In [None]:
import librosa

# get raw audio data
example, _ = librosa.load('example.wav', sr=32000, mono=True)
example.shape

### Process `Spectrogram` via `torchlibrosa`

In [None]:
spectrogram = processor(torch.Tensor(example).unsqueeze(0).cuda())
spectrogram.size()  # (batch_size, 1 (=mono), time_steps, freq_bins)

That's it! like the above example code, we can process `Spectrogram` fast by utilizing `torchlibrosa` successfully. 

### Benchmark processing speed

We can check that we can process audio data on the GPU by using torchlibrosa library. Also you may wonder how much faster on the GPU. Here's the speed of processing the benchmark between CPU and GPU devices. We just selected randomly about 1,000 audios from `UrbanSound` dataset, which is publicly available and compare how long it takes on `CPU` and `GPU`.

In [None]:
import os
import numpy as np
from glob import glob

root_path = 'D:\\DataSet\\UrbanSound\\audio\\fold1'
audio_paths = sorted(glob(os.path.join(root_path, '*.wav')))

raw_audios = [librosa.load(audio_path, sr=32000, mono=True)[0] for audio_path in audio_paths]

In [None]:
%%time
def get_spectrogram(x):
    stft = librosa.stft(x, n_fft=1024, hop_length=320)
    spectrogram = np.abs(stft)
    return spectrogram

for raw_audio in raw_audios:
    get_spectrogram(raw_audio)

In [None]:
%%time
for raw_audio in raw_audios:
    processor(torch.Tensor(raw_audio).unsqueeze(0).cuda())

* show how long is diff
* change the dataset to birdcall

## Deep Learning Architecture

As mentioned above, deep learning also shows a brilliant performance in the audio domain. It can catch various patterns of target classes nicely in the time-series data. The more important thing is the environment and data matter in birdcalls. The environments like fields or the middle of the mountains, there are lots of noises interfering with the birdcalls. There are lots of birds that can exist in long recorded audio. So considering these cases, we need to build a noise-robust, multi-label classifier.

We are going to introduce a deep learning architecture designed in Cornel Birdcall Identification Kaggle Challenge.

### Arhcitecture

In [2]:
## Summary

(72000,)

### Process `Spectrogram` via `torchlibrosa`

In [3]:
spectrogram = processor(torch.Tensor(example).unsqueeze(0).cuda())
spectrogram.size()  # (batch_size, 1 (=mono), time_steps, freq_bins)

torch.Size([1, 1, 226, 1025])

That's it! like the above example code, we can process `Spectrogram` fast by utilizing `torchlibrosa` successfully. 

### Benchmark processing speed

We can check that we can process audio data on the GPU by using torchlibrosa library. Also you may wonder how much faster on the GPU. Here's the speed of processing the benchmark between CPU and GPU devices. We just selected randomly about 1,000 audios from `UrbanSound` dataset, which is publicly available and compare how long it takes on `CPU` and `GPU`.

In [4]:
import os
import numpy as np
from glob import glob

root_path = 'D:\\DataSet\\UrbanSound\\audio\\fold1'
audio_paths = sorted(glob(os.path.join(root_path, '*.wav')))

raw_audios = [librosa.load(audio_path, sr=32000, mono=True)[0] for audio_path in audio_paths]

In [5]:
%%time
def get_spectrogram(x):
    stft = librosa.stft(x, n_fft=1024, hop_length=320)
    spectrogram = np.abs(stft)
    return spectrogram

for raw_audio in raw_audios:
    get_spectrogram(raw_audio)

Wall time: 3.09 s


In [6]:
%%time
for raw_audio in raw_audios:
    processor(torch.Tensor(raw_audio).unsqueeze(0).cuda())

Wall time: 2.6 s


* show how long is diff
* change the dataset to birdcall

## Deep Learning Architecture

As mentioned above, deep learning also shows a brilliant performance in the audio domain. It can catch various patterns of target classes nicely in the time-series data. The more important thing is the environment and data matter in birdcalls. The environments like fields or the middle of the mountains, there are lots of noises interfering with the birdcalls. There are lots of birds that can exist in long recorded audio. So considering these cases, we need to build a noise-robust, multi-label classifier.

We are going to introduce a deep learning architecture designed in Cornel Birdcall Identification Kaggle Challenge.

### Arhcitecture

## Summary