# Sound Event Detection with CNNs
> ''

- author: <a href=http://kozistr.tech/about>Hyeongchan Kim</a>
- categories: [python, data science]
- image:
- permalink: /audio/
- hide: true

## Introduction

Recent works have shown that using convolutional neural networks (CNNs) is powerful in various types of machine learning tasks like classification tasks across domains. CNN enables networks to catch informative features by fusing spatial and channel-wise information within local receptive fields at each layer. For audio signals, lots of related works take a spectrogram as input using CNN that allows extracting features both frequency and time wisely via its kernel, and these approaches show promising results.

Unlike sound classification tasks, sound event detection (SED) is a task to detect the events in recorded audio, and it's crucial to find the various length of single or multiple events at a segment-level. Although CNN is a powerful method to extract its feature maps, it can't capture long time dependency in audio due to the limited sizes of the receptive field. To resolve this problem, we utilize both a gated recurrent unit (GRU) and a self-attention mechanism to handle long time dependency effectively.

In this article, we'll explain the CNN architecture for detecting sound events by adopting the bi-GRU and self-attention module with the implementation and show performances compared with different methods.

## Data

* introducing cornel birdcall dataset & task or in general purpose something

## Pre-Processing

There're lots of feature extraction methods from the audio signal like *Spectrogram*, *MFCC*, and there're a few great libraries that allow us to use easily, for example, `librosa`. However, extracting these features from the signals takes a bit of time, especially for extracting features from tons of long audios. By the following, when extracting features and training a neural network, sometimes it becomes a bottleneck because of computation power.

To resolve the issue, we utilize a library, [`torchlibrosa`](https://github.com/qiuqiangkong/torchlibrosa), PyTorch implementation of `librosa`. One of the advantages of using the library is it can run on **GPU**, faster than running on **CPU**.

Here is a benchmark between running on GPU and CPU.

* To-Do : add a benchmark

 * To-Do : add an example

In [None]:
import torch.nn as nn

from torchlibrosa.stft import LogmelFilterBank, Spectrogram


class AudioProcessing(nn.Module):
    def __init__(self):
        super().__init__()

        self.spectrogram_extractor = Spectrogram(
            n_fft=window_size,
            hop_length=hop_size,
            win_length=window_size,
            window=window,
            center=center,
            pad_mode=pad_mode,
            freeze_parameters=True
        )

        self.logmel_extractor = LogmelFilterBank(
            sr=sample_rate,
            n_fft=window_size,
            n_mels=mel_bins,
            fmin=fmin,
            fmax=fmax,
            ref=ref,
            amin=amin,
            top_db=top_db,
            freeze_parameters=True
        )

    def forward(self, x):
        x = self.spectrogram_extractor(x)  # (batch_size, 1, time_steps, freq_bins)
        x = self.logmel_extractor(x)       # (batch_size, 1, time_steps, mel_bins)
        return x

## Architecture

* introducing related works on SED task like PANNs
* introducing our architecture


* explain why we chose / adopt this architecture
* explain why this architecture works

### Training Recipe

Training Recipe is also the most crucial part of deep-learning. It also determines the performance of the architecture by optimizers, a learning rate scheduler, augmentations, etc.

* augmentations
* hyper-parameters
* etc...

## Performance & Visualization

## Summary