# "Multi-Label Audio Tagging With Fastai2 Audio"
> "A Udacity ML Engineer Nanodegree Capstone Projec: Using the in-development fastai2 audio library on the Kaggle Freesound 2019 Competition"
- toc: true
- branch: master
- badges: false
- comments: true
- categories: [audio, deep learning, SageMaker, GCP]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

# 

**** 


Sections:

Remember to fully reference and give thanks to references

* Intro to Competition
* Intro to Fastai2 Audio
* Downloading the Data
* EDA
* Audio Data intro
* Chosen Spectrogram settings and Augmentations
* 2-stage Training Method - 2-Stage Kfold, Model details, learning rates

    - why good: 
    - used 1/5 training time of high scorers (approximately)
    - No time-consuming pre-processing 
    - multi-label accuracy within 98% in a few epochs on simple accuracy_multi metric
* Results  
* What's next 
    - self supervised learning?
    - improving on model

## Machine Listening

The sub-field of Machine Learning known as Machine Listening is a burgeoning area of research using signal processing for the automatic extraction of information from sound by a computational analysis of audio. There are many different areas of research within this field as demonstrated by the latest Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 Challenge, a machine learning challenge dedicated to the research and development of new methods and algorithms. These include:

* Acoustic Scene Classification
* Sound Event Detection and Localization
* Sound Event Detection and Separation in Domestic Environments
* Urban Sound tagging
* Automated Audio Captioning

As an acoustic engineer, I am extremely intrigued by this new field. Recent developments in machine learning algorithms have allowed significant progress to be made within this area, with the potential applications of the technology being wide and varied and meaning the tools could prove to be extremely useful for the acoustic practitioner amongst many other uses.

The in-development user-contributed fast.ai2 Audio library {% fn 1 %} inspired me to undertake the development of a deep learning audio-tagging system for my Udacity Capstone project, as described in this blog post.



{{ '[fast.ai 2 audio](https://github.com/rbracco/fastai2_audio)'  | fndetail: 1 }}

## Freesound 2019 Kaggle Competition

**Problem**

The Freesound Audio Tagging 2019 {% fn 2 %} Kaggle Competition provides the basis for my research project. 

The challenge is to develop a system for the automatic classification of multi-labelled audio files within 80 categories, that could potentially be used for automatic audio/video file tagging and/or real-time sound event detection with noisy or untagged audio data. This has historically been investigated in a variety of ways:

* Conversion of audio to mel-spectrogram images fed into CNNs
* End-to-End Deep learning
* Custom architectures involving auto-encoders
* Features representation transfer learning with custom architectures and Google's Audioset

**Dataset**

In addition, the classification of weakly labelled data from large-scale crowdsourced datasets provides a further problem for investigation[^2]. The dataset comprises the following elements:


The dataset used in the challenge is called FSDKaggle2019[^1] and was collected by members of Freesound (a Creative Commons Licensed sound database from the Music Technology Group of Universitat Pempeu Fabra, Barcelona) and Google Research's Machine Perception Team[^2].

- Freesound Dataset ([FSD](https://annotator.freesound.org/fsd/)): a dataset being collected at the [MTG-UPF](https://www.upf.edu/web/mtg) based on [Freesound](https://freesound.org/) content organized with the [AudioSet Ontology](https://research.google.com/audioset////////ontology/index.html) and manually labelled by humans.
- The soundtracks of a pool of Flickr videos taken from the [Yahoo Flickr Creative Commons 100M dataset (YFCC)](http://code.flickr.net/2014/10/15/the-ins-and-outs-of-the-yahoo-flickr-100-million-creative-commons-dataset/) which are automatically labelled using metadata from the original Flickr clips. These items therefore have significantly more label noise than the Freesound Dataset items.

The data comprises 80 categories labelled according to Google's Audioset Ontology [^3] with ground truth labels provided at the clip level. The clips range in duration between 0.3 to 30s in uncompressed PCM 16 bit, 44.1 kHz mono audio files.


**Metric**

The problem is clearly quantifiable in that a number of accuracy metrics could be used to quantify the accuracy of the model's predictions, however, the competition used label-weighted-label-ranked-precision (lwl-rap, or 'lol-rap') as the metric, as the advancement of audio multi-labelling has acheived >95% accuracy with ease in recent years. This will be discussed in further detail below.





* Intro - from proposal
    * Problem - tagging data using noisy datasets and small curated set
* Multi-label - lwl-wrap (more on that later)
* Data sets - Curated and Noisy






{{ '[Kaggle Freesound Audio Tagging 2019 Competition](https://www.kaggle.com/c/freesound-audio-tagging-2019/overview)'  | fndetail: 2 }}

    * Difference between audio and other data*

## EDA - WIP

The first stage of the process was to understand the dataset more fully. Fortunately, due to being a Kaggle Competition dataset it was well documented and clean in terms of organization.

Downloading the dataset was undertaken using guidance given within [the following link](https://www.kaggle.com/c/deepfake-detection-challenge/discussion/129521) directly into the SageMaker/GCP Instance storage for easy access.

The files were then unzipped for the EDA. For further details, please see the [notebook directly](https://github.com/mikful/udacity-mlend-capstone/blob/master/nbs_final/eda.ipynb).


**Pandas and Pandas Profiling**

In order to undertake the analysis of the data, Pandas and Pandas Profiling were used. 

[Pandas Profiling](https://github.com/pandas-profiling/pandas-profiling) is an extremely useful add-on package to Pandas, which creates HTMl profile reports directly from Pandas DataFrames quickly and easily.

Using these two packages the following was found:


**Curated Data**


**Noisy Data**

## Developing the Model - To Do

Although the course was taught using SageMaker, for the training an AI Notebook instance on GCP was used, as the author had free credits available and the GPU training times were likely to be significantly longer than possible with the remaining SageMaker credit from the course.

* Audio Data intro
* Chosen Spectrogram settings and Augmentations
* 2-stage Training Method - 2-Stage Kfold, Model details, learning rates

    - why good: 
    - used 1/5 training time of high scorers (approximately)
    - No time-consuming pre-processing 
    - multi-label accuracy within 98% in a few epochs on simple accuracy_multi metric

##  Results - To Do






## Evaluation of Model - To Do






## What's Next...


* Deploy a model using SageMaker / GCP for inference of user-uploaded audio.


## Thanks

Thanks to fastai and the fastai2-audio and fastpages communities whose amazing efforts and incredible generosity have made this learning journey possible.