<a href="https://colab.research.google.com/github/minguezalba/MusiCNN-embeddings/blob/main/dataset_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset Generation
---
Author: Alba Mínguez Sánchez

March 2021

---

In this notebook, we will do all the necessary tasks to build a songs-embedding dataset from the GTZAN dataset in order to use it in several tasks: evaluating songs similarity and genre classification. 

The steps are the following:

1.   Download GTZAN dataset with the help of [mirdata](https://mirdata.readthedocs.io/en/latest/) library.
2.   Download [Essentia’s MSD-MusiCNN TensorFlow model](https://essentia.upf.edu/models/classifiers/genre_tzanetakis/) for the specific task of genre classification.
3.   Embeddings extraction: Extract songs-temporal embeddings using MusiCNN model and process them to get mean-embeddings.
5.   Labels processing: Encode genre labels to numeric values.
6.   Save data to npy files.





**Packages and dependencies**

In [None]:
!pip install mirdata
!pip install essentia-tensorflow

In [None]:
import essentia.standard as es
import mirdata
import numpy as np

import json

from collections import Counter
from sklearn import preprocessing

# 1. Download GTZAN dataset

Download GTZAN dataset using mirdata library and validate the download.

In [None]:
DATASET_NAME = 'gtzan_genre'

dataset = mirdata.initialize(DATASET_NAME, data_home='/content/mydata/')
dataset.download()  # download the dataset
dataset.validate()  # validate that all the expected files are there

INFO: Downloading ['all'] to /content/mydata/
INFO: [all] downloading genres.tar.gz
1.14GB [12:11, 1.68MB/s]                            
100%|██████████| 1000/1000 [00:03<00:00, 310.51it/s]
INFO: Success: the dataset is complete and all files are valid.
INFO: --------------------


({'tracks': {}}, {'tracks': {}})

In [None]:
%cd mydata

/content/mydata


Load all tracks from the dataset

In [None]:
tracks = dataset.load_tracks()  # Return dict like {track_id: Track() object}

Check data attributes and genres distribution

In [None]:
example_track = dataset.choice_track()  # choose a random example track
print(example_track)  # see the available data

Track(
  audio_path="/content/mydata/gtzan_genre/genres/hiphop/hiphop.00049.wav",
  genre="hip-hop",
  track_id="hiphop.00049",
  audio: The track's audio

        Returns,
)


In [None]:
Counter([x.genre for x in tracks.values()])

Counter({'blues': 100,
         'classical': 100,
         'country': 100,
         'disco': 100,
         'hip-hop': 100,
         'jazz': 100,
         'metal': 100,
         'pop': 100,
         'reggae': 100,
         'rock': 100})

We can observe data is totally balanced between the 10 differente classes/genres.

# 2. Download Essentia’s MSD-MusiCNN TensorFlow model

Download the necessary files to describe the model

In [None]:
!curl -SLO https://essentia.upf.edu/models/classifiers/genre_tzanetakis/genre_tzanetakis-musicnn-msd-1.json
!curl -SLO https://essentia.upf.edu/models/classifiers/genre_tzanetakis/genre_tzanetakis-musicnn-msd-1.pb

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2092  100  2092    0     0   2716      0 --:--:-- --:--:-- --:--:--  2713
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3166k  100 3166k    0     0  1121k      0  0:00:02  0:00:02 --:--:-- 1120k


Explore model metadata, inputs and outputs

In [None]:
MODEL_NAME = 'genre_tzanetakis-musicnn-msd-1'
MODEL_JSON = f'{MODEL_NAME}.json'
MODEL_PB = f'{MODEL_NAME}.pb'

musicnn_metadata = json.load(open(MODEL_JSON, 'r'))
for k, v in musicnn_metadata.items():
    print('{}: {}'.format(k , v))

name: genre GTZAN
type: multi-class classifier
link: https://essentia.upf.edu/models/classifiers/genre_tzanetakis/genre_tzanetakis-musicnn-msd-1.pb
version: 1
description: classification of music by genre
author: Pablo Alonso
email: pablo.alonso@upf.edu
release_date: 2020-03-31
framework: tensorflow
framework_version: 1.15.0
classes: ['blu', 'cla', 'cou', 'dis', 'hip', 'jaz', 'met', 'pop', 'reg', 'roc']
model_types: ['frozen_model']
dataset: {'name': 'the GTZAN Genre Collection', 'citation': '@article{tzanetakis2002musical,\n  title={Musical genre classification of audio signals},\n  author={Tzanetakis, George and Cook, Perry},\n  journal={IEEE Transactions on speech and audio processing},\n  volume={10},\n  number={5},\n  pages={293--302},\n  year={2002},\n  publisher={IEEE}\n}', 'size': '1000 track excerpts, 100 per genre', 'metrics': {'5-fold_cross_validation_normalized_accuracy': 0.83}}
schema: {'inputs': [{'name': 'model/Placeholder', 'type': 'float', 'shape': [187, 96]}], 'output

We can observe the output of the penultimate dense layer is proposed as embeddings. 
We will use it to extract songs embeddings from our dataset.

# 3. Embeddings extraction

We will fix sample rate at 16 kHz as it is required for the input of MusiCNN model.

In [None]:
MUSICNN_SR = 16000

In [None]:
def extract_mean_embedding(filename):
  """
  Extract mean-temporal embedding from audio contained in filename

  Args:
    filename (str): Name of the audio file

  Return:
    Mean embedding of the song
  """
  
  # Load audiofile with essentia monoloader to resample the audios to the necessary sample rate in MusiCNN model
  audio = es.MonoLoader(filename=filename, sampleRate=MUSICNN_SR)()

  # Extract the embedding
  musicnn_emb = es.TensorflowPredictMusiCNN(graphFilename=MODEL_PB, output='model/dense/BiasAdd')(audio)

  # Compute mean-embedding across the frames
  mean_emb = np.mean(musicnn_emb, axis=0)
  mean_emb = mean_emb[np.newaxis, :]  # Each song is a 1x200 row vector

  return mean_emb

In [None]:
# This step may last several minutes
embeddings = np.zeros((len(tracks), 200))  # N songs x 200 embedding-dim

for i, track in enumerate(tracks.values()):
  embeddings[i, :] = extract_mean_embedding(track.audio_path)

In [None]:
embeddings.shape

(1000, 200)

# 4. Labels processing

Build a label encoder to transform categorical labels to numerical ones in a range [0-9].

In [None]:
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(list({x.genre for x in tracks.values()}))
label_encoder.classes_

array(['blues', 'classical', 'country', 'disco', 'hip-hop', 'jazz',
       'metal', 'pop', 'reggae', 'rock'], dtype='<U9')

In [None]:
labels = []
labels_decoded = []
track_ids = []

for i, (track_id, track) in enumerate(tracks.items()):
  print(f"{i+1}/{len(tracks)}", end="\r")
  labels.append(int(label_encoder.transform([track.genre])[0]))
  labels_decoded.append(track.genre)
  track_ids.append(track_id)



# 5. Save data

Save new data to npy files.

In [None]:
%mkdir emb_dataset

In [None]:
with open('emb_dataset/embeddings.npy', 'wb') as f:
    np.save(f, embeddings)
with open('emb_dataset/labels.npy', 'wb') as f:
    np.save(f, labels)
with open('emb_dataset/labels_decoded.npy', 'wb') as f:
    np.save(f, labels_decoded)
with open('emb_dataset/track_ids.npy', 'wb') as f:
    np.save(f, track_ids)