<a href="https://colab.research.google.com/github/minguezalba/MusiCNN-embeddings/blob/main/genre_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Genre classification
---
Author: Alba Mínguez Sánchez

March 2021

---

In this notebook, we will train and evaluate a new transfer learning classifier using the embeddings from GTZAN dataset, extracted by Essentia’s MSD-MusiCNN TensorFlow. 

We will use the audio features (embeddings) as an input to train a shallow target classifier inspired by the ones used in *Alonso-Jiménez, P., Bogdanov, D., & Serra, X. (2020). Deep embeddings with Essentia models*: a multilayer perceptron with a single hidden layer with 100 neurons and ReLU activations. The output layer uses softmax for predictions.

To do this task, we will use Pytorch.



**Packages and dependencies**


In [1]:
import numpy as np
import torch
import plotly.figure_factory as ff

from torch import nn, cuda
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
from sklearn.metrics import confusion_matrix

# Set fixed random number seed
torch.manual_seed(42)

<torch._C.Generator at 0x7f82d273c410>

**Check if GPU is available**

If you are running this notebook in Colab, you can use Google GPUs free. Go to  Edit > Notebook settings select GPU as Hardware accelerator.

In [2]:
using_gpu = cuda.is_available()
using_gpu

True

## 1. Loading embeddings dataset from npy files

This step requires to clone MusiCNN-embeddings repository to download the necessary files.

In [3]:
!git clone https://github.com/minguezalba/MusiCNN-embeddings.git

fatal: destination path 'MusiCNN-embeddings' already exists and is not an empty directory.


In [4]:
with open('MusiCNN-embeddings/emb_dataset/embeddings.npy', 'rb') as f:
    embeddings = np.load(f)
with open('MusiCNN-embeddings/emb_dataset/labels.npy', 'rb') as f:
    labels = np.load(f)
with open('MusiCNN-embeddings/emb_dataset/labels_decoded.npy', 'rb') as f:
    labels_decoded = np.load(f)
with open('MusiCNN-embeddings/emb_dataset/track_ids.npy', 'rb') as f:
    track_ids = np.load(f)

genres = {genre_id: genre for genre_id, genre in zip(labels, labels_decoded)}

Check data dimensions and types

In [5]:
print('embeddings: ', embeddings.shape, type(embeddings))
print('labels: ', labels.shape, type(labels))
print('labels_decoded: ', labels_decoded.shape, type(labels_decoded))
print('track_ids: ', track_ids.shape, type(track_ids))

embeddings:  (1000, 200) <class 'numpy.ndarray'>
labels:  (1000,) <class 'numpy.ndarray'>
labels_decoded:  (1000,) <class 'numpy.ndarray'>
track_ids:  (1000,) <class 'numpy.ndarray'>


# 1. Build custom Pytorch dataset

We will use Pytorch Datasets in order to ease the training and test pipeline. To do this, we will define a custom dataset which inherits from Pytorch Dataset class, and redefine the required methods.

In [6]:
class GenreDataset(Dataset):
  def __init__(self, embeddings, labels):
    # Initialization
    self.embeddings = embeddings
    self.labels = labels
    
  def __len__(self):
    # Denotes the total number of samples
    return len(self.labels)

  def __getitem__(self, index):
    # Generates one sample of data
    X = torch.from_numpy(self.embeddings[index, :].flatten())
    y = int(self.labels[index])

    return X, y

In [7]:
dataset = GenreDataset(embeddings, labels)

Check a sample from the dataset

In [8]:
iter_dataset = iter(dataset)
inputs, labels = next(iter_dataset)
print('label sample: ', labels)
print('input sample: ',inputs)

label sample:  1
input sample:  tensor([ -2.2277,  -5.8535,  -4.6969,  -6.3750,  -2.4076,  -6.3240,  15.2210,
          0.6533,  -8.8037,  -0.5683,  -8.0114, -12.6820,  -2.4189, -11.3236,
         -0.6095,  -2.7762,  -3.4282,   5.6842,  -6.3104, -17.0379, -16.2917,
         -9.1156,  -4.7927,  -4.6446,   1.3694,  -7.7527,  -6.6161,  -4.3145,
          7.3805,   3.8525,   8.4030,  -9.7115,  -7.4936,   2.9991, -14.2768,
        -12.8222,   0.9848,  -6.1617,  -6.8571,  12.2362, -16.0057,  -0.9801,
        -18.1293,   8.0538, -11.8202,  -5.4673,  -4.9918,   1.5809, -11.3232,
         -8.2023,  -7.0368, -13.2291, -11.1093,  -6.6410,  -9.5571, -11.5630,
         -6.8677,  -9.4659, -13.5530,   9.7406,   5.0170, -10.0505,  -8.9705,
         -8.3057,  -2.9102, -10.3271,  -5.7909,  -3.0057,  -4.0293,  -1.7750,
        -10.2112,  -2.4594,  -5.8352,  -6.5875,   6.0642,  -3.5237,  -4.7459,
          4.9358,  -9.3451,  -7.3402,   3.4711, -12.6000,  -0.0244, -10.3699,
         -5.2257,  -8.5515,  -6.

# 2. Build train and test dataloaders

We will randomly split our dataset in two: train (80%) and test (20%) using SubsetRandomSampler and Dataloaders from Pytorch.

In this step, we will also define batch size to train and test model.

In [9]:
BATCH_SIZE = 32

# Split using shuffled indexes
dataset_indices = list(range(len(dataset)))
np.random.shuffle(dataset_indices)
n_test = int(np.floor(0.2 * len(dataset)))
train_idx, test_idx = dataset_indices[n_test:], dataset_indices[:n_test]

train_sampler = SubsetRandomSampler(train_idx)
test_sampler = SubsetRandomSampler(test_idx)

# Data Loaders
train_loader = DataLoader(dataset=dataset, shuffle=False, batch_size=BATCH_SIZE, sampler=train_sampler)
test_loader = DataLoader(dataset=dataset, shuffle=False, batch_size=BATCH_SIZE, sampler=test_sampler)

# 3. Model definition

We will define a multilayer perceptron (MLP) with a single hidden layer with 100 neurons and ReLU activations. The output layer uses softmax for predictions. We want to use this model to classify a song-embedding out of the 10 possible genres.

In [10]:
class MLP(nn.Module):
  '''
    Multilayer Perceptron.
  '''
  def __init__(self, input_size, hidden_size, output_size):
    super().__init__()
    self.input_size = input_size
    self.hidden_size  = hidden_size
    self.output_size  = output_size

    net = nn.Sequential(
      nn.Linear(self.input_size, self.hidden_size),
      nn.ReLU(),
      nn.Linear(self.hidden_size, self.output_size)
    )

    if using_gpu:
      self.layers = net.cuda()
    else:
      self.layers = net

  def forward(self, x):
    '''Forward pass'''
    return self.layers(x)

In [11]:
# Initialize the MLP
mlp = MLP(200, 100, len(genres))

We also have to define the loss function and optimizer to use during the training process.

In [12]:
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-3)

# 4. Training the model

In [13]:
N_epochs = 30
mlp.train()

for epoch in range(0, N_epochs):
  train_loss = 0.0    # Set current loss value

  for i, data in enumerate(train_loader):    
    inputs, targets = data
    if using_gpu:
      inputs, targets = inputs.cuda(), targets.cuda()

    optimizer.zero_grad()    
    outputs = mlp(inputs.float())    
    loss = loss_function(outputs, targets.long())
    loss.backward()
    optimizer.step()    
    train_loss += loss.item()
  
  if epoch%5==4:
    print(f'Epoch {epoch+1} - Training Loss: {train_loss}')    

print('Training process has finished.')

Epoch 5 - Training Loss: 4.343503199517727
Epoch 10 - Training Loss: 2.5762134827673435
Epoch 15 - Training Loss: 2.0671953565906733
Epoch 20 - Training Loss: 1.5897746346890926
Epoch 25 - Training Loss: 0.6176864518783987
Epoch 30 - Training Loss: 0.9679376269923523
Training process has finished.


# 5. Testing the model

In [14]:
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
correct = 0
total = 0

y_test = []
y_test_predicted = []

mlp.eval()
for data in test_loader:    
    inputs, targets = data
    if using_gpu:
      inputs, targets = inputs.cuda(), targets.cuda()

    outputs = mlp(inputs.float())   
    _, predicted = torch.max(outputs, 1)

    y_test.extend(list(targets.cpu().numpy()))
    y_test_predicted.extend(list(predicted.cpu().numpy()))

    total += targets.size(0)
    correct += (predicted == targets).sum().item()

    c = (predicted == targets).squeeze()
    for i in range(len(targets)):
      label = targets[i]
      class_correct[label] += c[i].item()
      class_total[label] += 1

print('Test process has finished.')

Test process has finished.


# 6. Evaluate the results

Decode labels from numerical to categorical values.

In [15]:
labels_test = np.array([genres[v] for v in y_test])
labels_test_predicted = np.array([genres[v] for v in y_test_predicted])

Build confusion matrix

In [16]:
genres_set = list(genres.values())

cm = confusion_matrix(labels_test, labels_test_predicted, labels=genres_set)
cm_text = [[str(y) for y in x] for x in cm]

# set up figure 
fig = ff.create_annotated_heatmap(cm, x=genres_set, y=genres_set, annotation_text=cm_text, colorscale='Blues')
fig.update_layout(title_text='<i><b>Confusion matrix</b></i>')
fig.update_layout(margin=dict(t=100, l=200, r=200),
                  xaxis = dict(title='Predicted genre'),
                  yaxis = dict(title='Real genre'))


#fig['layout']['xaxis']['autorange'] = "reversed"
fig['data'][0]['showscale'] = True  # add colorbar
fig.show()

Show accuracies per genre and global

In [17]:
print(f"Global accuracy: {100 * correct / total :.2f}%")
print("\nAccuracy by genre:")
for i, genre in genres.items():
    print(f"- {genre}: {100 * class_correct[i] / class_total[i] :.2f}%")

Global accuracy: 94.50%

Accuracy by genre:
- classical: 94.44%
- country: 88.89%
- disco: 88.89%
- hip-hop: 100.00%
- jazz: 100.00%
- rock: 93.33%
- blues: 100.00%
- reggae: 92.31%
- pop: 89.47%
- metal: 91.30%
