# Homework #2: Music Genre Classification
Music genre classification is an important task that can be used in many musical applications such as music search or recommender systems. Your mission is to build your own Convolutional Neural Network (CNN) model to classify audio files into different music genres. Specifically, the goals of this homework are as follows:

* Experiencing the whole pipeline of deep learning based system: data preparation, feature extraction, model training and evaluation
* Getting familiar with the CNN architectures for music classification tasks
* Using Pytorch in practice

# Getting Ready

## Installing Packages

In [None]:
!pip install musicnn

## Preparing The Dataset
We use the [GTZAN](http://marsyas.info/downloads/datasets.html) dataset which has been the most widely used in the music genre classification task. 
The dataset contains 30-second audio files including 10 different genres including reggae, classical, country, jazz, metal, pop, disco, hiphop, rock and blues. 
For this homework, we are going to use a subset of GTZAN with only 8 genres.

In [None]:
# Download the dataset
!gdown --id 1J1DM0QzuRgjzqVWosvPZ1k7MnBRG-IxS

In [None]:
# Uncompress the dataset
!tar zxf gtzan.tar.gz

## Importing Packages

In [None]:
import numpy as np
import os
import librosa
import torch
import torch.nn as nn
from tqdm.notebook import tqdm
from glob import glob
from torch.utils.data import Dataset, DataLoader

## Enabling and testing the GPU

First, you'll need to enable GPUs for the Colab notebook:

- Navigate to Edit (수정) → Notebook Settings (노트 설정)
- select GPU from the Hardware Accelerator (하드웨어 가속기) drop-down

Next, we'll confirm that we can connect to the GPU with PyTorch and check versions of packages:

In [None]:
if not torch.cuda.is_available():
  raise SystemError('GPU device not found!')
print(f'Found GPU at: {torch.cuda.get_device_name()}')
print(f'PyTorch version: {torch.__version__}')
print(f'Librosa version: {librosa.__version__}')

If the cell above throws an error, then you should enable the GPU following the instruction above!

# Training CNNs from Scratch

The baseline code is provided so that you can easily start the homework and also compare with your own algorithm.
The baseline model extracts mel-spectrogram and has a simple set of CNN model that includes convolutional layer, batch normalization, maxpooling and fully-connected layer.

## Extracting Mel-spectrograms

In [None]:
# Mel-spectrogram setup.
SR = 16000
FFT_HOP = 512
FFT_SIZE = 1024
NUM_MELS = 96

In [None]:
genres = genres = ['classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae']
genre_dict = {g: i for i, g in enumerate(genres)}

In [None]:
def load_split(path):
  with open(path) as f:
    paths = [line.rstrip('\n') for line in f]
  return paths

train = load_split('gtzan/split/train.txt')
test = load_split('gtzan/split/test.txt')

# Each entry of the lists look like this:
len(train), len(test)

In [None]:
# Make directories to save mel-spectrograms.
for genre in genres:
  os.makedirs('gtzan/spec/' + genre, exist_ok=True)
  
for path_in in tqdm(train + test):
  # The spectrograms will be saved under `gtzan/spec/` with an file extension of `.npy`
  path_out = 'gtzan/spec/' + path_in.replace('.wav', '.npy')

  # Skip if the spectrogram already exists
  if os.path.isfile(path_out):
    continue
    
  # Load the audio signal with the desired sampling rate (SR).
  sig, _ = librosa.load(f'gtzan/wav/{path_in}', sr=SR, res_type='kaiser_fast')
  # Compute power mel-spectrogram.
  melspec = librosa.feature.melspectrogram(sig, sr=SR, n_fft=FFT_SIZE, hop_length=FFT_HOP, n_mels=NUM_MELS)
  # Transform the power mel-spectrogram into the log compressed mel-spectrogram.
  melspec = librosa.power_to_db(melspec)
  # "float64" uses too much memory! "float32" has enough precision for spectrograms.
  melspec = melspec.astype('float32')

  # Save the spectrogram.
  np.save(path_out, melspec)

## Defining a dataset of spectrograms

In [None]:
# Data processing setup.
BATCH_SIZE = 4

In [None]:
class SpecDataset(Dataset):
  def __init__(self, paths, mean=0, std=1, time_dim_size=None):
    self.paths = paths
    self.mean = mean
    self.std = std
    self.time_dim_size = time_dim_size

  def __getitem__(self, i):
    # Get i-th path.
    path = self.paths[i]
    # Get i-th spectrogram path.
    path = 'gtzan/spec/' + path.replace('.wav', '.npy')

    # Extract the genre from its path.
    genre = path.split('/')[-2]
    # Trun the genre into index number.
    label = genre_dict[genre]

    # Load the mel-spectrogram.
    spec = np.load(path)
    if self.time_dim_size is not None:
      # Slice the temporal dimension with a fixed length so that they have
      # the same temporal dimensionality in mini-batches.
      spec = spec[:, :self.time_dim_size]
    # Perform standard normalization using pre-computed mean and std.
    spec = (spec - self.mean) / self.std

    return spec, label
  
  def __len__(self):
    return len(self.paths)

### Computing statistics of the training set
The code below compute mean, standard deviation and the minimum temporal dimension size, and use them for preprocessing inputs.

In [None]:
# Load all spectrograms.
dataset_train = SpecDataset(train)
specs = [s for s, _ in dataset_train]
# Compute the minimum temporal dimension size.
time_dims = [s.shape[1] for s in specs]
min_time_dim_size = min(time_dims)
# Stack the spectrograms
specs = [s[:, :min_time_dim_size] for s in specs]
specs = np.stack(specs)
# Compute mean and standard deviation for standard normalization.
mean = specs.mean()
std = specs.std()

min_time_dim_size, mean, std, 

### Creating datasets and data loaders using the pre-computed statistics

In [None]:
dataset_train = SpecDataset(train, mean, std, min_time_dim_size)
dataset_test = SpecDataset(test, mean, std, min_time_dim_size)

num_workers = os.cpu_count()
loader_train = DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True, num_workers=num_workers, drop_last=True)
loader_test = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=False, num_workers=num_workers, drop_last=False)

## Training a baseline
The table below shows the architecture of the baseline.

| Layer          | Output Size | Details                 |
|----------------|-------------|-------------------------|
| conv           | 32 x 936    | kernel_size=7, stride=1 |
| maxpool        | 32 x 133    | kernel_size=7, stride=7 |
| conv           | 32 x 133    | kernel_size=7, stride=1 |
| maxpool        | 32 x 19     | kernel_size=7, stride=7 |
| conv           | 32 x 19     | kernel_size=7, stride=1 |
| maxpool        | 32 x 2      | kernel_size=7, stride=7 |
| global_avgpool | 32 x 1      | -                       |

The class below is an implementation of it:

In [None]:
class Baseline(nn.Module):
  def __init__(self):
    super(Baseline, self).__init__()

    self.conv0 = nn.Sequential(
      nn.Conv1d(NUM_MELS, out_channels=32, kernel_size=7, stride=1, padding=3),
      nn.BatchNorm1d(32),
      nn.ReLU(),
      nn.MaxPool1d(kernel_size=7, stride=7)
    )

    self.conv1 = nn.Sequential(
      nn.Conv1d(32, out_channels=32, kernel_size=7, stride=1, padding=3),
      nn.BatchNorm1d(32),
      nn.ReLU(),
      nn.MaxPool1d(kernel_size=7, stride=7)
    )

    self.conv2 = nn.Sequential(
      nn.Conv1d(32, out_channels=32, kernel_size=7, stride=1, padding=3),
      nn.BatchNorm1d(32),
      nn.ReLU(),
      nn.MaxPool1d(kernel_size=7, stride=7)
    )

    # Aggregate features over temporal dimension.
    self.final_pool = nn.AdaptiveAvgPool1d(1)

    # Predict genres using the aggregated features.
    self.linear = nn.Linear(32, len(genres))

  def forward(self, x):
    x = self.conv0(x)
    x = self.conv1(x)
    x = self.conv2(x)
    x = self.final_pool(x)
    x = self.linear(x.squeeze(-1))
    return x

In [None]:
# Training setup.
LR = 0.0006  # learning rate
MOMENTUM = 0.9
NUM_EPOCHS = 10
weight_decay = 0.0  # L2 regularization weight

In [None]:
model = Baseline()
model

In [None]:
# Define a loss function, which is cross entropy here.
criterion = torch.nn.CrossEntropyLoss()
# Setup an optimizer. Here, we use Stochastic gradient descent (SGD) with a nesterov mementum.
optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=MOMENTUM, nesterov=True, weight_decay=weight_decay)
# Choose a device. We will use GPU if it's available, otherwise CPU.
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# Move variables to the desired device.
model.to(device)
criterion.to(device)

print(f'Optimizer: {optimizer}')
print(f'Device: {device}')

In [None]:
# Util function for computing accuracy.
def accuracy(source, target):
  source = source.max(1)[1].long().cpu()
  target = target.cpu()
  correct = (source == target).sum().item()
  return correct / float(source.shape[0])

In [None]:
# Set the status of the model as training.
model.train()

# Iterate over epochs.
for epoch in range(NUM_EPOCHS):
  epoch_loss = 0
  epoch_acc = 0
  pbar = tqdm(loader_train, desc=f'Epoch {epoch:02}')  # progress bar
  for x, y in pbar:
    # Move mini-batch to the desired device.
    x = x.to(device)
    y = y.to(device)

    # Feed forward the model.
    prediction = model(x)
    # Compute the loss.
    loss = criterion(prediction, y)
    # Compute the accuracy.
    acc = accuracy(prediction, y)

    # Perform backward propagation to compute gradients.
    loss.backward()
    # Update the parameters.
    optimizer.step()
    # Reset the computed gradients.
    optimizer.zero_grad()

    # Log training metrics.
    batch_size = len(x)
    epoch_loss += batch_size * loss.item()
    epoch_acc += batch_size * acc
    # Update the progress bar.
    pbar.set_postfix({'loss': epoch_loss / len(dataset_train), 
                      'acc': epoch_acc / len(dataset_train)})

In [None]:
# Set the status of the model as evaluation.
model.eval()

# `torch.no_grad()` disables computing gradients. The gradients are still 
# computed even though you use `model.eval()`. You should use `torch.no_grad()` 
# if you don't want your memory is overflowed because of unnecesary gradients.
with torch.no_grad():
  epoch_loss = 0
  epoch_acc = 0
  pbar = tqdm(loader_test, desc=f'Test')  # progress bar
  for x, y in pbar:
    # Move mini-batch to the desired device.
    x = x.to(device)
    y = y.to(device)

    # Feed forward the model.
    prediction = model(x)
    # Compute the loss.
    loss = criterion(prediction, y)
    # Compute the accuracy.
    acc = accuracy(prediction, y)

    # Log training metrics.
    batch_size = len(x)
    epoch_loss += batch_size * loss.item()
    epoch_acc += batch_size * acc
    # Update the progress bar.
    pbar.set_postfix({'loss': epoch_loss / len(dataset_test), 'acc': epoch_acc / len(dataset_test)})

# Compute the evaluation scores.
test_loss = epoch_loss / len(dataset_test)
test_acc = epoch_acc / len(dataset_test)

print(f'test_loss={test_loss:.5f}, test_acc={test_acc * 100:.2f}%')

### [Question 1] Implement the given architecture.
Implement a CNN with the architecture below, train, and report a test accuracy of the CNN.

| Layer          | Output Size | Details                 |
|----------------|-------------|-------------------------|
| conv           | 16 x 936    | kernel_size=7, stride=1 |
| maxpool        | 16 x 133    | kernel_size=7, stride=7 |
| conv           | 32 x 133    | kernel_size=5, stride=1 |
| maxpool        | 32 x 26     | kernel_size=5, stride=5 |
| conv           | 64 x 26     | kernel_size=3, stride=1 |
| maxpool        | 64 x 8      | kernel_size=3, stride=3 |
| conv           | 128 x 8     | kernel_size=3, stride=1 |
| maxpool        | 128 x 2     | kernel_size=3, stride=3 |
| global_avgpool | 32 x 1      | -                       |

Note: you should give appropriate paddings! 

In [None]:
# TODO: Question 1

# Exploiting Prior Knowledge using Pre-trained Models


Someone who knows how to play acoustic guitars might be better at playing electric guitars than who never played a guitar.
Here, we will use pre-trained models from [`musicnn`](https://github.com/jordipons/musicnn) (pronounced as "musician"), which includes CNNs already trained on a large amount of songs.


You can predict some tags with the pre-trained model like this:

In [None]:
from musicnn.tagger import top_tags

_ = top_tags('gtzan/wav/' + train[0], model='MTT_musicnn', topN=10)

However, the 10 tags are not what we want as outputs! Let's extract embedding (or features) using the pre-trained model, and train 2-layer MLP using the embeddings as inputs.

## Extracting embeddings using the pre-trained model

Side note: this will take about 23 min.

In [None]:
from musicnn.extractor import extractor

# Make directories to save embeddings.
for genre in genres:
  os.makedirs('gtzan/embed/' + genre, exist_ok=True)

for path_in in tqdm(train + test):
  # The embeddings will be saved under `gtzan/embed/` with an file extension of `.npy`
  path_out = 'gtzan/embed/' + path_in.replace('.wav', '.npy')
  # Skip if the embedding already exists.
  if os.path.isfile(path_out):
    continue
  
  # Extract the embedding using the pre-trained model.
  _, _, embeds = extractor(f'gtzan/wav/{path_in}', model='MTT_musicnn', extract_features=True)
  # Average the embeddings over temporal dimension.
  embed = embeds['max_pool'].mean(axis=0)

  # Save the embedding.
  np.save(path_out, embed)

In [None]:
class EmbedDataset(Dataset):
  def __init__(self, paths):
    self.paths = paths

  def __getitem__(self, i):
    # Get i-th path.
    path = self.paths[i]
    # Get i-th embeddding path.
    path = 'gtzan/embed/' + path.replace('.wav', '.npy')

    # Extract the genre from its path.
    genre = path.split('/')[-2]
    # Trun the genre into index number.
    label = genre_dict[genre]

    # Load the mel-spectrogram.
    embed = np.load(path)

    return embed, label
  
  def __len__(self):
    return len(self.paths)

In [None]:
dataset_train = EmbedDataset(train)
dataset_test = EmbedDataset(test)

num_workers = os.cpu_count()
loader_train = DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True, num_workers=num_workers, drop_last=True)
loader_test = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=False, num_workers=num_workers, drop_last=False)

In [None]:
embed_size = dataset_train[0][0].shape[0]
embed_size

### [Question 2] Implement, train and evaluate 2-layer MLP using the extracted embeddings.

In [None]:
# TODO: Question 2

# Improving Algorithms [[Leader Board]](https://docs.google.com/spreadsheets/d/1bzkMFeXABTae7kDJG6QCU_qnP1ppJDoNQLgGz3ksJu0/edit?usp=sharing)

### [Question 3] Improve the performenace.
Now it is your turn. You should improve the baseline code with your own algorithm. There are many ways to improve it. The followings are possible ideas: 

* The first thing to do is to segment audio clips and generate more data. The baseline code utilizes the whole mel-spectrogram as an input to the network (e.g. 96x936 dimensions). Try to make the network input between 3-5 seconds segment and average the predictions of the segmentations for an audio clip.

* You can try training a model using both mel-spectrograms and features extracted using the pre-trained models. The baseline code is using a pre-trained model trained on 19k songs, but `musicnn` also has models trained on 200k songs! Try using the model giving `model='MSD_musicnn'` option on feature extraction.

* You can try 1D CNN or 2D CNN models and choose different model parameters:
    * Filter size
    * Pooling size
    * Stride size 
    * Number of filters
    * Model depth
    * Regularization: L2/L1 and Dropout

* You should try different hyperparameters to train the model and optimizers:
    * Learning rate
    * Model depth
    * Optimizers: SGD (with Nesterov momentum), Adam, RMSProp, ...

* You can try different parameters (e.g. hop and window size) to extract mel-spectrogram or different features as input to the network (e.g. MFCC, chroma features ...). 

* You can also use ResNet or other CNNs with skip connections. 

* Furthermore, you can augment data using digital audio effects.

In [None]:
# TODO: Question 3


# Deliverables
You should submit your Python code (`.ipynb` or `.py` files) and homework report (.pdf file) to KLMS. The report should include:
* Algorithm Description
* Experiments and Results
* Discussion

# Note
The code is written using PyTorch but you can use TensorFlow if you want for question 3.

# Credit
Thie homework was implemented by Jongpil Lee, Soonbeom Choi and Taejun Kim in the KAIST Music and Audio Computing Lab.
