# Music genre classifier
## Neural networks as classifiers
This notebook should be see as the third step in a series of notebooks aimed to build an ML audio classifier.

We continue our journey of music classification by trainging more complex models, such as CNNs and RNNs.
After this, we will see how our results compare against a pretrained model. 
If you missed our previous stesp, you can find them here:

- [preprocessing](https://github.com/pmhalvor/public-data/blob/master/notebooks/music-genre/preprocess.py)
- [traditional classifiers](https://github.com/pmhalvor/public-data/blob/master/notebooks/music-genre/classifiers.py) (note: currently only on branch [add/classifiers](https://github.com/pmhalvor/public-data/blob/add/classifiers/notebooks/music-genre/classifiers.ipynb))


## Goal
Train neural net classifiers to predict the genre of a song.

## Dataset
The dataset contains 1000 audio tracks each 30 seconds long. It contains 10 genres, each represented by 100 tracks. The tracks were all 22050Hz Mono 16-bit audio files in .wav format.
In [preprocess.py](preprocess.py), we convert the .wav fiels to MFCC features, and store them as PyTorch tensors (`mfcc.pt`). Labels and file paths are stored as numpy-arrays. 

## Source
https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification/ (accessed 2023-10-20)

In [63]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score

import numpy as np
import plotly.express as px
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# Load data

In [3]:
mfcc_tensor = torch.load("mfcc.pt")
covariance_tensor =  torch.load("covariance.pt")
file_paths = np.load("file_paths.npy")
labels = np.load("labels.npy")

In [4]:
mfcc_tensor.shape

torch.Size([999, 2986, 13])

In [5]:
labels.shape

(999,)

In [6]:
# for plotting
file_paths.shape

(999,)

In [7]:
labels_to_idx = {label: idx for idx, label in enumerate(np.unique(labels))}
idx_to_labels = {idx: label for idx, label in enumerate(np.unique(labels))}
labels_to_idx

{'blues': 0,
 'classical': 1,
 'country': 2,
 'disco': 3,
 'hiphop': 4,
 'jazz': 5,
 'metal': 6,
 'pop': 7,
 'reggae': 8,
 'rock': 9}

# Build simple classifiers

Asked ChatGPT to generate a classifier for us. I fed it this prompt:

Me: _I want to test a basic feed-forward neural network, a CNN, and an RNN. I will be using PyTorch as my ML framework. Could you help me generate the basic boilerplate code?_ 


It came up with this the following three models:


In [8]:
features = 13
measurements = 2986
input_size = measurements * features  # Number of MFCC coefficients
hidden_size = 128  # Number of neurons in the hidden layer
num_classes = 10  # Number of music genres
criterion = nn.CrossEntropyLoss()

# Hyperparameters
num_epochs = 10
batch_size = 100
learning_rate = 0.001
rnn_layers = 2

In [9]:
class FFN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(FFN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x



torch.Size([128, 38818])


In [10]:
# Define the Convolutional Neural Network model
class CNN(nn.Module):
    def __init__(self, num_channels, num_classes, out_channels=32, measurements=2986, features=13, verbose=False):
        super(CNN, self).__init__()

        kernel_size = 3
        stride = 1
        padding = 1

        self.conv1 = nn.Conv2d(in_channels=num_channels, out_channels=out_channels, kernel_size=1, stride=1, padding=0)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc = nn.Linear(out_channels * measurements * 3, num_classes)  # Adjust the input size based on your data

        self.measurements = measurements
        self.features = features
        self.verbose = verbose
    
    def forward(self, x):
        print("x.shape", x.shape) if self.verbose else None
        x = x.reshape(-1, 1, self.measurements, self.features)
        print("x.shape", x.shape) if self.verbose else None
        x = self.conv1(x)
        print("x.shape", x.shape) if self.verbose else None
        x = self.relu(x)
        print("x.shape", x.shape) if self.verbose else None
        x = self.maxpool(x)
        print("x.shape", x.shape) if self.verbose else None
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        print("x.shape", x.shape) if self.verbose else None
        print("-"*10) if self.verbose else None
        return x


torch.Size([10, 286656])


In [34]:
# Define the Recurrent Neural Network model
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes, verbose=False):
        super(RNN, self).__init__()
        self.rnn = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

        self.verbose = verbose
    
    def forward(self, x):
        print('Shape of input: {}'.format(x.shape)) if self.verbose else None
        x = x.reshape([x.shape[0], 2986, 13])
        print('Shape of input: {}'.format(x.shape)) if self.verbose else None
    
        _, (hn, _) = self.rnn(x)
        print('Shape of hidden state: {}'.format(hn.shape)) if self.verbose else None
    
        x = self.fc(hn[-1, :, :])
        print('Shape of output: {}'.format(x.shape)) if self.verbose else None
    
        return x


# Train test split

In [14]:
# Reshape the data into a 2D array (num_samples, num_features)
num_samples, num_frames, num_mfcc = mfcc_tensor.shape
mfcc_tensor_2d = np.reshape(mfcc_tensor, (num_samples, num_frames * num_mfcc))

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(mfcc_tensor_2d, labels, test_size=0.2, random_state=42)

# Get validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

In [15]:
uniques, counts = np.unique(y_train, return_counts=True)
dict(zip(uniques, counts))

{'blues': 73,
 'classical': 61,
 'country': 68,
 'disco': 71,
 'hiphop': 73,
 'jazz': 71,
 'metal': 80,
 'pop': 72,
 'reggae': 77,
 'rock': 73}

# Train methods

In [16]:
def train_batch(batch, model, criterion, optimizer):
    # Get the batch of data
    batch_X, batch_y = batch
    # convert strings to ids
    batch_y = np.array([labels_to_idx[x] for x in batch_y])
    batch_X = batch_X.float()
    
    # Zero out the gradients
    optimizer.zero_grad()
    
    # Forward pass
    outputs = model(batch_X)
    loss = criterion(outputs, torch.tensor(batch_y))
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    return loss.item()  


def eval_batch(X_val, y_val, model, criterion):
     # Evaluate
    model.eval()
    with torch.no_grad():
        X_val = X_val.float()
        y_val = torch.tensor([labels_to_idx[x] for x in y_val])
        outputs = model(X_val)
        loss = criterion(outputs, y_val)
    
    return loss.item()


def train_epoch(X_train, y_train, model, criterion, optimizer, batch_size=100):
    # Shuffle the training data
    indices = np.arange(len(X_train))
    np.random.shuffle(indices)
    
    # Create batches
    num_batches = len(X_train) // batch_size
    batches = [(X_train[i*batch_size:(i+1)*batch_size], y_train[i*batch_size:(i+1)*batch_size]) for i in range(num_batches)]
    
    # Train each batch
    losses = []
    for batch in batches:
        loss = train_batch(batch, model, criterion, optimizer)
        losses.append(loss)
    
    return losses


def train_model(X_train, y_train, X_val, y_val, model, criterion, optimizer, num_epochs=10, batch_size=100, verbose=False):
    train_losses = []
    val_losses = []
    average_loss = []
    
    for epoch in range(num_epochs):
        print('Epoch: {}'.format(epoch)) if verbose or epoch % 5 == 0 else None
        
        # Train
        model.train()
        losses = train_epoch(X_train, y_train, model, criterion, optimizer, batch_size=batch_size)
        train_losses.extend(losses)
        average_loss.append(np.mean(losses))
        
        # Evaluate
        model.eval()
        with torch.no_grad():
            eval_loss = eval_batch(X_val, y_val, model, criterion)
            val_losses.append(eval_loss)
        
        if verbose or epoch % 5 == 0:
            print('Train loss: {:.4f}'.format(losses[-1])) 
            print('Val loss: {:.4f}'.format(val_losses[-1]))
        

    return train_losses, val_losses, average_loss


X_train.numpy().shape


(719, 38818)

In [17]:
def plot_losses(train_losses, val_losses, model=""):
    """Plot using Plotly Express"""
    import plotly.express as px
    import pandas as pd
    pd.options.plotting.backend = "plotly"
    
    df = pd.DataFrame({
        'epoch': np.arange(len(train_losses)),
        'train_loss': train_losses,
        'val_loss': val_losses
    })
    
    fig = px.line(df, x='epoch', y=['train_loss', 'val_loss'], title=f'Losses {model}')
    fig.show()

# Training

## Feed Forward 

In [None]:
# Initialize the FFN model with the correct input_size
input_size = 38818  # Set the correct input size
hidden_size = 128
num_classes = 10

In [18]:
ffn_model = FFN(input_size, hidden_size*5, num_classes)
optimizer_ffn = optim.Adam(ffn_model.parameters(), lr=0.001)  # double check optimizer set-up

ffn_train_losses, ffn_val_losses, ffn_avg_losses = train_model(
    X_train, y_train, X_val, y_val, ffn_model, criterion, optimizer_ffn, 
    num_epochs=15, batch_size=batch_size, verbose=False
)

plot_losses(ffn_avg_losses, ffn_val_losses, "FFN")

Epoch: 0
Train loss: 681.9460
Val loss: 564.4234
Epoch: 5
Train loss: 6.5908
Val loss: 22.8691
Epoch: 10
Train loss: 0.3541
Val loss: 13.3757


## CNN

In [None]:
# Initialize the CNN model
num_channels = 1  # Since each feature is treated as a channel
num_classes = 10  # Number of output classes (genres in your case)

In [19]:
# Initialize the CNN model, loss function, and optimizer
cnn_model = CNN(num_channels, num_classes, out_channels=32, measurements=2986, features=13)
optimizer_cnn = optim.Adam(cnn_model.parameters(), lr=0.0001)

cnn_train_losses, cnn_val_losses, cnn_avg_losses = train_model(
    X_train, y_train, X_val, y_val, cnn_model, criterion, optimizer_cnn, 
    num_epochs=15, batch_size=batch_size, verbose=False
)

plot_losses(cnn_avg_losses, cnn_val_losses, "CNN")

Epoch: 0
Train loss: 405.5887
Val loss: 424.3141
Epoch: 5
Train loss: 35.5789
Val loss: 32.3569
Epoch: 10
Train loss: 4.0976
Val loss: 17.5252


# RNN

In [22]:
# Assuming your input tensor is named 'x' with size [100, 2986, 13]
input_size = 13  # Number of features
hidden_size = 128  # Number of hidden units in the RNN layer
num_layers = 2  # Number of RNN layers
num_classes = 10  # Number of output classes (genres in your case)
batch_size = 100  # Number of examples in a batch

In [None]:

# Initialize the RNN model, loss function, and optimizer
rnn_model = RNN(features, hidden_size, rnn_layers, num_classes)

In [36]:
# Initialize the RNN model, loss function, and optimizer
rnn_model = RNN(features, hidden_size, num_layers, num_classes, verbose=False)
optimizer_rnn = optim.Adam(rnn_model.parameters(), lr=0.1)

rnn_train_losses, rnn_val_losses, rnn_avg_losses = train_model(
    X_train, y_train, X_test, y_test, rnn_model, criterion, optimizer_rnn, 
    num_epochs=10, batch_size=batch_size, verbose=False
)

plot_losses(rnn_avg_losses, rnn_val_losses, "RNN")

Epoch: 0
Train loss: 3.1300
Val loss: 2.9816
Epoch: 5
Train loss: 2.2097
Val loss: 2.0377


# Evaluate


In [42]:
y_labels = np.array([labels_to_idx[x] for x in y_val])

In [73]:
def metrics(predictions, y_labels, verbose=False):
    """Calculate accuracy, F1 score, and confusion matrix"""
    accuracy = accuracy_score(y_labels, predictions)
    f1 = f1_score(y_labels, predictions, average='weighted', zero_division=0)
    confusion = confusion_matrix(y_labels, predictions)
    report = classification_report(y_labels, predictions, zero_division=0)

    if verbose:
        print("Accuracy:", accuracy)
        print("F1 Score:", f1) 
        # print("Confusion matrix:", confusion) 
        print("Classification Report:\n", report)
    return accuracy, f1, confusion, report

In [68]:
def plot_confusion_matrix(cm, classes=list(set(labels_to_idx)), title='Confusion matrix', cmap=px.colors.sequential.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    fig = px.imshow(cm, x=classes, y=classes, color_continuous_scale=cmap)
    fig.update_layout(title=title, xaxis_title="Predicted", yaxis_title="Actual")
    fig.show()


In [69]:
ffn_outputs = ffn_model(X_val).detach().numpy()
ffn_preds = np.argmax(ffn_outputs, axis=1)


ffn_acc, ffn_f1, ffn_cm, ffn_cr = metrics(ffn_preds, y_labels, verbose=True)

plot_confusion_matrix(ffn_cm)

Accuracy: 0.5
F1 Score: 0.48940404366874957
Classification Report:
               precision    recall  f1-score   support

           0       0.38      0.60      0.46         5
           1       0.80      0.73      0.76        11
           2       0.30      0.30      0.30        10
           3       0.00      0.00      0.00         5
           4       0.43      0.43      0.43         7
           5       0.56      0.56      0.56         9
           6       0.70      0.88      0.78         8
           7       0.50      0.71      0.59         7
           8       0.57      0.50      0.53         8
           9       0.50      0.20      0.29        10

    accuracy                           0.50        80
   macro avg       0.47      0.49      0.47        80
weighted avg       0.50      0.50      0.49        80



In [74]:
cnn_outputs = cnn_model(X_val).detach().numpy()
cnn_preds = np.argmax(cnn_outputs, axis=1)


cnn_outputs = cnn_model(X_val).detach().numpy()
cnn_preds = np.argmax(cnn_outputs, axis=1)


cnn_acc, cnn_f1, cnn_cm, cnn_cr = metrics(cnn_preds, y_labels, verbose=True)

plot_confusion_matrix(cnn_cm)

Accuracy: 0.275
F1 Score: 0.2450187855622638
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.40      0.57         5
           1       0.75      0.82      0.78        11
           2       0.00      0.00      0.00        10
           3       0.08      0.60      0.14         5
           4       0.21      0.57      0.31         7
           5       1.00      0.11      0.20         9
           6       0.00      0.00      0.00         8
           7       0.60      0.43      0.50         7
           8       0.00      0.00      0.00         8
           9       0.00      0.00      0.00        10

    accuracy                           0.28        80
   macro avg       0.36      0.29      0.25        80
weighted avg       0.35      0.28      0.25        80



In [75]:
rnn_outputs = rnn_model(X_val).detach().numpy()
rnn_preds = np.argmax(rnn_outputs, axis=1)

rnn_acc, rnn_f1, rnn_cm, rnn_cr = metrics(rnn_preds, y_labels, verbose=True)

plot_confusion_matrix(rnn_cm)

Accuracy: 0.3625
F1 Score: 0.3153313652721547
Classification Report:
               precision    recall  f1-score   support

           0       0.14      0.60      0.22         5
           1       0.86      0.55      0.67        11
           2       0.00      0.00      0.00        10
           3       0.00      0.00      0.00         5
           4       1.00      0.14      0.25         7
           5       0.29      0.56      0.38         9
           6       0.45      0.62      0.53         8
           7       0.33      0.86      0.48         7
           8       0.75      0.38      0.50         8
           9       0.00      0.00      0.00        10

    accuracy                           0.36        80
   macro avg       0.38      0.37      0.30        80
weighted avg       0.40      0.36      0.32        80

