# Network Two: music genre classifier

This notebook contains the second part of the assigment for ECS7013P, Deep Learning for Audio and Music, at QMUL. It is expected to be used as an additional step for the predictions obtained from Network One.
It consists of an **implementation of a neural network for music genre classification, including an extra class to account for non-music examples** (missclassifications from networkOne).  
Code and ideas were based, built upon and inspired by:  
- Lectures Deep Learning for Audio and Music (ECS7013P), at QMUL, by Dan Stowell. 
- https://github.com/cetinsamet/music-genre-classification/tree/master/src
- https://github.com/slychief/ismir2018_tutorial 
- https://github.com/stenlytw/genre-classification
- https://github.com/keunwoochoi/dl4mir  

Also, this implementation uses pre-trained features, exploring torchvggish's package available at:  
- https://github.com/harritaylor/torchvggish  

Training, validation and testing were conducted using GTZAN and URBANSOUND8K datasets available at:  
- [GTZAN] http://marsyas.info/downloads/datasets.html  
- [URBANSOUND8K] https://urbansounddataset.weebly.com/urbansound8k.html  

Testing was also conducted using the output from *networkOne.ipynb*, a list of files from GTZAN and URBANSOUND8K classified as music. 

As per this coursework's requirements, a test case audio file is needed to evaluate the network's performance. A track from *Conduct*, by October Horse, a progressive metal band from Porto, Portugal was chosen:  
- https://octoberhorse.bandcamp.com/track/waving

In [2]:
import numpy as np
import pandas as pd
import os

import librosa
import pickle

import time

from sklearn.utils import shuffle
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix

import itertools
from collections import OrderedDict

import matplotlib.pyplot as plt
%matplotlib inline

import IPython.display as ipd

import torch
torch.manual_seed(123)
from torch.nn import Module, Conv2d, MaxPool2d, Linear, Dropout, BatchNorm2d
import torch.nn.functional as F
from torch.autograd import Variable

# Implementation

In [13]:
PATH = os.getcwd()

In [14]:
# Paths to where the datasets (GTZAN and URBANSOUND8K) are stored on local filesystem
path_to_gtzan  = "/import/c4dm-datasets/gtzan/"
path_to_urban = "/import/c4dm-datasets/UrbanSound8K/audio/"

In [15]:
GENRES = os.listdir(path_to_gtzan)
# Add genre 'unknown' to our labes to account for non-musical examples coming from networkOne
GENRES.append('unknown')

In [12]:
def get_data_gtzan(path):    
    # initialize lists to append spectrogram data and keys
    # and the dictionary to save everything
    data_list = []
    feat_Dict = {}
    
    # iterate over all genre folders in the dataset
    for genre in GENRES[:-1]: #last element is unknown
        genre_folder = path+genre+'/'
        # iterate over all audio tracks in the genre folder
        for track in os.listdir(genre_folder):
            track_path = genre_folder+track
            filename = genre+'/'+track[:-3] # filename without extension
            y, sr = librosa.load(track_path, mono=True)
            # compute melspectrogram (default num mels = 128)
            S = librosa.feature.melspectrogram(y, sr).T # shape = num frames x num mels
            # Convert it from amplitude squared to dB for easier visualization
            S = librosa.power_to_db(S, ref=np.max)
            # discard last (S.shape[0] % 128) elements of the spectrogram
            # so that the dimensionality is a multiple of num mels
            # so that we can split the spectrogram in chunks of dimensionality 128
            S = S[:-1 * (S.shape[0] % 128)] 
            # number of chunks = 10
            # for a 30 sec long audio track (3sec chunks)
            # (with default librosa settings for the spectrogram calculation)
            num_chunks  = S.shape[0] / 128
            # split spectrogram into chunks
            data_chunks = np.split(S, num_chunks)
            # (spectrogram chunk, label) tuple
            data_chunks = [(data, genre, filename) for data in data_chunks] # append the label for each chunk
            # each list element corresponds to a complete track
            data_list.append(data_chunks)
    
    # save spectrogram chunks in a list and return it
    # now each list element corresponds to a spectrogram chunk (rather than a complete track) 
    
    chunks = []
    for track in data_list:
        for audio_chunk in track:
            chunks.append(audio_chunk)
    
    return chunks

In [13]:
def get_data_urban(path):
    # initialize list to append spectrogram data
    label = "unknown"
    chunks = []
    
    for entry in os.scandir(path):
        if os.path.isdir(entry): ## check if it is a directory
        #folder_names.append(entry.name)
            for wavFile in os.scandir(entry.path):
                if(wavFile.name == '.DS_Store'):
                    break
                # Loading 1000 files
                # But each file is shorter than the ones in GTZAN
                elif(len(chunks) < 1000): # we will load 1000 files
                    filename = entry.name + '/' + str(wavFile.name)
                    filename = filename[:-4] # file extension does not need to be saved
                    y, sr = librosa.load(wavFile, mono=True)
                    S = librosa.feature.melspectrogram(y, sr).T # shape = num frames x num mels
                    # Convert it from amplitude squared to dB for easier visualization
                    S = librosa.power_to_db(S, ref=np.max)
                    # Discard samples
                    if (S.shape[0] >= 128):
                        S = S[:-1 * (S.shape[0] % 128)] 
                        num_chunks  = int(S.shape[0] / 128)
                        data_chunks = np.split(S, num_chunks)
                        data_chunks = [(data, label, filename) for data in data_chunks] # append the label for each chunk
                        # each list element corresponds to a complete track
                        chunks.extend(data_chunks)
                        

    # returns a list of tuples of shape (data_chunk, label, filename)
    return chunks

In [179]:
gtzan_chunks = get_data_gtzan(path_to_gtzan)
urban_chunks = get_data_urban(path_to_urban)

In [None]:
# Save dictionaries into pickle - GTZAN
feat_gtzan_pickle = open("gtzan_feat", "wb")
pickle.dump(gtzan_chunks, feat_gtzan_pickle)
feat_gtzan_pickle.close()

# Save dictionaries into pickle - URBANSOUND
feat_urban_pickle = open("urban_feat", "wb")
pickle.dump(urban_chunks, feat_urban_pickle)
feat_urban_pickle.close()

In [169]:
# Load from pickle - GTZAN
pickle_in = open("gtzan_feat","rb")
gtzan_chunks = pickle.load(pickle_in)

# Load from pickle - URBAN
pickle_in = open("urban_feat","rb")
urban_chunks = pickle.load(pickle_in)

In [171]:
# Convert to dataFrame
data_gtzan = pd.DataFrame.from_records(gtzan_chunks, columns=["spectrogram", "genre", "filename"])
data_urban = pd.DataFrame.from_records(urban_chunks, columns=["spectrogram", "genre", "filename"])
dtframes = [data_gtzan, data_urban]
raw_data_whole = pd.concat(dtframes)

In [None]:
raw_data = raw_data_whole.drop(columns="filename")

In [None]:
# UNCOMMENT CELL TO SEE/HEAR ONE EXAMPLE FROM EACH DATASET
# # Let's look at the spectrogram of one example
# # We use numpy "concatenate" to join the 1-second chunks back together.

# # Some examples, one per each class
# example_keys = ['fold7/194754-3-0-1', 'blues/blues.00032', 'classical/classical.00010', 
#                 'country/country.00012', 'disco/disco.00027', 'hiphop/hiphop.00020', 'jazz/jazz.00012',
#                 'metal/metal.00045', 'pop/pop.00070', 'reggae/reggae.00033', 'rock/rock.00032']

# examples_2D_list = []

# for i, example in enumerate(example_keys):
#     example_2D = raw_data_whole.loc[raw_data_whole['filename'] == example_keys[i]]
#     example_2D = example_2D.drop(columns = ['filename', 'genre']).to_numpy()
#     example_2D = np.concatenate(np.concatenate(example_2D)).T
#     examples_2D_list.append(example_2D)


# # Plotting side-by-side to visually inspect mel spectograms
# fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 12))
# axes[0,0].imshow(examples_2D_list[1], aspect='auto', origin='lower')
# axes[0,0].set_title(example_keys[1])
# axes[0,0].set_ylabel('Mel bins')
# axes[0,0].set_xlabel('Time frames')
# axes[0,1].imshow(examples_2D_list[7], aspect='auto', origin='lower')
# axes[0,1].set_title(example_keys[7])
# axes[0,1].set_ylabel('Mel bins')
# axes[0,1].set_xlabel('Time frames')
# axes[1,0].imshow(examples_2D_list[0], aspect='auto', origin='lower')
# axes[1,0].set_title('genre unknown '+ example_keys[0])
# axes[1,0].set_ylabel('Mel bins')
# axes[1,0].set_xlabel('Time frames')
# axes[1,1].imshow(examples_2D_list[2], aspect='auto', origin='lower')
# axes[1,1].set_title(example_keys[2])
# axes[1,1].set_ylabel('Mel bins')
# axes[1,1].set_xlabel('Time frames')
# fig.tight_layout()


# # Aurally inspect the files
# print(example_keys[1])
# ipd.display(ipd.Audio(librosa.load("%s%s.au" % (path_to_gtzan, example_keys[1]))[0], 
#           rate=librosa.load("%s%s.au" % (path_to_gtzan, example_keys[1]))[1]))

# print(example_keys[7])
# ipd.display(ipd.Audio(librosa.load("%s%s.au" % (path_to_gtzan, example_keys[7]))[0], 
#           rate=librosa.load("%s%s.au" % (path_to_gtzan, example_keys[7]))[1]))

# print("genre unknown" , example_keys[0])
# ipd.display(ipd.Audio(librosa.load("%s%s.wav" % (path_to_urban, example_keys[0]))[0], 
#           rate=librosa.load("%s%s.wav" % (path_to_urban, example_keys[0]))[1]))

# print(example_keys[2])
# ipd.display(ipd.Audio(librosa.load("%s%s.au" % (path_to_gtzan, example_keys[2]))[0], 
#           rate=librosa.load("%s%s.au" % (path_to_gtzan, example_keys[2]))[1]))

In [None]:
# Split data (70% for training, 20% for validation, 10% for testing)
train_data, val_data, test_data = [], [], []

# To mantain balance between genres/labels
for genre in GENRES:
    genre_df = raw_data[raw_data['genre'] == genre]
    # For each genre, append 700 values of spectograms and labels as training data
    train_data.append(genre_df.iloc[:700].values)
    # For each genre, append 200 values of spectograms and labels as validation data
    val_data.append(genre_df.iloc[700:900].values)
    # For each genre, append 100 values of spectograms and labels as test data
    test_data.append(genre_df.iloc[900:].values)

In [181]:
# Shuffles train, validation and test data
train_data = shuffle([record for genre_records in train_data for record in genre_records])
val_data = shuffle([record for genre_records in val_data for record in genre_records])
test_data = shuffle([record for genre_records in test_data for record in genre_records])

In [182]:
# Converts train, validation and test data to dataFrame
train_data = pd.DataFrame.from_records(train_data,  columns=['spectrogram', 'genre']) 
val_data = pd.DataFrame.from_records(val_data,  columns=['spectrogram', 'genre']) 
test_data = pd.DataFrame.from_records(test_data,  columns=['spectrogram', 'genre']) 

In [184]:
le = LabelEncoder()

# Creates x and y for training
x_train = np.stack(train_data['spectrogram'].values)
x_train = np.reshape(x_train, (x_train.shape[0], 1, x_train.shape[1], x_train.shape[2]))
y_train = np.stack(train_data['genre'].values)
y_train = le.fit(GENRES).transform(y_train)
# Creates x and y for validation
x_val = np.stack(val_data['spectrogram'].values)
x_val = np.reshape(x_val, (x_val.shape[0], 1, x_val.shape[1], x_val.shape[2]))
y_val = np.stack(val_data['genre'].values)
y_val = le.fit(GENRES).transform(y_val)
# Creates x and y for testing
x_test = np.stack(test_data['spectrogram'].values)
x_test = np.reshape(x_test, (x_test.shape[0], 1, x_test.shape[1], x_test.shape[2]))
y_test = np.stack(test_data['genre'].values)
y_test = le.fit(GENRES).transform(y_test)

In [8]:
# Class for network architecture
class classNet(Module):

    def __init__(self):
        super(classNet, self).__init__()

        self.conv1 = Conv2d(in_channels=1,     out_channels=64,    kernel_size=3,  stride=1,   padding=1)
        torch.nn.init.normal_(self.conv1.weight)
        self.bn1 = BatchNorm2d(64)
        self.pool1 = MaxPool2d(kernel_size=2)

        self.conv2 = Conv2d(in_channels=64, out_channels=128,      kernel_size=3,  stride=1,   padding=1)
        torch.nn.init.normal_(self.conv2.weight)
        self.bn2 = BatchNorm2d(128)
        self.pool2 = MaxPool2d(kernel_size=2)

        self.conv3 = Conv2d(in_channels=128, out_channels=256,      kernel_size=3,  stride=1,   padding=1)
        torch.nn.init.normal_(self.conv3.weight)
        self.bn3 = BatchNorm2d(256)
        self.pool3 = MaxPool2d(kernel_size=4)

        self.conv4 = Conv2d(in_channels=256, out_channels=512,      kernel_size=3,  stride=1,   padding=1)
        torch.nn.init.normal_(self.conv4.weight)
        self.bn4 = BatchNorm2d(512)
        self.pool4 = MaxPool2d(kernel_size=2)

        self.conv5 = Conv2d(in_channels=512, out_channels=512,      kernel_size=3,  stride=1,   padding=1)
        torch.nn.init.normal_(self.conv4.weight)
        self.bn5 = BatchNorm2d(512)
        self.pool5 = MaxPool2d(kernel_size=2)
        
        self.fc1 = Linear(in_features=2048,  out_features=1000)
        self.drop1 = Dropout(0.5)

        self.fc2 = Linear(in_features=1000,   out_features=11)

    def forward(self, inp):
        x = F.relu(self.bn1(self.conv1(inp)))
        x = self.pool1(x)

        x = F.relu(self.bn2(self.conv2(x)))
        x = self.pool2(x)

        x = F.relu(self.bn3(self.conv3(x)))
        x = self.pool3(x)

        x = F.relu(self.bn4(self.conv4(x)))
        x = self.pool4(x)
        
        x = F.relu(self.bn5(self.conv5(x)))
        x = self.pool5(x)
        
        x = x.view(x.size()[0], -1)
        x = F.relu(self.fc1(x))
        x = self.drop1(x)

        x = F.log_softmax(self.fc2(x))
        
        return x

In [None]:
# Create object from class classNet()
net = classNet()
# Use CUDA
net.cuda()

In [285]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)

In [71]:
EPOCH_NUM = 100
BATCH_SIZE = 16

TRAIN_SIZE = len(x_train)
VALID_SIZE = len(x_val)
TEST_SIZE = len(x_test)

In [None]:
# Variables for plotting losses and accuracies
acc_train_values = []
loss_train_values = []
acc_val_values = []
loss_val_values = []

# To count time elapsed between epochs
t0 = time.time()

# Training and validation loop 
for epoch in range(EPOCH_NUM):
    
    t1 = time.time()
    inp_train = Variable(torch.from_numpy(x_train)).float().cuda()
    out_train = Variable(torch.from_numpy(y_train)).long().cuda()
    inp_valid = Variable(torch.from_numpy(x_val)).float().cuda()
    out_valid = Variable(torch.from_numpy(y_val)).long().cuda()

    # Training phase
    train_loss = 0
    optimizer.zero_grad() 
    
    # Calculating loss
    for i in range(0, TRAIN_SIZE, BATCH_SIZE):
        x_train_batch = inp_train[i:i + BATCH_SIZE] 
        y_train_batch = out_train[i:i + BATCH_SIZE]

        pred_train_batch = net(x_train_batch)
        loss_train_batch = criterion(pred_train_batch, y_train_batch)
        train_loss += loss_train_batch.data.cpu().numpy()

        loss_train_batch.backward()
        
    optimizer.step()  

    epoch_train_loss = (train_loss * BATCH_SIZE) / TRAIN_SIZE
    loss_train_values.append(epoch_train_loss)
    
    train_sum = 0
    # Calculating accuracy
    for i in range(0, TRAIN_SIZE, BATCH_SIZE):
        pred_train = net(inp_train[i:i + BATCH_SIZE])
        # Get the argmax
        indices_train = pred_train.max(1)[1]
        train_sum += (indices_train == out_train[i:i + BATCH_SIZE]).sum().data.cpu().numpy()
        
    train_accuracy = train_sum / float(TRAIN_SIZE)
    acc_train_values.append(train_accuracy)
    
    
    
    # Validation phase
    valid_loss = 0
    # Calculating the loss
    for i in range(0, VALID_SIZE, BATCH_SIZE):
        x_valid_batch = inp_valid[i:i + BATCH_SIZE]  
        y_valid_batch = out_valid[i:i + BATCH_SIZE]

        pred_valid_batch = net(x_valid_batch)
        loss_valid_batch = criterion(pred_valid_batch, y_valid_batch)
        valid_loss += loss_valid_batch.data.cpu().numpy()

    epoch_valid_loss = (valid_loss * BATCH_SIZE) / VALID_SIZE
    loss_val_values.append(epoch_valid_loss)
    
    valid_sum = 0
    # Calculating accuracy
    for i in range(0, VALID_SIZE, BATCH_SIZE):
        pred_valid = net(inp_valid[i:i + BATCH_SIZE])
        # Get the argmax
        indices_valid = pred_valid.max(1)[1]
        valid_sum += (indices_valid == out_valid[i:i + BATCH_SIZE]).sum().data.cpu().numpy()
        
    valid_accuracy = valid_sum / float(VALID_SIZE)
    acc_val_values.append(valid_accuracy)

    print("Epoch: %d\tTrain loss : %.2f\tValid loss : %.2f\tTrain acc : %.2f\tValid acc : %.2f" % \
          (epoch + 1, epoch_train_loss, epoch_valid_loss, train_accuracy, valid_accuracy))
    print('Epoch time: {:.3f} seconds'.format(time.time() - t1))
    
print('TOTAL TIME: {:.3f} seconds'.format(time.time() - t0))  

In [None]:
# UNCOMMENT CELL TO PLOT TRAINING/VALIDATION LOSS AND ACCURACY
# # Plotting side-by-side both accuracy and loss, in training and validation set

# # For 150 epochs
# fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(13, 5))
# axes[0].plot(loss_val_values,label="Validation")
# axes[0].plot(loss_train_values, label='Training')
# axes[0].set_xlabel('Epochs')
# axes[0].set_ylabel('Loss')
# axes[0].legend()
# axes[0].set_title('Loss: (validation set loss = %.2f)' % loss_val_values[-1])

# axes[1].plot(acc_val_values,label="Validation")
# axes[1].plot(acc_train_values, label='Training')
# axes[1].set_xlabel('Epochs')
# axes[1].set_ylabel('Accuracy')
# axes[1].legend()
# axes[1].set_title('Accuracy: (validation set accuracy = %.2f)' % acc_val_values[-1])

# fig.tight_layout()

In [None]:
# Save the model
cwd = os.getcwd()
torch.save(net.state_dict(), cwd+'/classGTZANURBAN.pt')
print('-> ptorch model is saved.')

In [10]:
cwd = os.getcwd()
model = classNet()
model.cuda()
model.load_state_dict(torch.load(cwd+'/classGTZANURBAN.pt'))

<All keys matched successfully>

In [None]:
# Evaluate accuracy on the test set
inp_test = Variable(torch.from_numpy(x_test)).float().cuda() 
out_test = Variable(torch.from_numpy(y_test)).long().cuda() 

test_sum = 0
predictions = []

for i in range(0, TEST_SIZE, BATCH_SIZE):
    pred_test = model(inp_test[i:i + BATCH_SIZE])
    # Get the argmax
    indices_test = pred_test.max(1)[1]
    test_sum += (indices_test == out_test[i:i + BATCH_SIZE]).sum().data.cpu().numpy()
    # To plot the confusion matrix
    predictions.append(indices_test.cpu().numpy()) 
test_accuracy = test_sum / float(TEST_SIZE)
print("Test acc: %.2f" % test_accuracy)

In [None]:
# Shaping predictions to use when plotting the confusion matrix
predictions = np.asarray(predictions)
predictions.flatten
predictions.shape

In [105]:
# Function to plot the confusion matrix
# From: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
cm = confusion_matrix(predictions, y_test)

In [None]:
# UNCOMMENT TO PLOT NORMALIZED CONFUSION MATRIX
# plt.figure(figsize=(10,10))
# plot_confusion_matrix(cm, GENRES, normalize=True)

In [None]:
# UNCOMMENT TO PLOT UNNORMALIZED CONFUSION MATRIX
# plt.figure(figsize=(10,10))
# plot_confusion_matrix(cm, GENRES, normalize=False)

## Performance evaluation of networkTwo with predictions from networkOne

In [174]:
# Load output of networkOne from a pickle
pickle_in = open("music_keys.pickle","rb")
music_keys = pickle.load(pickle_in)

In [175]:
# Filter raw_data_whole to get the correspondent music keys from network
music_keys_data = pd.DataFrame()
df = pd.DataFrame()
df2 = []
for key in music_keys:
    df = raw_data_whole[raw_data_whole['filename'] == key]
    df2.append(df)
    music_keys_data = pd.concat(df2)
    
music_keys_data = music_keys_data.drop(columns="filename")

In [177]:
le = LabelEncoder()

# Create x and y for testing
x_data = np.stack(music_keys_data['spectrogram'].values)
x_data = np.reshape(x_data, (x_data.shape[0], 1, x_data.shape[1], x_data.shape[2]))
y_data = np.stack(music_keys_data['genre'].values)
y_data = le.fit(GENRES).transform(y_data)

In [None]:
# Evaluate accuracy on the predictions from networkOne

TEST_SIZE = len(x_data)

inp_test = Variable(torch.from_numpy(x_data)).float().cuda()
out_test = Variable(torch.from_numpy(y_data)).long().cuda()

test_sum = 0
predictions = []

for i in range(0, TEST_SIZE, BATCH_SIZE):
    pred_test       = model(inp_test[i:i + BATCH_SIZE])
    # Get the argmax
    indices_test    = pred_test.max(1)[1]
    test_sum        += (indices_test == out_test[i:i + BATCH_SIZE]).sum().data.cpu().numpy()
    # To plot the confusion matrix
    predictions.append(indices_test.cpu().numpy())
test_accuracy = test_sum / float(TEST_SIZE)
print("Test acc: %.2f" % test_accuracy)

In [None]:
# Shaping predictions to use when plotting the confusion matrix
pred = np.asarray(predictions)
pred = np.concatenate(pred)

In [103]:
cm = confusion_matrix(pred, y_data)

In [None]:
# UNCOMMENT TO PLOT NORMALIZED CONFUSION MATRIX
# plt.figure(figsize=(10,10))
# plot_confusion_matrix(cm, GENRES, normalize=True)

## Performance evaluation of networkTwo on test audio file

In [25]:
# Audio file test case
audiotest_path = '/homes/pps30/venvs/venv_dl4am_a1/dl4am/assignment/Waving_OH.wav'
# Offset and duration were chosen to retrieve a snippet that contains a 'metal' section and 'non-metal' one
y, sr = librosa.load(audiotest_path, offset=96,duration=30, mono=True)

ipd.Audio(y, rate=sr)

In [5]:
S_audio = librosa.feature.melspectrogram(y, sr).T # shape = num frames x num mels
# Convert it from amplitude squared to dB for easier visualization
S_audio = librosa.power_to_db(S_audio, ref=np.max)
# discard last (S.shape[0] % 128) elements of the spectrogram
# so that the dimensionality is a multiple of num mels
# so that we can split the spectrogram in chunks of dimensionality 128
S_audio = S_audio[:-1 * (S_audio.shape[0] % 128)] 
# number of chunks = 10
# for a 30 sec long audio track (3sec chunks)
# (with default librosa settings for the spectrogram calculation)
num_chunks_audio  = S_audio.shape[0] / 128
# split spectrogram into chunks
data_chunks = np.split(S_audio, num_chunks_audio)

In [6]:
data_chunks = [(data, 'metal') for data in data_chunks]

In [7]:
data_audiotest = pd.DataFrame.from_records(data_chunks, columns=["spectrogram", "genre"])

In [16]:
le = LabelEncoder()

# Create x and y for testing
x_audiotest = np.stack(data_audiotest['spectrogram'].values)
x_audiotest = np.reshape(x_audiotest, (x_audiotest.shape[0], 1, x_audiotest.shape[1], x_audiotest.shape[2]))

y_audiotest = np.stack(data_audiotest['genre'].values)
y_audiotest = le.fit(GENRES).transform(y_audiotest)

In [22]:
# EVALUATE TEST ACCURACY

TEST_SIZE = len(x_audiotest)
BATCH_SIZE = 16

inp_test = Variable(torch.from_numpy(x_audiotest)).float().cuda()
out_test = Variable(torch.from_numpy(y_audiotest)).long().cuda()

test_sum = 0
predictions = []

for i in range(0, TEST_SIZE, BATCH_SIZE):
    pred_test       = model(inp_test[i:i + BATCH_SIZE])
    # Get the argmax
    indices_test    = pred_test.max(1)[1]
    test_sum        += (indices_test == out_test[i:i + BATCH_SIZE]).sum().data.cpu().numpy()
    # To plot the confusion matrix
    predictions.append(indices_test.cpu().numpy())
test_accuracy   = test_sum / float(TEST_SIZE)
print("Test acc: %.2f" % test_accuracy)

Test acc: 0.60




In [23]:
predictions

[array([6, 6, 6, 6, 6, 6, 0, 0, 8, 8])]

In [24]:
list(le.inverse_transform(predictions[0]))

['metal',
 'metal',
 'metal',
 'metal',
 'metal',
 'metal',
 'blues',
 'blues',
 'reggae',
 'reggae']