# CSCI 0451: Deep Music Genre Classification


### Introduction

This blog post will explore classification of music genre from a variety of features using PyTorch. I will implement 3 different neural networks with Torch and train them on the dataset. The first network will use only the lyrics, the second will use only the engineered features, and the third will use both the lyrics and the engineered features. I will compare the performance of the three networks and discuss the results.

In [18]:
import pandas as pd

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/tcc_ceds_music.csv"
df = pd.read_csv(url)

In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,sadness,1.0
1,4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,world/life,1.0
2,6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.00277,0.00277,0.00277,...,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,music,1.0
3,10,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,romantic,1.0
4,12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.00135,0.00135,0.417772,...,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.0


In [19]:
engineered_features = ['dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy']

In [24]:
# transform target variable, "genre", into a categorical variable
genres = {'pop': 0, 'country': 1, 'blues': 2, 'jazz': 3, 'reggae': 4, 'rock': 5, 'hip hop': 6}
df["genre"] = df["genre"].apply(genres.get)

df.head()

Unnamed: 0,dating,violence,world/life,night/time,shake the audience,family/gospel,romantic,communication,obscene,music,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,lyrics,genre
0,0.000598,0.063746,0.000598,0.000598,0.000598,0.048857,0.017104,0.263751,0.000598,0.039288,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,hold time feel break feel untrue convince spea...,0
1,0.035537,0.096777,0.443435,0.001284,0.001284,0.027007,0.001284,0.001284,0.001284,0.118034,...,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,believe drop rain fall grow believe darkest ni...,0
2,0.00277,0.00277,0.00277,0.00277,0.00277,0.00277,0.158564,0.250668,0.00277,0.323794,...,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,sweetheart send letter goodbye secret feel bet...,0
3,0.048249,0.001548,0.001548,0.001548,0.0215,0.001548,0.411536,0.001548,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,kiss lips want stroll charm mambo chacha merin...,0
4,0.00135,0.00135,0.417772,0.00135,0.00135,0.00135,0.46343,0.00135,0.00135,0.00135,...,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,till darling till matter know till dream live ...,0


In [28]:
# create a new dataframe with only the engineered features, lyrics and the target
df2 = df[["lyrics", "genre"] + engineered_features]
df2.head()

Unnamed: 0,lyrics,genre,dating,violence,world/life,night/time,shake the audience,family/gospel,romantic,communication,...,family/spiritual,like/girls,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy
0,hold time feel break feel untrue convince spea...,0,0.000598,0.063746,0.000598,0.000598,0.000598,0.048857,0.017104,0.263751,...,0.000598,0.000598,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711
1,believe drop rain fall grow believe darkest ni...,0,0.035537,0.096777,0.443435,0.001284,0.001284,0.027007,0.001284,0.001284,...,0.051124,0.001284,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324
2,sweetheart send letter goodbye secret feel bet...,0,0.00277,0.00277,0.00277,0.00277,0.00277,0.00277,0.158564,0.250668,...,0.00277,0.00277,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112
3,kiss lips want stroll charm mambo chacha merin...,0,0.048249,0.001548,0.001548,0.001548,0.0215,0.001548,0.411536,0.001548,...,0.001548,0.081132,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736
4,till darling till matter know till dream live ...,0,0.00135,0.00135,0.417772,0.00135,0.00135,0.00135,0.46343,0.00135,...,0.029755,0.00135,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375


In [30]:
# explore the "genre" data distribution
df2["genre"].value_counts()

genre
0    7042
1    5445
2    4604
5    4034
3    3845
4    2498
6     904
Name: count, dtype: int64

### Dataloader

In [31]:
# dataloader
from torch.utils.data import Dataset, DataLoader

class LyricsDataset(Dataset):
    def __init__(self, df):
        self.df = df
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        lyrics = row["lyrics"]
        genre = row["genre"]
        features = row[engineered_features]
        return {
            "lyrics": lyrics,
            "features": features,
            "genre": genre
        }
    
    def get_labels(self):
        return self.df["genre"].values

In [46]:
# train test split
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df2, test_size = 0.2, random_state = 123, shuffle=True)

train_df.shape, test_df.shape

train_dataset = LyricsDataset(train_df)
val_dataset = LyricsDataset(test_df)

In [47]:
# test the dataloader
train_dataset[0]

{'lyrics': 'cause leave heartbreaker hurt cause leave heartbreaker hurt help cause lonely need somebody want somebody cause leave heartbreaker hurt cause leave heartbreaker hurt know die know cry go life go life cause leave heartbreaker hurt cause leave heartbreaker hurt flash cause leave heartbreaker hurt cause leave heartbreaker hurt gotta gonna know deep inside treat good cause leave heartbreaker hurt cause leave heartbreaker hurt',
 'features': dating                      0.000975
 violence                    0.000975
 world/life                  0.044549
 night/time                  0.000975
 shake the audience          0.000975
 family/gospel               0.000975
 romantic                    0.000975
 communication               0.159908
 obscene                     0.071244
 music                       0.000975
 movement/places             0.000975
 light/visual perceptions    0.000975
 family/spiritual            0.000975
 like/girls                  0.000975
 sadness        

In [62]:
# training loop function for all three models
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score
import numpy as np

def train_loop(model, train_loader, val_loader, loss_fn, optimizer, device, epochs=2):
    for epoch in range(epochs):
        model.train()
        for i, data in enumerate(train_loader):
            lyrics = data[0]
            features = data[1]
            genre = data[2]
            
            lyrics = lyrics.to(device)
            features = features.to(device)
            genre = genre.to(device)
            
            optimizer.zero_grad()
            outputs = model(lyrics, features)
            loss = loss_fn(outputs, genre)
            loss.backward()
            optimizer.step()
            
            if i % 100 == 0:
                print(f"Epoch {epoch}, iter {i}, loss: {loss.item()}")
                
        val_loss, val_acc = evaluate(model, val_loader, loss_fn, device)
        print(f"Validation loss: {val_loss}, Validation accuracy: {val_acc}")

def evaluate(model, val_loader, loss_fn, device):
    model.eval()
    val_loss = 0
    all_preds = []
    all_targets = []
    with torch.no_grad():
        for data in val_loader:
            lyrics = data[0]
            features = data[1]
            genre = data[2]
            
            lyrics = lyrics.to(device)
            features = features.to(device)
            genre = genre.to(device)
            
            outputs = model(lyrics, features)
            loss = loss_fn(outputs, genre)
            val_loss += loss.item()
            
            preds = torch.argmax(outputs, axis = 1)
            all_preds.append(preds.cpu().numpy())
            all_targets.append(genre.cpu().numpy())
            
    all_preds = np.concatenate(all_preds)
    all_targets = np.concatenate(all_targets)
    acc = accuracy_score(all_targets, all_preds)
    return val_loss, acc


In [39]:
%conda install -c pytorch torchtext -y

Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: c:\Users\julia\anaconda3\envs\ML-2000

  added / updated specs:
    - torchtext


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    torchtext-0.6.0            |             py_1          48 KB  pytorch
    ------------------------------------------------------------
                                           Total:          48 KB

The following NEW packages will be INSTALLED:

  torchtext          pytorch/noarch::torchtext-0.6.0-py_1 



Downloading and Extracting Packages

torchtext-0.6.0      | 48 KB     |            |   0% 
torchtext-0.6.0      | 48 KB     | ###3       |  34% 
torchtext-0.6.0      | 48 KB     | ########## | 100% 
                                                     


Preparing

Error while loading conda entry point: anaconda-cloud-auth (cannot import name 'ChannelAuthBase' from 'conda.plugins.types' (C:\Users\julia\anaconda3\Lib\site-packages\conda\plugins\types.py))
Error while loading conda entry point: anaconda-cloud-auth (cannot import name 'ChannelAuthBase' from 'conda.plugins.types' (C:\Users\julia\anaconda3\Lib\site-packages\conda\plugins\types.py))


  current version: 23.7.4
  latest version: 24.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=24.5.0




In [57]:
# tokenization
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import Vocab
from collections import Counter

tokenizer = get_tokenizer("basic_english")
counter = Counter()
for line in train_df["lyrics"]:
    counter.update(tokenizer(line))
vocab = Vocab(counter, specials=("<unk>", "<pad>"))

# collate function
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
    lyrics = [torch.tensor([vocab[token] for token in tokenizer(item["lyrics"])]) for item in batch]
    lyrics = pad_sequence(lyrics, batch_first=True)
    features = torch.stack([torch.tensor(item["features"].values.astype('float32')) for item in batch])
    genre = torch.tensor([item["genre"] for item in batch])
    return lyrics, features, genre

# data
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_batch)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, collate_fn=collate_batch)

next(iter(train_loader))

(tensor([[  169,     8,   121,  ...,     0,     0,     0],
         [   11,    11,    11,  ...,     0,     0,     0],
         [   76,    47,     2,  ...,     0,     0,     0],
         ...,
         [ 1189,    13,    52,  ...,     0,     0,     0],
         [   18,   796, 39065,  ...,     0,     0,     0],
         [  146,   356,   395,  ...,     0,     0,     0]]),
 tensor([[2.0243e-03, 4.6580e-02, 7.6299e-01, 2.0243e-03, 2.0243e-03, 2.0243e-03,
          2.0243e-03, 7.0791e-02, 2.0243e-03, 2.0243e-03, 2.0243e-03, 2.0243e-03,
          2.0243e-03, 2.0243e-03, 5.0386e-02, 2.0243e-03, 3.6315e-01, 5.2188e-01,
          2.7410e-01, 1.0931e-04, 1.7045e-01, 3.6835e-01],
         [3.0903e-02, 9.9645e-02, 6.1766e-01, 7.3099e-04, 1.1184e-01, 1.0164e-01,
          7.3099e-04, 7.3099e-04, 7.3099e-04, 7.3099e-04, 7.3099e-04, 7.3099e-04,
          1.2789e-02, 1.6749e-02, 7.3099e-04, 7.3099e-04, 5.7002e-01, 5.1824e-01,
          3.9458e-01, 6.5789e-03, 8.4336e-01, 5.4553e-01],
         [7.1634e-02

In [60]:
# network 1: genre classification using only the lyrics
from torch import nn

class LyricsModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super(LyricsModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers = n_layers, 
                           bidirectional = bidirectional, 
                           dropout = dropout,
                           batch_first = True)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, lyrics, features):
        embedded = self.embedding(lyrics)
        output, (hidden, cell) = self.rnn(embedded)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        return self.fc(hidden)
    
# hyperparameters
vocab_size = len(vocab)
embedding_dim = 4
hidden_dim = 32
output_dim = 7
n_layers = 2
bidirectional = True
dropout = 0.5


In [63]:
# training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LyricsModel(vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

train_loop(model, train_loader, val_loader, loss_fn, optimizer, device, epochs=1)

Epoch 0, iter 0, loss: 1.9052387475967407
Epoch 0, iter 100, loss: 1.8855047225952148
Epoch 0, iter 200, loss: 1.9408767223358154
Epoch 0, iter 300, loss: 1.7533650398254395
Epoch 0, iter 400, loss: 1.784616470336914
Epoch 0, iter 500, loss: 1.7031092643737793
Epoch 0, iter 600, loss: 1.9928700923919678
Epoch 0, iter 700, loss: 1.7837918996810913
Validation loss: 326.86677515506744, Validation accuracy: 0.2562114537444934


Huh. Thats a pretty bad accuracy.

In [66]:
# network 2: genre classification using only the engineered features
class FeaturesModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout):
        super(FeaturesModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, lyrics, features):
        x = self.dropout(torch.relu(self.fc1(features)))
        return self.fc2(x)
    
# hyperparameters
input_dim = len(engineered_features)
hidden_dim = 32
output_dim = 7
dropout = 0.5

In [67]:
# training
model = FeaturesModel(input_dim, hidden_dim, output_dim, dropout).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

train_loop(model, train_loader, val_loader, loss_fn, optimizer, device, epochs=1)

Epoch 0, iter 0, loss: 1.9866634607315063
Epoch 0, iter 100, loss: 1.8374310731887817
Epoch 0, iter 200, loss: 1.691967248916626
Epoch 0, iter 300, loss: 1.899735689163208
Epoch 0, iter 400, loss: 1.7824122905731201
Epoch 0, iter 500, loss: 2.0268425941467285
Epoch 0, iter 600, loss: 1.9030365943908691
Epoch 0, iter 700, loss: 1.843409538269043
Validation loss: 320.2576390504837, Validation accuracy: 0.2659030837004405


Well I suppose an accuracy of 26% is better than 25%, so we have an improvement of 1% there. I think I am limited by the capabilities of my poor tablet...

In [68]:
# network 3: genre classification using both the lyrics and the engineered features
class CombinedModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, input_dim):
        super(CombinedModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers = n_layers, 
                           bidirectional = bidirectional, 
                           dropout = dropout,
                           batch_first = True)
        self.fc1 = nn.Linear(hidden_dim*2 + input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, lyrics, features):
        embedded = self.embedding(lyrics)
        output, (hidden, cell) = self.rnn(embedded)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        combined = torch.cat((hidden, features), dim = 1)
        x = self.dropout(torch.relu(self.fc1(combined)))
        return self.fc2(x)
    
# hyperparameters
vocab_size = len(vocab)
embedding_dim = 4
hidden_dim = 32
output_dim = 7
n_layers = 2
bidirectional = True
dropout = 0.5
input_dim = len(engineered_features)

In [69]:
# training
model = CombinedModel(vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, input_dim).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

train_loop(model, train_loader, val_loader, loss_fn, optimizer, device, epochs=1)


Epoch 0, iter 0, loss: 1.914614200592041
Epoch 0, iter 100, loss: 2.0118398666381836
Epoch 0, iter 200, loss: 1.968980312347412
Epoch 0, iter 300, loss: 1.862599492073059
Epoch 0, iter 400, loss: 1.8309868574142456
Epoch 0, iter 500, loss: 1.8668416738510132
Epoch 0, iter 600, loss: 2.001359462738037
Epoch 0, iter 700, loss: 2.0670621395111084
Validation loss: 327.63753485679626, Validation accuracy: 0.2562114537444934


The accuracy of the combined model has gone back down to 25% which is a bit disappointing. However, I am impressed that despite the low hyperparameter tuning, the model was actually able to run at all.

I think that the reason this model is not performing as well is because there is soooo much loss present in the model. I think that if I were actually able to use real hyperparameter tuning, I could get a much better result from the third than the first two models individually.

I cannot be too disappointed with the results, as they are slightly above the base rate of selecting 'pop' for every song, which has the highest frequency of 24%. I think that with more time and resources, I could improve the model to have a higher accuracy.

### Conclusion

This was a fun foray in the world of music genres. It was similar to see the different approach from how I looked at classifying song lyrics by gender in my Natural Language Processing class. I was able to get mid to high 90s accuracy using BERT and Naive Bayes models on ADA, but I think that the difference in accuracy is due to the difference in the available resources. ADA is really cool, and I would have loved to use it for this, but I am not sure how to use it with jupyter notebooks. Some of the interesting things I noticed in both projects are that the hyperparameters are everything. They can wildly change the accuracy and speed with just the smallest tuning. With my NLP final project, we were only able to avoid running out of working memory with a learning rate of 0.0001 - and thats even on ADA! I am reminded of Professor Biester's WiDS talk where she analyzed all of reddit. I have no idea what kind of computing power was necessary for that.