# Deep Music Classification

### Introduction

Like most of the population, I listen to music every day: when I'm in the shower, when I'm walking to class, or when I want to process a specific emotion. I enjoy a variety of different genres, especially pop and country. If you played me a pop or country song, I don't think I'd have a challenge classifying it as one of the two. Genres of music tend to have similar word choice in the lyrics, similar levels of "upbeat-ness," similar content matters.

In this blog post, I will use Torch to predict the genre of a song based on the track’s lyrics and engineered features.

I will create three neural networks using torch and train them, evaluating each one using unseen validation data.

The first neural network will only use the song lyrics, the second will use engineered features, such as 'family/gospel', 'romantic', and 'obscene,' which contain numerical ratings of how these content matters apply to the song. My third network will use both the lyrics and the engineered features. 

Finally, I will investigate the word embeddings learned by my models and consider what biases my model has learned based on songs' content.

### Data Preparation

Load in dataset. We have 31 observations of 28372 songs.

In [5]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/tcc_ceds_music.csv"
df = pd.read_csv(url)

df.head(5)

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,sadness,1.0
1,4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,world/life,1.0
2,6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.00277,0.00277,0.00277,...,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,music,1.0
3,10,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,romantic,1.0
4,12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.00135,0.00135,0.417772,...,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.0


There are 7 genres we'll classify into.

In [6]:
# how many genres?
df["genre"].unique()

array(['pop', 'country', 'blues', 'jazz', 'reggae', 'rock', 'hip hop'],
      dtype=object)

Label encode the genres

In [7]:
# assign label to genre
genres = {
    "pop": 0,
    "country" : 1, 
    "blues": 2,
    "jazz": 3,
    "reggae": 4,
    "rock": 5,
    "hip hop": 6
}
df["genre"] = df["genre"].apply(genres.get)
df.head(5)

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,0,mukesh,mohabbat bhi jhoothi,1950,0,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,sadness,1.0
1,4,frankie laine,i believe,1950,0,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,world/life,1.0
2,6,johnnie ray,cry,1950,0,sweetheart send letter goodbye secret feel bet...,24,0.00277,0.00277,0.00277,...,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,music,1.0
3,10,pérez prado,patricia,1950,0,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,romantic,1.0
4,12,giorgos papadopoulos,apopse eida oneiro,1950,0,till darling till matter know till dream live ...,48,0.00135,0.00135,0.417772,...,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.0


What would be the baseline accuracy for our model?

In [8]:
# baseline accuracy
df.groupby("genre").size() / len(df)

genre
0    0.248202
1    0.191915
2    0.162273
3    0.135521
4    0.088045
5    0.142182
6    0.031862
dtype: float64

If our model always predicts a song to be pop, it would achieve 24% accuracy. Let's see if we can beat this using neural networks.

### Neural Network 1: Lyrics

This class allows us to retrieve the lyrics from the data frame

In [9]:
from torch.utils.data import Dataset, DataLoader

class TextDataFromDF(Dataset):
    def __init__(self, df):
        self.df = df
    
    def __getitem__(self, index):
        return self.df.iloc[index, 5], self.df.iloc[index, 4], self.df.iloc[index, 6:28] #add a third item for features

    def __len__(self):
        return len(self.df)                

Train test split

In [10]:
df_train, df_val = train_test_split(df,shuffle = True, test_size = 0.2)
train_data = TextDataFromDF(df_train)
val_data   = TextDataFromDF(df_val)

Investigate an entry of our training dataset. Each entry has the lyrics, as well as the engineered features

In [11]:
train_data[3]

('scorn give lie save blue sky turn grey know game play wind rage start build madness card draw help hear sound approach thunder cloud blacken remember warn shelter fierce wind hear echo anger fear roar thunder nightmares confusion come true ace eights fate lightning bolt swords pull gauntlet slash rain survive testify cloud blacken remember warn shelter fierce wind',
 2,
 len                               59
 dating                      0.001462
 violence                    0.346129
 world/life                  0.179856
 night/time                  0.001462
 shake the audience          0.001462
 family/gospel               0.001462
 romantic                    0.001462
 communication               0.001462
 obscene                     0.001462
 music                       0.204009
 movement/places             0.001462
 light/visual perceptions    0.248077
 family/spiritual            0.001462
 like/girls                  0.001462
 sadness                     0.001462
 feelings        

To tokenize the text, each word is assigned an integer value. This allows us to feed the lyrics into a neural netowrk.

In [12]:
tokenizer = get_tokenizer('basic_english')
tokenized = tokenizer(train_data[194][0])
tokenized[0:10]

['stay',
 'mornin',
 'pass',
 'time',
 'somethin',
 'wrong',
 'denyin',
 'changin',
 'maybe',
 'stop']

To convert between the tokens and the word, create a yield_tokens method.

In [13]:
def yield_tokens(data_iter):
    for text, features, _ in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_data), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

Let's check out of vocabulary.

In [14]:
vocab.get_itos()[0:10]

['<unk>',
 'know',
 'like',
 'time',
 'come',
 'go',
 'feel',
 'away',
 'heart',
 'yeah']

How does this look as tokens? We see the words as their integer representation.

In [15]:
vocab(tokenized)[0:10]

[46, 791, 191, 3, 388, 91, 8594, 3165, 173, 70]

Make a text pipeline to preprocess text data

In [16]:
max_len = 30
num_tokens = len(vocab.get_itos())
def text_pipeline(x):
    # tokenize input string x
    tokens = vocab(tokenizer(x))
    # create a zero tensor of length max_len
    y = torch.zeros(max_len, dtype=torch.int64) + num_tokens
    # trim tokens to the first max_len tokens
    if len(tokens) > max_len:
        tokens = tokens[0:max_len]
    # replace first len(tokens) elements of y with tokenized input
    y[0:len(tokens)] = torch.tensor(tokens,dtype=torch.int64)
    return y

label_pipeline = lambda x: int(x)

Method collate_batch processes a batch of data

In [17]:
def collate_batch(batch):
    label_list, text_list, feature_list = [], [], []
    for (_text, _label, _features) in batch:
         
         # featire pipeline
         feature_list.append(torch.tensor(_features))

        # add label to list
         label_list.append(label_pipeline(_label))

         # add text (as sequence of integers) to list
         processed_text = text_pipeline(_text)
         text_list.append(processed_text)

    feature_list = torch.stack(feature_list)
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = torch.stack(text_list)
    return text_list, label_list, feature_list

Create two instances of data loaders for train and test set.

In [18]:
train_loader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate_batch)
val_loader = DataLoader(val_data, batch_size=8, shuffle=True, collate_fn=collate_batch)

Now we can build our model! First, embed the text, then utilize dropout to prevent neuron dependency, and finish with a fully-connected linear layer.

In [19]:
from torch import nn
import torch.nn.functional as F

class TextClassificationModel(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, max_len, num_class):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.dropout = nn.Dropout(p=0.2)
        self.fc   = nn.Linear(embedding_dim, num_class)  

    def forward(self, x):
        x = self.embedding(x)
        x = self.dropout(x)
        x = x.mean(axis=1)  
        x = self.fc(x)
        return x

Set parameters for model training

In [20]:
vocab_size = len(vocab)
embedding_dim = 3
max_len = 100
num_class = 7

lyrics_model = TextClassificationModel(vocab_size, embedding_dim, max_len, num_class)

optimizer = torch.optim.Adam(lyrics_model.parameters(), lr=.1)
loss_fn = torch.nn.CrossEntropyLoss()

Here is the training loop and evaluation method we'll use for each neural network.

In [21]:
import time

def train(model, dataloader, lyrics, engineering):
    # keep track of time for each epoch
    epoch_start_time = time.time()
    log_interval = 300
    start_time = time.time()

    # for measuring accuracy
    total_acc, total_count = 0, 0

    for idx, (text, label, features) in enumerate(dataloader):
        # zero gradients
        optimizer.zero_grad()

        # prediction on batch, based on specified features 
        predicted_label = ''
        if(engineering and not lyrics):
            predicted_label = model(features)
        elif(lyrics and not engineering):
            predicted_label = model(text)
        elif(lyrics and engineering):
            predicted_label = model(text, features)

        # evaluate loss on prediction
        loss = loss_fn(predicted_label, label)

        # compute gradient
        loss.backward()

        # take an optimization step
        optimizer.step()

        # for printing accuracy
        total_acc   += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        
    print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')
    
def evaluate(model, dataloader, lyrics=True, engineering=True):

    # for determining accuracy
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (text, label, features) in enumerate(dataloader):

            # form prediction on batch
            if(engineering and not lyrics):
                predicted_label = model(features)
            elif(lyrics and not engineering):
                predicted_label = model(text)
            elif(lyrics and engineering):
                predicted_label = model(text, features)

            # compute accuracy
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
            
    return total_acc/total_count

Now we can finally train the model. In just 5 epochs we receive 56% training set accuracy.

In [22]:
EPOCHS = 5
for epoch in range(1, EPOCHS + 1):
    train(lyrics_model, train_loader, lyrics = True, engineering = False)

  feature_list.append(torch.tensor(_features))


| epoch   1 | train accuracy    0.284 | time:  9.81s
| epoch   2 | train accuracy    0.385 | time: 16.25s
| epoch   3 | train accuracy    0.466 | time: 16.35s
| epoch   4 | train accuracy    0.527 | time: 17.27s
| epoch   5 | train accuracy    0.564 | time: 16.79s


Evaluate our model on a testing dataset. 32% validation accuracy is better than the baseline!

In [23]:
evaluate(lyrics_model, val_loader, lyrics = True, engineering = False)

  feature_list.append(torch.tensor(_features))


0.3221145374449339

### Neural Network 2: Engineered Features

We were able to effectively classify song genre using just the lyrics. Can we do the same using engineered features -- descriptors of topics relavent to the songs?

In [24]:
engineered_features = ['genre', 'dating', 'violence', 'world/life', 'night/time','shake the audience','family/gospel', 'romantic', 'communication','obscene', 'music', 'movement/places', 'light/visual perceptions','family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability','loudness', 'acousticness', 'instrumentalness', 'valence', 'energy']      
len(engineered_features)

23

In [25]:
# train test split
df_train, df_val = train_test_split(df,shuffle = True, test_size = 0.2)
train_data = TextDataFromDF(df_train)
val_data   = TextDataFromDF(df_val)

Our Engineering Classification Model utilizes sequences of Linear, ReLU, and Dropout layers.

In [44]:
from torch import nn

class EngineeringClassificationModel(nn.Module):
    def __init__(self, input_size, num_class):
        super().__init__()
        self.model = nn.Sequential( 
            nn.Linear(input_size, 128), 
            nn.ReLU(), 
            nn.Dropout(0.2),
            nn.Linear(128, 64),
            nn.ReLU(), 
            nn.Dropout(0.2),
            nn.Linear(64, 32), 
            nn.ReLU(), 
            nn.Linear(32, 16), 
            nn.ReLU(), 
            nn.Linear(16, num_class), 
            nn.Softmax(dim=1) )

    def forward(self, x):
        x = x.float()
        x = torch.flatten(x, 1)
        x = self.model(x)
        return x

In [45]:
train_loader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate_batch)
val_loader = DataLoader(val_data, batch_size=8, shuffle=True, collate_fn=collate_batch)

In [46]:
input_size = 22
num_classes = 7
engineer_model = EngineeringClassificationModel(input_size, num_classes)

optimizer = torch.optim.Adam(engineer_model.parameters(), lr=.00001)
loss_fn = torch.nn.CrossEntropyLoss()

Let's train and test.

In [47]:
EPOCHS = 10
for epoch in range(1, EPOCHS + 1):
    train(engineer_model, train_loader, lyrics=False, engineering=True)


Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`



| epoch   1 | train accuracy    0.235 | time:  9.59s
| epoch   2 | train accuracy    0.248 | time: 15.03s
| epoch   3 | train accuracy    0.248 | time: 15.05s
| epoch   4 | train accuracy    0.248 | time: 15.06s
| epoch   5 | train accuracy    0.248 | time: 15.07s
| epoch   6 | train accuracy    0.248 | time: 16.55s
| epoch   7 | train accuracy    0.248 | time: 16.54s
| epoch   8 | train accuracy    0.248 | time: 17.64s
| epoch   9 | train accuracy    0.248 | time: 15.33s
| epoch  10 | train accuracy    0.248 | time: 16.13s


In [30]:
evaluate(engineer_model, val_loader, lyrics = False, engineering = True)

  feature_list.append(torch.tensor(_features))


0.24916299559471367

This is only slightly better than baseline. Perhaps lyrics are better and predicting genre, or the model I designedis not strong enough for this task.

### Neural Network 3: Lyrics and Engineered Featuers 

Can we get high accuracy by combining our previous two neural networks -- using both lyrics and engineered features to classify song genre?

In [31]:
class CombinedModel(nn.Module):

    def __init__(self, vocab_size, embedding_dim, num_features, num_classes):
        super().__init__()

        # separate data into text features and engineered features

        # Text Pipeline
        self.embedding = nn.Embedding(vocab_size+1, embedding_dim)
        self.text_fc = nn.Linear(embedding_dim, 128)

        # Engineered Features Pipeline
        self.engineered_fc = nn.Linear(num_features, 128)

        # Combined Layers
        self.combine_fc = nn.Linear(12928, 64)
        self.output_fc = nn.Linear(64, num_classes)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, text, engineer):
        # separate x into x_1 (text features) and x_2 (engineered features)

        # text pipeline: try embedding! 
        text_embed = self.embedding(text)
        x_1 = self.text_fc(text_embed)
        x_1 = torch.flatten(x_1, 1)

        # engineered features: fully-connected Linear layers are fine
        engineer = engineer.float()
        x_2 = self.engineered_fc(engineer)

        # ensure that both x_1 and x_2 are 2-d tensors, flattening if necessary
        combined = torch.cat((x_1, x_2), dim=1)

        # pass x through a couple more fully-connected layers and return output
        combined = self.combine_fc(combined)
        output = self.output_fc(combined)
        output = self.softmax(output)

        return output

In [32]:
train_loader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate_batch)
val_loader = DataLoader(val_data, batch_size=8, shuffle=True, collate_fn=collate_batch)

In [33]:
vocab_size = len(vocab)
embedding_dim = 3
num_features = 22
num_classes = 7

combined_model = CombinedModel(vocab_size, embedding_dim, num_features, num_classes)

optimizer = torch.optim.Adam(combined_model.parameters(), lr=0.0001)
loss_fn = torch.nn.CrossEntropyLoss()

In [34]:
EPOCHS = 5
for epoch in range(1, EPOCHS + 1):
    train(combined_model, train_loader, lyrics=True, engineering=True)

  feature_list.append(torch.tensor(_features))


| epoch   1 | train accuracy    0.246 | time: 29.97s
| epoch   2 | train accuracy    0.252 | time: 31.29s
| epoch   3 | train accuracy    0.257 | time: 30.10s
| epoch   4 | train accuracy    0.263 | time: 30.53s
| epoch   5 | train accuracy    0.264 | time: 30.80s


In [35]:
evaluate(combined_model, val_loader, lyrics = True, engineering = True)

  feature_list.append(torch.tensor(_features))


0.2526872246696035

Once again, slighly better than baseline!

### Visualize Word Embedding

Text embedding models blindly learn associations between words used in the input text. It would be unsurprising to see this occur in songs, especially considering the tendency for music, especially hip-hop and country, to contain racist and sexist undertones.

In [36]:
# for embedding visualization later
import plotly.express as px 
import plotly.io as pio
import numpy as np

In [37]:
embedding_matrix = combined_model.embedding.cpu().weight.data.numpy()
tokens = vocab.get_itos()

Let's utilize PCA to reduce the dimensionality of our data

In [38]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
weights = pca.fit_transform(embedding_matrix)

In [39]:
tokens = vocab.get_itos()
tokens.append(" ")
embedding_df = pd.DataFrame({
    'word' : tokens, 
    'x0'   : weights[:,0],
    'x1'   : weights[:,1]
})

embedding_df

Unnamed: 0,word,x0,x1
0,<unk>,-0.241795,0.101041
1,know,1.901600,0.761639
2,like,0.061012,-0.139691
3,time,-2.086986,-0.204237
4,come,-0.004097,-0.908995
...,...,...,...
45659,트램펄린,-0.433597,0.306044
45660,한번쯤은,0.411126,-0.498122
45661,함께라는,-1.094337,-1.840321
45662,ﬁnished,-1.300753,1.882751


In [40]:
fig = px.scatter(embedding_df, 
                 x = "x0", 
                 y = "x1", 
                 size = list(np.ones(len(embedding_df))),
                 size_max = 10,
                 hover_name = "word")

fig.show()

![PCA graph](pca.jpg)

It's a bit hard to make sense of, seeing that we are classifying into 7 categories. However, some of the "outlier" words in our PCA plot seem to be associated with specific genres. 

In [41]:
feminine = ["she", "her", "woman"]
masculine = ["he", "him", "man"]

highlight_1 = ["strong", "powerful", "smart",     "thinking", "brave", "muscle"]
highlight_2 = ["hot",    "sexy",     "beautiful", "shopping", "children", "thin"]

def gender_mapper(x):
    if x in feminine:
        return 1
    elif x in masculine:
        return 4
    elif x in highlight_1:
        return 3
    elif x in highlight_2:
        return 2
    else:
        return 0

embedding_df["highlight"] = embedding_df["word"].apply(gender_mapper)
embedding_df["size"]      = np.array(1.0 + 50*(embedding_df["highlight"] > 0))

# 
sub_df = embedding_df[embedding_df["highlight"] > 0]

In [42]:
import plotly.express as px 

fig = px.scatter(sub_df, 
                 x = "x0", 
                 y = "x1", 
                 color = "highlight",
                 size = list(sub_df["size"]),
                 size_max = 10,
                 hover_name = "word", 
                 text = "word")

fig.update_traces(textposition='top center')

fig.show()

![word embedding image](word_embeddings.jpg)

I'm pleased to see not too much association between stereotypically feminine traits and unequivocally feminine words, and same with masculine traits and unequivocally masculine words. 

### Conclusion

In this blog post, I learned to design neural networks to handle different types of inputs: text input (lyrics), and engineered features. Utilizing machine learning, I was able to create classification models for 7 categories, performing better than the baseline. Finally, I analyzed the word embeddings learned by my model, considering how these came to be.

Neural network models are ubiquitous -- perhaps the most commonly used machine learning algorithm. Every time I use fingerprint ID on my phone, or use auto-complete, I'm utilizing neural networks. It's exciting to begin to understand how these work, and apply them in an interesting way -- to music!