INFX 575  SPRING 2017 
Alfonso Bonilla, Jared Praino, Michael Stepanovic

## Questions

Can a song’s genre be predicted from its lyrics?
How do different genre constructs compare in terms of classification accuracy?
Which machine learning model best suits this task?

Through our project, we attempted to determine whether a song’s genre could be classified purely from its lyrical content.  On a deeper level, we sought to answer 1) Can a song’s genre be identified from its lyrics? 2) Which genres are easiest to predict? and 3) Which machine learning algorithm best suits the task?

## Data

To answer these questions, we compiled data from three sources: Million Song Dataset, MusiXmatch, and Last.fm.  Respectively, these sources provided us with song metadata, lyrical content (in bag-of-word format), and expert labeled genre tags for songs.  After joining these three datasets, we were left with 83,192 songs for analysis.

Songs were represented in bag-of-word format for the 5,000 most popular song words, with the first 5,000 indices representing word count and the 5,001st index identifying song genre. Preliminary analysis of the data revealed a skew in song genres, with a large majority of songs (49.42%) classified as ‘Rock’. We decided to set a benchmark of 50% to beat the most simple model of guessing ‘Rock’ for each song.

Million Song Dataset -- https://labrosa.ee.columbia.edu/millionsong/
MusiXmatch dataset (lyrics BoWs) -- http://labrosa.ee.columbia.edu/millionsong/musixmatch
Last.fm API (song genre tags) -- https://labrosa.ee.columbia.edu/millionsong/lastfm#api

Diekroeger, D. (2012) “Can Song Lyrics Predict Genre?” Stanford University. Retrieved from http://cs229.stanford.edu/proj2012/Diekroeger-CanSongLyricsPredictGenre.pdf
Fell, M., & Sporleder, C. (2014). Lyrics-based Analysis and Classification of Music. In COLING (Vol. 2014, pp. 620-631).
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere.  The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.
![image.png](attachment:image.png)

## Results

Across each genre we received a 45.13% classification accuracy.  This did not reach our benchmark, leading us to believe perhaps more data than simply lyrics are needed to create an accurate model for genre classification. However, it is apparent that certain genres are can be accurately predicted from their lyrics, specifically ‘Latin’.  We suspect this is because Latin tends to contain Spanish lyrics, unlike any other genre analyzed.  Genres like ‘New’ and ‘Rock’ did very poor.  We suspect this is because either the training set was too small, i.e. ‘New’ contained 141 songs, or the genre was generalized and contained too many genre constructs, i.e. ‘Rock’ has many sub-categories.

We also concluded that between k-NN and Naive Bayes, that Naive Bayes was the best model for predicting song genre, which gave the following result:

![image.png](attachment:image.png)

## Methods and Approach

We attempted to classify genres with two methods.  First, we attempted to use the k-Nearest Neighbors model, using Euclidean Distance to calculate proximity of neighbors. This resulted in a 36.09% accuracy using k = 1:

As we increased k to 2, computation time and accuracy become exponentially worse.  A k of greater than 3 triggered a memory error. Also, k-NN tended to predict ‘Rock’ for each song.  Therefore, the 36.09% accuracy is just a result of a majority of songs being ‘Rock’.  We concluded that k-NN was an inappropriate model.

The second model we attempted was the Naive Bayes model, resulting in a 45.13% accuracy at α = 0.1, shown above in Results.  We implemented this model on AWS, uploading our data to an S3 bucket, copying this bucket to an EC2 volume, and attaching this volume to an EMR cluster, to then implement MapReduce with Spark, Hadoop, and Zeppelin.

This model resulted in drastic improvements from the k-NN model, with accurate predictions for ‘Latin’, ‘Metal’, ‘Rap’, ‘R&B’, ‘Country’, ‘World’, and ‘Blues’.  However, it did not meet our benchmark of 50% accuracy across the board.  We concluded that not every genre can be classified simply from lyrics, and perhaps more data is required for analysis to determine a song’s genre with an algorithm.

Naive Bayes Classification Algorithm

Apply Bayes' theorem with “naive” independence assumptions between the features
PROS: 
•	Highly scalable
•	Implementations Exists! (see: Spark MLlib) 
•	Works well with text data
CONS: 
•	“Naive” assumption!
< INSERT IMAGE HERE>

## Limitations

Our Data was sparse data and unbalanced.  We had too much ‘Rock’, and not enough of anything else.  In addition, the data for song lyrics was only provided as a bag of words due to copyright issues.  As such, lyrics were only represented with 5,000 features, including stop words common across all songs.

The original algorithm of choice was k-NN.  However, there was no spark implementation of k-NN.  k-NN has been noted to be notoriously hard to parallelize in Spark because of its “lazy learning” nature.  Because of this, our attempts to parallelize the algorithm were unsuccessful resulting in extreme long runtimes.

## Related Work

A student from Stanford University, Danny Diekroeger, attempted answering this question using the Naive Bayes model. Diekroeger’s attempts led to insufficient findings, as he limited himself to five genre categories.  His dataset was much smaller and produced low classification accuracy.

Michael Fell and Caroline Sporleder attempted to answer genre classification, as well as release date classification, using the n-gram model.  They included lyrical content, imagery, slang, rhyme structure, chorus, and six other attributes in their model.  Their research led to poor genre classification, but accurate year prediction.

## Future Work

We would like to run principal component analysis to reduce the number of features in the bag-of-words.  It is our hope that reducing the number of features would make the k-NN model run faster, more efficient, and more accurate.  In addition, PCA could potentially provide us with discriminant projection of genres. 

In addition, we would like to try to other classification methods such a Neural Networks and Support Vector Machines.  In addition to more “complicated” models, we would like to try the efficiency of simpler models such as word count and TF-IDF (Hedayati, 2016) along with a Random Forests.  Ultimately, we would like to construct a mixed model to classify genres solely by lyrics. 