Skip to content

Building a TFIDF-NMF model which performs topic modelling on Eminem's Lyrics

Notifications You must be signed in to change notification settings

jai-agrawal/eminem-lyrics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analysing Eminem's Lyrics Using Topic-Modelling

One fascinating combined use of the Natural Language Processing (NLP) and Machine Learning is that texts can be analysed in their diction. This is achieved through concepts such as Term Frequency-Inverse Document Frequency (TF-IDF), which outlines how significant a certain word is in a corpus of texts, as well as Non-Negative Matrix Factorization (NMF), which allows for an unsupervised grouping of these certain words (found from TF-IDF) in the form of topics of said corpus. Thus, the TFIDF-NMF model allows for topics to be modelled based on a certain corpus of texts, for which the user only needs the raw text files. I decided to leverage these concepts and extract topics from Rap-Artist Eminem's lyrics.

Data Used

The data used for this project is solely pulled from the Genius website. More information on the same lies in the next subsection.

Data Preparation

As outlined in data.py, the first thing I needed was the API key which allows for pulling of data from the Genius website using the lyricsgenius API. It was easy enough once I made an account. Next, I decided to only analyse the lyrics from Eminem's 11 studio albums, which was an arbitrary decision but would be useful for future work (see last subsection). I made a list for all of them, a feature I couldn't find in the lyricsgenius library, and then pulled lyrical data from all individual albums. These saved on my directory in json format. Next, I opened these jsons and saved relevant data onto a dictionary (this too, would be helpful for future work and not this project specifically). I then proceeded to use the dictionary to make a list of all lyrics only. While previously it was required that data fed into the TFIDF vectorizer be normalized, it now does its own preprocessing including tokenization and lemmatization - as far as I understand. Thus, the making and saving of the list (into a pickle file, for use in main.py) was the last step needed in data prep.

Main Code

I first opened the pickle file containing the lyrics list. I then opened a list of stopwords using NLTK's library, and then added words I found in Eminem's lyrics which were not in said list, assumingly due to their profanitic or slang nature. Further, models were built using TFIDF and Bag-Of-Words(BOW) concepts. While TFIDF is preferred when analysing corpuses, it did not take in an argument of the separate corpuses (albums), and thus I'm not sure how well it performed. The BOW was not that relevant either. I used these vectorizers to transform the data into separate TFIDF and BOW data. It was then that I fed these data into both Latent Dirichlet Allocation (LDA, also used for topic modelling) and NMF models. The best result I found was in the TFIDF-NMF model. Following were the topics found:

Screenshot 2021-05-15 at 6 19 20 PM

*excuse the profanity

Evaluation

While the topics seem obscure, I could find some meaning in them. For example, I would assume Topic #1 with Eminem returning to features, Topic #2 with fatherhood, and Topic #3 has to do with love. Further improvements that need to be made to the code is to find ways to make TFIDF more useful, by incorporating the corpus concept - by inputting the album data separately.

Future Work

I would firsly want to, as stated in previous subsections, not analyse all of Eminem's works togehter, but separately in the form of albums. I would then want to know how significant each of these topics is in the albums. Then, I would also want to make a t-SNE plot of all vocab the code has analysed. I would lastly want to apply this to other artists and writers.