Analysing Eminem's Lyrics Using Topic-Modelling

One fascinating combined use of the Natural Language Processing (NLP) and Machine Learning is that texts can be analysed in their diction. This is achieved through concepts such as Term Frequency-Inverse Document Frequency (TF-IDF), which outlines how significant a certain word is in a corpus of texts, as well as Non-Negative Matrix Factorization (NMF), which allows for an unsupervised grouping of these certain words (found from TF-IDF) in the form of topics of said corpus. Thus, the TFIDF-NMF model allows for topics to be modelled based on a certain corpus of texts, for which the user only needs the raw text files. I decided to leverage these concepts and extract topics from Rap-Artist Eminem's lyrics.

Data Used

The data used for this project is solely pulled from the Genius website. More information on the same lies in the next subsection.

Data Preparation

As outlined in data.py, the first thing I needed was the API key which allows for pulling of data from the Genius website using the lyricsgenius API. It was easy enough once I made an account. Next, I decided to only analyse the lyrics from Eminem's 11 studio albums, which was an arbitrary decision but would be useful for future work (see last subsection). I made a list for all of them, a feature I couldn't find in the lyricsgenius library, and then pulled lyrical data from all individual albums. These saved on my directory in json format. Next, I opened these jsons and saved relevant data onto a dictionary (this too, would be helpful for future work and not this project specifically). I then proceeded to use the dictionary to make a list of all lyrics only. While previously it was required that data fed into the TFIDF vectorizer be normalized, it now does its own preprocessing including tokenization and lemmatization - as far as I understand. Thus, the making and saving of the list (into a pickle file, for use in main.py) was the last step needed in data prep.

Main Code

I first opened the pickle file containing the lyrics list. I then opened a list of stopwords using NLTK's library, and then added words I found in Eminem's lyrics which were not in said list, assumingly due to their profanitic or slang nature. Further, models were built using TFIDF and Bag-Of-Words(BOW) concepts. While TFIDF is preferred when analysing corpuses, it did not take in an argument of the separate corpuses (albums), and thus I'm not sure how well it performed. The BOW was not that relevant either. I used these vectorizers to transform the data into separate TFIDF and BOW data. It was then that I fed these data into both Latent Dirichlet Allocation (LDA, also used for topic modelling) and NMF models. The best result I found was in the TFIDF-NMF model. Following were the topics found:

*excuse the profanity

Evaluation

While the topics seem obscure, I could find some meaning in them. For example, I would assume Topic #1 with Eminem returning to features, Topic #2 with fatherhood, and Topic #3 has to do with love. Further improvements that need to be made to the code is to find ways to make TFIDF more useful, by incorporating the corpus concept - by inputting the album data separately.

Future Work

I would firsly want to, as stated in previous subsections, not analyse all of Eminem's works togehter, but separately in the form of albums. I would then want to know how significant each of these topics is in the albums. Then, I would also want to make a t-SNE plot of all vocab the code has analysed. I would lastly want to apply this to other artists and writers.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
data.py		data.py
lyrics.pkl		lyrics.pkl
main.py		main.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

data.py

data.py

lyrics.pkl

lyrics.pkl

main.py

main.py

utils.py

utils.py

Repository files navigation

Analysing Eminem's Lyrics Using Topic-Modelling

Data Used

Data Preparation

Main Code

Evaluation

Future Work

About

Languages

jai-agrawal/eminem-lyrics

Folders and files

Latest commit

History

Repository files navigation

Analysing Eminem's Lyrics Using Topic-Modelling

Data Used

Data Preparation

Main Code

Evaluation

Future Work

About

Topics

Resources

Stars

Watchers

Forks

Languages