# Topic Modeling Experiment with Lyrics from Billboard100 from 1965-2015
by Keith Truong

## Introduction

Topic modeling utilizes an algorithm to seek patterns of words. This method can be used to identify similar structures or topics among corpus, and to draw relationships among different documents (Micah Saxton, Topic Modeling Best Practices).

For this project, I want to apply this method to a dataset of songs charted on the Billboard100 from 1965 to 2015, retrived from Humanities Datasets in Context. The data set contains 5101 entries of song titles, as well as their artists' names, ranking on the Billboard100, year released, lyrics, and source. About 271 entries do not have the wanted lyrics information, because they are either left blank or catergorized as instrumental. 

This project attempts to illustrate the common themes and topics among popular songs in the US over 50 years. With the topic modeling approach, I hope to study the common song-writing word and topic. Besides studying the connections between artists and their songs, this project also wants to reflect the audience of these songs. The common threads among these musical expressions and preferences can offer a lens into understanding American pop culture from 1965 to 2015. 

A majority of the code in this project refers to the coding assignments from CLS160: Introduction to Digital Humanities taught by Dr. Micah Saxton in Fall 2023 at Tufts University.


## Code
### Set up

In [1]:
! pip install funcy
! pip install tzdata
! pip install --no-dependencies pyLDAvis
! pip install wget
! pip install gensim




In [2]:
from collections import defaultdict
import wget
from gensim import corpora, models
import pandas as pd
import pyLDAvis.gensim
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import spacy

In [3]:
# set up nlp pipline
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')

['ner', 'parser']

### Upload data

In [4]:
path = ".../"
file_name = 'billboard_lyrics_1964_2015.csv'
df = pd.read_csv('billboard_lyrics_1964_2015.csv', encoding = "ISO-8859-1")
df.head()

Unnamed: 0,Rank,Song,Artist,Year,Lyrics
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love you...
2,4,you were on my mind,we five,1965,when i woke up this morning you were on my min...
3,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss ...
4,6,downtown,petula clark,1965,when youre alone and life is making you lonely...


### Prepare data

In [5]:
df['Lyrics'] = df['Lyrics'].astype(str) 

In [6]:
print (df.dtypes)

Rank       int64
Song      object
Artist    object
Year       int64
Lyrics    object
dtype: object


In [7]:
# extract lemmas that are nouns, verbs, or adjectives
def process_text(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    noun_verb_tokens = [token for token in no_punct if token.pos_ == 'NOUN' or token.pos_ == 'VERB']
    noun_verb_lemmas = [token.lemma_ for token in noun_verb_tokens]
    noun_verb_lemmas_lower = [lemma.lower() for lemma in noun_verb_lemmas]
    noun_verb_lemmas_string = ' '.join(noun_verb_lemmas_lower)
    return noun_verb_lemmas_string

In [8]:
# extract lemmas that are only nouns
def process_text_nouns(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    noun_tokens = [token for token in no_punct if token.pos_ == 'NOUN']
    noun_lemmas = [token.lemma_ for token in noun_tokens]
    noun_lemmas_lower = [lemma.lower() for lemma in noun_lemmas]
    noun_lemmas_string = ' '.join(noun_lemmas_lower)
    return noun_lemmas_string

In [9]:
# extract lemmas that are only verbs
def process_text_verbs(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    verb_tokens = [token for token in no_punct if token.pos_ == 'VERB']
    verb_lemmas = [token.lemma_ for token in verb_tokens]
    verb_lemmas_lower = [lemma.lower() for lemma in verb_lemmas]
    verb_lemmas_string = ' '.join(verb_lemmas_lower)
    return verb_lemmas_string

In [11]:
# extract lemmas
df['noun_verb_lemmas'] = df['Lyrics'].apply(process_text)
df['noun_verb_lemmas'].head()

0    sham domingo samudio do tre tell thing see hor...
1    sugar honey bunch know love help love elsein l...
2    wake morning mind mind get trouble whoaoh get ...
3    close eye kiss lip s tenderness fingertip try ...
4    life make downtown get worry noise hurry help ...
Name: noun_verb_lemmas, dtype: object

In [12]:
df['noun_lemmas'] = df['Lyrics'].apply(process_text_nouns)
df['noun_lemmas'].head()

0    sham domingo samudio tre thing horn jaw let ch...
1    sugar honey bunch elsein life picture finger e...
2    morning mind mind trouble worry wound corner p...
3    eye lip tenderness fingertip baby baby whoa fe...
4    life downtown worry noise hurry downtownjust m...
Name: noun_lemmas, dtype: object

In [13]:
df['verb_lemmas'] = df['Lyrics'].apply(process_text_verbs)
df['verb_lemmas'].head()

0    do tell see tell let come learn dance tell s p...
1    know love help love come leave kiss timeswhen ...
2    wake get whoaoh get whoaoh get bind go ease ea...
3    close kiss s try know lose lose go go go wohno...
4    make get help know listen lose forget forget w...
Name: verb_lemmas, dtype: object

In [14]:
# extract the data out of the DataFrame
nv_documents = df['noun_verb_lemmas'].to_list()
n_documents = df['noun_lemmas'].to_list()
v_documents = df['verb_lemmas'].to_list()

In [15]:
len(nv_documents[0])
len(n_documents[0])
len(v_documents[0])

75

In [18]:
# tokenize - the syntax below will create a list of lists
nv_texts =[
    [word for word in nv_document.lower().split()]
    for nv_document in nv_documents
]

n_texts =[
    [word for word in n_document.lower().split()]
    for n_document in n_documents
]

v_texts =[
    [word for word in v_document.lower().split()]
    for v_document in v_documents
]

In [20]:
# create a count of each token
frequency = defaultdict(int)
for nv_text in nv_texts:
  for token in nv_text:
    frequency[token] += 1
    
for n_text in n_texts:
  for token in n_text:
    frequency[token] += 1
    
for v_text in v_texts:
  for token in v_text:
    frequency[token] += 1

In [21]:
# remove words that appear only 1 time
nv_texts = [
    [token for token in text if frequency[token] > 1]
    for text in nv_texts
]

n_texts = [
    [token for token in text if frequency[token] > 1]
    for text in n_texts
]

v_texts = [
    [token for token in text if frequency[token] > 1]
    for text in v_texts
]

### Build topic model

In [22]:
# create a dictionary based off our texts
# The dictionary maps each token to a unique integer id
nv_dictionary = corpora.Dictionary(nv_texts)

In [23]:
# create a dictionary based off our texts
# The dictionary maps each token to a unique integer id
n_dictionary = corpora.Dictionary(n_texts)

In [24]:
# create a dictionary based off our texts
# The dictionary maps each token to a unique integer id
v_dictionary = corpora.Dictionary(v_texts)

In [39]:
# create a corpus based off our dictionary and our texts
nv_corpus = [nv_dictionary.doc2bow(nv_text) for nv_text in nv_texts]

In [44]:
# create a corpus based off our dictionary and our texts
n_corpus = [n_dictionary.doc2bow(n_text) for n_text in n_texts]

In [45]:
# create a corpus based off our dictionary and our texts
v_corpus = [v_dictionary.doc2bow(v_text) for v_text in v_texts]

In [40]:
# build LDA model
lda_model_nv = models.LdaModel(corpus=nv_corpus, id2word=nv_dictionary, num_topics=20, passes=70)

In [46]:
# build LDA model
lda_model_n = models.LdaModel(corpus=n_corpus, id2word=n_dictionary, num_topics=20, passes=70)

In [47]:
# build LDA model
lda_model_v = models.LdaModel(corpus=v_corpus, id2word=v_dictionary, num_topics=20, passes=70)

In [41]:
# explore topics
lda_model_nv.print_topics()

[(0,
  '0.218*"break" + 0.113*"heart" + 0.109*"help" + 0.032*"tear" + 0.031*"funk" + 0.019*"diva" + 0.016*"damage" + 0.013*"shout" + 0.010*"whoop" + 0.009*"aye"'),
 (1,
  '0.060*"fire" + 0.058*"burn" + 0.057*"lie" + 0.055*"stop" + 0.040*"stand" + 0.025*"move" + 0.021*"kick" + 0.015*"clap" + 0.013*"round" + 0.012*"make"'),
 (2,
  '0.087*"girl" + 0.063*"get" + 0.061*"know" + 0.052*"s" + 0.044*"ai" + 0.032*"tell" + 0.028*"man" + 0.022*"like" + 0.020*"want" + 0.020*"look"'),
 (3,
  '0.142*"go" + 0.089*"run" + 0.075*"know" + 0.068*"dance" + 0.054*"miss" + 0.051*"beat" + 0.024*"city" + 0.015*"night" + 0.014*"road" + 0.012*"trouble"'),
 (4,
  '0.138*"wait" + 0.031*"kid" + 0.029*"write" + 0.027*"power" + 0.023*"lip" + 0.020*"fight" + 0.019*"read" + 0.018*"care" + 0.017*"face" + 0.016*"sign"'),
 (5,
  '0.068*"boy" + 0.059*"hand" + 0.057*"turn" + 0.038*"hear" + 0.038*"play" + 0.035*"song" + 0.034*"music" + 0.033*"bring" + 0.028*"sing" + 0.018*"air"'),
 (6,
  '0.053*"know" + 0.051*"m" + 0.040*"ti

In [48]:
# explore topics
lda_model_n.print_topics()

[(0,
  '0.272*"way" + 0.154*"tonight" + 0.060*"friend" + 0.025*"day" + 0.014*"honey" + 0.014*"line" + 0.011*"time" + 0.011*"waitin" + 0.010*"alright" + 0.009*"cause"'),
 (1,
  '0.616*"baby" + 0.041*"woman" + 0.012*"lovin" + 0.010*"minute" + 0.010*"babe" + 0.009*"motion" + 0.008*"lover" + 0.007*"burn" + 0.006*"thing" + 0.005*"man"'),
 (2,
  '0.108*"hand" + 0.101*"let" + 0.078*"dance" + 0.058*"music" + 0.057*"floor" + 0.048*"party" + 0.022*"gon" + 0.018*"bass" + 0.015*"rhythm" + 0.015*"fun"'),
 (3,
  '0.072*"feelin" + 0.054*"tryin" + 0.033*"edge" + 0.031*"wing" + 0.030*"clap" + 0.029*"comin" + 0.027*"chorus" + 0.021*"highway" + 0.020*"goin" + 0.019*"hell"'),
 (4,
  '0.159*"boy" + 0.077*"day" + 0.075*"song" + 0.032*"babe" + 0.014*"rack" + 0.014*"jump" + 0.013*"number" + 0.013*"man" + 0.010*"hair" + 0.009*"weed"'),
 (5,
  '0.110*"heart" + 0.073*"world" + 0.043*"man" + 0.022*"time" + 0.021*"life" + 0.021*"people" + 0.017*"bit" + 0.015*"arm" + 0.014*"chance" + 0.012*"truth"'),
 (6,
  '0.119*

In [49]:
# explore topics
lda_model_v.print_topics()

[(0,
  '0.611*"love" + 0.049*"help" + 0.023*"hate" + 0.020*"blame" + 0.013*"get" + 0.009*"touch" + 0.009*"know" + 0.007*"care" + 0.006*"stand" + 0.005*"name"'),
 (1,
  '0.273*"feel" + 0.111*"fall" + 0.063*"believe" + 0.063*"try" + 0.041*"make" + 0.027*"dream" + 0.021*"wonder" + 0.020*"call" + 0.019*"m" + 0.019*"happen"'),
 (2,
  '0.278*"break" + 0.150*"kiss" + 0.080*"beat" + 0.059*"write" + 0.045*"fight" + 0.043*"build" + 0.041*"end" + 0.035*"mess" + 0.029*"learn" + 0.028*"matter"'),
 (3,
  '0.199*"find" + 0.125*"look" + 0.091*"fly" + 0.060*"see" + 0.047*"wake" + 0.047*"shine" + 0.025*"drink" + 0.022*"carry" + 0.022*"reach" + 0.016*"lose"'),
 (4,
  '0.187*"worry" + 0.164*"bout" + 0.073*"prove" + 0.072*"wash" + 0.047*"fool" + 0.039*"cross" + 0.035*"wind" + 0.021*"bury" + 0.019*"fire" + 0.016*"cast"'),
 (5,
  '0.727*"gon" + 0.024*"move" + 0.019*"lift" + 0.014*"wanna" + 0.011*"have" + 0.010*"celebrate" + 0.010*"meet" + 0.010*"tear" + 0.009*"m" + 0.009*"treat"'),
 (6,
  '0.369*"s" + 0.210*

In [42]:
# Find topics in each document
lda_model_nv.get_document_topics(nv_corpus[80])

[(2, 0.21636395),
 (6, 0.67062074),
 (9, 0.03466244),
 (15, 0.028915156),
 (16, 0.028671093),
 (18, 0.012890657)]

In [50]:
# Find topics in each document
lda_model_n.get_document_topics(n_corpus[80])

[(1, 0.028749341),
 (4, 0.510122),
 (6, 0.12708528),
 (7, 0.1561706),
 (15, 0.15908806)]

In [51]:
# Find topics in each document
lda_model_v.get_document_topics(v_corpus[80])

[(1, 0.17447914),
 (3, 0.1423452),
 (4, 0.021875653),
 (8, 0.032979436),
 (9, 0.15122499),
 (11, 0.07550787),
 (13, 0.020934427),
 (15, 0.16748421),
 (18, 0.048522145),
 (19, 0.15484223)]

### Visualize topic model with noun-and-verb corpus

In [52]:
# visualize
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model_nv, nv_corpus, nv_dictionary)
vis

### Visualize topic model with noun-only corpus

In [53]:
# visualize
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model_n, n_corpus, n_dictionary)
vis

### Visualize topic model with verb-only corpus

In [54]:
# visualize
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model_v, v_corpus, v_dictionary)
vis

## Discussion

### Data set

The raw data set was manually modified before uploading to this program. Out of 5101, 271 entries were removed because they did not have the wanted lyrics information, either because they were left blank or categorized as instrumental. Also of these 271 removed entries, less than 10 entries are songs with lyrics in non-Latin alphabets. There were attempts to upload these entries but the Topic Modeling algorithm does not seem to work effectively with transcribed-not translated-lyrics. After some research, I identified the issues: the current encoding has issues reading these entries. I tried different encoding options but it was unsuccessful. For future direction, I would like to solve this issue. 

### Corpus

After the lyrics data was uploaded to notebook and topic modeling was conducted for testing, the corpus was strongly impacted by the extraneous data, such as prepositions and conjunctions. Therefore, while preparing the data, I also removed all tokens that are not a noun or a verb. I wanted to look at adjectives as well but the spaCy NLP does not include adjectives in its part-of-speech labels. I also created two additional corpora that are noun-only and verb-only to observe how Gensim handles different part-of-speech corpora. 

### Analysis of Topic Modeling Artifacts

#### Noun-and-verb corpus

The topic modeling of noun-and-verb corpus results in a diverse set of topics. The topics on top of the PC1 axis range with a smooth transition among topics of money, partying, violence, sex, male/female identifier/reference, and body parts. While the topics below the PC1 axis focus more on emotions (sadness as a majority), nature, cityscapes, breaking up, and making love. 


#### Noun-only corpus

The topic modeling of noun-only corpus results in a less coherent set of topics. Out of 20 topics generated, 18 topics are located on top of the PC1 axis, with mostly mental and physical associations of love, nature, and life. The two topics located below the PC1 axis are more aggressive in their grouping of salient terms.

#### Verb-only corpus

The topic modeling of verb-only corpus results in the least coherent set of topics. Most of the topics cluster on the right side of the PC2 axis. Surprisingly, not a lot of these topics show evident themes of violence, sex, or love like the previous experiments. 

### Future Direction

I would like to look at this dataset based on the year of release and observe the changes in topics over time. A strategy for that project might start with creating different topic modeling for songs charted each year, then figure out a way to define each topic effectively to rank their frequency. 