# Week 15 Topic Modeling

### Task 1. Change some parameters (optional)
Run the codes below to find topics from song lyrics by using NMF and LDA. If you want to change a few settings, change the following parameters:
- `n_components` in NMF() and LatentDirichletAllocation(): total number of topics
- `max_df` in CountVectorizer() and TfidfVectorizer (): ignore terms that appear in more than max_df% songs
- `min_df` in CountVectorizer() and TfidfVectorizer (): ignore terms that appear in only less than min_df songs
- `ngram` in CountVectorizer() and TfidfVectorizer (): unigram and/or bigram and/or trigram?

### Task 2. Analyze the results (optional)
- Try to name topics based on their most dominant words. Topics are stored in NMF_topics.csv and LDA_topics.csv. (rows: topics, columns: words)
- Can you find some coherent topics? Can you also see some junk topics that you cannot come up with meaning names? 
- Pick a few sample song lyrics and take a look at their topic distributions. Do they make sense?
- Do the relationship between genres and their dominant topics make sense?


In [2]:
import os
import csv
import time
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [4]:
# git clone "Music Dataset: Lyrics and Metadata from 1950 to 2019"
# the original corpus is adopted from https://data.mendeley.com/datasets/3t9vbwxgr5/2
if os.path.exists('W15'): 
    !rm -fr 'W15/'
!git clone https://github.com/music-data-mining/W15.git
%cd W15

Cloning into 'W15'...
remote: Enumerating objects: 7, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 7 (delta 0), reused 7 (delta 0), pack-reused 0[K
Unpacking objects: 100% (7/7), done.
/content/W15/W15


## Load Data

We will be using lyrics from Music Dataset v2 ([*Music Dataset: Lyrics and Metadata from 1950 to 2019*](https://data.mendeley.com/datasets/3t9vbwxgr5/2)).

In [None]:
# read a relevant subset of the dataset
lyrics = pd.read_csv('tcc_ceds_music.csv', 
                     usecols=['artist_name', 'track_name', 'release_date', 'genre', 'lyrics', 'topic'])   

In [None]:
# show 10 random rows
lyrics.sample(10) 

Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,topic
22250,the expendables,one night stand,2002,reggae,night stand ringtone mobile phone gonna gonna ...,violence
158,t. m. soundararajan,enna vendum,1958,pop,girl friend swag sauce drip swagu girl friends...,obscene
19624,the meters,down by the river,2001,jazz,reason hide madness sorrow impossible today oo...,violence
3761,green day,dry ice,1991,pop,late night dream fly hand hand wake cold sweat...,sadness
23469,elvis presley,it is no secret (what god can do),1957,rock,chime time ring news slip fell long add streng...,sadness
911,the byrds,all i really want to do,1965,pop,lookin compete beat cheat mistreat simplify cl...,obscene
3887,saint etienne,spring,1992,pop,eye need forget yesterday time love lose feel ...,world/life
24285,toto,georgy porgy,1978,rock,situation need contemplation systematic addict...,romantic
9589,america,fallin' off the world,1984,country,time feel heart beat want walkin street matter...,world/life
6147,walk the moon,tightrope,2012,pop,easy heart easy heart walk little tightrope wa...,sadness


In [None]:
print("Number of songs with lyrics: ", len(lyrics))

Number of songs with lyrics:  28372


Ok, we got a dataset of 28372 samples. It's pretty large. Can we quickly summarize the lyrics into topics? The anwser is Yes. Next we will use NMF and LDA to group the lyrics into topics.

Notice, each sample is annotated with the topic in the dataset (there are eight categories: 'feelings', 'music', 'night/time', 'obscene', 'romantic', 'sadness', 'violence', and 'world/life'). We can, of course, build a classfier in a supervised manner. But a more realistic situation for topic modeling is 1) the dataset is very large to reveiw mannually and 2) we don't know the labels beforehand. So we want to first have a good overview of the possible topics in an unsupervised way.

## Feature Engineering

Before using the model, we need to transform the lyrics into a n*m matrix for our model to decompose. `n` is # of documents, `m` is # of unique word-level representation. A word-level representation can be a word ('unigram') or two consecutive words ('bigram'), w or w/t tf-idf transformation.

We will be using feature engineering for texts learned before.
- For both NMF and LDA, we will test term-freqency (tf) unigram matrix (using tf-idf is not recommended for its creator Blei: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?ref=https://githubhelp.com).
- For NMF, we will test tf bigram and tf-idf unigram as well.

In [None]:
# by setting max_df==0.8 we remove common words (which is too common to indicate anything)
# by setting min_df==2 we remove very rare words only appear only in two docs; they can be miss/non-standard spellings
# stop_words arg indicate whether we want to remove very common words ("function words", e.g., 'is', 'of', and 'which')
tfv = CountVectorizer(max_df=0.8, min_df=2, stop_words='english', ngram_range=(1,1))
start_time = time.time()
tf_unigram = tfv.fit_transform(lyrics['lyrics'])
print("It took %s seconds to generate a TF unigram matrix for LDA" % (time.time() - start_time))
# the first dimention notes the row of the samples
# the second for the number of unique unigrams
print("tf.shape: ", tf_unigram.shape)  

It took 1.7820885181427002 seconds to generate a TF matrix for LDA
tf.shape:  (28372, 22720)


In [None]:
tfidfv_unigram = TfidfVectorizer(max_df=0.8, min_df=2, stop_words='english', ngram_range=(1,1))
start_time = time.time()
tfidf_unigram = tfidfv_unigram.fit_transform(lyrics['lyrics'])
print("It took %s seconds to generate a TFIDF unigram matrix for NMF" % (time.time() - start_time))
print("tfidf_unigram.shape: ", tfidf_unigram.shape)

It took 2.400155782699585 seconds to generate a TFIDF unigram matrix for NMF
tfidf_unigram.shape:  (28372, 22720)


In [None]:
tfv_uni_bigram = CountVectorizer(max_df=0.5, min_df=5, stop_words='english', ngram_range=(1,2))
start_time = time.time()
tf_uni_bigram = tfv_uni_bigram.fit_transform(lyrics['lyrics'])
print("It took %s seconds to generate a TF bigram matrix for LDA" % (time.time() - start_time))
print("tf_uni_bigram.shape: ", tf_uni_bigram.shape)

It took 5.1385931968688965 seconds to generate a TF bigram matrix for LDA
tf_uni_bigram.shape:  (28372, 50644)


## 1. NMF

In [None]:
# fitting the NMF model with tfidf unigram features
# `n_components` is a rough estimation that how many latent topics we'd like to infer from data
# there is no harm to start from 30 or 10, it depends on you
nmf = NMF(n_components=30, solver='mu', init= 'nndsvda', beta_loss=1, random_state=42)
start_time = time.time()
nmf.fit(tfidf_unigram)
print("It took %s seconds to find NMF topics" % (time.time() - start_time))

It took 138.40290427207947 seconds to find NMF topics


In [None]:
start_time = time.time()
topics_nmf =[]
for index, topic in enumerate(nmf.components_): # H = nmf.components_
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    words = []
    for i in topic.argsort()[:-15-1:-1]: # the top 15 words
      words.append(tfidfv_unigram.get_feature_names_out()[i]) 
    print(words)
    topics_nmf.append(words) # the top 15 words per each topic
# with open("NMF_topics.csv", "w", newline="") as f:
#     writer = csv.writer(f)
#     writer.writerow(np.arange(1,16))
#     for l in topics_nmf:
#       writer.writerow(l)
print("It took %s seconds to generate NMF_topics.csv" % (time.time() - start_time))

THE TOP 15 WORDS FOR TOPIC #0
['like', 'check', 'straight', 'cause', 'number', 'damn', 'deal', 'beat', 'style', 'rhyme', 'kid', 'kick', 'drop', 'game', 'play']
THE TOP 15 WORDS FOR TOPIC #1
['heart', 'break', 'tear', 'apart', 'start', 'promise', 'know', 'darling', 'inside', 'hearts', 'deep', 'eye', 'smile', 'pain', 'lose']
THE TOP 15 WORDS FOR TOPIC #2
['time', 'mind', 'wait', 'years', 'forever', 'line', 'waste', 'pass', 'spend', 'lose', 'grow', 'wonder', 'search', 'know', 'maybe']
THE TOP 15 WORDS FOR TOPIC #3
['away', 'walk', 'fade', 'run', 'yesterday', 'slip', 'throw', 'watch', 'blow', 'wind', 'river', 'stay', 'turn', 'place', 'drift']
THE TOP 15 WORDS FOR TOPIC #4
['life', 'earth', 'power', 'shall', 'live', 'fate', 'spirit', 'strength', 'bless', 'holy', 'soul', 'rise', 'peace', 'beauty', 'praise']
THE TOP 15 WORDS FOR TOPIC #5
['sweet', 'kiss', 'dear', 'heaven', 'moon', 'lover', 'bring', 'thrill', 'send', 'shin', 'warm', 'letter', 'lady', 'lips', 'till']
THE TOP 15 WORDS FOR TOPIC 

Does the clusters make sense? Algthough the `n_components` arg is arbitarily chosen, one out cluter in my output is very reasonable:

***
THE TOP 15 WORDS FOR TOPIC #25
['yeah', 'miss', 'summer', 'whoa', 'winter', 'woah', 'lovely', 'bang', 'spring', 'autumn', 'beach', 'september', 'like', 'butterfly', 'cold']
***

It seems the latent topic is 'season'.

In [None]:
topic_dist_NMF = nmf.transform(tfidf_unigram) 

In [None]:
lyrics['Topic_NMF'] = topic_dist_NMF.argmax(axis=1) # assign the most dominant topic to each song
lyrics.sample(10) #show random rows

Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,topic,Topic_NMF
16591,royal blood,you can be so cruel,2014,blues,outside window outside door lonely show kill g...,violence,1
15064,sweet,wig-wam bam,1993,blues,suck bigger muthafucka right glass malone suck...,obscene,22
6004,avril lavigne,wish you were here,2011,pop,tough strong like girl give shit wall walk rem...,obscene,0
27729,freeman,rescaper,2010,hip hop,climb catacombs multi millions rest remain dea...,obscene,22
5986,travis barker,let's go,2011,pop,barker yeah lelelet yelawolf lelelet twista ho...,obscene,22
14271,the birthday party,"6"" gold blade",1982,blues,stick sixinch gold blade head girl lie teeth h...,violence,8
14921,louis prima,twist all night,1991,blues,come baby twist night come twist night come tw...,night/time,10
5708,rihanna,disturbia,2008,pop,bedum bedum wrong bedum bedum feel like bedum ...,obscene,10
26606,david crowder band,how he loves,2009,rock,jealous love like hurricane tree bend beneath ...,romantic,24
17067,gary clark jr.,don't wait til tomorrow,2019,blues,tire fight tire wrong hate cry wish right wron...,night/time,20


In [None]:
lyrics['Topic_NMF'] = lyrics['Topic_NMF'].apply(str)
lyrics.groupby(["genre"])[["Topic_NMF"]].describe() # show stats of the topics per each genre

Unnamed: 0_level_0,Topic_NMF,Topic_NMF,Topic_NMF,Topic_NMF
Unnamed: 0_level_1,count,unique,top,freq
genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
blues,4604,30,15,240
country,5445,30,1,523
hip hop,904,29,22,514
jazz,3845,30,22,316
pop,7042,30,22,539
reggae,2498,30,4,269
rock,4034,30,15,354


In [None]:
s = lyrics['Topic_NMF'].groupby(lyrics['genre']).value_counts()
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  
  print(s)

genre    Topic_NMF
blues    15           240
         27           227
         1            225
         2            223
         18           200
         16           185
         24           179
         8            176
         10           175
         5            175
         6            162
         28           159
         17           158
         20           158
         4            157
         7            148
         9            148
         0            140
         13           138
         19           138
         3            131
         11           130
         12           121
         21           121
         22           114
         26           103
         14           100
         29            95
         25            92
         23            86
country  1            523
         2            341
         18           299
         6            266
         10           264
         5            248
         17           245
         13        

In [None]:
s.groupby('genre').head(3) # show the top 3 topics per genre

genre    Topic_NMF
blues    15           240
         27           227
         1            225
country  1            523
         2            341
         18           299
hip hop  22           514
         15            31
         0             25
jazz     22           316
         4            262
         5            239
pop      22           539
         1            433
         2            336
reggae   4            269
         22           258
         18           251
rock     15           354
         27           336
         2            227
Name: Topic_NMF, dtype: int64

## 2. NMF with both TF unigram and bigram

In [None]:
nmf_uni_bigram = NMF(n_components=30, solver='mu', init= 'nndsvda', beta_loss=1, random_state=42)
start_time = time.time()
nmf_uni_bigram.fit(tf_uni_bigram)
print("It took %s seconds to find NMF topics" % (time.time() - start_time))

It took 169.89509439468384 seconds to find NMF topics


In [None]:
start_time = time.time()
topics_nmf2 =[]
for index,topic in enumerate(nmf_uni_bigram.components_): # H = nmf.components_
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    words = []
    for i in topic.argsort()[:-15-1:-1]: # the top 15 words
      words.append(tfv_uni_bigram.get_feature_names_out()[i]) 
    print(words)
    topics_nmf2.append(words) # the top 15 words per each topic

with open("NMF2_topics.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(np.arange(1,16))
    for l in topics_nmf2:
      writer.writerow(l)
      
print("It took %s seconds to generate NMF2_topics.csv" % (time.time() - start_time))

THE TOP 15 WORDS FOR TOPIC #0
['look', 'like', 'face', 'think', 'head', 'eye', 'place', 'open', 'say', 'make', 'try', 'lyric', 'cause', 'tell', 'commercial']
THE TOP 15 WORDS FOR TOPIC #1
['time', 'time time', 'waste', 'line', 'come', 'think', 'time come', 'waste time', 'spend', 'turn', 'know time', 'second', 'fine', 'time know', 'lose']
THE TOP 15 WORDS FOR TOPIC #2
['away', 'walk', 'stay', 'away away', 'fade', 'walk away', 'slip', 'turn', 'look', 'throw', 'hide', 'fade away', 'run', 'steal', 'today']
THE TOP 15 WORDS FOR TOPIC #3
['like', 'fuck', 'shit', 'bitch', 'nigga', 'niggas', 'cause', 'come', 'real', 'gotta', 'check', 'tell', 'beat', 'wanna', 'tryna']
THE TOP 15 WORDS FOR TOPIC #4
['feel', 'like', 'real', 'feel like', 'touch', 'feel feel', 'pain', 'inside', 'feel good', 'know feel', 'cause', 'ohoh', 'closer', 'good feel', 'body']
THE TOP 15 WORDS FOR TOPIC #5
['yeah', 'yeah yeah', 'bout', 'whoa', 'talkin', 'woah', 'dirty', 'talk', 'talkin bout', 'like', 'drop', 'cause', 'yeah k

In [None]:
topic_dist_NMF2 = nmf_uni_bigram.transform(tf_uni_bigram)

In [None]:
lyrics['Topic_NMF2'] = topic_dist_NMF2.argmax(axis=1)
lyrics.sample(10) # show 10 random rows

Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,topic,Topic_NMF2
6011,go radio,rolling in the deep,2011,pop,start heart reach fever pitch bring dark final...,sadness,9
7989,loretta lynn,put it off until tomorrow,1966,country,tomorrow wowo hurt today go away leave tomorro...,world/life,2
28240,n.w.a.,100 miles and runnin',2018,hip hop,think gonna away spot immediate vicinity runni...,obscene,3
5445,new found glory,hold my hand,2006,pop,hair swing eye motor head turn want long time ...,romantic,10
23066,sticky fingers,velvet skies,2014,reggae,come yeah alright baby music boys see light ru...,violence,27
11959,the be good tanyas,the littlest birds,2012,country,feel like hobo lonesome blue fair summer summe...,music,19
26745,bobaflex,bury me with my guns,2011,rock,conversations make sense stop suffer stop scre...,violence,0
11532,the be good tanyas,for the turnstiles,2006,country,sailors seasick mamas hear sirens shore singin...,world/life,29
17098,dean martin,just for fun,1950,jazz,softly sigh close maybe agree hand magic land ...,sadness,2
7950,lorne greene,i'm a gun,1966,country,bear blast furnace steel wonder shape hard col...,violence,29


In [None]:
lyrics['Topic_NMF2']=lyrics['Topic_NMF2'].apply(str)
lyrics.groupby(["genre"])[["Topic_NMF2"]].describe() # show stats of the topics per each genre

Unnamed: 0_level_0,Topic_NMF2,Topic_NMF2,Topic_NMF2,Topic_NMF2
Unnamed: 0_level_1,count,unique,top,freq
genre,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
blues,4604,30,29,319
country,5445,30,9,409
hip hop,904,30,3,615
jazz,3845,30,3,439
pop,7042,30,3,592
reggae,2498,30,3,362
rock,4034,30,0,389


In [None]:
s = lyrics['Topic_NMF2'].groupby(lyrics['genre']).value_counts()
s.groupby('genre').head(3) # show the top 3 topics per genre

genre    Topic_NMF2
blues    29            319
         0             253
         13            193
country  9             409
         19            306
         21            291
hip hop  3             615
         20             31
         29             27
jazz     3             439
         20            347
         19            247
pop      3             592
         0             504
         9             366
reggae   3             362
         7             263
         29            217
rock     0             389
         29            326
         20            297
Name: Topic_NMF2, dtype: int64

------

## 3. Latent Dirichlet Allocation (LDA)

In [None]:
lda = LatentDirichletAllocation(n_components=30, random_state=42)
start_time = time.time()
lda.fit(tf_unigram)
print("It took %s seconds to find LDA topics" % (time.time() - start_time))

It took 180.20423102378845 seconds to find LDA topics


In [None]:
start_time = time.time()
topics_lda =[]
for index,topic in enumerate(lda.components_): # H = nmf.components_
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    words = []
    for i in topic.argsort()[:-15-1:-1]: # the top 15 words
      words.append(tfv.get_feature_names_out()[i]) 
    print(words)
    topics_lda.append(words) # the top 15 words per each topic
# with open("LDA_topics.csv", "w", newline="") as f:
#     writer = csv.writer(f)
#     writer.writerow(np.arange(1,16))
#     for l in topics_lda:
#       writer.writerow(l)
print("It took %s seconds to generate LDA_topics.csv" % (time.time() - start_time))

THE TOP 15 WORDS FOR TOPIC #0
['black', 'open', 'hole', 'radio', 'teach', 'eye', 'point', 'like', 'bang', 'swing', 'pain', 'white', 'everybody', 'mama', 'texas']
THE TOP 15 WORDS FOR TOPIC #1
['cold', 'hurt', 'write', 'game', 'minute', 'letter', 'read', 'oooh', 'cause', 'tell', 'burn', 'head', 'like', 'sittin', 'want']
THE TOP 15 WORDS FOR TOPIC #2
['baby', 'wish', 'blood', 'come', 'girl', 'lady', 'bleed', 'luck', 'drink', 'check', 'power', 'head', 'watch', 'damn', 'run']
THE TOP 15 WORDS FOR TOPIC #3
['feel', 'good', 'real', 'know', 'like', 'morning', 'woah', 'cause', 'moment', 'tire', 'ways', 'years', 'tell', 'thing', 'yeah']
THE TOP 15 WORDS FOR TOPIC #4
['play', 'better', 'hear', 'music', 'devil', 'come', 'listen', 'like', 'band', 'sound', 'guitar', 'know', 'dance', 'say', 'thing']
THE TOP 15 WORDS FOR TOPIC #5
['world', 'fight', 'ready', 'save', 'come', 'know', 'lose', 'share', 'battle', 'everybody', 'stand', 'say', 'steady', 'makin', 'word']
THE TOP 15 WORDS FOR TOPIC #6
['time',

In [None]:
topic_dist_LDA = lda.transform(tf_unigram)

In [None]:
lyrics['Topic_LDA'] = topic_dist_LDA.argmax(axis=1)
lyrics.sample(10) #show random rows

Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,topic,Topic_NMF2,Topic_LDA
1765,abba,i saw it in the mirror,1973,pop,mirror face longer need place mirror look eye ...,sadness,0,21
4691,destiny's child,hey ladies,1999,pop,ladies wrong decide holdin strength leave ladi...,obscene,11,26
3712,they might be giants,particle man,1990,pop,particle particle things particle like importa...,violence,7,27
3330,big black,bad penny,1987,pop,ought know liar ought know curse nature bless ...,obscene,8,21
27300,matt maeson,put it on me,2018,rock,hang high blame blame street pain cold inescap...,violence,15,18
26647,audrey assad,restless,2010,rock,dwell songs sing rise heavens rise heart heart...,world/life,9,26
19185,judy garland,i got rhythm,1994,jazz,days sign need money bird tree sing song shoul...,music,22,19
5712,bon iver,blindsided,2008,pop,bike downtown lock board nail crouch like crow...,violence,29,29
19976,devin townsend band,vampira,2006,jazz,night follow go nowhow night darkness strong c...,violence,6,22
20307,helen kane,i wanna be loved by you,2011,jazz,greedy kind want simple know mind rest eye gli...,romantic,17,17


In [None]:
s = lyrics['Topic_LDA'].groupby(lyrics['genre']).value_counts()
s.groupby('genre').head(3) # show the top 3 topics per genre

genre    Topic_LDA
blues    7             443
         21            369
         22            303
country  7            1036
         21            554
         25            401
hip hop  11            697
         22             33
         7              18
jazz     11            502
         19            333
         21            319
pop      7             927
         11            812
         21            721
reggae   11            491
         22            175
         18            158
rock     7             433
         22            398
         21            370
Name: Topic_LDA, dtype: int64