# Topic Modeling

## Introduction

Since I've looked at the sentiment of the lyrics, now I'll try another popular text analysis technique called topic modeling. Topic modeling helps identify various topics in the text (corpus). Each document in the corpus will be made up of at least one topic, but can also be made of multiple topics.

As an initial attempt, I decided to do Topic Modeling using **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. I chose this because it is able to put weight on a sinlge topic when there is a mix of topics and also seemed easy to implement. It was specifically designed for text data.

To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up.

LDA is an unsupervised algorithm. This means that once LDA is applied, the user has interpret the results and decide whether the words in each topic make sense. Ideally the words in each topic should have a common theme/topic, making it a... TOPIC! But sometimes the words just seem random and so you would have to can try changing the number of topics, the terms in the document-term matrix or if nothing seems to work, try a different model.

## Topic Modeling - Attempt #1 (All Songs)

In [1]:
# Let's read in our document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('lyrics/song_dtm_stop.pkl')
data

Unnamed: 0_level_0,accepted,act,actors,address,adore,advice,affair,afternoon,age,ago,...,yellow,yes,yesterday,york,youd,youll,young,younger,youre,youve
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
No Such Thing,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
Why Georgia,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
My Stupid Mouth,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
Your Body Is A Wonderland,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
Neon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"""Moving On and Getting Over""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
"""Never on the Day You Leave""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,2,0,0,0,0
"""Rosie""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"""Roll It on Home""",0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1


In [2]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

unable to import 'smart_open.gcs', disabling that module


In [3]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Title,No Such Thing,Why Georgia,My Stupid Mouth,Your Body Is A Wonderland,Neon,City Love,83,3X5,Love Song For No One,Back To You,...,"""Helpless""","""Love on the Weekend""","""In the Blood""","""Changing""","""Theme from ""The Search for Everything""""","""Moving On and Getting Over""","""Never on the Day You Leave""","""Rosie""","""Roll It on Home""","""You're Gonna Live Forever in Me"""
accepted,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
act,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
actors,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
address,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
adore,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
# Contains the term and count of each term
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [5]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("lyrics/cv_s_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [6]:
# id2word is a dictionary that indexes each word
id2word

{1420: 'welcome',
 973: 'real',
 1464: 'world',
 1040: 'said',
 217: 'condescendingly',
 1069: 'seat',
 697: 'life',
 915: 'plot',
 92: 'black',
 1431: 'white',
 713: 'lived',
 324: 'dreams',
 936: 'prom',
 661: 'kings',
 316: 'drama',
 953: 'queens',
 608: 'id',
 1288: 'think',
 82: 'best',
 575: 'hiding',
 1139: 'sleeve',
 734: 'love',
 1272: 'tell',
 1196: 'stay',
 625: 'inside',
 705: 'lines',
 1164: 'somethings',
 84: 'better',
 1390: 'want',
 1033: 'run',
 529: 'halls',
 576: 'high',
 1058: 'school',
 1060: 'scream',
 743: 'lungs',
 1282: 'theres',
 1286: 'thing',
 696: 'lie',
 1489: 'youve',
 1011: 'rise',
 493: 'good',
 114: 'boys',
 476: 'girls',
 146: 'called',
 1008: 'right',
 1326: 'track',
 377: 'faded',
 547: 'hats',
 500: 'grabbing',
 241: 'credits',
 762: 'maybe',
 1331: 'transfers',
 971: 'read',
 106: 'books',
 23: 'answers',
 873: 'parents',
 1284: 'theyre',
 472: 'getting',
 846: 'older',
 1455: 'wonder',
 1285: 'theyve',
 1449: 'wished',
 773: 'memories',
 1308: 't

Picking the ideal number of topics - this can be done by calculating the coherence score for each number of topics

In [7]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

[(0,
  '0.027*"love" + 0.015*"home" + 0.015*"heart" + 0.012*"dont" + 0.011*"feel" + 0.009*"way" + 0.009*"say" + 0.008*"waiting" + 0.008*"ill" + 0.008*"man"'),
 (1,
  '0.015*"dont" + 0.014*"love" + 0.012*"tell" + 0.011*"time" + 0.009*"life" + 0.009*"whiskey" + 0.008*"ill" + 0.008*"going" + 0.008*"come" + 0.008*"want"')]

In [8]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.012*"whiskey" + 0.011*"love" + 0.011*"time" + 0.010*"youre" + 0.010*"dont" + 0.009*"changing" + 0.009*"gone" + 0.008*"thats" + 0.008*"water" + 0.008*"heart"'),
 (1,
  '0.033*"love" + 0.019*"dont" + 0.018*"home" + 0.013*"feel" + 0.013*"going" + 0.012*"say" + 0.011*"want" + 0.011*"come" + 0.009*"time" + 0.008*"right"'),
 (2,
  '0.018*"waiting" + 0.017*"tell" + 0.016*"heart" + 0.015*"ill" + 0.015*"life" + 0.014*"half" + 0.013*"world" + 0.010*"love" + 0.010*"body" + 0.009*"change"')]

In [9]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.027*"love" + 0.015*"life" + 0.014*"time" + 0.013*"heart" + 0.013*"dont" + 0.011*"good" + 0.011*"way" + 0.010*"want" + 0.010*"half" + 0.010*"right"'),
 (1,
  '0.042*"love" + 0.023*"feel" + 0.016*"whiskey" + 0.015*"man" + 0.012*"water" + 0.012*"hold" + 0.011*"gone" + 0.010*"ill" + 0.009*"youre" + 0.009*"going"'),
 (2,
  '0.018*"waiting" + 0.014*"world" + 0.014*"dont" + 0.012*"change" + 0.012*"ooh" + 0.011*"moving" + 0.011*"time" + 0.010*"changing" + 0.010*"day" + 0.010*"leave"'),
 (3,
  '0.025*"home" + 0.017*"dont" + 0.016*"tell" + 0.015*"say" + 0.013*"ill" + 0.012*"love" + 0.011*"come" + 0.011*"right" + 0.010*"bed" + 0.010*"face"')]

## Topic Modeling - Attempt #2 (Nouns Only)

One popular trick is to look only at terms that are from one part of speech (only nouns, only adjectives, etc.). Check out the UPenn tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [8]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [11]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('lyrics/corpus.pkl')
data_clean

Unnamed: 0_level_0,Album,Year,Track #,Title,Track Length,Lyrics
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No Such Thing,Room For Squares,2001,1,No Such Thing,231,welcome to the real world she said to me conde...
Why Georgia,Room For Squares,2001,2,Why Georgia,269,i am driving up in the kind of morning that l...
My Stupid Mouth,Room For Squares,2001,3,My Stupid Mouth,225,my stupid mouth has got me in trouble i said t...
Your Body Is A Wonderland,Room For Squares,2001,4,Your Body Is A Wonderland,250,we got the afternoon you got this room for two...
Neon,Room For Squares,2001,5,Neon,262,when sky blue gets dark enough to see the colo...
...,...,...,...,...,...,...
"""Moving On and Getting Over""",The Search For Everything,2017,8,"""Moving On and Getting Over""",261,moving on and getting over are not the same it...
"""Never on the Day You Leave""",The Search For Everything,2017,9,"""Never on the Day You Leave""",224,no its never on the day you leave that you won...
"""Rosie""",The Search For Everything,2017,10,"""Rosie""",243,rosie come down and get the door for me im dru...
"""Roll It on Home""",The Search For Everything,2017,11,"""Roll It on Home""",205,one last drink to drink wishful thinkin and th...


In [12]:
# Apply the nouns function to the transcripts to filter only on nouns
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

data_nouns = pd.DataFrame(data_clean.Lyrics.apply(nouns))
data_nouns

[nltk_data] Downloading package punkt to /Users/numalj/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/numalj/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Unnamed: 0_level_0,Lyrics
Title,Unnamed: 1_level_1
No Such Thing,welcome world seat life plot well i dreams kin...
Why Georgia,i kind morning afternoon gloom exits apartment...
My Stupid Mouth,mouth trouble i date dinner yesterday i change...
Your Body Is A Wonderland,afternoon room thing left mile inch skin porce...
Neon,sky colors city trail ruby sunrise one shes st...
...,...
"""Moving On and Getting Over""",mind i time friends i mind i time time fact yo...
"""Never on the Day You Leave""",day youll sound day grows time sing shell hair...
"""Rosie""",rosie door i dream rain heart hand room tune r...
"""Roll It on Home""",drink thinkin bar walls journey jukebox singin...


In [13]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['just', 'im', 'like', 'know', 'got', 'oh']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.Lyrics)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0_level_0,actors,address,advice,affair,afternoon,age,aint,air,airports,alarms,...,year,years,yellow,yes,yesterday,york,youd,youll,youre,youve
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
No Such Thing,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,3
Why Georgia,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
My Stupid Mouth,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
Your Body Is A Wonderland,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
Neon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"""Moving On and Getting Over""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
"""Never on the Day You Leave""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
"""Rosie""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"""Roll It on Home""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [14]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [15]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.027*"home" + 0.025*"way" + 0.024*"heart" + 0.020*"time" + 0.019*"man" + 0.015*"world" + 0.014*"water" + 0.013*"half" + 0.013*"love" + 0.010*"youre"'),
 (1,
  '0.029*"life" + 0.029*"love" + 0.019*"time" + 0.015*"heart" + 0.015*"dont" + 0.012*"thing" + 0.012*"home" + 0.009*"day" + 0.008*"youre" + 0.008*"world"')]

In [16]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.032*"love" + 0.030*"time" + 0.023*"man" + 0.019*"way" + 0.019*"heart" + 0.019*"world" + 0.016*"water" + 0.014*"life" + 0.011*"thing" + 0.010*"ooh"'),
 (1,
  '0.056*"home" + 0.022*"way" + 0.015*"time" + 0.013*"face" + 0.013*"boy" + 0.013*"sadness" + 0.012*"repair" + 0.010*"life" + 0.010*"body" + 0.010*"ill"'),
 (2,
  '0.032*"heart" + 0.024*"half" + 0.023*"life" + 0.021*"dont" + 0.017*"love" + 0.014*"room" + 0.013*"bed" + 0.013*"youre" + 0.012*"home" + 0.011*"world"')]

In [17]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.064*"home" + 0.034*"man" + 0.030*"life" + 0.016*"water" + 0.014*"whiskey" + 0.013*"time" + 0.013*"sadness" + 0.012*"repair" + 0.012*"face" + 0.010*"day"'),
 (1,
  '0.028*"dont" + 0.023*"ooh" + 0.023*"life" + 0.020*"world" + 0.019*"bed" + 0.018*"daughters" + 0.013*"assassin" + 0.013*"head" + 0.011*"thing" + 0.010*"door"'),
 (2,
  '0.052*"love" + 0.028*"time" + 0.026*"way" + 0.017*"youre" + 0.015*"body" + 0.014*"day" + 0.011*"aint" + 0.010*"age" + 0.010*"thing" + 0.009*"life"'),
 (3,
  '0.055*"heart" + 0.026*"half" + 0.022*"world" + 0.022*"time" + 0.021*"way" + 0.018*"love" + 0.013*"dont" + 0.012*"helpless" + 0.012*"mind" + 0.011*"heartbreak"')]

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [18]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [19]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.Lyrics.apply(nouns_adj))
data_nouns_adj

Unnamed: 0_level_0,Lyrics
Title,Unnamed: 1_level_1
No Such Thing,welcome real world seat life plot black white ...
Why Georgia,i kind morning afternoon gloom exits apartment...
My Stupid Mouth,stupid mouth trouble i much date dinner yester...
Your Body Is A Wonderland,afternoon room thing ive left mile inch skin p...
Neon,sky blue dark colors city trail ruby red diamo...
...,...
"""Moving On and Getting Over""",same im older cant mind i time friends long i ...
"""Never on the Day You Leave""",day goodbye youll old familiar sound day grows...
"""Rosie""",rosie door rosie i dream january rain heart ha...
"""Roll It on Home""",last drink wishful thinkin bar walls journey j...


In [38]:
# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['just', 'im', 'like', 'know', 'got', 'oh','dont', 'ill']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8, min_df=0.05)
data_cvna = cvna.fit_transform(data_nouns_adj.Lyrics)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0_level_0,aint,alive,baby,bed,best,big,black,blue,broken,cause,...,ways,white,wont,words,world,wrong,yeah,youll,youre,youve
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
No Such Thing,0,1,0,0,1,0,1,0,0,0,...,0,2,0,0,4,0,0,0,0,3
Why Georgia,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
My Stupid Mouth,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Your Body Is A Wonderland,0,0,0,1,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
Neon,0,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"""Moving On and Getting Over""",0,0,1,0,0,0,0,0,0,2,...,0,0,0,0,0,1,0,0,0,1
"""Never on the Day You Leave""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
"""Rosie""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
"""Roll It on Home""",0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


From vectorizer documentation
**max_dffloat in range [0.0, 1.0] or int (default=1.0):**
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**min_dffloat in range [0.0, 1.0] or int (default=1)**
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [39]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [25]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=100)
ldana.print_topics()

[(0,
  '0.015*"world" + 0.011*"half" + 0.009*"goodbye" + 0.007*"body" + 0.007*"bed" + 0.006*"long" + 0.006*"sadness" + 0.006*"baby" + 0.006*"bold" + 0.005*"yeah"'),
 (1,
  '0.023*"whiskey" + 0.018*"water" + 0.012*"face" + 0.012*"olivia" + 0.011*"helpless" + 0.009*"age" + 0.009*"gon" + 0.009*"weekend" + 0.008*"aint" + 0.008*"days"')]

In [26]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=100)
ldana.print_topics()

[(0,
  '0.021*"world" + 0.013*"whiskey" + 0.010*"water" + 0.009*"body" + 0.008*"bed" + 0.008*"yeah" + 0.008*"sadness" + 0.007*"days" + 0.007*"face" + 0.007*"baby"'),
 (1,
  '0.025*"half" + 0.017*"goodbye" + 0.012*"heartbreak" + 0.008*"hearts" + 0.008*"long" + 0.008*"assassin" + 0.008*"warfare" + 0.007*"youll" + 0.007*"night" + 0.006*"pain"'),
 (2,
  '0.019*"helpless" + 0.016*"gon" + 0.016*"weekend" + 0.015*"roll" + 0.013*"honey" + 0.010*"water" + 0.010*"wave" + 0.010*"blood" + 0.008*"nobodys" + 0.007*"rosie"')]

In [27]:
# Let's try 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=100)
ldana.print_topics()

[(0,
  '0.038*"whiskey" + 0.022*"water" + 0.020*"olivia" + 0.019*"face" + 0.015*"age" + 0.013*"aint" + 0.013*"days" + 0.012*"boy" + 0.012*"worry" + 0.012*"verb"'),
 (1,
  '0.035*"half" + 0.024*"goodbye" + 0.016*"heartbreak" + 0.012*"assassin" + 0.012*"warfare" + 0.011*"night" + 0.011*"hearts" + 0.009*"fight" + 0.009*"lovers" + 0.009*"long"'),
 (2,
  '0.016*"world" + 0.008*"body" + 0.007*"baby" + 0.007*"sadness" + 0.007*"bed" + 0.006*"bold" + 0.006*"old" + 0.006*"gon" + 0.006*"yeah" + 0.005*"neon"')]

In [28]:
# Let's try 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=100)
ldana.print_topics()

[(0,
  '0.025*"world" + 0.015*"bold" + 0.014*"gon" + 0.013*"repair" + 0.013*"helpless" + 0.012*"baby" + 0.011*"weekend" + 0.010*"water" + 0.010*"roll" + 0.009*"train"'),
 (1,
  '0.016*"whiskey" + 0.015*"half" + 0.011*"face" + 0.011*"goodbye" + 0.010*"water" + 0.008*"bed" + 0.008*"sadness" + 0.008*"olivia" + 0.008*"aint" + 0.008*"free"'),
 (2,
  '0.001*"feet" + 0.001*"older" + 0.001*"safe" + 0.001*"think" + 0.001*"clothes" + 0.001*"sea" + 0.001*"hair" + 0.001*"youve" + 0.001*"lights" + 0.001*"sky"'),
 (3,
  '0.016*"neon" + 0.013*"body" + 0.013*"world" + 0.012*"living" + 0.009*"hands" + 0.009*"wonderland" + 0.008*"great" + 0.008*"shes" + 0.008*"long" + 0.008*"indoors"')]

The topics aren't too convincing. Let's see if the albums belong to a general topic

## Topic Modeling on Albums - Nouns and Adjectives

In [2]:
# Read in the cleaned data, before the CountVectorizer step
data_clean_album = pd.read_pickle('lyrics/album_corpus.pkl')
data_clean_album

Unnamed: 0_level_0,Album,Year,Total Album Length,Num_of_Tracks,Lyrics
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001,Room For Squares,2001,3259,13,welcome to the real world she said to me conde...
2003,Heavier Things,2003,2780,10,i worry i weigh three times my body i worry i ...
2006,Continuum,2006,2987,12,me and all my friends were all misunderstood t...
2009,Battle Studies,2009,2798,11,lightning strikes inside my chest to keep me u...
2012,Born And Raised,2012,2799,12,ooh ooh ooh ooh close your eyes and clone you...
2013,Paradise Valley,2013,2410,11,dear marie tell me what it was i used to be d...
2017,The Search For Everything,2017,2629,12,i still feel like your man i still feel like y...


In [6]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [9]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean_album.Lyrics.apply(nouns_adj))
data_nouns_adj

Unnamed: 0_level_0,Lyrics
Year,Unnamed: 1_level_1
2001,welcome real world seat life plot black white ...
2003,i i times body i i fear morning calm i rock ca...
2006,friends misunderstood nothing way everything t...
2009,strikes chest night dream ways pain clouds sul...
2012,ooh ooh ooh eyes clone heart army innocence ev...
2013,dear marie dear marie youre road i cant boy fi...
2017,i man i man i i i man prettiest girl room come...


In [17]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['just', 'im', 'like', 'know', 'got', 'oh','dont', 'ill', 'ooh']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8, min_df=0.05)
data_cvna = cvna.fit_transform(data_nouns_adj.Lyrics)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0_level_0,actors,address,advice,affair,afternoon,age,aint,air,airports,alarms,...,years,yellow,yes,yesterday,york,youd,youll,young,younger,youve
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2001,1,0,0,0,2,0,0,0,0,0,...,1,0,0,1,0,0,1,0,0,3
2003,0,1,0,0,2,0,0,1,1,0,...,0,0,1,0,0,1,0,0,0,0
2006,0,0,3,0,0,0,3,0,0,1,...,0,1,0,0,0,0,2,2,0,0
2009,0,0,0,1,0,0,0,2,0,0,...,0,1,1,1,3,0,2,2,1,0
2012,0,0,0,0,0,10,9,1,0,0,...,0,0,0,1,1,1,0,0,0,1
2013,0,0,0,0,0,1,4,0,0,0,...,1,0,0,0,0,1,5,0,0,3
2017,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,2,2,0,1


In [18]:
from gensim import matutils, models
import scipy.sparse

# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [31]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=500)
ldana.print_topics()

[(0,
  '0.017*"half" + 0.016*"world" + 0.011*"goodbye" + 0.009*"bold" + 0.008*"repair" + 0.008*"war" + 0.008*"hearts" + 0.007*"heartbreak" + 0.007*"train" + 0.006*"pain"'),
 (1,
  '0.013*"whiskey" + 0.011*"water" + 0.009*"world" + 0.008*"body" + 0.007*"face" + 0.007*"days" + 0.007*"sadness" + 0.007*"bed" + 0.007*"olivia" + 0.006*"neon"')]

In [30]:
# Let's start with 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=500)
ldana.print_topics()

[(0,
  '0.019*"sadness" + 0.016*"bed" + 0.015*"daughters" + 0.012*"mothers" + 0.012*"someday" + 0.010*"yeah" + 0.010*"body" + 0.010*"oooooo" + 0.008*"bigger" + 0.008*"wheel"'),
 (1,
  '0.022*"world" + 0.009*"bold" + 0.008*"neon" + 0.008*"repair" + 0.007*"train" + 0.007*"body" + 0.006*"shes" + 0.006*"trust" + 0.006*"baby" + 0.006*"living"'),
 (2,
  '0.020*"whiskey" + 0.018*"half" + 0.016*"water" + 0.013*"face" + 0.013*"goodbye" + 0.010*"olivia" + 0.009*"free" + 0.009*"helpless" + 0.008*"night" + 0.008*"heartbreak"')]

In [29]:
# Let's start with 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=500)
ldana.print_topics()

[(0,
  '0.026*"world" + 0.015*"bold" + 0.013*"repair" + 0.011*"train" + 0.010*"baby" + 0.010*"trust" + 0.009*"lovin" + 0.008*"old" + 0.008*"youll" + 0.008*"burning"'),
 (1,
  '0.040*"half" + 0.027*"goodbye" + 0.019*"heartbreak" + 0.014*"warfare" + 0.014*"assassin" + 0.012*"hearts" + 0.012*"night" + 0.010*"fight" + 0.010*"long" + 0.010*"lovers"'),
 (2,
  '0.011*"neon" + 0.011*"helpless" + 0.010*"gon" + 0.010*"body" + 0.010*"weekend" + 0.010*"world" + 0.009*"long" + 0.009*"living" + 0.009*"roll" + 0.008*"honey"'),
 (3,
  '0.026*"whiskey" + 0.016*"water" + 0.014*"sadness" + 0.014*"olivia" + 0.013*"face" + 0.012*"bed" + 0.011*"daughters" + 0.010*"age" + 0.009*"days" + 0.009*"yeah"')]

Looking at the three topic classification, seems like the topics are:
1.  Family (Mothers and Daughters)
2.  Living Life
3.  Breakups

## Topic Modeling - NMF (Non-negative Matrix Factorization)

In [34]:
from gensim.models import ldamodel
import gensim.corpora;
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer;
from sklearn.decomposition import NMF;
from sklearn.preprocessing import normalize;
import pickle;

In [36]:
# Read in the cleaned data, before the CountVectorizer step
data_clean_songs = pd.read_pickle('lyrics/corpus.pkl')
data_clean_songs

Unnamed: 0_level_0,Album,Year,Track #,Title,Track Length,Lyrics
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
No Such Thing,Room For Squares,2001,1,No Such Thing,231,welcome to the real world she said to me conde...
Why Georgia,Room For Squares,2001,2,Why Georgia,269,i am driving up in the kind of morning that l...
My Stupid Mouth,Room For Squares,2001,3,My Stupid Mouth,225,my stupid mouth has got me in trouble i said t...
Your Body Is A Wonderland,Room For Squares,2001,4,Your Body Is A Wonderland,250,we got the afternoon you got this room for two...
Neon,Room For Squares,2001,5,Neon,262,when sky blue gets dark enough to see the colo...
...,...,...,...,...,...,...
"""Moving On and Getting Over""",The Search For Everything,2017,8,"""Moving On and Getting Over""",261,moving on and getting over are not the same it...
"""Never on the Day You Leave""",The Search For Everything,2017,9,"""Never on the Day You Leave""",224,no its never on the day you leave that you won...
"""Rosie""",The Search For Everything,2017,10,"""Rosie""",243,rosie come down and get the door for me im dru...
"""Roll It on Home""",The Search For Everything,2017,11,"""Roll It on Home""",205,one last drink to drink wishful thinkin and th...


In [47]:
cv = CountVectorizer(stop_words='english', analyzer='word', max_features=5000)
song_cv = cv.fit_transform(data_clean_songs.Lyrics)
song_dtm = pd.DataFrame(song_cv.toarray(), columns=cv.get_feature_names())
song_dtm.index = data_clean_songs.index
song_dtm


Unnamed: 0_level_0,accepted,act,actors,address,adore,advice,affair,afternoon,age,ago,...,yellow,yes,yesterday,york,youd,youll,young,younger,youre,youve
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
No Such Thing,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
Why Georgia,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
My Stupid Mouth,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
Your Body Is A Wonderland,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
Neon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"""Moving On and Getting Over""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
"""Never on the Day You Leave""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,2,0,0,0,0
"""Rosie""",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"""Roll It on Home""",0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1


In [51]:
transformer = TfidfTransformer(smooth_idf=False)
tfidf_song_cv = transformer.fit_transform(song_cv)

In [52]:
# Normalize
xtfidf_norm = normalize(tfidf_song_cv, norm='l1', axis=1)

In [63]:
# Num of topics
num_topics = 4

#obtain a NMF model.
model = NMF(n_components=num_topics, init='nndsvd');
#fit the model
model.fit(xtfidf_norm)

NMF(alpha=0.0, beta_loss='frobenius', init='nndsvd', l1_ratio=0.0, max_iter=200,
    n_components=4, random_state=None, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [61]:
def get_nmf_topics(model, n_top_words):
    
    #the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
    feat_names = cv.get_feature_names()
    
    word_dict = {};
    for i in range(num_topics):
        
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = model.components_[i].argsort()[:-20 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words;
    
    return pd.DataFrame(word_dict);

In [62]:
get_nmf_topics(model, 20)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06
0,love,olivia,im,changing,goodbye,feel
1,verb,like,helpless,hearts,say,man
2,aint,thinking,tell,young,try,like
3,dreaming,man,moving,run,gone,babe
4,thing,steal,waiting,old,want,wash
5,oh,gets,got,round,break,know
6,hold,time,change,senses,heart,whiskey
7,youve,taken,going,follow,going,water
8,yeah,im,time,apart,wildly,cause
9,youre,sleep,ill,fences,plane,dont
