# Model - Rap Genius

## Topic Modeling

for topic modeling we are going to be using a technique called Laten Dirichlet Allocation (LDA). NLTK for part of speech tagging. We want to know what are the different themes that arise in a rappers lyrics, and who tends to talk about what.

LDA deals with probability. Dirichlet is a type of probability and we are trying to discover what is the probability that the document is about a specific topic given a set of words. 

### Attempt 1 -- All Text
This is not going to work the first time through. 

In [43]:
from gensim import matutils, models # matutils will turn the array into a bag of words
import pandas as pd
import pickle
import scipy.sparse
from nltk import word_tokenize, pos_tag
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
data = pd.read_pickle('DataFrame_with_new_stopwords.pkl')
data

Unnamed: 0_level_0,02,10,100,1000,1008,10yearolds,11,12,125,140,...,zeros,zip,zod,zombie,zone,zonin,zöld,ölén,úgy,な音楽
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Drake,0,0,6,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
Jayz,0,0,2,0,0,0,2,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Nas,0,1,0,1,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
Eminem,1,0,0,0,0,1,0,1,0,0,...,0,0,1,1,0,0,0,0,0,0
Future,0,0,0,0,1,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,1
KanyeWest,0,0,0,0,0,0,0,1,2,0,...,0,0,0,1,0,3,1,1,1,0


In [7]:
# transpose to create a term-document matrix
tdm = data.transpose()
tdm.head()

Artist,Drake,Jayz,Nas,Eminem,Future,KanyeWest
2,0,0,0,1,0,0
10,0,0,1,0,0,0
100,6,2,0,0,0,0
1000,0,0,1,0,0,0
1008,0,0,0,0,1,0


In [8]:
# now we are going to turn the tdm into a sparse matrix
# where most of the elements are zero (the opposite is a dense matrix where they are nonzero)

sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [15]:
# gensim requires a dictionary of all the terms and where they reside in the tdm
# this will show us all the unique words in our corpus.
cv = pickle.load(open('cv_stop.pkl', 'rb'))
id2word = dict((v,k) for k, v in cv.vocabulary_.items())

In [16]:
id2word # if I want to see the first few Items I am going to need to import itertools

{3548: 'produced',
 531: 'boi1da',
 1753: 'frank',
 1405: 'dukes',
 3092: 'noah',
 34: '40',
 4063: 'shebib',
 3086: 'nineteen85',
 2: '100',
 4921: 'verse',
 396: 'bein',
 846: 'chill',
 3692: 'real',
 3630: 'quick',
 3667: 'raptopaythebill',
 1615: 'feel',
 2631: 'little',
 454: 'bit',
 3151: 'oh',
 2668: 'lord',
 5175: 'worth',
 95: 'actions',
 2680: 'louder',
 5161: 'words',
 2084: 'high',
 1431: 'earth',
 5000: 'wanna',
 4816: 'turf',
 3866: 'rookie',
 4926: 'vet',
 4107: 'shoutout',
 456: 'bitches',
 2114: 'holdin',
 4034: 'set',
 3357: 'phone',
 2661: 'lookin',
 3373: 'pictures',
 3077: 'night',
 1890: 'gon',
 4883: 'upset',
 3986: 'scrollin',
 2558: 'left',
 1186: 'dawg',
 3690: 'ready',
 1928: 'greatest',
 2045: 'headed',
 2825: 'mean',
 5030: 'way',
 1774: 'friendly',
 2533: 'lay',
 2967: 'mothafuckin',
 4393: 'steph',
 1138: 'curry',
 4097: 'shot',
 1029: 'cookin',
 3946: 'sauce',
 827: 'chef',
 3483: 'pot',
 569: 'boy',
 32: '360',
 5189: 'wrist',
 3222: 'ovo',
 3699: 'real

In [21]:
# Instantiate the LDA model

lda = models.LdaModel(
    corpus=corpus, # this our term document matrix
    id2word=id2word, # this our dict of location:term
    num_topics=2, # choosing two topics the model will try to discover
    passes=10, # we will start with 10 passes and see what difference it makes moving up or down.
              # this will go through the document once searching for the best topics based off the document. I'm asking it to do it 10x.
              # the more passes the more the topics begin to make sense.
    random_state = 42
)
# print the discovered topics out
lda.print_topics()
# the output will show you the top words of the topic. It will not output the topic itself.
# this output probably won't make much sense. 

[(0,
  '0.007*"verse" + 0.007*"life" + 0.006*"bitch" + 0.006*"chorus" + 0.005*"just" + 0.005*"ass" + 0.004*"high" + 0.004*"new" + 0.004*"em" + 0.004*"time"'),
 (1,
  '0.010*"just" + 0.006*"need" + 0.006*"verse" + 0.006*"chorus" + 0.005*"want" + 0.005*"say" + 0.004*"bitch" + 0.004*"time" + 0.004*"make" + 0.004*"em"')]

#### Adjust the topics count

In [23]:
# let's try 3 topics
lda = models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=3,
    passes=10,
    random_state= 42
)

lda.print_topics()
# not getting anything better

[(0,
  '0.009*"just" + 0.007*"verse" + 0.007*"chorus" + 0.007*"bitch" + 0.006*"ass" + 0.006*"life" + 0.005*"need" + 0.005*"make" + 0.005*"real" + 0.005*"time"'),
 (1,
  '0.009*"just" + 0.005*"say" + 0.005*"let" + 0.005*"em" + 0.004*"fuckin" + 0.004*"think" + 0.004*"better" + 0.004*"bitch" + 0.004*"verse" + 0.004*"yall"'),
 (2,
  '0.007*"verse" + 0.006*"chorus" + 0.005*"just" + 0.005*"black" + 0.005*"bitch" + 0.005*"life" + 0.005*"em" + 0.005*"west" + 0.004*"kanye" + 0.004*"want"')]

In [24]:
# let's try 4
# let's try 3 topics
lda = models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=4,
    passes=10,
    random_state= 42
)

lda.print_topics()
# not getting anything better...again

[(0,
  '0.009*"just" + 0.008*"verse" + 0.007*"chorus" + 0.007*"bitch" + 0.007*"ass" + 0.006*"life" + 0.005*"need" + 0.005*"make" + 0.005*"real" + 0.005*"time"'),
 (1,
  '0.010*"just" + 0.005*"let" + 0.005*"say" + 0.005*"em" + 0.005*"fuckin" + 0.005*"think" + 0.005*"better" + 0.004*"bitch" + 0.004*"verse" + 0.004*"yall"'),
 (2,
  '0.007*"verse" + 0.006*"chorus" + 0.006*"just" + 0.005*"black" + 0.005*"bitch" + 0.005*"life" + 0.005*"em" + 0.005*"west" + 0.005*"kanye" + 0.004*"want"'),
 (3,
  '0.000*"just" + 0.000*"verse" + 0.000*"time" + 0.000*"bitch" + 0.000*"life" + 0.000*"chorus" + 0.000*"ass" + 0.000*"make" + 0.000*"need" + 0.000*"em"')]

Going to need to go back and clean much more...

#### Adjust the number of passes

In [25]:
# I'll use 4 topics w/ 50 passes
lda = models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=3,
    passes=50,
    random_state= 42
)

lda.print_topics()
# that didn't seem to work any better. One last shot...

[(0,
  '0.009*"just" + 0.007*"verse" + 0.007*"chorus" + 0.007*"bitch" + 0.006*"ass" + 0.006*"life" + 0.005*"need" + 0.005*"real" + 0.005*"make" + 0.005*"high"'),
 (1,
  '0.009*"just" + 0.005*"let" + 0.005*"say" + 0.005*"em" + 0.004*"fuckin" + 0.004*"better" + 0.004*"think" + 0.004*"bitch" + 0.004*"yall" + 0.004*"verse"'),
 (2,
  '0.007*"verse" + 0.006*"chorus" + 0.005*"just" + 0.005*"black" + 0.005*"bitch" + 0.005*"west" + 0.005*"em" + 0.005*"life" + 0.004*"kanye" + 0.004*"big"')]

In [30]:
# I'll try w/ 100 passes
lda = models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=3,
    passes=100,
#     distributed=True, # This will not work without a module named Pyro4
    random_state= 42,
    per_word_topics=False # prints the list of topics (not working)
)

lda.print_topics()
# still the same...

[(0,
  '0.009*"just" + 0.007*"verse" + 0.007*"chorus" + 0.007*"bitch" + 0.006*"ass" + 0.006*"life" + 0.005*"need" + 0.005*"real" + 0.005*"make" + 0.005*"high"'),
 (1,
  '0.009*"just" + 0.005*"let" + 0.005*"say" + 0.005*"em" + 0.004*"fuckin" + 0.004*"better" + 0.004*"think" + 0.004*"bitch" + 0.004*"yall" + 0.004*"verse"'),
 (2,
  '0.007*"verse" + 0.006*"chorus" + 0.005*"just" + 0.005*"black" + 0.005*"bitch" + 0.005*"west" + 0.005*"em" + 0.005*"life" + 0.004*"kanye" + 0.004*"big"')]

### Nouns only

In [39]:
def nouns(text):
    """
    pull out the nouns only from a string of text
    """
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [40]:
# use the cleaned data to gather the nouns
data_clean = pd.read_pickle('DataFrame_Corpus.pkl')
data_clean

Unnamed: 0_level_0,Lyrics,Artist Name
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1
Drake,produced by boi1da frank dukes noah 40 shebib ...,Drake
Jayz,intro hannah williams do i find it so hard whe...,Jayz
Nas,produced by ron browz intro fuck jay z whats u...,Nas
Eminem,verse 1 now this shits about to kick off this ...,Eminem
Future,intro high klassified な音楽 i got the truth in m...,Future
KanyeWest,produced by daft punk kanye west verse 1 for ...,KanyeWest


In [42]:
# apply the noun function that was created above
data_nouns = pd.DataFrame(data_clean.Lyrics.apply(nouns))
data_nouns
# below you will see the corpus with nouns only. It seems like some
# words just don't belong like the word 'moves' but notice the word dance next
# to it. That makes it a noun. 

Unnamed: 0_level_0,Lyrics
Artist,Unnamed: 1_level_1
Drake,boi1da frank dukes shebib part verse fuck bein...
Jayz,intro hannah williams i heart im day day look ...
Nas,ron browz fuck jay z ayo i z dick nigga style ...
Eminem,party lets hiphop scratch im bout track everyb...
Future,な音楽 i truth weeknd dont dance moves nigga sham...
KanyeWest,daft punk kanye verse theme song jeans byanyme...


In [96]:
# okay now we have to do the same steps above all over again
# meaning we need another term document matrix and another dictionary 
# before modeling. 

# right now our tdm doesn not have stop words in it. Re-add them
add_stop_words = [ # inside this list I will insert words that just shouldn't be considered
    'な音楽','verse','produced','intro','just','em','chorus',
    'bitch','kanye','west','boi1da','ass','yall', 'zöld',
    'ölén','úgy', 'im',
    
    'fuck','fucking','fucks','fuckin','nigga','niggas','shit', 
    'bitch','bitches','pussy','hoes','muhfucka','motherfucker',
    'ass',
    
    'lets','cause','thats', 'youre','aint', 'yeah', 'future','nas',
    'drake'
]
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Create a document term matrix to turn into a term document matrix
cvn = CountVectorizer(
    stop_words=stop_words
)
data_cvn = cvn.fit_transform(
    data_nouns.Lyrics
)
data_dtmn = pd.DataFrame(
    data_cvn.toarray(),
    columns = cvn.get_feature_names()
)
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0_level_0,a1,aaaah,ability,abundance,accelerants,accolades,account,accounts,ace,acetaminophen,...,yung,zapatos,zazie,ze,zeros,zip,zod,zombie,zone,zonin
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Drake,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,1,0
Jayz,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
Nas,0,0,0,1,0,0,0,1,1,0,...,0,0,0,1,1,0,0,0,0,0
Eminem,0,0,1,0,1,1,1,0,1,1,...,1,0,0,0,0,0,1,1,0,0
Future,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
KanyeWest,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,3


In [97]:
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))
id2wordn = dict((v,k) for k,v in cvn.vocabulary_.items())

In [98]:
# Instantiate the LDA model

ldan = models.LdaModel(
    corpus=corpusn, # this our term document matrix
    id2word=id2wordn, # this our dict of location:term
    num_topics=2, # choosing two topics the model will try to discover
    passes=10, # we will start with 10 passes and see what difference it makes moving up or down.
              # this will go through the document once searching for the best topics based off the document. I'm asking it to do it 10x.
              # the more passes the more the topics begin to make sense.
    random_state = 42
)
# print the discovered topics out
ldan.print_topics()
# the output will show you the top words of the topic. It will not output the topic itself.
# this is starting look decent as an output do to the stop words, 
# which only reinforces that I need to clean better.

[(0,
  '0.015*"clique" + 0.009*"swerve" + 0.005*"girl" + 0.005*"money" + 0.005*"hands" + 0.005*"monster" + 0.004*"life" + 0.004*"teeth" + 0.004*"sound" + 0.004*"need"'),
 (1,
  '0.011*"love" + 0.009*"life" + 0.009*"time" + 0.008*"world" + 0.008*"man" + 0.006*"day" + 0.006*"dont" + 0.005*"way" + 0.005*"money" + 0.004*"thou"')]

In [99]:
# trying 3 topics w/ 50 passes
ldan = models.LdaModel(
    corpus=corpusn,
    id2word=id2wordn,
    num_topics=3,
    passes=50,
    random_state= 42
)

ldan.print_topics()

[(0,
  '0.018*"clique" + 0.011*"swerve" + 0.006*"girl" + 0.006*"hands" + 0.006*"money" + 0.006*"monster" + 0.005*"life" + 0.005*"sound" + 0.005*"teeth" + 0.005*"need"'),
 (1,
  '0.008*"time" + 0.008*"day" + 0.007*"dont" + 0.007*"man" + 0.006*"way" + 0.006*"love" + 0.005*"face" + 0.005*"girl" + 0.005*"home" + 0.004*"water"'),
 (2,
  '0.015*"love" + 0.015*"world" + 0.015*"life" + 0.009*"thou" + 0.009*"time" + 0.009*"man" + 0.007*"trophy" + 0.006*"dog" + 0.006*"commas" + 0.006*"money"')]

In [100]:
# trying 4 topics w/ 50 passes
ldan = models.LdaModel(
    corpus=corpusn,
    id2word=id2wordn,
    num_topics=4,
    passes=50,
    random_state= 42
)

ldan.print_topics()

[(0,
  '0.020*"clique" + 0.012*"swerve" + 0.007*"girl" + 0.007*"hands" + 0.007*"money" + 0.006*"monster" + 0.006*"life" + 0.005*"sound" + 0.005*"teeth" + 0.005*"need"'),
 (1,
  '0.009*"time" + 0.008*"day" + 0.008*"dont" + 0.007*"man" + 0.007*"way" + 0.007*"love" + 0.005*"face" + 0.005*"girl" + 0.005*"home" + 0.004*"water"'),
 (2,
  '0.018*"life" + 0.018*"thou" + 0.014*"trophy" + 0.013*"dog" + 0.013*"commas" + 0.010*"money" + 0.009*"percocets" + 0.008*"time" + 0.007*"wifey" + 0.007*"foreigns"'),
 (3,
  '0.023*"world" + 0.023*"love" + 0.012*"man" + 0.009*"represent" + 0.009*"life" + 0.008*"time" + 0.006*"mind" + 0.006*"son" + 0.005*"death" + 0.005*"state"')]

### Nouns and Adjectives

In [101]:
def noun_adj(text):
    """
    Same as nouns above, but now including adjectives too.
    """
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word,pos) in pos_tag(tokenized) if is_noun_adj (pos)]
    return ' '.join(nouns_adj)

In [102]:
data_nouns_adj = pd.DataFrame(data_clean.Lyrics.apply(noun_adj))
data_nouns_adj

Unnamed: 0_level_0,Lyrics
Artist,Unnamed: 1_level_1
Drake,boi1da frank dukes shebib nineteen85 part vers...
Jayz,intro hannah williams hard i heart im day day ...
Nas,ron browz intro fuck jay z niggas ayo i aint j...
Eminem,verse party wack lets hiphop scratch im bout t...
Future,high な音楽 i truth verse weeknd nigga dont dance...
KanyeWest,daft punk kanye west verse theme song black le...


In [103]:
# instantiate countvectorizer
cvna = CountVectorizer(
    stop_words=stop_words,
#     max_df=.8 # consider adding this in, see what difference it makes
)
data_cvna = cvna.fit_transform(data_nouns_adj.Lyrics)
data_dtmna = pd.DataFrame (data_cvna.toarray(), columns = cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0_level_0,21st,41st,a1,aaaah,ability,able,absurd,abundance,ac,accelerants,...,yung,zapatos,zazie,ze,zeros,zip,zod,zombie,zone,zonin
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Drake,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,1,0
Jayz,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
Nas,0,1,0,0,0,0,0,1,0,0,...,0,0,0,1,1,0,0,0,0,0
Eminem,0,0,0,0,1,2,1,0,1,1,...,1,0,0,0,0,0,1,1,0,0
Future,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
KanyeWest,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,3


In [104]:
# create our sparse matrix and vocab dictionary
corpusna = matutils.Sparse2Corpus(
    scipy.sparse.csr_matrix(data_dtmna.transpose())
)

id2wordna = dict((v,k) for k, v in cvna.vocabulary_.items())

In [105]:
# time to model. We will start with 2 and move to 4

ldana = models.LdaModel(
    corpus=corpusna,
    id2word=id2wordna,
    num_topics=2,
    passes=10
)

ldana.print_topics()

[(0,
  '0.009*"life" + 0.009*"love" + 0.008*"world" + 0.006*"high" + 0.006*"new" + 0.005*"time" + 0.005*"man" + 0.005*"day" + 0.005*"low" + 0.005*"thou"'),
 (1,
  '0.006*"dont" + 0.006*"time" + 0.006*"clique" + 0.005*"man" + 0.005*"god" + 0.005*"girl" + 0.005*"way" + 0.004*"love" + 0.004*"real" + 0.004*"night"')]

In [106]:
# 3 topics with 50 passes
ldana = models.LdaModel(
    corpus=corpusna,
    id2word=id2wordna,
    num_topics=3,
    passes=50
)

ldana.print_topics()

[(0,
  '0.011*"clique" + 0.008*"girl" + 0.007*"god" + 0.007*"real" + 0.007*"time" + 0.006*"swerve" + 0.006*"love" + 0.006*"man" + 0.006*"black" + 0.005*"big"'),
 (1,
  '0.008*"life" + 0.007*"high" + 0.007*"dont" + 0.007*"thou" + 0.006*"low" + 0.006*"time" + 0.006*"ta" + 0.005*"trophy" + 0.005*"gon" + 0.005*"dog"'),
 (2,
  '0.013*"love" + 0.012*"world" + 0.008*"new" + 0.007*"man" + 0.006*"black" + 0.006*"life" + 0.006*"york" + 0.006*"day" + 0.005*"represent" + 0.005*"time"')]

In [107]:
# 4 topics with 1000 passes
ldana = models.LdaModel(
    corpus=corpusna,
    id2word=id2wordna,
    num_topics=4,
    passes=1000
)

ldana.print_topics()

[(0,
  '0.000*"hold" + 0.000*"insane" + 0.000*"hit" + 0.000*"story" + 0.000*"watch" + 0.000*"roll" + 0.000*"mama" + 0.000*"reason" + 0.000*"lord" + 0.000*"code"'),
 (1,
  '0.010*"love" + 0.009*"world" + 0.006*"man" + 0.006*"new" + 0.006*"dont" + 0.006*"day" + 0.005*"time" + 0.005*"life" + 0.005*"black" + 0.004*"york"'),
 (2,
  '0.014*"high" + 0.014*"life" + 0.014*"thou" + 0.013*"low" + 0.011*"trophy" + 0.010*"dog" + 0.010*"commas" + 0.008*"money" + 0.007*"percocets" + 0.007*"strong"'),
 (3,
  '0.011*"clique" + 0.008*"girl" + 0.008*"real" + 0.008*"god" + 0.007*"time" + 0.007*"love" + 0.007*"swerve" + 0.007*"man" + 0.006*"black" + 0.006*"way"')]

You are going to have to interpret the topics. Very subjective