<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Model---Rap-Genius" data-toc-modified-id="Model---Rap-Genius-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Model - Rap Genius</a></span><ul class="toc-item"><li><span><a href="#Topic-Modeling" data-toc-modified-id="Topic-Modeling-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Topic Modeling</a></span><ul class="toc-item"><li><span><a href="#Attempt-1----All-Text" data-toc-modified-id="Attempt-1----All-Text-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Attempt 1 -- All Text</a></span><ul class="toc-item"><li><span><a href="#Adjust-the-topics-count" data-toc-modified-id="Adjust-the-topics-count-1.1.1.1"><span class="toc-item-num">1.1.1.1&nbsp;&nbsp;</span>Adjust the topics count</a></span></li><li><span><a href="#Adjust-the-number-of-passes" data-toc-modified-id="Adjust-the-number-of-passes-1.1.1.2"><span class="toc-item-num">1.1.1.2&nbsp;&nbsp;</span>Adjust the number of passes</a></span></li></ul></li><li><span><a href="#Nouns-only" data-toc-modified-id="Nouns-only-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Nouns only</a></span></li><li><span><a href="#Nouns-and-Adjectives" data-toc-modified-id="Nouns-and-Adjectives-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Nouns and Adjectives</a></span></li></ul></li><li><span><a href="#Text-Generation" data-toc-modified-id="Text-Generation-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Text Generation</a></span><ul class="toc-item"><li><span><a href="#Read-in-data-to-imitate" data-toc-modified-id="Read-in-data-to-imitate-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Read in data to imitate</a></span></li><li><span><a href="#Build-a-Markov-Chains" data-toc-modified-id="Build-a-Markov-Chains-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Build a Markov Chains</a></span></li><li><span><a href="#Create-a-text-generator" data-toc-modified-id="Create-a-text-generator-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>Create a text generator</a></span></li></ul></li></ul></li></ul></div>

# Model - Rap Genius

In [1]:
from gensim import matutils, models # matutils will turn the array into a bag of words
import pandas as pd
import pickle
import scipy.sparse
from nltk import word_tokenize, pos_tag
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
import random
from IPython.lib.pretty import pprint

## Topic Modeling

for topic modeling we are going to be using a technique called Laten Dirichlet Allocation (LDA). NLTK for part of speech tagging. We want to know what are the different themes that arise in a rappers lyrics, and who tends to talk about what.

LDA deals with probability. Dirichlet is a type of probability and we are trying to discover what is the probability that the document is about a specific topic given a set of words. 

### Attempt 1 -- All Text
This is not going to work the first time through. 

In [2]:
data = pd.read_pickle('../Datasets/Pickled_Files/DataFrame_with_new_stopwords.pkl')
data

Unnamed: 0_level_0,02,10,100,1000,1008,10yearolds,11,12,125,140,...,zeros,zip,zod,zombie,zone,zonin,zöld,ölén,úgy,な音楽
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Drake,0,0,6,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
Jayz,0,0,2,0,0,0,2,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Nas,0,1,0,1,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0
Eminem,1,0,0,0,0,1,0,1,0,0,...,0,0,1,1,0,0,0,0,0,0
Future,0,0,0,0,1,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,1
KanyeWest,0,0,0,0,0,0,0,1,2,0,...,0,0,0,1,0,3,1,1,1,0


In [3]:
# transpose to create a term-document matrix
tdm = data.transpose()
tdm.head()

Artist,Drake,Jayz,Nas,Eminem,Future,KanyeWest
2,0,0,0,1,0,0
10,0,0,1,0,0,0
100,6,2,0,0,0,0
1000,0,0,1,0,0,0
1008,0,0,0,0,1,0


In [4]:
# now we are going to turn the tdm into a sparse matrix
# where most of the elements are zero (the opposite is a dense matrix where they are nonzero)

sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [5]:
# gensim requires a dictionary of all the terms and where they reside in the tdm
# this will show us all the unique words in our corpus.
cv = pickle.load(open('../Datasets/Pickled_Files/cv_stop.pkl', 'rb'))
id2word = dict((v,k) for k, v in cv.vocabulary_.items())

In [7]:
# Instantiate the LDA model

lda = models.LdaModel(
    corpus=corpus, # this our term document matrix
    id2word=id2word, # this our dict of location:term
    num_topics=2, # choosing two topics the model will try to discover
    passes=10, # we will start with 10 passes and see what difference it makes moving up or down.
              # this will go through the document once searching for the best topics based off the document. I'm asking it to do it 10x.
              # the more passes the more the topics begin to make sense.
    random_state = 42,
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)
# print the discovered topics out
lda.print_topics()
# the output will show you the top words of the topic. It will not output the topic itself.
# this output probably won't make much sense. 

[(0,
  '0.007*"verse" + 0.007*"life" + 0.006*"bitch" + 0.006*"chorus" + 0.005*"just" + 0.005*"ass" + 0.004*"high" + 0.004*"new" + 0.004*"em" + 0.004*"time"'),
 (1,
  '0.010*"just" + 0.006*"need" + 0.006*"verse" + 0.006*"chorus" + 0.005*"want" + 0.005*"say" + 0.004*"bitch" + 0.004*"time" + 0.004*"make" + 0.004*"em"')]

#### Adjust the topics count

In [8]:
# let's try 3 topics
lda = models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=3,
    passes=10,
    random_state= 42,
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)

lda.print_topics()
# not getting anything better

[(0,
  '0.009*"just" + 0.007*"verse" + 0.007*"chorus" + 0.007*"bitch" + 0.006*"ass" + 0.006*"life" + 0.005*"need" + 0.005*"make" + 0.005*"real" + 0.005*"time"'),
 (1,
  '0.009*"just" + 0.005*"say" + 0.005*"let" + 0.005*"em" + 0.004*"fuckin" + 0.004*"think" + 0.004*"better" + 0.004*"bitch" + 0.004*"verse" + 0.004*"yall"'),
 (2,
  '0.007*"verse" + 0.006*"chorus" + 0.005*"just" + 0.005*"black" + 0.005*"bitch" + 0.005*"life" + 0.005*"em" + 0.005*"west" + 0.004*"kanye" + 0.004*"want"')]

In [9]:
# let's try 4
# let's try 3 topics
lda = models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=4,
    passes=10,
    random_state= 42,
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)

lda.print_topics()
# not getting anything better...again

[(0,
  '0.009*"just" + 0.008*"verse" + 0.007*"chorus" + 0.007*"bitch" + 0.007*"ass" + 0.006*"life" + 0.005*"need" + 0.005*"make" + 0.005*"real" + 0.005*"time"'),
 (1,
  '0.010*"just" + 0.005*"let" + 0.005*"say" + 0.005*"em" + 0.005*"fuckin" + 0.005*"think" + 0.005*"better" + 0.004*"bitch" + 0.004*"verse" + 0.004*"yall"'),
 (2,
  '0.007*"verse" + 0.006*"chorus" + 0.006*"just" + 0.005*"black" + 0.005*"bitch" + 0.005*"life" + 0.005*"em" + 0.005*"west" + 0.005*"kanye" + 0.004*"want"'),
 (3,
  '0.000*"just" + 0.000*"verse" + 0.000*"time" + 0.000*"bitch" + 0.000*"life" + 0.000*"chorus" + 0.000*"ass" + 0.000*"make" + 0.000*"need" + 0.000*"em"')]

Going to need to go back and clean much more...

#### Adjust the number of passes

In [10]:
# I'll use 4 topics w/ 50 passes
lda = models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=3,
    passes=50,
    random_state= 42,
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)

lda.print_topics()
# that didn't seem to work any better. One last shot...

[(0,
  '0.009*"just" + 0.007*"verse" + 0.007*"chorus" + 0.007*"bitch" + 0.006*"ass" + 0.006*"life" + 0.005*"need" + 0.005*"real" + 0.005*"make" + 0.005*"high"'),
 (1,
  '0.009*"just" + 0.005*"let" + 0.005*"say" + 0.005*"em" + 0.004*"fuckin" + 0.004*"better" + 0.004*"think" + 0.004*"bitch" + 0.004*"yall" + 0.004*"verse"'),
 (2,
  '0.007*"verse" + 0.006*"chorus" + 0.005*"just" + 0.005*"black" + 0.005*"bitch" + 0.005*"west" + 0.005*"em" + 0.005*"life" + 0.004*"kanye" + 0.004*"big"')]

In [11]:
# I'll try w/ 100 passes
lda = models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=3,
    passes=100,
#     distributed=True, # This will not work without a module named Pyro4
    random_state= 42,
    per_word_topics=False, # prints the list of topics (not working)
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)

lda.print_topics()
# still the same...

[(0,
  '0.009*"just" + 0.007*"verse" + 0.007*"chorus" + 0.007*"bitch" + 0.006*"ass" + 0.006*"life" + 0.005*"need" + 0.005*"make" + 0.005*"real" + 0.005*"time"'),
 (1,
  '0.009*"just" + 0.005*"say" + 0.005*"let" + 0.005*"em" + 0.004*"fuckin" + 0.004*"better" + 0.004*"think" + 0.004*"bitch" + 0.004*"yall" + 0.004*"verse"'),
 (2,
  '0.007*"verse" + 0.006*"chorus" + 0.005*"just" + 0.005*"black" + 0.005*"bitch" + 0.005*"em" + 0.005*"west" + 0.005*"life" + 0.004*"kanye" + 0.004*"big"')]

### Nouns only

In [12]:
def nouns(text):
    """
    pull out the nouns only from a string of text
    """
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [13]:
# use the cleaned data to gather the nouns
data_clean = pd.read_pickle('../Datasets/Pickled_Files/DataFrame_Corpus.pkl')
data_clean

Unnamed: 0_level_0,Lyrics,Artist Name
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1
Drake,produced by boi1da frank dukes noah 40 shebib ...,Drake
Jayz,intro hannah williams do i find it so hard whe...,Jayz
Nas,produced by ron browz intro fuck jay z whats u...,Nas
Eminem,verse 1 now this shits about to kick off this ...,Eminem
Future,intro high klassified な音楽 i got the truth in m...,Future
KanyeWest,produced by daft punk kanye west verse 1 for ...,KanyeWest


In [14]:
# apply the noun function that was created above
data_nouns = pd.DataFrame(data_clean.Lyrics.apply(nouns))
data_nouns
# below you will see the corpus with nouns only. It seems like some
# words just don't belong like the word 'moves' but notice the word dance next
# to it. That makes it a noun. 

Unnamed: 0_level_0,Lyrics
Artist,Unnamed: 1_level_1
Drake,boi1da frank dukes shebib part verse fuck bein...
Jayz,intro hannah williams i heart im day day look ...
Nas,ron browz fuck jay z ayo i z dick nigga style ...
Eminem,party lets hiphop scratch im bout track everyb...
Future,な音楽 i truth weeknd dont dance moves nigga sham...
KanyeWest,daft punk kanye verse theme song jeans byanyme...


In [15]:
# okay now we have to do the same steps above all over again
# meaning we need another term document matrix and another dictionary 
# before modeling. 

# right now our tdm doesn not have stop words in it. Re-add them
add_stop_words = [ # inside this list I will insert words that just shouldn't be considered
    'な音楽','verse','produced','intro','just','em','chorus',
    'bitch','kanye','west','boi1da','ass','yall', 'zöld',
    'ölén','úgy', 'im',
    
    'fuck','fucking','fucks','fuckin','nigga','niggas','shit', 
    'bitch','bitches','pussy','hoes','muhfucka','motherfucker',
    'ass', 'dont', 'ya','yuh','ayy','ta','ive'
    
    'lets','cause','thats', 'youre','aint', 'yeah', 'future','nas',
    'drake'
]
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Create a document term matrix to turn into a term document matrix
cvn = CountVectorizer(
    stop_words=stop_words
)
data_cvn = cvn.fit_transform(
    data_nouns.Lyrics
)
data_dtmn = pd.DataFrame(
    data_cvn.toarray(),
    columns = cvn.get_feature_names()
)
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0_level_0,a1,aaaah,ability,abundance,accelerants,accolades,account,accounts,ace,acetaminophen,...,yung,zapatos,zazie,ze,zeros,zip,zod,zombie,zone,zonin
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Drake,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,1,0
Jayz,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
Nas,0,0,0,1,0,0,0,1,1,0,...,0,0,0,1,1,0,0,0,0,0
Eminem,0,0,1,0,1,1,1,0,1,1,...,1,0,0,0,0,0,1,1,0,0
Future,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
KanyeWest,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,3


In [16]:
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))
id2wordn = dict((v,k) for k,v in cvn.vocabulary_.items())

In [17]:
# Instantiate the LDA model

ldan = models.LdaModel(
    corpus=corpusn, # this our term document matrix
    id2word=id2wordn, # this our dict of location:term
    num_topics=2, # choosing two topics the model will try to discover
    passes=10, # we will start with 10 passes and see what difference it makes moving up or down.
              # this will go through the document once searching for the best topics based off the document. I'm asking it to do it 10x.
              # the more passes the more the topics begin to make sense.
    random_state = 42,
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)
# print the discovered topics out
ldan.print_topics()
# the output will show you the top words of the topic. It will not output the topic itself.
# this is starting look decent as an output do to the stop words, 
# which only reinforces that I need to clean better.

[(0,
  '0.016*"love" + 0.012*"world" + 0.010*"man" + 0.008*"time" + 0.006*"life" + 0.006*"york" + 0.006*"day" + 0.005*"way" + 0.005*"baby" + 0.004*"represent"'),
 (1,
  '0.009*"life" + 0.007*"clique" + 0.007*"time" + 0.006*"lets" + 0.006*"money" + 0.006*"thou" + 0.005*"gon" + 0.005*"trophy" + 0.005*"everybody" + 0.004*"day"')]

In [18]:
# trying 3 topics w/ 50 passes
ldan = models.LdaModel(
    corpus=corpusn,
    id2word=id2wordn,
    num_topics=3,
    passes=50,
    random_state= 42,
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)

ldan.print_topics()

[(0,
  '0.018*"love" + 0.013*"world" + 0.011*"man" + 0.009*"time" + 0.007*"life" + 0.007*"york" + 0.006*"day" + 0.005*"way" + 0.005*"baby" + 0.005*"represent"'),
 (1,
  '0.010*"clique" + 0.006*"swerve" + 0.006*"time" + 0.005*"gon" + 0.005*"everybody" + 0.005*"life" + 0.004*"day" + 0.004*"moment" + 0.004*"monster" + 0.004*"man"'),
 (2,
  '0.016*"life" + 0.016*"thou" + 0.015*"lets" + 0.012*"trophy" + 0.011*"dog" + 0.011*"commas" + 0.009*"money" + 0.008*"percocets" + 0.007*"time" + 0.006*"wifey"')]

In [19]:
# trying 4 topics w/ 50 passes
ldan = models.LdaModel(
    corpus=corpusn,
    id2word=id2wordn,
    num_topics=4,
    passes=50,
    random_state= 42,
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)

ldan.print_topics()

[(0,
  '0.023*"love" + 0.018*"world" + 0.015*"man" + 0.012*"time" + 0.007*"way" + 0.007*"represent" + 0.007*"life" + 0.006*"face" + 0.006*"home" + 0.005*"things"'),
 (1,
  '0.000*"world" + 0.000*"love" + 0.000*"man" + 0.000*"life" + 0.000*"time" + 0.000*"day" + 0.000*"clique" + 0.000*"gon" + 0.000*"money" + 0.000*"represent"'),
 (2,
  '0.020*"clique" + 0.012*"swerve" + 0.007*"girl" + 0.007*"money" + 0.007*"hands" + 0.006*"monster" + 0.006*"life" + 0.005*"teeth" + 0.005*"sound" + 0.005*"need"'),
 (3,
  '0.010*"life" + 0.009*"day" + 0.008*"lets" + 0.007*"time" + 0.007*"thou" + 0.006*"money" + 0.006*"trophy" + 0.005*"gon" + 0.005*"dog" + 0.005*"commas"')]

### Nouns and Adjectives

In [20]:
def noun_adj(text):
    """
    Same as nouns above, but now including adjectives too.
    """
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word,pos) in pos_tag(tokenized) if is_noun_adj (pos)]
    return ' '.join(nouns_adj)

In [21]:
data_nouns_adj = pd.DataFrame(data_clean.Lyrics.apply(noun_adj))
data_nouns_adj

Unnamed: 0_level_0,Lyrics
Artist,Unnamed: 1_level_1
Drake,boi1da frank dukes shebib nineteen85 part vers...
Jayz,intro hannah williams hard i heart im day day ...
Nas,ron browz intro fuck jay z niggas ayo i aint j...
Eminem,verse party wack lets hiphop scratch im bout t...
Future,high な音楽 i truth verse weeknd nigga dont dance...
KanyeWest,daft punk kanye west verse theme song black le...


In [22]:
# instantiate countvectorizer
cvna = CountVectorizer(
    stop_words=stop_words,
#     max_df=.8 # consider adding this in, see what difference it makes
)
data_cvna = cvna.fit_transform(data_nouns_adj.Lyrics)
data_dtmna = pd.DataFrame (data_cvna.toarray(), columns = cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0_level_0,21st,41st,a1,aaaah,ability,able,absurd,abundance,ac,accelerants,...,yung,zapatos,zazie,ze,zeros,zip,zod,zombie,zone,zonin
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Drake,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,1,0
Jayz,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
Nas,0,1,0,0,0,0,0,1,0,0,...,0,0,0,1,1,0,0,0,0,0
Eminem,0,0,0,0,1,2,1,0,1,1,...,1,0,0,0,0,0,1,1,0,0
Future,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
KanyeWest,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,3


In [23]:
# create our sparse matrix and vocab dictionary
corpusna = matutils.Sparse2Corpus(
    scipy.sparse.csr_matrix(data_dtmna.transpose())
)

id2wordna = dict((v,k) for k, v in cvna.vocabulary_.items())

In [24]:
# time to model. We will start with 2 and move to 4

ldana = models.LdaModel(
    corpus=corpusna,
    id2word=id2wordna,
    num_topics=2,
    passes=10,
    random_state = 42,
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)

ldana.print_topics()

[(0,
  '0.008*"life" + 0.006*"high" + 0.006*"time" + 0.006*"money" + 0.006*"real" + 0.006*"girl" + 0.006*"clique" + 0.006*"lets" + 0.005*"god" + 0.005*"big"'),
 (1,
  '0.010*"world" + 0.010*"love" + 0.006*"man" + 0.005*"time" + 0.004*"life" + 0.004*"gon" + 0.004*"represent" + 0.004*"better" + 0.003*"new" + 0.003*"yo"')]

In [25]:
# 3 topics with 50 passes
ldana = models.LdaModel(
    corpus=corpusna,
    id2word=id2wordna,
    num_topics=3,
    passes=50,
    random_state = 42,
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)

ldana.print_topics()

[(0,
  '0.011*"life" + 0.009*"high" + 0.009*"lets" + 0.008*"low" + 0.008*"thou" + 0.007*"day" + 0.007*"money" + 0.006*"trophy" + 0.006*"dog" + 0.006*"new"'),
 (1,
  '0.005*"better" + 0.005*"gon" + 0.005*"common" + 0.005*"time" + 0.005*"moment" + 0.004*"oohoohoohooh" + 0.004*"hes" + 0.004*"eminem" + 0.004*"day" + 0.004*"way"'),
 (2,
  '0.013*"love" + 0.009*"world" + 0.009*"man" + 0.008*"clique" + 0.008*"time" + 0.007*"black" + 0.006*"real" + 0.006*"life" + 0.005*"big" + 0.005*"god"')]

In [26]:
# 4 topics with 1000 passes
ldana = models.LdaModel(
    corpus=corpusna,
    id2word=id2wordna,
    num_topics=4,
    passes=1000,
    random_state= 42,
    alpha = 'auto', # The alpha controls the mixture of topics for any given document. 
                   # Turn it down and the documents will likely have less of a mixture of topics. 
                   # Turn it up and the documents will likely have more of a mixture of topics.
    eta = 'auto' # this is beta. This is just like alpha but instead it deals with words for any given topic.
)

ldana.print_topics()

[(0,
  '0.011*"life" + 0.010*"high" + 0.010*"lets" + 0.009*"thou" + 0.009*"low" + 0.008*"day" + 0.007*"money" + 0.007*"trophy" + 0.006*"commas" + 0.006*"dog"'),
 (1,
  '0.006*"better" + 0.005*"gon" + 0.005*"common" + 0.005*"time" + 0.005*"moment" + 0.005*"oohoohoohooh" + 0.004*"eminem" + 0.004*"hes" + 0.004*"day" + 0.004*"way"'),
 (2,
  '0.018*"love" + 0.014*"world" + 0.011*"man" + 0.010*"time" + 0.008*"real" + 0.006*"way" + 0.006*"represent" + 0.005*"black" + 0.005*"life" + 0.005*"new"'),
 (3,
  '0.016*"clique" + 0.009*"god" + 0.009*"swerve" + 0.007*"black" + 0.007*"big" + 0.006*"girl" + 0.005*"monster" + 0.005*"money" + 0.005*"hands" + 0.005*"sean"')]

- You are going to have to interpret the topics. Very subjective
- Another technique you can apply is once you have ran your documents through LDA you can do a DBSCAN to do a more strict categorization of the documents. 
- LDA is a fuzzy categorization. 


In [27]:
for index, topic in ldana.show_topics(formatted=False, num_words= 30):
        print('Topic: {} \nWords: {}'.format(index, [w[0] for w in topic]))

Topic: 0 
Words: ['life', 'high', 'lets', 'thou', 'low', 'day', 'money', 'trophy', 'dog', 'commas', 'new', 'big', 'time', 'percocets', 'york', 'young', 'bad', 'strong', 'real', 'baby', 'water', 'uh', 'problems', 'wifey', 'mask', 'house', 'girl', 'metro', 'gang', 'foreigns']
Topic: 1 
Words: ['better', 'gon', 'common', 'moment', 'time', 'oohoohoohooh', 'eminem', 'hes', 'day', 'way', 'man', 'rap', 'music', 'crazy', 'everybody', 'little', 'ive', 'ill', 'head', 'shot', 'yo', 'boy', 'mic', 'joyner', 'people', 'woo', 'lil', 'world', 'night', 'ready']
Topic: 2 
Words: ['love', 'world', 'man', 'time', 'real', 'represent', 'way', 'black', 'life', 'new', 'face', 'fake', 'home', 'girl', 'things', 'night', 'big', 'mind', 'minewhose', 'wishin', 'son', 'quick', 'york', 'thing', 'baby', 'yo', 'death', 'days', 'long', 'state']
Topic: 3 
Words: ['clique', 'god', 'swerve', 'black', 'big', 'girl', 'hands', 'monster', 'money', 'life', 'sean', 'mornin', 'need', 'thirsty', 'teeth', 'new', 'sound', 'high', '

In [28]:
top_words_per_topic = []
for t in range(ldana.num_topics):
    top_words_per_topic.extend([(t, ) + x for x in ldana.show_topic(t, topn = 5)])

display = pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P'])

In [29]:
display

Unnamed: 0,Topic,Word,P
0,0,life,0.011404
1,0,high,0.009815
2,0,lets,0.009588
3,0,low,0.00868
4,0,thou,0.00868
5,1,better,0.005964
6,1,gon,0.005428
7,1,common,0.00516
8,1,time,0.004892
9,1,moment,0.004892


In [30]:
corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

[(2, 'Drake'),
 (0, 'Jayz'),
 (2, 'Nas'),
 (1, 'Eminem'),
 (0, 'Future'),
 (3, 'KanyeWest')]

## Text Generation
Using a corpus we can use markov chains to generate brand new text. We preserve the order of the text and punctuation too. We can generate new text based off this corpus in the style of a certain rapper, like Drake.

Markov Chains are a mathematical way of representing how systems change over time. Unfortunately they are memoryless and only knows what happens in one previous state. Basically, tomorrow's weather is based on what happens today. 

Put another way Today it rained, there is a 50% chance it will rain again, 30% chance it will be sunny and, 20% chance it will be cloudy. The system will choose rainy, but now there is a 30% chance it will rain, 50% chance it will be sunny, and 20% chance it will be cloudy. Now the system chooses sunny and the system starts over again.

For the computer to do this we can create a dictionary, the key being a word, and the value a list of words that would proceed that word. We then will randomly select a word from that list that appears a number of time.

### Read in data to imitate

In [15]:
data = pd.read_pickle('../Datasets/Pickled_Files/DataFrame_Corpus.pkl')
data

Unnamed: 0_level_0,Lyrics,Artist Name
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1
Drake,produced by boi1da frank dukes noah 40 shebib ...,Drake
Jayz,intro hannah williams do i find it so hard whe...,Jayz
Nas,produced by ron browz intro fuck jay z whats u...,Nas
Eminem,verse 1 now this shits about to kick off this ...,Eminem
Future,intro high klassified な音楽 i got the truth in m...,Future
KanyeWest,produced by daft punk kanye west verse 1 for ...,KanyeWest


In [16]:
drake_text = data.Lyrics.loc['Drake']
drake_text[0:200]

'produced by boi1da frank dukes noah 40 shebib  nineteen85 part i 0 to 100 verse 1 fuck bein on some chill shit we go 0 to 100 nigga real quick they be on that raptopaythebill shit and i dont feel that'

### Build a Markov Chains
- Keys are all the words in the corpus 
- Values are all the words that follow the key

In [17]:
def markov_chain(text):
    
    #tokenize the text by word, including punctuation
    words = text.split(' ')
    
    # initialize the dictionary that will hold all the keys and values
    m_dict = defaultdict(list)
    
    # create a list of the word: list of words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)
    
    # convert it back to a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [19]:
drake_dictionary = markov_chain(drake_text)
pprint(drake_dictionary,max_seq_length= 5)
# there are alot of dupes. That's okay because that means there is a
# high probability that word will get picked after the key.

{'produced': ['by', 'by'],
 'by': ['boi1da', 'key', 'the', 'a', 'nineteen85', ...],
 'boi1da': ['frank', 'whats'],
 'frank': ['dukes'],
 'dukes': ['noah'],
 ...}


### Create a text generator

In [6]:
def generate_sentence (chain, count=20):
    
    # capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()
    
    # generate the second word from the value list. 
    # don't forget to use this new word as word 1 to move forward.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2
        
    # end it with a period
    # psuedo code for later: if final word is not an determiner/article, then:
    sentence += '...'
    return(sentence)

In [17]:
generate_sentence(drake_dictionary)

'Outlive me all me stay up im the shine i know it like eight advances god damn okay made yall...'

**Drake Samples**

'Perspective i did you cause youre talkin boasy and shit 0 to do it i dont even know it you.'


'All me no more time where where i see i been cookin with my cell phone late night late night.'


'Cuddle with the goats for someone else you left your advances god damn chorus gods plan i had a lot...'


'Million off and when she say youll never ever leave from beside me cause i was a good girl and...'

In [20]:
nasText = data.Lyrics.loc['Nas']
nasText[0:200]

'produced by ron browz intro fuck jay z whats up niggas ayo i know you aint talkin about me dog you what fuck jay z you been on my dick nigga you love my style nigga fuck jay z chorus i fuck with your '

In [21]:
nasDictionary =  markov_chain(nasText)
pprint(nasDictionary,max_seq_length=5)

{'produced': ['by'],
 'by': ['ron', 'rashad', 'large', 'aesop', 'villanova', ...],
 'ron': ['browz'],
 'browz': ['intro'],
 'intro': ['fuck', 'nas', 'az', 'yeah', 'umm'],
 ...}


In [13]:
generate_sentence(nasDictionary)

'Changes and ill defeat foes yall rock fellas put you traded your mothafuckin mind i aint the story yesterday when...'

**Nas Samples**

'Headed for breakfast dime sexes and i had me laugh in these last time is i fuck i trap em.'


'Prisoner set free all races combined in hand in a maximum state pen and be riffin while im a cheetah...'