##     Products In Rap Lyrics

In [1]:
import nltk

In [3]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'JayZ'
wordlist = PlaintextCorpusReader(corpus_root, '.*')

In [6]:
import re

"""
This function takes in an object of the type PlaintextCorpusReader, and system path.
It returns an nltk corpus

It requires the regular expression package re to work
"""

#In here is where I should get rid of the rap stopwords. 

def create_corpus(wordlist, some_corpus): #process the files so I know what was read in
    for fileid in wordlist.fileids():
        print fileid
        raw = wordlist.raw(fileid)
        raw = re.split(r'\W+', raw) ## split the raw text into appropriate words 
        some_corpus.extend(raw)
        print fileid

    return some_corpus

In [7]:
the_corpus = create_corpus(wordlist, []) 

.DS_Store


UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

In [None]:
len(the_corpus)

In [None]:
the_corpus[:10]

In [None]:
Albums = wordlist.fileids()
Albums[:14]
[fileid for fileid in Albums[:14]]

In [None]:
the_corpus[34990:35000]

We can now go ahead and figure out the number of unique words used in Jay Z's first 35,000 lyrics. An astute observer will notice that we have not done any data cleaning. For example, take a look inside a slice of the corpus, the last 10 words `the_corpus[34990:35000]`, `['die', 'And', 'even', 'if', 'Jehovah', 'witness', 'bet', 'he', 'll', 'never']`, you will see it has treated the contraction "I'm" as two separate words. The `create_corpus` function that we used, works by separating each contiguous chunk of alphabets separated by punctuations or space as a word. As a result contractions like "I'm" gets treated as two words. We can use the function `lexical_diversity` to determine the number of unique words in our Jay Z corpus.

In [None]:
def lexical_diversity(my_text_data):
    word_count = len(my_text_data)
    vocab_size = len(set(my_text_data))
    diversity_score = word_count / vocab_size
    return diversity_score

If we call our function on the Jay Z sliced corpus, it should give us a score.

In [None]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

print "The lexical diversity score for the first 35,000 words in the Jay Z corpus is ", 
        lexical_diversity(the_corpus[:35000])
print "The lexical diversity score for the first 35,000 words in the Emma corpus is ", 
        lexical_diversity(emma[:35000])



In [None]:
[fileid[5:] for fileid in Albums[:14]]

In [None]:
basketball_bag_of_words = ['bounce','crossover','technical',
 'shooting','double','jump','goal','backdoor','chest','ball',
 'team','block','throw','offensive','point','airball','pick',
 'assist','shot','layup','break','dribble','roll','cut','forward',
 'move','zone','three-pointer','free','post','fast','blocking','backcourt',
 'violation','foul','field','pass','turnover','alley-oop','guard']

In [None]:
cfd = nltk.ConditionalFreqDist(
          (target, fileid[5:])
           for fileid in Albums[:14]
           for w in wordlist.words(fileid)
           for target in basketball_bag_of_words
           if w.lower().startswith(target))

In [None]:
# have inline graphs
#get_ipython().magic(u'matplotlib inline')
%pylab inline

In [None]:
cfd.plot()

From the plot we see that the basketball term "roll" seems to be used extensively in the song *Party Life*. Let's take a closer look at this phenomenon, and determine if "roll" was used in the "basketball" sense of the term. To do this, we need to see the context in which it was used. What we really need is a concordance. Let's build one.

The first thing I want to do is to create a corpus that only contain words from the American Gangster album.

In [None]:
AmericanGangster_wordlist = PlaintextCorpusReader(corpus_root, 'JayZ_American Gangster_.*') 
AmericanGangster_corpus = create_corpus(AmericanGangster_wordlist, []) 

Building a concordance, gets us to the area of elementary information retrieval (IR)<a href="#fn1" id="ref1">1</a>, think, <i> basic search engine</i>. So why do we even need to “normalize” terms? We want to match <b>U.S.A.</b> and <b>USA</b>. Also when we enter <b>roll</b>, we would like to match <b>Roll</b>, and <b>rolling</b>. One way to do this is to stem the word. That is, reduce it down to its base/stem/root form. As such <b>automate(s)</b>, <b>automatic</b>, <b>automation</b> all reduced to <b>automat</b>. Most stemmers are pretty basic and just chop off standard affixes indicating things like tense (e.g., "-ed") and possessive forms (e.g., "-'s"). Here, we'll use the most popular english language stemmer, the Potter stemmer, which comes with NLTK. 

Once our tokens are stemmed, we can rest easy knowing that roll, Rolling, Rolls will all stem to roll.

<sup id="fn1">1. Some of this content has been adapted from Dan Jurafsky's <a href="https://web.stanford.edu/class/cs124/">Stanford CS124 class</a><a href="#ref1" title="Jump back to footnote 1 in the text."></a></sup>

In [None]:
porter = nltk.PorterStemmer()

stemmer = nltk.PorterStemmer()
stemmed_tokens = [stemmer.stem(t) for t in AmericanGangster_corpus]

for token in sorted(set(stemmed_tokens))[860:870]:
    print token + ' [' + str(stemmed_tokens.count(token)) + ']'


    
Now we can go ahead and create a concordance to test if "roll" is used in the basketball (pick and roll) sense or not.

In [None]:
AmericanGangster_lyrics = IndexedText(porter, AmericanGangster_corpus)
AmericanGangster_lyrics.concordance('roll')

In [None]:
print AmericanGangster_wordlist.raw(fileids='JayZ_American Gangster_Party Life.txt')

Based on the context, you can decide if the word "roll" is used in a basketball sense. This is really where the "art" of the word "Arts and Sciences" comes to play in Data Science and NLP.