# Eminem, Akon & NLP
## I use NLTK to perform stemming, lemmatiziation, tokenization, stop word removal on a dataset of my choice

I stumbled upon this [interesting piece](https://genius.com/a/akon-explains-how-eminem-treats-recording-music-like-a-nine-to-five-job) on Genius where "Akon Explains How Eminem Treats Recording Music Like A Nine-To-Five Job". I'll scrape this and use it as my dataset.

## Getting the Data

In [30]:
# Web scraping, pickle imports
import requests
from bs4 import BeautifulSoup

page = requests.get('https://genius.com/a/akon-explains-how-eminem-treats-recording-music-like-a-nine-to-five-job').text # Get all data from URL
soup = BeautifulSoup(page, "lxml") # Read as an HTML document
text = [p.text for p in soup.find(class_="article_rich_text_formatting").find_all('p')] # Pull out all text from the class article_rich_text_formatting. Creates a list of <p> paragraphs

text # Let's look at the data

['Back in 2006, Akon recruited Eminem for “Smack That,” the first single from his sophomore album, Konvicted. In a recent interview with Hot 97’s Ebro in the Morning, the Senegalese-American singer provided a look into the Detroit rapper’s 9-to-5 recording schedule.',
 '“Working with him made me look at the business different because he was the first artist that I worked with that actually treated the business like a real job,” Akon explained. “He comes in at 9 a.m. every day to the studio, takes his lunch break at 1, and is out of there by 5 p.m. It’s like a schedule.”',
 'Not anticipating Em’s strict schedule, Akon showed up on the first day at 6 p.m., only to find out that the Shady Records co-founder had already left for the day. After Em told him to arrive at the studio at 9 a.m. the next day, Akon recalled how Em immediately paused working not only for lunch, but at the end of the day as well.',
 "“I’m in the middle of writing a record, and he’s like, ‘Yo, I’m about to go out for

# Tokenization

My data is actually a list of text. I'm going to combine all text into one large string of text to make it easier to work with.

In [31]:
text = ' '.join(text)
text

"Back in 2006, Akon recruited Eminem for “Smack That,” the first single from his sophomore album, Konvicted. In a recent interview with Hot 97’s Ebro in the Morning, the Senegalese-American singer provided a look into the Detroit rapper’s 9-to-5 recording schedule. “Working with him made me look at the business different because he was the first artist that I worked with that actually treated the business like a real job,” Akon explained. “He comes in at 9 a.m. every day to the studio, takes his lunch break at 1, and is out of there by 5 p.m. It’s like a schedule.” Not anticipating Em’s strict schedule, Akon showed up on the first day at 6 p.m., only to find out that the Shady Records co-founder had already left for the day. After Em told him to arrive at the studio at 9 a.m. the next day, Akon recalled how Em immediately paused working not only for lunch, but at the end of the day as well. “I’m in the middle of writing a record, and he’s like, ‘Yo, I’m about to go out for lunch,’” Ako

In [32]:
# Make all text lowercase
text_lowered = text.lower()
text_lowered

"back in 2006, akon recruited eminem for “smack that,” the first single from his sophomore album, konvicted. in a recent interview with hot 97’s ebro in the morning, the senegalese-american singer provided a look into the detroit rapper’s 9-to-5 recording schedule. “working with him made me look at the business different because he was the first artist that i worked with that actually treated the business like a real job,” akon explained. “he comes in at 9 a.m. every day to the studio, takes his lunch break at 1, and is out of there by 5 p.m. it’s like a schedule.” not anticipating em’s strict schedule, akon showed up on the first day at 6 p.m., only to find out that the shady records co-founder had already left for the day. after em told him to arrive at the studio at 9 a.m. the next day, akon recalled how em immediately paused working not only for lunch, but at the end of the day as well. “i’m in the middle of writing a record, and he’s like, ‘yo, i’m about to go out for lunch,’” ako

In [36]:
from nltk.tokenize import word_tokenize
# Tokenize the lowercase text
word_tokens = word_tokenize(text_lowered) 
word_tokens

['back',
 'in',
 '2006',
 ',',
 'akon',
 'recruited',
 'eminem',
 'for',
 '“',
 'smack',
 'that',
 ',',
 '”',
 'the',
 'first',
 'single',
 'from',
 'his',
 'sophomore',
 'album',
 ',',
 'konvicted',
 '.',
 'in',
 'a',
 'recent',
 'interview',
 'with',
 'hot',
 '97',
 '’',
 's',
 'ebro',
 'in',
 'the',
 'morning',
 ',',
 'the',
 'senegalese-american',
 'singer',
 'provided',
 'a',
 'look',
 'into',
 'the',
 'detroit',
 'rapper',
 '’',
 's',
 '9-to-5',
 'recording',
 'schedule',
 '.',
 '“',
 'working',
 'with',
 'him',
 'made',
 'me',
 'look',
 'at',
 'the',
 'business',
 'different',
 'because',
 'he',
 'was',
 'the',
 'first',
 'artist',
 'that',
 'i',
 'worked',
 'with',
 'that',
 'actually',
 'treated',
 'the',
 'business',
 'like',
 'a',
 'real',
 'job',
 ',',
 '”',
 'akon',
 'explained',
 '.',
 '“',
 'he',
 'comes',
 'in',
 'at',
 '9',
 'a.m.',
 'every',
 'day',
 'to',
 'the',
 'studio',
 ',',
 'takes',
 'his',
 'lunch',
 'break',
 'at',
 '1',
 ',',
 'and',
 'is',
 'out',
 'of',
 

## Removing Stop Words

In [39]:
from nltk.corpus import stopwords

filtered_words = [word for word in word_tokens if word not in stopwords.words('english')]
filtered_words

['back',
 '2006',
 ',',
 'akon',
 'recruited',
 'eminem',
 '“',
 'smack',
 ',',
 '”',
 'first',
 'single',
 'sophomore',
 'album',
 ',',
 'konvicted',
 '.',
 'recent',
 'interview',
 'hot',
 '97',
 '’',
 'ebro',
 'morning',
 ',',
 'senegalese-american',
 'singer',
 'provided',
 'look',
 'detroit',
 'rapper',
 '’',
 '9-to-5',
 'recording',
 'schedule',
 '.',
 '“',
 'working',
 'made',
 'look',
 'business',
 'different',
 'first',
 'artist',
 'worked',
 'actually',
 'treated',
 'business',
 'like',
 'real',
 'job',
 ',',
 '”',
 'akon',
 'explained',
 '.',
 '“',
 'comes',
 '9',
 'a.m.',
 'every',
 'day',
 'studio',
 ',',
 'takes',
 'lunch',
 'break',
 '1',
 ',',
 '5',
 'p.m.',
 '’',
 'like',
 'schedule.',
 '”',
 'anticipating',
 'em',
 '’',
 'strict',
 'schedule',
 ',',
 'akon',
 'showed',
 'first',
 'day',
 '6',
 'p.m.',
 ',',
 'find',
 'shady',
 'records',
 'co-founder',
 'already',
 'left',
 'day',
 '.',
 'em',
 'told',
 'arrive',
 'studio',
 '9',
 'a.m.',
 'next',
 'day',
 ',',
 'akon

## Stemming
I'll use the Porter stemmer.

In [40]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()
stemmed_words = [porter.stem(w) for w in filtered_words]
stemmed_words

['back',
 '2006',
 ',',
 'akon',
 'recruit',
 'eminem',
 '“',
 'smack',
 ',',
 '”',
 'first',
 'singl',
 'sophomor',
 'album',
 ',',
 'konvict',
 '.',
 'recent',
 'interview',
 'hot',
 '97',
 '’',
 'ebro',
 'morn',
 ',',
 'senegalese-american',
 'singer',
 'provid',
 'look',
 'detroit',
 'rapper',
 '’',
 '9-to-5',
 'record',
 'schedul',
 '.',
 '“',
 'work',
 'made',
 'look',
 'busi',
 'differ',
 'first',
 'artist',
 'work',
 'actual',
 'treat',
 'busi',
 'like',
 'real',
 'job',
 ',',
 '”',
 'akon',
 'explain',
 '.',
 '“',
 'come',
 '9',
 'a.m.',
 'everi',
 'day',
 'studio',
 ',',
 'take',
 'lunch',
 'break',
 '1',
 ',',
 '5',
 'p.m.',
 '’',
 'like',
 'schedule.',
 '”',
 'anticip',
 'em',
 '’',
 'strict',
 'schedul',
 ',',
 'akon',
 'show',
 'first',
 'day',
 '6',
 'p.m.',
 ',',
 'find',
 'shadi',
 'record',
 'co-found',
 'alreadi',
 'left',
 'day',
 '.',
 'em',
 'told',
 'arriv',
 'studio',
 '9',
 'a.m.',
 'next',
 'day',
 ',',
 'akon',
 'recal',
 'em',
 'immedi',
 'paus',
 'work',
 '

**For an easier side-by-side visual comparison I will combine the original filtered words and the stemmed filtered words into a dictionary.**

Notice, for example, that words like `single` became `singl` and `sophomore` became `sophomor`.

In [41]:
dict(zip(filtered_words, stemmed_words))

{'back': 'back',
 '2006': '2006',
 ',': ',',
 'akon': 'akon',
 'recruited': 'recruit',
 'eminem': 'eminem',
 '“': '“',
 'smack': 'smack',
 '”': '”',
 'first': 'first',
 'single': 'singl',
 'sophomore': 'sophomor',
 'album': 'album',
 'konvicted': 'konvict',
 '.': '.',
 'recent': 'recent',
 'interview': 'interview',
 'hot': 'hot',
 '97': '97',
 '’': '’',
 'ebro': 'ebro',
 'morning': 'morn',
 'senegalese-american': 'senegalese-american',
 'singer': 'singer',
 'provided': 'provid',
 'look': 'look',
 'detroit': 'detroit',
 'rapper': 'rapper',
 '9-to-5': '9-to-5',
 'recording': 'record',
 'schedule': 'schedul',
 'working': 'work',
 'made': 'made',
 'business': 'busi',
 'different': 'differ',
 'artist': 'artist',
 'worked': 'work',
 'actually': 'actual',
 'treated': 'treat',
 'like': 'like',
 'real': 'real',
 'job': 'job',
 'explained': 'explain',
 'comes': 'come',
 '9': '9',
 'a.m.': 'a.m.',
 'every': 'everi',
 'day': 'day',
 'studio': 'studio',
 'takes': 'take',
 'lunch': 'lunch',
 'break'

## Lemmatization

In [42]:
from nltk import pos_tag

tagged_words = pos_tag(filtered_words)
tagged_words

[('back', 'RB'),
 ('2006', 'CD'),
 (',', ','),
 ('akon', 'RB'),
 ('recruited', 'VBN'),
 ('eminem', 'NN'),
 ('“', 'NNP'),
 ('smack', 'NN'),
 (',', ','),
 ('”', 'NNP'),
 ('first', 'RB'),
 ('single', 'JJ'),
 ('sophomore', 'NN'),
 ('album', 'NN'),
 (',', ','),
 ('konvicted', 'VBN'),
 ('.', '.'),
 ('recent', 'JJ'),
 ('interview', 'NN'),
 ('hot', 'JJ'),
 ('97', 'CD'),
 ('’', 'JJ'),
 ('ebro', 'NN'),
 ('morning', 'NN'),
 (',', ','),
 ('senegalese-american', 'JJ'),
 ('singer', 'NN'),
 ('provided', 'VBD'),
 ('look', 'NN'),
 ('detroit', 'JJ'),
 ('rapper', 'NN'),
 ('’', 'NNP'),
 ('9-to-5', 'CD'),
 ('recording', 'NN'),
 ('schedule', 'NN'),
 ('.', '.'),
 ('“', 'NN'),
 ('working', 'VBG'),
 ('made', 'VBN'),
 ('look', 'NN'),
 ('business', 'NN'),
 ('different', 'JJ'),
 ('first', 'JJ'),
 ('artist', 'NN'),
 ('worked', 'VBD'),
 ('actually', 'RB'),
 ('treated', 'JJ'),
 ('business', 'NN'),
 ('like', 'IN'),
 ('real', 'JJ'),
 ('job', 'NN'),
 (',', ','),
 ('”', 'NNP'),
 ('akon', 'NN'),
 ('explained', 'VBD'),
 (

### Transformation

By default, the WordNetLemmatizer.lemmatize() function will assume that the word is a Noun if there's no explict POS tag in the input. To resolve the problem, always POS-tag your data before lemmatizing. 

Above I have a `tagged_words` which is a list of tuples. However, the `.lemmatize()` function takes the word in string format as the first argument and a POS tag in string format as the second argument. So I'll need to create a function that iterate the list of tuples and lemmatizes each word.

Additionally,  because he POS tags used by the part-of-speech tagger are not the same as the POS codes used by WordNet, I need a small mapping function to convert POS tagger tags to WordNet POS codes. 

To convert Treebank tags to WordNet tags the mapping is as follows:
1. wn.VERB = 'v'
2. wn.ADV = 'r'
3. wn.NOUN = 'n'
4. wn.ADJ = 'a'
5. wn.ADJ_SAT = 's'. But we can ignore `'s'` because the `WordNetLemmatizer` in NLTK does not differentiate satellite adjectives from normal adjectives.

From [[WordNet docs](http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html)]

The other parts of speech will be tagged as nouns. See this [post](https://stackoverflow.com/questions/51634328/wordnetlemmatizer-different-handling-of-wn-adj-and-wn-adj-sat) if you're interested in details.

In [46]:
tagged_words[0][0]

'back'

In [104]:
def convert_tag(treebank_tag):
    '''Convert Treebank tags to WordNet tags'''
    if treebank_tag.startswith('J'):
        return 'a'
    elif treebank_tag.startswith('V'):
        return 'v'
    elif treebank_tag.startswith('N'):
        return 'n'
    elif treebank_tag.startswith('R'):
        return 'r'
    else:
        return 'n' # if no match, default to noun

In [110]:
from nltk.stem import WordNetLemmatizer # lemmatizes word based on it's parts of speech

# Lemmatize using WordNet's built-in morphy function. Returns the input word unchanged if it cannot be found in WordNet.
wnl = WordNetLemmatizer()

def tuple_lemmatizer(tuple_list):
    '''Lemmatize words leveraging their part-of-speech tags as well'''
    lemma_word = []
    
    for tupl in tuple_list:
        lemma_word.append(wnl.lemmatize(tupl[0], convert_tag(tupl[1])))
    
    return(lemma_word)

In [112]:
lemma_words = tuple_lemmatizer(tagged_words) # list of lemmatized words

In [113]:
dict(zip(filtered_words, lemma_words)) # key = original filtered text; value = lemmatized words

{'back': 'back',
 '2006': '2006',
 ',': ',',
 'akon': 'akon',
 'recruited': 'recruit',
 'eminem': 'eminem',
 '“': '“',
 'smack': 'smack',
 '”': '”',
 'first': 'first',
 'single': 'single',
 'sophomore': 'sophomore',
 'album': 'album',
 'konvicted': 'konvicted',
 '.': '.',
 'recent': 'recent',
 'interview': 'interview',
 'hot': 'hot',
 '97': '97',
 '’': '’',
 'ebro': 'ebro',
 'morning': 'morning',
 'senegalese-american': 'senegalese-american',
 'singer': 'singer',
 'provided': 'provide',
 'look': 'look',
 'detroit': 'detroit',
 'rapper': 'rapper',
 '9-to-5': '9-to-5',
 'recording': 'recording',
 'schedule': 'schedule',
 'working': 'work',
 'made': 'make',
 'business': 'business',
 'different': 'different',
 'artist': 'artist',
 'worked': 'work',
 'actually': 'actually',
 'treated': 'treated',
 'like': 'like',
 'real': 'real',
 'job': 'job',
 'explained': 'explain',
 'comes': 'come',
 '9': '9',
 'a.m.': 'a.m.',
 'every': 'every',
 'day': 'day',
 'studio': 'studio',
 'takes': 'take',
 'lu