# Basic Data Analysis Review: Taylor Swift Lyrics

## Importing Libraries

In [1]:
import string
import re
import numpy as np
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import SnowballStemmer
import spacy

sp = spacy.load("en")


## Overview

While driving and listening to the radio, I recently had an encounter with one of Taylor Swift's songs. Her lyrics were easy to understand and it got me thinking about how they compare to the most common words of the English language and how they compare to other artists. 
This lead me on a journey of parsing and tokenizing to answer the overall question: 

``` How do Taylor Swift's lyrics compare to the most common words of the English language?```


### The Data

The data is free to use from [this website](https://www.kaggle.com/PromptCloudHQ/taylor-swift-song-lyrics-from-all-the-albums). It contains lyrics from 6 of her 7 albums organized by line. Unfortunately her most recent album "Lover" was not available at the time of this analysis. 

In [2]:
# Loading in data
data = pd.read_csv("data/taylor_swift_lyrics.csv", encoding="latin1")

In [3]:
data.head()

Unnamed: 0,artist,album,track_title,track_n,lyric,line,year
0,Taylor Swift,Taylor Swift,Tim McGraw,1,He said the way my blue eyes shined,1,2006
1,Taylor Swift,Taylor Swift,Tim McGraw,1,Put those Georgia stars to shame that night,2,2006
2,Taylor Swift,Taylor Swift,Tim McGraw,1,"I said, ""That's a lie""",3,2006
3,Taylor Swift,Taylor Swift,Tim McGraw,1,Just a boy in a Chevy truck,4,2006
4,Taylor Swift,Taylor Swift,Tim McGraw,1,That had a tendency of gettin' stuck,5,2006


The dataset chosen has 7 columns and 4862 rows with the definition of the columns below:

* `artist`: Artist name
* `album`: Album name
* `track_title`: Song title
* `track_n`: Track number in the album
* `lyric`: Song lyric
* `line`: Line number in the song
* `year`: the year album was released


The data is organized by song line which means I'll need to wrangle the data to concatenate the lyric column by song. 

In [4]:
# This will group by the track title and keep all the needed columns.
# The lines are being joined with a simple space
data = data.groupby(['album', 'year', 'track_n', 'track_title', ])[
    'lyric'].apply(' '.join).reset_index()
data.head()

Unnamed: 0,album,year,track_n,track_title,lyric
0,1989,2014,1,Welcome to New York,"Walking through a crowd, the village is aglow ..."
1,1989,2014,2,Blank Space,"Nice to meet you, where you been? I could show..."
2,1989,2014,3,Style,"Midnight You come and pick me up, no headlight..."
3,1989,2014,4,Out of the Woods,Looking at it now It all seems so simple We we...
4,1989,2014,5,All You Had to Do Was Stay,People like you always want back The love they...


This now is grouped by the track title and will be easier to tokenize. Below are some easy summary statistics. 

In [5]:
data.describe(include = "all")

Unnamed: 0,album,year,track_n,track_title,lyric
count,94,94.0,94.0,94.0,94
unique,6,,,94.0,94
top,Red,,,22.0,"Tall, dark, and superman He puts papers in his..."
freq,19,,,1.0,1
mean,,2011.329787,8.457447,,
std,,3.557178,4.753414,,
min,,2006.0,1.0,,
25%,,2008.0,4.25,,
50%,,2012.0,8.0,,
75%,,2014.0,12.0,,


We are going to add a word count feature to the data set. I feel this could potentially be useful.

In [6]:
data['word_count'] = data['lyric'].apply(lambda x: len(str(x).split(" ")))
data.head()

Unnamed: 0,album,year,track_n,track_title,lyric,word_count
0,1989,2014,1,Welcome to New York,"Walking through a crowd, the village is aglow ...",321
1,1989,2014,2,Blank Space,"Nice to meet you, where you been? I could show...",502
2,1989,2014,3,Style,"Midnight You come and pick me up, no headlight...",371
3,1989,2014,4,Out of the Woods,Looking at it now It all seems so simple We we...,652
4,1989,2014,5,All You Had to Do Was Stay,People like you always want back The love they...,453


## Pre-processing

### Convert to lowercase

Let's first convert all the uppercases to lower case lyrics.

In [7]:
data['clean_lyrics'] = data['lyric'].apply(
    lambda x: " ".join(x.lower() for x in x.split()))

# And replace hephenated words into clearly defined words 
data["clean_lyrics"]= data["clean_lyrics"].str.replace("-", " ") 

In [8]:
data.head()

Unnamed: 0,album,year,track_n,track_title,lyric,word_count,clean_lyrics
0,1989,2014,1,Welcome to New York,"Walking through a crowd, the village is aglow ...",321,"walking through a crowd, the village is aglow ..."
1,1989,2014,2,Blank Space,"Nice to meet you, where you been? I could show...",502,"nice to meet you, where you been? i could show..."
2,1989,2014,3,Style,"Midnight You come and pick me up, no headlight...",371,"midnight you come and pick me up, no headlight..."
3,1989,2014,4,Out of the Woods,Looking at it now It all seems so simple We we...,652,looking at it now it all seems so simple we we...
4,1989,2014,5,All You Had to Do Was Stay,People like you always want back The love they...,453,people like you always want back the love they...


### Expanding  contractions

The next part is a little trickier. I want to expand out all the contractions in the lyrics. 
Let's identify them first

In [9]:
# Finding all the contractions that need to be addressed
contracts = []
for i in range(len(data['clean_lyrics'])):
    for word in data['clean_lyrics'][i].split():
        # adds a word if it contain's an apostrophe
        contracts.extend(re.findall("[\w]+'[\w]+", data['clean_lyrics'][i]))
        # adds a word if it ends with an apostrophe
        contracts.extend(re.findall("[\w]+' ", data['clean_lyrics'][i]))
        # adds a word if it starts with an apostrophe
        contracts.extend(re.findall(" '+[\w]+", data['clean_lyrics'][i]))

contracts = list(np.unique(contracts))
print(contracts)

[" '45", " 'baby", " 'bout", " 'cause", " 'em", " 'fore", " 'i", " 'round", " 'til", " 'till", " 'yes", "ain't", "aren't", "baby's", "battle's", "beggin' ", "bein' ", "breakin' ", "brother's", "burnin' ", "c'mon", "callin' ", "can't", "city's", "comin' ", "cory's", "could've", "couldn't", "cruisin' ", "cryin' ", "daddy's", "darlin' ", "didn't", "doesn't", "doin' ", "don't", "drinkin' ", "drivin' ", "dyin' ", "everybody's", "everything's", "father's", "gettin' ", "gon' ", "gravity's", "groovin' ", "guessin' ", "hadn't", "hand's", "hasn't", "haven't", "he'd", "he'll", "he's", "heart's", "here' ", "here's", "holdin' ", "how'd", "how's", "i'd", "i'll", "i'm", "i've", "isn't", "it'll", "it's", "jury's", "let's", "lookin' ", "love's", "lovin' ", "lyin' ", "magic's", "makin' ", "man's", "mom's", "momma's", "mother's", "movin' ", "n' ", "nineteen's", "nobody's", "nothin' ", "nothing's", "one's", "parents' ", "passenger's", "people's", "pickin' ", "pj's", "reputation's", "ridin' ", "runnin' ", 

We can use the `contraction` module and expand the most common contractions. I found a few contractions that needed manual expanding, such as **_c'mon_** , _**why'd**_ and _**that'll**_ as well as changing all the shortened verbs to their full tense (example: callin becomes **calling**). 

In [10]:
import contractions

In [11]:
# This creates a list with the value of the contractions when expanded
expands = []
forbiddens = [" 'bout", " 'cause", " 'em", " 'fore",
              " 'round", " 'til", " 'till", "c'mon", "why'd", "that'll", "shoulda' "]
for word in contracts:
    if word == "n' ":
        word = "and"
    if word.endswith("n\' "):
        word = word[:-2]
        word += "g"

    if word == " 'bout":
        expands.append("about")

    if word == " 'cause":
        expands.append("because")

    if word == " 'em":
        expands.append("them")

    if word == " 'fore":
        expands.append("before")

    if word == " 'round":
        expands.append("around")

    if word == " 'til":
        expands.append("until")

    if word == " 'till":
        expands.append("until")

    if word == "c'mon":
        expands.append("come on")

    if word == "why'd":
        expands.append("why did")

    if word == "that'll":
        expands.append("that will")
        
    if word == "shoulda' ":
        expands.append("should have")

    elif word not in forbiddens:
        expands.append(contractions.fix(word))
print(expands)

[" '45", " 'baby", 'about', 'because', 'them', 'before', " 'i", 'around', 'until', 'until', " 'yes", 'are not', 'are not', "baby's", "battle's", 'begging', 'being', 'breaking', "brother's", 'burning', 'come on', 'calling', 'can not', "city's", 'coming', "cory's", 'could have', 'could not', 'cruising', 'crying', "daddy's", 'darling', 'did not', 'does not', 'doing', 'do not', 'drinking', 'driving', 'dying', "everybody's", "everything's", "father's", 'getting', 'gong', "gravity's", 'grooving', 'guessing', 'had not', "hand's", 'has not', 'have not', 'he would', 'he will', 'he is', "heart's", "here' ", "here's", 'holding', 'how did', 'how is', 'I would', 'I will', 'I am', 'I have', 'is not', 'it will', 'it is', "jury's", 'let us', 'looking', "love's", 'loving', 'lying', "magic's", 'making', "man's", "mom's", "momma's", "mother's", 'moving', 'and', "nineteen's", "nobody's", 'nothing', "nothing's", "one's", "parents' ", "passenger's", "people's", 'picking', "pj's", "reputation's", 'riding', '

One quick adjustment to the contraction values so we can identify them when tokenize is to remove the space in front of them and the space at the end of words ending with an apostrophe. 

In [12]:
for word in range(len(contracts)):
    if contracts[word].startswith(" "):
        contracts[word] = contracts[word][1:]
    if contracts[word].endswith("n\' "):
        contracts[word] = contracts[word][:-2]

# This will remove the space and the apostrophe before the word
for word in range(len(expands)):
    if expands[word].startswith(" "):
        expands[word] = expands[word][2:]
    if expands[word].startswith("\'"):
        expands[word] = expands[word][1:]

In [13]:
print(contracts)

["'45", "'baby", "'bout", "'cause", "'em", "'fore", "'i", "'round", "'til", "'till", "'yes", "ain't", "aren't", "baby's", "battle's", 'beggin', 'bein', 'breakin', "brother's", 'burnin', "c'mon", 'callin', "can't", "city's", 'comin', "cory's", "could've", "couldn't", 'cruisin', 'cryin', "daddy's", 'darlin', "didn't", "doesn't", 'doin', "don't", 'drinkin', 'drivin', 'dyin', "everybody's", "everything's", "father's", 'gettin', 'gon', "gravity's", 'groovin', 'guessin', "hadn't", "hand's", "hasn't", "haven't", "he'd", "he'll", "he's", "heart's", "here' ", "here's", 'holdin', "how'd", "how's", "i'd", "i'll", "i'm", "i've", "isn't", "it'll", "it's", "jury's", "let's", 'lookin', "love's", 'lovin', 'lyin', "magic's", 'makin', "man's", "mom's", "momma's", "mother's", 'movin', 'n', "nineteen's", "nobody's", 'nothin', "nothing's", "one's", "parents' ", "passenger's", "people's", 'pickin', "pj's", "reputation's", 'ridin', 'runnin', 'sayin', "she'd", "she'll", "she's", "should've", "shoulda' ", "sho

In [14]:
print(expands)

['45', 'baby', 'about', 'because', 'them', 'before', 'i', 'around', 'until', 'until', 'yes', 'are not', 'are not', "baby's", "battle's", 'begging', 'being', 'breaking', "brother's", 'burning', 'come on', 'calling', 'can not', "city's", 'coming', "cory's", 'could have', 'could not', 'cruising', 'crying', "daddy's", 'darling', 'did not', 'does not', 'doing', 'do not', 'drinking', 'driving', 'dying', "everybody's", "everything's", "father's", 'getting', 'gong', "gravity's", 'grooving', 'guessing', 'had not', "hand's", 'has not', 'have not', 'he would', 'he will', 'he is', "heart's", "here' ", "here's", 'holding', 'how did', 'how is', 'I would', 'I will', 'I am', 'I have', 'is not', 'it will', 'it is', "jury's", 'let us', 'looking', "love's", 'loving', 'lying', "magic's", 'making', "man's", "mom's", "momma's", "mother's", 'moving', 'and', "nineteen's", "nobody's", 'nothing', "nothing's", "one's", "parents' ", "passenger's", "people's", 'picking', "pj's", "reputation's", 'riding', 'running'

We will make a dictionary of the contractions to the relative expansions to use so processing is easier. 

In [15]:
# Creates a dictionary containing the contractions and their respective expansions
dictionary = dict(zip(contracts, expands))
print(dictionary)

{"'45": '45', "'baby": 'baby', "'bout": 'about', "'cause": 'because', "'em": 'them', "'fore": 'before', "'i": 'i', "'round": 'around', "'til": 'until', "'till": 'until', "'yes": 'yes', "ain't": 'are not', "aren't": 'are not', "baby's": "baby's", "battle's": "battle's", 'beggin': 'begging', 'bein': 'being', 'breakin': 'breaking', "brother's": "brother's", 'burnin': 'burning', "c'mon": 'come on', 'callin': 'calling', "can't": 'can not', "city's": "city's", 'comin': 'coming', "cory's": "cory's", "could've": 'could have', "couldn't": 'could not', 'cruisin': 'cruising', 'cryin': 'crying', "daddy's": "daddy's", 'darlin': 'darling', "didn't": 'did not', "doesn't": 'does not', 'doin': 'doing', "don't": 'do not', 'drinkin': 'drinking', 'drivin': 'driving', 'dyin': 'dying', "everybody's": "everybody's", "everything's": "everything's", "father's": "father's", 'gettin': 'getting', 'gon': 'gong', "gravity's": "gravity's", 'groovin': 'grooving', 'guessin': 'guessing', "hadn't": 'had not', "hand's": 

Some contractions did not include an apostrophe and are more slangly used. We process these manually in the function below. 

In [16]:
def expanding_contractions(string):
    """
    INSERT DOCSTRING HERE
    """

    split_string = string.split()
    for word in range(len(split_string)):
        if split_string[word] in dictionary:
            split_string[word] = dictionary[split_string[word]]
        if split_string[word]  == "cause":
            split_string[word] = "because"
        if split_string[word]  == "wanna":
            split_string[word] = "want to"
        if split_string[word]  == "till":
            split_string[word] = "until"
        if split_string[word]  == "kinda":
            split_string[word] = "kind of"
        if split_string[word]  == "skip-skippin'":
            split_string[word] = "skip"
        if split_string[word]  == "trip-trippin'":
            split_string[word] = "trip"       
        if split_string[word]  == "whatcha":
            split_string[word] = "what are you"
        if split_string[word]  == "outta":
            split_string[word] = "out of" 
        if split_string[word]  == "shoulda'":
            split_string[word] = "should have" 
        if split_string[word]  == "cmon":
            split_string[word] = "come on" 
        if split_string[word]  == "shoulda":
            split_string[word] = "should have" 

            
        
        if split_string[word]  in [ "comin", "screamin", "doin", "flyin", "shakin", "usin", "pacin"]:
            split_string[word] = str(split_string[word] + "g")
            
        if split_string[word].endswith("in\'"):
            split_string[word] = str(split_string[word][:-1] + "g")
    
        if split_string[word].endswith("in\',"):
            split_string[word] = str(split_string[word][:-2] + "g")
      
        
    return " ".join(split_string)



#["sparkin","beggin","breakin","burnin", "callin", "comin","cruisin",
#                                  "cryin", "darlin","doin", "drinkin", "drivin","dyin", "gettin", "groovin",
#                                   "guessin", "wonderin", "whippin", "usin", "tryin", "trippin", "toyin",
#                                  "thinkin", "thankin", "standin", "sneakin", "touchin", "twistin",
#                                  "skippin", "sittin", "screamin", "sayin", "shakin", "runnin", 
#                                  "ridin", "pickin", "nothin", "flyin", "holdin", "lookin", "lovin", "lyin", 
#                                  "makin", "movin", "pacin"]


We can now amend the column for each song title and clean the lyrics more based on these contractions. 

In [17]:
data['clean_lyrics'] = data['clean_lyrics'].map(lambda s: expanding_contractions(s))

## New contraction cleaned song lyrics example

In [18]:
data['clean_lyrics'][52]

'long were the nights when my days once revolved around you counting my footsteps praying the floor will not fall through... again my mother accused me of losing my mind but i swore i was fine you paint me a blue sky then go back and turn it to rain and i lived in your chess game but you changed the rules everyday wondering which version of you i might get on the phone, tonight well i stopped picking up and this song is to let you know why dear john, i see it all now that you are gone do not you think i was too young to be messed with the girl in the dress cried the whole way home i should have known well maybe it is me and my blind optimism to blame or maybe it is you and your sick need to give love and take it away and you will add my name to your long list of traitors who do not understand and i look back in regret how i ignored when they said "run as fast as you can" dear john, i see it all now that you are gone do not you think i was too young to be messed with the girl in the dre

### Lemmatization and Punctuation

This is where I explored multiple different lemmatization methods and although this method seems the most strenuous, I believe it does the best job of identifying the roots of words while still returning a full word. 

In order to reproduce this method of lemmatization you will need to follow these steps of which I ammended from the article [Lemmatization Approaches with Examples in Python](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/):
1. Install Java from [this link](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)
2. On your terminal (if using a mac) run :

    ```brew cask install java ```


3. Install the python package `stanfordcorenlp` on your terminal using:

    ```pip install stanfordcorenlp```
    
    
4. Download Standford CoreNLP software from [this link](https://stanfordnlp.github.io/CoreNLP/index.html#download).
5. Locate yourself to the Standford CoreNLP file through your terminal and run the following:   
``` cd stanford-corenlp-full-2018-02-27 ```   

    ```java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators "tokenize,ssplit,pos,lemma,parse,sentiment" -port 9000 -timeout 30000```


In [19]:
from stanfordcorenlp import StanfordCoreNLP
import json

Attribution to the [Lemmatization Approaches with Examples in Python](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/) article for the function below with some alterations

In [20]:
def preprocess(connection, phrase):
    
    """
    INSERT DOCSTRING HERE
    """

    props = {'annotators': 'pos,lemma',
             'pipelineLanguage': 'en',
             'outputFormat': 'json'}

    tokenize = connection.word_tokenize(phrase)

    sents_no_punct = []
    for token in tokenize:
        if token not in string.punctuation:
            if token.endswith('ness') or token.endswith('less'):
                token = token[:-4]
            sents_no_punct.append(token)
        
    sentence_no_punct = " ".join(sents_no_punct)

    parsed_string = connection.annotate(sentence_no_punct, properties=props)
    parsed_dict = json.loads(parsed_string)

    lemma_list = []
    for item in range(len(parsed_dict['sentences'])):
        for word in parsed_dict['sentences'][item]['tokens']:
            for key, value in word.items():
                if key == 'lemma':
                    lemma_list.append(value)

    lemmatized = " ".join(lemma_list)
    lemmatized = lemmatized.translate(
        str.maketrans('', '', string.punctuation))
    
    return lemmatized.lower()


nlp = StanfordCoreNLP('http://localhost', port=9000, timeout=30000)


data['clean_lyrics'] = data['clean_lyrics'].map(lambda s: preprocess(nlp, s))

### Sample after preprocessing

#### Before

In [21]:
data['lyric'][16]

"There's something 'bout the way The street looks when it's just rained There's a glow off the pavement You walk me to the car And you know I wanna ask you to dance right there In the middle of the parking lot, yeah Oh, yeah We're driving down the road I wonder if you know I'm trying so hard not to get caught up now But you're just so cool Run your hands through your hair Absent mindedly making me want you And I don't know how it gets better than this You take my hand and drag me head first Fearless And I don't know why but with you I'd dance in a storm in my best dress Fearless So baby drive slow 'til we run out of road in this one horse town I wanna stay right here in this passenger's seat You put your eyes on me In this moment now capture it, remember it And I don't know how it gets better than this You take my hand and drag me head first Fearless And I don't know why but with you I'd dance in a storm in my best dress Fearless Well you stood there with me in the doorway My hands sha

#### After

In [22]:
data['clean_lyrics'][16]

'there be something about the way the street look when it be just rain there be a glow off the pavement you walk i to the car and you know i want to ask you to dance right there in the middle of the parking lot yeah oh yeah we be drive down the road i wonder if you know i be try so hard not to get catch up now but you be just so cool run you hand through you hair absent mindedly make i want you and i do not know how it get better than this you take my hand and drag i head first fear and i do not know why but with you i would dance in a storm in my best dress fear so baby drive slow until we run out of road in this one horse town i want to stay right here in this passenger s seat you put you eye on i in this moment now capture it remember it and i do not know how it get better than this you take my hand and drag i head first fear and i do not know why but with you i would dance in a storm in my best dress fear well you stand there with i in the doorway my hand shake i be not usually thi

This doesn't look too bad! Now we can finally move onto Exploratory Data Analysis and answering the question.

## Tokenization 

In [23]:
data['clean_lyrics_tokenized'] = data['clean_lyrics'].apply(lambda x: nltk.word_tokenize(x))

That was easily done. Let's explore Taylor's lyrics in entirety. 

## Exploration of All Lyrics

In [24]:
all_lyrics = np.concatenate(data['clean_lyrics_tokenized'].values).tolist()

In [25]:
all_albums_vocab_count = len(all_lyrics)

Of all her post preprocessed lyrics, from 6 of her 7 albums, she writes 37491 words. 

In [26]:
values, counts = np.unique(all_lyrics, return_counts=True)

In [27]:
len(values)

1739

Only 1742 of them are unique

This next part let's me explore the common lyrics Taylor uses. The code creates a dictionary with word and the count frequency.

In [41]:
from collections import Counter, OrderedDict

#Counter(all_lyrics).most_common(1000)

Fun fact "ooh" is used 103 times. "ey" is used 72, "na" 68 times, "gong" 65 times, "uh" 25 times and "mm" 19 times. all are more than "boy" which is is written 17 times. 

In [29]:
#Counter(all_lyrics).most_common(1834)[:-1834:-1]

Let's import the top 5000 english US works curtesy of [wordfrequency.info](http://www.wordfrequency.info). This CSV has been imported from their site and are considered the owners of the material. 

In [30]:
word_frequency_df = pd.read_csv('word_frequency.csv')
word_frequency_list = list(word_frequency_df['Word'])
word_frequency_list = [k.lower() for k in word_frequency_list] # change to all lowercase
top_1000 = word_frequency_list[0:1000]
top_100 = word_frequency_list[0:100]

In [31]:
#Convert to dictionary over a counter object
all_lyrics_dict = dict(Counter(all_lyrics))

In [32]:
all_lyrics_dict = dict(Counter(all_lyrics))
common_words = 0
original_word = dict()
for word in all_lyrics_dict.keys(): 
    if word in top_100:
        common_words += all_lyrics_dict[word]
    else: 
        original_word[word] = all_lyrics_dict[word]
print(common_words)


24448


In [33]:
all_lyrics_dict = dict(Counter(all_lyrics))
common_words_1000 = 0
original_word = dict()
for word in all_lyrics_dict.keys(): 
    if word in top_1000:
        common_words_1000 += all_lyrics_dict[word]
    else: 
        original_word[word] = all_lyrics_dict[word]
print(common_words_1000)

32954


In [34]:
all_lyrics_dict = dict(Counter(all_lyrics))
common_words_5000 = 0
original_word = dict()
for word in all_lyrics_dict.keys(): 
    if word in word_frequency_list:
        common_words_5000 += all_lyrics_dict[word]
    else: 
        original_word[word] = all_lyrics_dict[word]
print(common_words_5000)

35954


This means that 24448 are from the top 100 common english words. Her original words are below:

In [42]:
sorted(original_word.items(), key=lambda x: x[1], reverse=True)[:20]

[('ooh', 103),
 ('ey', 72),
 ('s', 70),
 ('na', 68),
 ('gon', 65),
 ('la', 42),
 ('whoa', 34),
 ('york', 30),
 ('fake', 23),
 ('mmm', 22),
 ('starlight', 22),
 ('getaway', 22),
 ('wonderland', 21),
 ('mm', 19),
 ('darling', 18),
 ('e', 16),
 ('alright', 14),
 ('ta', 14),
 ('aah', 13),
 ('loving', 13)]

So of the total preprocessed lyrics, 65% are from the 100 most common words and 87% are from the 1000 most frequently used english words. 

In [36]:
round(common_words/all_albums_vocab_count *100,2)

64.81

In [37]:
round(common_words_1000/all_albums_vocab_count *100,2)

87.36

The rest are oooos and ahhhhs :) jk

## Attributions:

* [Lemmatization Approaches with Examples in Python](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/)