<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Synopsis" data-toc-modified-id="Synopsis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Synopsis</a></span></li><li><span><a href="#Importing-Libraries" data-toc-modified-id="Importing-Libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Importing Libraries</a></span></li><li><span><a href="#Read-in-data" data-toc-modified-id="Read-in-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Read in data</a></span></li><li><span><a href="#Cleaning" data-toc-modified-id="Cleaning-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Cleaning</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Corpus" data-toc-modified-id="Corpus-4.0.1"><span class="toc-item-num">4.0.1&nbsp;&nbsp;</span>Corpus</a></span></li><li><span><a href="#Countvectorizer" data-toc-modified-id="Countvectorizer-4.0.2"><span class="toc-item-num">4.0.2&nbsp;&nbsp;</span>Countvectorizer</a></span></li></ul></li></ul></li></ul></div>

# Synopsis

Now that we have to lyrics in a dataframe with each artist as the index, we can begin to pre-process the data for Exploratory Data Analysis and Modeling. Problem is there are characters that are not ideal to our analysis, could end up giving us a terrible output in the end. So we will spend time cleaning the data to put it in the right form for analysis and modeling. This includes lowering, removing punctuation, section headers, indicators of line spacing etc. 

# Importing Libraries

In [2]:
# !pip install -U contractions
# !pip install -U inflect
# !pip install -U nltk

In [3]:
# Read in necessary modules

import re, string, unicodedata
import nltk
import contractions
import codecs
import inflect
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk import word_tokenize, sent_tokenize
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer

import pickle

# choose how much of dictionaries/list to print oabsut

from IPython.lib.pretty import pprint

# importing warnings to turn off future warnings

import warnings
warnings.simplefilter(action='ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Vonn\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vonn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Vonn\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Read in data

In [4]:
# Read in data from our Raw Dataframe pickle file

with codecs.open('../Datasets/Pickled_Files/Raw_Dataframe','rb') as f:
    df_1 = pickle.load(f)

df_1.set_index('Artist',inplace = True)
df_1

Unnamed: 0_level_0,Lyrics
Artist,Unnamed: 1_level_1
Drake,"[Produced by Boi-1da, Frank Dukes, Noah ""40"" S..."
Jayz,[Intro: Hannah Williams]\nDo I find it so hard...
Nas,[Produced by Ron Browz]\n\n[Intro]\nFuck Jay Z...
Eminem,"[Verse 1]\nNow this shit's about to kick off, ..."
Future,"[Intro]\nDJ Esco-Moe City, the coolest DJ on t..."
KanyeWest,[Produced By Daft Punk & Kanye West]\n\n[Verse...


In [5]:
df_1.loc['Drake'][0][0:100]

'[Produced by Boi-1da, Frank Dukes, Noah "40" Shebib, & Nineteen85]\n\n[Part I: 0 to 100]\n\n[Verse 1]\nFu'

# Cleaning

In [6]:

def first_cleaning_session():
    """
    I am creating a function to do the first cleaning session.
    There are simple things that will show up in any version of the Lyrics
    that I obtain. This function will handle it before moving on to harder
    cleaning session regarding lemmatizing, stemming, or profanity.
    """
    index = list(df_1.index.values)
    rapper = df_1.Lyrics
    
    for artist_ in index:
        # Lower text
        try:
            rapper.loc[artist_] = rapper.loc[artist_].lower();
        except:
            pass
        # Remove punctuation
        try:
            rapper.loc[artist_] = re.sub("[^\w\d\s]+", "", rapper.loc[artist_])
        except:
            pass
        # Split text
        try:
            rapper.loc[artist_] = rapper.loc[artist_].split("\n");
        except:
            pass
        # remove producer
        try:
            rapper.loc[artist_] = [line for line in df_1.loc[artist_][0] if ('[produced' not in line) == True]
        except:
            pass
        # remove blank strings
        try:
            rapper.loc[artist_] = list(filter(None, df_1.loc[artist_][0]))
        except:
            pass
        

In [7]:
# apply the first cleaning session to the dataframe

first_cleaning_session()

In [8]:
# take a look to see how it turned out...

df_1

Unnamed: 0_level_0,Lyrics
Artist,Unnamed: 1_level_1
Drake,[produced by boi1da frank dukes noah 40 shebib...
Jayz,"[intro hannah williams, do i find it so hard, ..."
Nas,"[produced by ron browz, intro, fuck jay z, wha..."
Eminem,"[verse 1, now this shits about to kick off thi..."
Future,"[intro, dj escomoe city the coolest dj on the ..."
KanyeWest,"[produced by daft punk kanye west, verse 1, f..."


The data frame is now in the form of a list of lyrics, each line is a new "bar". In the rap world a bar is just another saying a verse. "A bar is a measure of time in music, and in rap music a bar signifies a verse of the song within a 1, 2, 3, 4 count."

Next we are going to want to take a look at the dataframe and make some other changes to the lyrics. Sometimes the cleaning doesn't catch everything so we are going to go back through and look for cases where the information is not in the way we want. Looking for the following:

- Section Headers
- Producer labels
- Unnecessary characters and punctuation

In [9]:
pprint(df_1.loc['Drake'][0], max_width= 0, newline='\n', max_seq_length=10)
# now it is a list of lyrics

['produced by boi1da frank dukes noah 40 shebib  nineteen85',
 'part i 0 to 100',
 'verse 1',
 'fuck bein on some chill shit',
 'we go 0 to 100 nigga real quick',
 'they be on that raptopaythebill shit',
 'and i dont feel that shit not even a little bit',
 'oh lord know yourself know your worth nigga',
 'my actions been louder than my words nigga',
 'how you so high but still so down to earth nigga',
 ...]


In [10]:
song = {}
index = list(df_1.index.values)
for artst in index:
    var_1 = df_1.Lyrics.loc[artst]
    song[artst] = []
    s = " "
    s = s.join(var_1)
    lst = re.split('\[(.*?)\]', s)
    for i in lst:
        if ":" in i:
            if artst in i:
                song[artst].append(i)
        else:
            song[artst].append(i)

In [11]:
for artst in index:
    rapper = df_1.Lyrics
    rapper.loc[artst] = song[artst]

df_1

Unnamed: 0_level_0,Lyrics
Artist,Unnamed: 1_level_1
Drake,[produced by boi1da frank dukes noah 40 shebib...
Jayz,[intro hannah williams do i find it so hard wh...
Nas,[produced by ron browz intro fuck jay z whats ...
Eminem,[verse 1 now this shits about to kick off this...
Future,[intro dj escomoe city the coolest dj on the m...
KanyeWest,[produced by daft punk kanye west verse 1 for...


In [12]:
song = {}
index = list(df_1.index.values)
for artst in index:
    song[artst] = []
    lst = df_1.loc[artst][0]
    for i in lst:
        if len(i) > 20:
            song[artst].append(i)

In [13]:
for artst in index:
    rapper = df_1.Lyrics
    rapper.loc[artst] = song[artst]

df_1

Unnamed: 0_level_0,Lyrics
Artist,Unnamed: 1_level_1
Drake,[produced by boi1da frank dukes noah 40 shebib...
Jayz,[intro hannah williams do i find it so hard wh...
Nas,[produced by ron browz intro fuck jay z whats ...
Eminem,[verse 1 now this shits about to kick off this...
Future,[intro dj escomoe city the coolest dj on the m...
KanyeWest,[produced by daft punk kanye west verse 1 for...


In [14]:
first_cleaning_session()

In [15]:
index = list(df_1.index.values)
rapper = df_1.Lyrics

s = " "
for artist_ in index:
    rapper.loc[artist_] = s.join(rapper[artist_])

There are two types of ways we can represent the words that were just cleaned: as a bag of words (corpus) or a countvectorizer. We are going to keep in both forms because it will help us to analyze it later, and different EDA and Modeling techniques call for it to be in one form over the other to really work. 

### Corpus

In [16]:
# After cleaning this dataframe is in the form of a corpus.
df_1

Unnamed: 0_level_0,Lyrics
Artist,Unnamed: 1_level_1
Drake,produced by boi1da frank dukes noah 40 shebib ...
Jayz,intro hannah williams do i find it so hard whe...
Nas,produced by ron browz intro fuck jay z whats u...
Eminem,verse 1 now this shits about to kick off this ...
Future,intro dj escomoe city the coolest dj on the mo...
KanyeWest,produced by daft punk kanye west verse 1 for ...


In [17]:
# For visualizations later on let's create a column of the artists' names. 
full_names = df_1.index.tolist()
df_1['Artist Name'] = full_names
df_1

Unnamed: 0_level_0,Lyrics,Artist Name
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1
Drake,produced by boi1da frank dukes noah 40 shebib ...,Drake
Jayz,intro hannah williams do i find it so hard whe...,Jayz
Nas,produced by ron browz intro fuck jay z whats u...,Nas
Eminem,verse 1 now this shits about to kick off this ...,Eminem
Future,intro dj escomoe city the coolest dj on the mo...,Future
KanyeWest,produced by daft punk kanye west verse 1 for ...,KanyeWest


In [18]:
#pickle it for later use
df_1.to_pickle('C:/Users/Vonn/DSI - Nash/GAProjects/Capstone Project/Datasets/Pickled_Files/DataFrame_Corpus.pkl')

### Countvectorizer

In [19]:
stop_words = set(stopwords.words("english"))
cv = CountVectorizer(stop_words=stop_words)
df_cv = cv.fit_transform(df_1.Lyrics)
df_dtm = pd.DataFrame(df_cv.toarray(), columns=cv.get_feature_names())
df_dtm.index = df_1.index
df_dtm

Unnamed: 0_level_0,02,10,100,1000,10yearolds,11,12,125,140,15,...,zip,zod,zombie,zone,zonin,zé,zöld,ölén,úgy,な音楽
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Drake,0,0,6,0,0,0,0,0,0,1,...,1,0,0,1,0,0,0,0,0,0
Jayz,0,0,2,0,0,2,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
Nas,0,1,0,1,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Eminem,1,0,0,0,1,0,1,0,0,0,...,0,1,1,0,0,0,0,0,0,0
Future,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,1
KanyeWest,0,0,0,0,0,0,1,2,0,0,...,0,0,1,0,3,0,1,1,1,0


In [20]:
# Let's pickle it for later use
df_dtm.to_pickle('C:/Users/Vonn/DSI - Nash/GAProjects/Capstone Project/Datasets/Pickled_Files/DataFrame_Document_Term_Matrix.pkl')