# Problem Statement

Every artist goes through changes in life and their music usually reflects that. I'm a big fan of John Mayer's music, mostly because he's one of the few artists that are keeping the guitar solo alive in modern music. Being a big guitar fan I often get distracted and lost in the guitar solos and background music. But the music aspect aside I was curious to learn what John Mayer sings about through his lyrics and find out: 
1. Changes in vocablury across his albums
2. The general sentiment of his lyrics (did they have a positive tone or a negative tone)
3. What topics did he sing about

I've broken down this project into XX main steps

1. **Create the Data Sets**
    1. Get the raw data by scraping
    2. Clean the Data
    3. Convert the data into required formats (Corpus and Document-Term-Matrix)


2. **EDA (Explarotary Data Analysis)**
    1. Most common words
    2. Wordclouds

3. **Sentiment Analysis**

4. **Topic Modeling**

# Create the Data Sets

## Introduction

In this section I will perform three main steps:

1. Get the Data - This will involve scraping the lyrics of each song from a website
2. Clean the Data - This invovles pre-processing so that it fits the form on which we can do analysis and NLP on
3. Convert the cleaned data into a format that can be used by the algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

## 1. Getting the Raw Data

Getting the lyrics of the song involves two steps:

1. Make a list of all the songs that I need lyrics for
    - Organize the song name, album and year
2. Scrape a website to extract the lyrics for each song in the list
    - Use genius.com API as well as BeautifulSoup to scrape the lyrics

### Organize song names

For one of my other projects I wanted to get the names and artists for hundreds of songs across the decades. In that case I used **selenium** to scrape an online list of songs to get the song name and artist.
But in this case there are only a few albums and a simple google search gives me the names of the songs, the album and the year.

In [137]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.debug("test")

import pandas as pd
import csv

In [138]:
discography_df_raw = pd.read_csv('data_sets/JM_Discography.csv')

In [139]:
print(discography_df_raw.shape)
discography_df_raw.head(2)

(81, 5)


Unnamed: 0,Album,Year,Track #,Title,Track Length
0,Room For Squares,2001,1,No Such Thing,3:51
1,Room For Squares,2001,2,Why Georgia,4:29


In [140]:
# Convert Track length into seconds
import datetime
import time

def convert_to_seconds(track_len):
    #ftr = [3600,60,1]
    #return sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
    time_obj = time.strptime(track_len.split(',')[0],'%M:%S')
    return time_obj.tm_min*60+time_obj.tm_sec

convert_t = lambda x: convert_to_seconds(x)

In [141]:
#discography_clean['Lyrics'] = pd.DataFrame(discography_clean.Lyrics.apply(round2))
discography_df_raw['Track Length'] = pd.DataFrame(discography_df_raw['Track Length'].apply(convert_t))
discography_df_raw.head()

Unnamed: 0,Album,Year,Track #,Title,Track Length
0,Room For Squares,2001,1,No Such Thing,231
1,Room For Squares,2001,2,Why Georgia,269
2,Room For Squares,2001,3,My Stupid Mouth,225
3,Room For Squares,2001,4,Your Body Is A Wonderland,250
4,Room For Squares,2001,5,Neon,262


### Get lyrics from Genius.com using their API

In [142]:
#Get lyrics from Genius

# Make HTTP requests
import requests
# Scrape data from an HTML document
from bs4 import BeautifulSoup
# I/O
import os
# Search and manipulate strings
import re

import pickle

#Search for song and then scrape lyrics

GENIUS_API_TOKEN = "sMKu7QNqzkVUk4LQ-bJUPiwWXHyQpNJIntW4sW1xbt8AuyjAFY98sp4JOUV7TiwJ"
eras = ['seventies','eighties', 'nineties', 'twothousands', 'twentytens']

# Get song object from Genuis API
def request_song_object(song_name):
    
    try:
        base_url = 'https://api.genius.com'
        headers = {'Authorization': 'Bearer ' + GENIUS_API_TOKEN}
    #   search_url = base_url + '/search?q='+song_name  # Or include data dictionary in request
        search_url = base_url + '/search?'
        data = {'q': song_name}
        response = requests.get(search_url, data=data, headers=headers)
        return response
    except:
        print("Couldn't get url for: "+song_name)
        return ''

def request_artist_info(artist_name, page):
    base_url = 'https://api.genius.com'
    headers = {'Authorization': 'Bearer ' + GENIUS_API_TOKEN}
    search_url = base_url + '/search?per_page=10&page=' + str(page)
    data = {'q': artist_name}
    response = requests.get(search_url, data=data, headers=headers)
    return response

def request_song_url(song_name):
    page = 1
    songs = []
    logger.info("getting url for: "+song_name)
    response = request_song_object(song_name)
    json = response.json()
    
    return json['response']['hits'][0]['result']['url']

# Scrape lyrics from a Genius.com song URL
def scrape_song_lyrics(url):
    
    try: 
        logger.info('Getting lyrics for: '+url)
        page = requests.get(url)
        html = BeautifulSoup(page.text, 'html.parser')
        lyrics = html.find('div', class_='lyrics').get_text()
        #remove identifiers like chorus, verse, etc
        lyrics = re.sub(r'[\(\[].*?[\)\]]', '', lyrics)
        #remove empty lines
        lyrics = os.linesep.join([s for s in lyrics.splitlines() if s])
        #replace new line with ' '
        lyrics = lyrics.replace('\n',' ')
        return lyrics
    except:
        print("Failed to get lyrics for: "+url)
        return ''

def get_url_list(songList):

    #Does the below code in one line
    urlList = [request_song_url(songName) for songName in songList]
    
#     urlList = []
#     for songName in songList:
#         urlList.append(request_song_url(songName))
    return urlList

In [143]:
# Check if the lyrics seem right
url = request_song_url("Clarity John Mayer")
url

INFO:root:getting url for: Clarity John Mayer


'https://genius.com/John-mayer-clarity-lyrics'

In [144]:
lyrics = scrape_song_lyrics(url)
logger.debug(lyrics)

INFO:root:Getting lyrics for: https://genius.com/John-mayer-clarity-lyrics


Failed to get lyrics for: https://genius.com/John-mayer-clarity-lyrics


In [145]:
# Use API to find lyrics page and scrape lyrics for each song using BeautifulSoup

# creating a blank series 
lyrics = pd.Series([]) 

for i,title in enumerate(discography_df_raw['Title']):
    song_url = request_song_url(title+" John Mayer")
    song_lyrics = scrape_song_lyrics(song_url)
    lyrics[i]=song_lyrics

  lyrics = pd.Series([])
INFO:root:getting url for: No Such Thing John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-no-such-thing-lyrics
INFO:root:getting url for: Why Georgia John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-why-georgia-lyrics
INFO:root:getting url for: My Stupid Mouth John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-my-stupid-mouth-lyrics
INFO:root:getting url for: Your Body Is A Wonderland John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-your-body-is-a-wonderland-lyrics
INFO:root:getting url for: Neon John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-neon-lyrics
INFO:root:getting url for: City Love John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-city-love-lyrics
INFO:root:getting url for: 83 John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-83-lyrics
INFO:root:getting url for: 3X5 John Mayer
INFO:root:Getting lyrics f

Failed to get lyrics for: https://genius.com/John-mayer-clarity-lyrics


INFO:root:Getting lyrics for: https://genius.com/John-mayer-bigger-than-my-body-lyrics
INFO:root:getting url for: Something's Missing John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-somethings-missing-lyrics
INFO:root:getting url for: New Deep John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-new-deep-lyrics
INFO:root:getting url for: Come Back To Bed John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-come-back-to-bed-lyrics
INFO:root:getting url for: Home Life John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-home-life-lyrics
INFO:root:getting url for: Split Screen Sadness John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-split-screen-sadness-lyrics
INFO:root:getting url for: Daughters John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-daughters-lyrics
INFO:root:getting url for: Only Heart John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-o

INFO:root:getting url for: In the Blood John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-in-the-blood-lyrics
INFO:root:getting url for: Changing John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-changing-lyrics
INFO:root:getting url for: Theme from The Search for Everything John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-theme-from-the-search-for-everything-lyrics
INFO:root:getting url for: Moving On and Getting Over John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-moving-on-and-getting-over-lyrics
INFO:root:getting url for: Never on the Day You Leave John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-never-on-the-day-you-leave-lyrics
INFO:root:getting url for: Rosie John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-rosie-lyrics
INFO:root:getting url for: Roll It on Home John Mayer
INFO:root:Getting lyrics for: https://genius.com/John-mayer-roll-it-on-home-ly

In [146]:
discography_df_raw.insert(len(discography_df_raw.columns),'Lyrics',lyrics)

In [147]:
# See which songs we failed to get lyrics for
discography_df_raw.loc[discography_df_raw['Lyrics']=='']


Unnamed: 0,Album,Year,Track #,Title,Track Length,Lyrics
13,Heavier Things,2003,1,Clarity,272,
75,The Search For Everything,2017,7,Theme from The Search for Everything,114,


In [148]:
discography_df_raw.loc[discography_df_raw['Title'].str.contains('Edge Of')]['Lyrics'].item()

"Young and full of running Tell me where has that taken me Just a great figure eight or a tiny infinity Love is really nothing But a dream that keeps waking me For all of my trying, we still end up dying How can it be Don't say a word, just come over and lie here with me Because I'm just about to set fire to everything I see I want you so bad, I'll go back on the things I believe There I just said it, I'm scared you'll forget about me So young and full of running All the way to the edge of desire Steady my breathing, silently screaming I have to have you now Wired and I'm tired Think I'll sleep in my clothes on the floor Or maybe this mattress will spin on its axis And find me on yours Don't say a word, just come over and lie here with me Because I'm just about to set fire to everything I see I want you so bad, I'll go back on the things I believe There I just said it, I'm scared you'll forget about me"

## Cleaning the Data

For starters I did the following just to get the lyrics chunks in good shape to create a document term matrix

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [149]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x) #Notebook

In [150]:
discography_df_raw.head()

Unnamed: 0,Album,Year,Track #,Title,Track Length,Lyrics
0,Room For Squares,2001,1,No Such Thing,231,"""Welcome to the real world"", she said to me Co..."
1,Room For Squares,2001,2,Why Georgia,269,I am driving up '85 in the Kind of morning tha...
2,Room For Squares,2001,3,My Stupid Mouth,225,My stupid mouth Has got me in trouble I said t...
3,Room For Squares,2001,4,Your Body Is A Wonderland,250,We got the afternoon You got this room for two...
4,Room For Squares,2001,5,Neon,262,When sky blue gets dark enough To see the colo...


In [151]:
discography_clean = discography_df_raw.copy()

# Let's take a look at the updated text
discography_clean['Lyrics'] = pd.DataFrame(discography_df_raw.Lyrics.apply(round1))
discography_clean.head() 

Unnamed: 0,Album,Year,Track #,Title,Track Length,Lyrics
0,Room For Squares,2001,1,No Such Thing,231,welcome to the real world she said to me conde...
1,Room For Squares,2001,2,Why Georgia,269,i am driving up in the kind of morning that l...
2,Room For Squares,2001,3,My Stupid Mouth,225,my stupid mouth has got me in trouble i said t...
3,Room For Squares,2001,4,Your Body Is A Wonderland,250,we got the afternoon you got this room for two...
4,Room For Squares,2001,5,Neon,262,when sky blue gets dark enough to see the colo...


In [152]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [153]:
discography_clean['Lyrics'] = pd.DataFrame(discography_clean.Lyrics.apply(round2))
discography_clean.head()

Unnamed: 0,Album,Year,Track #,Title,Track Length,Lyrics
0,Room For Squares,2001,1,No Such Thing,231,welcome to the real world she said to me conde...
1,Room For Squares,2001,2,Why Georgia,269,i am driving up in the kind of morning that l...
2,Room For Squares,2001,3,My Stupid Mouth,225,my stupid mouth has got me in trouble i said t...
3,Room For Squares,2001,4,Your Body Is A Wonderland,250,we got the afternoon you got this room for two...
4,Room For Squares,2001,5,Neon,262,when sky blue gets dark enough to see the colo...


**NOTE:** Other cleaning and pre-processing steps I would consider are:
* Mark 'cheering' and 'cheer' as the same word (stemming / lemmatization)
* Combine 'thank you' into one term (bi-grams)
* And a lot more...

## Organizing the Data

As mentioned earlier, we want to get the data organized into the following two formats:
1. **Corpus -** a collection of text
2. **Document-Term Matrix -** word counts in matrix format

I will aggregate the data in two levels

1. Document Term Matrix per song
2. **Document Term Matrix per album** (Group all the lyrics in each album together). This is the more useflu one since we are tryiing to observe the changes over time. Grouping by album is the same as grouping by year


### Corpus

The corpus has already been created in the previous section - The corpus of each song is held in discography_clean['Lyrics']

In [154]:
# Let's pickle it for later use
discography_df_raw.to_pickle("lyrics/corpus_original.pkl")
discography_clean.to_pickle("lyrics/corpus.pkl")

#### Aggregate by Album

In [168]:
#Inner join to get the resulting corpus

tmp_df = discography_clean.groupby(['Album','Year'])['Track Length'].sum().reset_index()
print(tmp_df.shape)

tmp_df2 = discography_clean.groupby(['Album','Year'])['Track #'].count().reset_index()
print(tmp_df2.shape)

# Merge the two tables
album_df = tmp_df.merge(tmp_df2, on=['Album', 'Year'], how='inner')

tmp_df3 = discography_clean.groupby(['Album','Year'])['Lyrics'].agg(lambda x: ' '.join(x)).reset_index()
print(tmp_df3.shape)

album_df = album_df.merge(tmp_df3, on=['Album', 'Year'], how='inner').sort_values(by='Year')

# Rename
album_df = album_df.rename(columns={"Track Length": "Total Album Length", "Track #": "Num_of_Tracks"})

# Change index
album_df.set_index("Year", inplace = True, append = False, drop = False)
album_df

(7, 3)
(7, 3)
(7, 3)


Unnamed: 0_level_0,Album,Year,Total Album Length,Num_of_Tracks,Lyrics
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001,Room For Squares,2001,3259,13,welcome to the real world she said to me conde...
2003,Heavier Things,2003,2780,10,this is a call to the colorblind this is an i...
2006,Continuum,2006,2987,12,me and all my friends were all misunderstood t...
2009,Battle Studies,2009,2798,11,lightning strikes inside my chest to keep me u...
2012,Born And Raised,2012,2799,12,goodbye cold goodbye rain goodbye sorrow and g...
2013,Paradise Valley,2013,2410,11,rivers strong you cant swim inside it we could...
2017,The Search For Everything,2017,2629,12,i still feel like your man i still feel like y...


In [169]:
# Let's pickle it for later use
album_df.to_pickle("lyrics/album_corpus.pkl")

### Document-Term Matrix

Document-Term matrix converts the the text into tokens, which means breaking it down into smaller parts. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [172]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
song_cv = cv.fit_transform(discography_clean.Lyrics)
song_dtm = pd.DataFrame(song_cv.toarray(), columns=cv.get_feature_names())
song_dtm.index = discography_clean.index
song_dtm

Unnamed: 0,accepted,act,actors,address,adore,advice,affair,afternoon,age,ago,...,yellow,yes,yesterday,york,youd,youll,young,younger,youre,youve
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
77,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,2,0,0,0,0
78,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
79,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1


#### Document Term Matrix for Albums

In [171]:
cv_a = CountVectorizer(stop_words='english')
album_cv = cv_a.fit_transform(album_df.Lyrics)
album_dtm = pd.DataFrame(album_cv.toarray(), columns=cv_a.get_feature_names())
album_dtm.index = album_df.index
album_dtm

Unnamed: 0_level_0,accepted,act,actors,address,adore,advice,affair,afternoon,age,ago,...,yellow,yes,yesterday,york,youd,youll,young,younger,youre,youve
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2001,0,0,1,0,0,0,0,2,0,1,...,0,0,1,0,0,4,0,0,4,3
2003,0,0,0,1,1,0,0,1,0,0,...,0,3,0,0,1,2,0,0,4,0
2006,0,0,0,0,0,3,0,0,0,0,...,1,0,0,0,0,2,2,0,8,0
2009,0,0,0,0,0,0,1,0,0,0,...,1,1,1,3,0,2,2,1,2,0
2012,1,1,0,0,0,0,0,0,10,0,...,0,0,1,1,1,1,1,0,6,4
2013,0,0,0,0,1,0,0,0,1,0,...,0,2,0,0,2,5,0,0,16,3
2017,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,2,2,0,5,4


In [173]:
# Pickle the Document-term matrix
song_dtm.to_pickle("lyrics/song_dtm.pkl")
album_dtm.to_pickle("lyrics/album_dtm.pkl")

# Pickle the cv
pickle.dump(cv, open("lyrics/cv.pkl", "wb"))
pickle.dump(cv_a, open("lyrics/cv_a.pkl", "wb"))

# Exploratory Data Analysis

863