# Pop Lyrics Preproccessing for RNN input
---

This notebook reads pop lyrics from [here](https://raw.githubusercontent.com/jamesthomson/Evolution_of_Pop_Lyrics/master/data/scraped_lyrics.tsv)
and does the following preprocessing/cleaning:

* Removes lyrics that say "Lyrics not found"
* Convert '\r\n' to indicated line and stanza start and end positions
* Uses Regular expressions to remove non-lyrical text such and [chorus] or (2x)

### Reading in the data

In [25]:
import pandas as pd
import re
import time
import csv
from time import sleep
import numpy as np

In [26]:
pop = pd.read_csv('https://raw.githubusercontent.com/jamesthomson/Evolution_of_Pop_Lyrics/master/data/scraped_lyrics.tsv',sep='\t')

### Cleaning and preprocessing

In [27]:
def clean_str(string):
	"""
	Tokenization/string cleaning for all datasets except for SST.
	Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
	"""
	string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
	string = re.sub(r"\'s", " \'s", string)
	string = re.sub(r"\'ve", " \'ve", string)
	string = re.sub(r"n\'t", " not", string)
	string = re.sub(r"\'re", " \'re", string)
	string = re.sub(r"\'d", " \'d", string)
	string = re.sub(r"\'ll", " \'ll", string)
	string = re.sub(r",", " , ", string)
	string = re.sub(r"!", " ! ", string)
	string = re.sub(r"\(", " \( ", string)
	string = re.sub(r"\)", " \) ", string)
	string = re.sub(r"\?", " \? ", string)
	string = re.sub(r"\s{2,}", " ", string)
	string = re.sub(r"<br />", " ", string) #Replace HTML break with white space
	string = re.sub(r"br", " ", string)
	string = re.sub(r"\\", " ", string)
	return string.strip().lower()

In [28]:
pop_clean = pop[pop['lyrics']!='Lyrics Not found']

In [29]:
x_text = [clean_str(sent) for sent in pop_clean.lyrics]

### Vocabulary building

In [30]:


def replace_with_oov(input_str,vocab):
    result=''
    for word in input_str.split():
        if (word in vocab):
            result= result + word + ' '
        else:
            result= result + '<oov> '
    return result


word_count = {} # Keys are words, Values are frequency

for review in x_text:

    words = review.split()

    for word in words:
        try:
            word_count[word]+=1
        except:
            word_count[word]=0


res = list(sorted(word_count, key=word_count.__getitem__, reverse=True))

global vocab
vocab = res[:10000]

# Replacing words that are not in the vocab with '<oov>'
cleaned_x_text = [replace_with_oov(item,vocab) for item in x_text]

In [32]:
def get_tagged_lyric(str_input):
    tagged_lyric = (str_input).replace('\r\n\r\n','</l></s><s><l>')
    tagged_lyric = (tagged_lyric).replace('\r\n','</l><l>')
    return '<s><l>'+tagged_lyric+'</l></s>'

In [33]:
pattern_1 = '\([0-9]+x\)'
pattern_2 = '\[.*?\]'
pattern_3 = '\{.*?\}'
pattern_4 = 'chorus'
pattern_5 = 'verse'

all_patterns = [pattern_1,pattern_2,pattern_3,pattern_4,pattern_5]

final_lyrics = []

for lyric in cleaned_x_text:

    try:
        lyric = lyric.lower()
        for pattern in all_patterns:
            lyric = re.sub(pattern,'',lyric)
            
        final_lyrics.append(get_tagged_lyric(lyric))
    except:
        print "There was a problem"

In [34]:
pop_clean['Final_lyrics']=final_lyrics

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [37]:
# Before cleaning
print pop_clean.lyrics[1000]

Chorus:
Now you caught my heart for the evening
Kissed my cheek, moved in you confused things
Should I just sit out or come harder
Help me find my way

Messin' me up, my whole head
Teasing me, just like Tisha did Martin
Now look, at what you starting
School boy crush, and it ain't on the hush
The whole world, see it, but you can't
My people's they complain, sittin' rave and rant
Your name is out my mouth like a ancient chant
Got me like a dog as I pause and pant

Speaking of which, got a leash and a wish, just to rock you miss
Make a militant move, peep my strategy
End of the day you're not mad at me
Not dealing with nobody now, that's what you told me
I said hey yo it's cool, we could just be friendly
Cause yo, picture me messin it up
Her mind I corrupt with the ill C-cups
Shiiit, I'm on my day off
Bullshittin, hopin' that the day go slow
Got me like a friend, what confuses me though
It's kisses when we breeze, tell me what the deal yo

{Chorus} (2x)

Now 

In [36]:
# After cleaning
print pop_clean.Final_lyrics[1000]

<s><l> now you caught my heart for the evening kissed my cheek , moved in you confused things should i just sit out or come harder help me find my way messin' me up , my whole head teasing me , just like <oov> did martin now look , at what you starting school boy crush , and it ai not on the hush the whole world , see it , but you ca not my people 's they complain , sittin' <oov> and <oov> your name is out my mouth like a ancient chant got me like a dog as i pause and <oov> speaking of which , got a <oov> and a wish , just to rock you miss make a <oov> move , peep my <oov> end of the day you 're not mad at me not dealing with nobody now , that 's what you told me i said hey yo it 's cool , we could just be friendly cause yo , picture me messin it up her mind i <oov> with the ill c cups <oov> , i'm on my day off <oov> , hopin' that the day go slow got me like a friend , what <oov> me though it 's kisses when we eeze , tell me what the deal yo  ( 2x ) now why you wanna go and do that lov

In [38]:
pop_clean.to_csv('data/pop_clean_lyrics_dataset.csv')