## Cleaning Your Song Lyrics with Regex (and more Beautiful Soup)

_Lauren F. Klein wrote version 1.0 of this notebook. Dan Sinykin updated it in 2020 and Lauren Klein updated it again in 2021._

As part of the homework due last week, you did something like this:

In [59]:
# get the text of some song lyrics
import requests 

resp = requests.get("https://genius.com/Aretha-franklin-respect-lyrics") 
html_str = resp.text

And then the BeautifulSoup part of the process...

In [69]:
# use BeautifulSoup to find the lyrics tags on the page
from bs4 import BeautifulSoup
document = BeautifulSoup(html_str, "html.parser")

lyrics_divs = document.find_all("div", attrs={"class": "Lyrics__Container-sc-1ynbvzw-8"})

lyrics_divs

[<div class="Lyrics__Container-sc-1ynbvzw-8 eOLwDW"><a class="ReferentFragment__ClickTarget-oqvzi6-0 evuxZm" href="/5055050/Aretha-franklin-respect/Written-by-otis-redding"><span class="ReferentFragment__Highlight-oqvzi6-1 kCgfVd">[Written by Otis Redding]</span></a><br/><br/>[Verse 1]<br/><a class="ReferentFragment__ClickTarget-oqvzi6-0 evuxZm" href="/5394087/Aretha-franklin-respect/What-you-want-baby-i-got-it-what-you-need-do-you-know-i-got-it"><span class="ReferentFragment__Highlight-oqvzi6-1 eZruqd">What you want, baby, I got it<br/>What you need, do you know I got it?</span></a><br/><br/>[Chorus]<br/>All I'm askin' is for a little respect when you come home<br/>(Just a little bit) Hey baby<br/>(Just a little bit) when you get home<br/>(Just a little bit) mister<br/>(Just a little bit)<br/><br/>[Verse 2]<br/><a class="ReferentFragment__ClickTarget-oqvzi6-0 evuxZm" href="/10449041/Aretha-franklin-respect/I-aint-gonna-do-you-wrong-while-youre-gone-aint-gon-do-you-wrong-cause-i-dont-w

At this point there were several options for the next step:

In [79]:
lyrics = []

for div in lyrics_divs:
    lyrics.append(div.string) # this does not work in this particular case
    
lyrics

[None, None]

In [97]:
lyrics = []

for div in lyrics_divs:
    lyrics.append(div.text) # this gets us closer, but since the .text method just strips 
                            # out all the html, it leaves some whitespace errors 

lyrics

["[Written by Otis Redding][Verse 1]What you want, baby, I got itWhat you need, do you know I got it?[Chorus]All I'm askin' is for a little respect when you come home(Just a little bit) Hey baby(Just a little bit) when you get home(Just a little bit) mister(Just a little bit)[Verse 2]I ain't gonna do you wrong while you're goneAin't gon' do you wrong 'cause I don't wanna[Chorus]All I'm askin' is for a little respect when you come home(Just a little bit) Baby(Just a little bit) When you get home(Just a little bit) Yeah(Just a little bit)[Verse 3]I'm about to give you all of my moneyAnd all I'm askin' in return, honeyIs to give me my propers when you get home",
 "[Refrain](Just a, just a, just a, just a) Yeah, baby(Just a, just a, just a, just a) When you get home(Just a little bit) Yeah(Just a little bit)[Instrumental][Verse 4]Ooh, your kisses, sweeter than honeyAnd guess what? So is my money[Chorus 2]All I want you to do for me, is give it to me when you get home(Re, re, re, re) Yeah b

There is one more helpful Beautiful Soup method that we can use to extract the contents from the html: **the `get_text` function.**

This is a smarter version of the `.text` method we used last week.

`get_text` returns all the text in a document or-- crucially for us-- beneath the specified tag-- as a single string.

You can also include a delimeter as follows:

In [91]:
lyrics = []

for div in lyrics_divs:
    lyrics.append(div.get_text(separator='\n'))
    
lyrics

["[Written by Otis Redding]\n[Verse 1]\nWhat you want, baby, I got it\nWhat you need, do you know I got it?\n[Chorus]\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Hey baby\n(Just a little bit) when you get home\n(Just a little bit) mister\n(Just a little bit)\n[Verse 2]\nI ain't gonna do you wrong while you're gone\nAin't gon' do you wrong 'cause I don't wanna\n[Chorus]\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Baby\n(Just a little bit) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n[Verse 3]\nI'm about to give you all of my money\nAnd all I'm askin' in return, honey\nIs to give me my propers when you get home",
 "[Refrain]\n(Just a, just a, just a, just a) Yeah, baby\n(Just a, just a, just a, just a) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n[Instrumental]\n[Verse 4]\nOoh, your kisses, sweeter than honey\nAnd guess what? So is my money\n[Chorus 2]\nAll I want you to do f

In [92]:
# let's print it out so that we get the newlines formatted properly

for chunk in lyrics:
    print(chunk)

[Written by Otis Redding]
[Verse 1]
What you want, baby, I got it
What you need, do you know I got it?
[Chorus]
All I'm askin' is for a little respect when you come home
(Just a little bit) Hey baby
(Just a little bit) when you get home
(Just a little bit) mister
(Just a little bit)
[Verse 2]
I ain't gonna do you wrong while you're gone
Ain't gon' do you wrong 'cause I don't wanna
[Chorus]
All I'm askin' is for a little respect when you come home
(Just a little bit) Baby
(Just a little bit) When you get home
(Just a little bit) Yeah
(Just a little bit)
[Verse 3]
I'm about to give you all of my money
And all I'm askin' in return, honey
Is to give me my propers when you get home
[Refrain]
(Just a, just a, just a, just a) Yeah, baby
(Just a, just a, just a, just a) When you get home
(Just a little bit) Yeah
(Just a little bit)
[Instrumental]
[Verse 4]
Ooh, your kisses, sweeter than honey
And guess what? So is my money
[Chorus 2]
All I want you to do for me, is give it to me when you get h

This is pretty good. But It's still not perfect.

**What are some problems with this text?**

In [None]:
# We still have words between brackets that aren't really part of the lyrics

This is where regex comes in handy.

Let's remove the info between brackets.

**Which of the functions that we've discussed in class today should we use to remove it?**

In [None]:
re.sub() 

Remember that re.sub takes the format:

`newstring = re.sub(pattern, replacement, original_string)`

When we want to get rid of something without replacing it, we use `""` as the replacement string.

So, in our case, we can do something like:

`cleaner_lyrics = re.sub(pattern, "", lyrics)`

**But what goes in the pattern?**

**Exercise: How can we use regex to remove the brackets and the info betwen them?**

Hints:
* In order to look for a `[`, you need to escape it like this `\[`
* A `.` represents any character
* A `*` after a character matches zero or more instances

In [None]:
cleaner_lyrics = []

for chunk in lyrics: 
    cleaner_chunk = re.sub("\[.*\]", "", chunk)
    cleaner_lyrics.append(cleaner_chunk)
    
cleaner_lyrics

["\n\nWhat you want, baby, I got it\nWhat you need, do you know I got it?\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Hey baby\n(Just a little bit) when you get home\n(Just a little bit) mister\n(Just a little bit)\n\nI ain't gonna do you wrong while you're gone\nAin't gon' do you wrong 'cause I don't wanna\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Baby\n(Just a little bit) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\nI'm about to give you all of my money\nAnd all I'm askin' in return, honey\nIs to give me my propers when you get home",
 "\n(Just a, just a, just a, just a) Yeah, baby\n(Just a, just a, just a, just a) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n\nOoh, your kisses, sweeter than honey\nAnd guess what? So is my money\n\nAll I want you to do for me, is give it to me when you get home\n(Re, re, re, re) Yeah baby\n(Re, re, re, re) Whip it to me\n(Respec

Now there are two extra newlines at the beginning.

**How can we remove these?**

Hints: 
* `\n` is the same in regex
* `^` before square brackets matches the start of a string
* `{}` with a number inside indicates the number of chars to match

In [None]:
cleanest_lyrics = []

for chunk in cleaner_lyrics: 
    cleanest_chunk = re.sub("^[\n]{2}","", chunk)
    cleanest_lyrics.append(cleanest_chunk)

cleanest_lyrics

["What you want, baby, I got it\nWhat you need, do you know I got it?\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Hey baby\n(Just a little bit) when you get home\n(Just a little bit) mister\n(Just a little bit)\n\nI ain't gonna do you wrong while you're gone\nAin't gon' do you wrong 'cause I don't wanna\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Baby\n(Just a little bit) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\nI'm about to give you all of my money\nAnd all I'm askin' in return, honey\nIs to give me my propers when you get home",
 "\n(Just a, just a, just a, just a) Yeah, baby\n(Just a, just a, just a, just a) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n\nOoh, your kisses, sweeter than honey\nAnd guess what? So is my money\n\nAll I want you to do for me, is give it to me when you get home\n(Re, re, re, re) Yeah baby\n(Re, re, re, re) Whip it to me\n(Respect, j

In [95]:
# print it out to see our beautiful work:

for chunk in cleanest_lyrics:
    print(chunk)

What you want, baby, I got it
What you need, do you know I got it?

All I'm askin' is for a little respect when you come home
(Just a little bit) Hey baby
(Just a little bit) when you get home
(Just a little bit) mister
(Just a little bit)

I ain't gonna do you wrong while you're gone
Ain't gon' do you wrong 'cause I don't wanna

All I'm askin' is for a little respect when you come home
(Just a little bit) Baby
(Just a little bit) When you get home
(Just a little bit) Yeah
(Just a little bit)

I'm about to give you all of my money
And all I'm askin' in return, honey
Is to give me my propers when you get home

(Just a, just a, just a, just a) Yeah, baby
(Just a, just a, just a, just a) When you get home
(Just a little bit) Yeah
(Just a little bit)


Ooh, your kisses, sweeter than honey
And guess what? So is my money

All I want you to do for me, is give it to me when you get home
(Re, re, re, re) Yeah baby
(Re, re, re, re) Whip it to me
(Respect, just a little bit) When you get home, no

And with that, we're done! 

Now, one last thing: we need to save our clean lyrics to a file:

In [96]:
# this is how you save to a file in python

with open("lyrics.txt", "w") as file:
    file.writelines(cleanest_lyrics)

We did it!

**If time, pseudocode how we'd put all the pieces together to create our lyrics corpus**