## Cleaning Your Song Lyrics with Regex (and more Beautiful Soup)

As part of the homework due today, you did something like this:

In [10]:
# get the text of some song lyrics
import requests 

resp = requests.get("https://genius.com/Aretha-franklin-respect-lyrics") 
html_str = resp.text

In [11]:
# use BeautifulSoup to find the lyrics tag on the page

from bs4 import BeautifulSoup
document = BeautifulSoup(html_str, "html.parser")
lyrics_div = document.find('div', attrs={'class': 'lyrics'})

And you ended up with something that looked like this:

In [12]:
lyrics_div

<div class="lyrics">
<!--sse-->
<p><a annotation-fragment="5055050" class="referent" classification="verified" data-id="5055050" href="/Aretha-franklin-respect-lyrics#note-5055050" image="false" ng-class="{
          'referent--linked_to_preview': song_ctrl.referent_has_preview(fragment_id),
          'referent--linked_to_preview_active': song_ctrl.highlight_preview_referent(fragment_element_id),
          'referent--purple_indicator': song_ctrl.show_preview_referent_indicator(fragment_element_id)
        }" ng-click="open()" on-hover-with-no-digest="set_current_hover_and_digest(hover ? fragment_id : undefined)" pending-editorial-actions-count="0" prevent-default-click="" verified-annotator-ids="27144">[Written by Otis Redding]</a><br/>
<br/>
[Verse 1]<br/>
<a annotation-fragment="5394087" class="referent" classification="accepted" data-id="5394087" href="/Aretha-franklin-respect-lyrics#note-5394087" image="false" ng-class="{
          'referent--linked_to_preview': song_ctrl.referent_

There is one more helpful Beautiful Soup function that we can use to extract the contents from the html: **the `get_text` function.**

This is a smarter version of the `.text` method we've been using to date.

`get_text` returns all the text in a document or-- crucially for us-- beneath the specified tag-- as a single string.

Behold!

In [13]:
lyrics = lyrics_div.get_text()

lyrics

"\n\n[Written by Otis Redding]\n\n[Verse 1]\nWhat you want, baby, I got it\nWhat you need, do you know I got it?\n\n[Chorus]\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Hey baby\n(Just a little bit) when you get home\n(Just a little bit) mister\n(Just a little bit)\n\n[Verse 2]\nI ain't gonna do you wrong while you're gone\nAin't gon' do you wrong 'cause I don't wanna\n\n[Chorus]\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Baby\n(Just a little bit) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n[Verse 3]\nI'm about to give you all of my money\nAnd all I'm askin' in return, honey\nIs to give me my propers when you get home\n\n[Refrain]\n(Just a, just a, just a, just a) Yeah, baby\n(Just a, just a, just a, just a) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n[Verse 4]\nOoh, your kisses, sweeter than honey\nAnd guess what? So is my money\n\n[Chorus 2]\nAll I want you to do f

This is pretty good. But It's still not perfect.

What are some problems with this text?

In [None]:
# show as file -- messy_lyrics 

# extra newlines, indications of chorus, author, etc. 

This is where regex comes in very handy.

Let's start by removing those extra newlines at the start of the string.

**Which of the functions that we've discussed in class today should we use to remove them?**

In [None]:
# re.sub 

Remember that re.sub takes the format:

`newstring = re.sub(pattern, replacement, original_string`

In our case, we can do something like:

`cleaner_lyrics = re.sub(pattern, "", lyrics)`

**But what goes in the pattern?**

Hints:
* a newline (`\n`) is the same in regex 
* `^` before the square brackets matches the start of a string
* `{}` with a number inside indicates the number of chars to match

In [14]:
import re

cleaner_lyrics = re.sub("^[\n]{2}","",lyrics)

cleaner_lyrics



"[Written by Otis Redding]\n\n[Verse 1]\nWhat you want, baby, I got it\nWhat you need, do you know I got it?\n\n[Chorus]\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Hey baby\n(Just a little bit) when you get home\n(Just a little bit) mister\n(Just a little bit)\n\n[Verse 2]\nI ain't gonna do you wrong while you're gone\nAin't gon' do you wrong 'cause I don't wanna\n\n[Chorus]\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Baby\n(Just a little bit) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n[Verse 3]\nI'm about to give you all of my money\nAnd all I'm askin' in return, honey\nIs to give me my propers when you get home\n\n[Refrain]\n(Just a, just a, just a, just a) Yeah, baby\n(Just a, just a, just a, just a) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n[Verse 4]\nOoh, your kisses, sweeter than honey\nAnd guess what? So is my money\n\n[Chorus 2]\nAll I want you to do for m

What about the other issue with these lyrics: all of the info in brackets. 

**Exericse: How can we use regex to remove them?**

Additional hints:
* In order to look for a `[`, you need to escape it like this `\[`
* A `.` represents any character
* A `*` matches zero or more instances

In [15]:
cleanest_lyrics = re.sub("\[.*\]\n", "", cleaner_lyrics)

cleanest_lyrics

"\nWhat you want, baby, I got it\nWhat you need, do you know I got it?\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Hey baby\n(Just a little bit) when you get home\n(Just a little bit) mister\n(Just a little bit)\n\nI ain't gonna do you wrong while you're gone\nAin't gon' do you wrong 'cause I don't wanna\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Baby\n(Just a little bit) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\nI'm about to give you all of my money\nAnd all I'm askin' in return, honey\nIs to give me my propers when you get home\n\n(Just a, just a, just a, just a) Yeah, baby\n(Just a, just a, just a, just a) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\nOoh, your kisses, sweeter than honey\nAnd guess what? So is my money\n\nAll I want you to do for me, is give it to me when you get home\n(Re, re, re ,re) Yeah baby\n(Re, re, re ,re) Whip it to me\n(Respect, just 

Oops. There's just one extra newline at the start of the file. 

**How can we remove this?**

In [16]:
more_cleanest_lyrics = re.sub("^[\n]","", cleanest_lyrics)

more_cleanest_lyrics

"What you want, baby, I got it\nWhat you need, do you know I got it?\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Hey baby\n(Just a little bit) when you get home\n(Just a little bit) mister\n(Just a little bit)\n\nI ain't gonna do you wrong while you're gone\nAin't gon' do you wrong 'cause I don't wanna\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Baby\n(Just a little bit) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\nI'm about to give you all of my money\nAnd all I'm askin' in return, honey\nIs to give me my propers when you get home\n\n(Just a, just a, just a, just a) Yeah, baby\n(Just a, just a, just a, just a) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\nOoh, your kisses, sweeter than honey\nAnd guess what? So is my money\n\nAll I want you to do for me, is give it to me when you get home\n(Re, re, re ,re) Yeah baby\n(Re, re, re ,re) Whip it to me\n(Respect, just a 

In [None]:
# could remove newline at end w/ $ instead of ^ but don't have to do it

And with that, we're done! 

Now, one last thing: we need to save our clean lyrics to a file:

In [19]:
# this is how you save to a file in python

with open("lyrics.txt", "w") as file:
    file.writelines(more_cleanest_lyrics)


We did it!

**If time, pseudocode how we'd put all the pieces together to create our lyrics corpus**