## Cleaning Your Song Lyrics with Regex (and more Beautiful Soup)

_Lauren F. Klein wrote version 1.0 of this notebook._

As part of the homework due today, you did something like this:

In [5]:
# get the text of some song lyrics
import requests 
import re

resp = requests.get("https://genius.com/Aretha-franklin-respect-lyrics") 
html_str = resp.text

## Import BeautifulSoup

In [2]:
# use BeautifulSoup to find the lyrics tag on the page

from bs4 import BeautifulSoup
document = BeautifulSoup(html_str, "html.parser")
lyrics = document.find("p").text

And you ended up with something that looked like this:

In [3]:
lyrics

"[Written by Otis Redding]\n\n[Verse 1]\nWhat you want, baby, I got it\nWhat you need, do you know I got it?\n\n[Chorus]\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Hey baby\n(Just a little bit) when you get home\n(Just a little bit) mister\n(Just a little bit)\n\n[Verse 2]\nI ain't gonna do you wrong while you're gone\nAin't gon' do you wrong 'cause I don't wanna\n\n[Chorus]\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Baby\n(Just a little bit) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n[Verse 3]\nI'm about to give you all of my money\nAnd all I'm askin' in return, honey\nIs to give me my propers when you get home\n\n[Refrain]\n(Just a, just a, just a, just a) Yeah, baby\n(Just a, just a, just a, just a) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n[Verse 4]\nOoh, your kisses, sweeter than honey\nAnd guess what? So is my money\n\n[Chorus 2]\nAll I want you to do for m

This is pretty good. But It's still not perfect.

What are some problems with this text?

We still have words between brackets that aren't really part of the lyrics

This is where regex comes in handy.

Let's remove the info between brackets.

**Which of the functions that we've discussed in class today should we use to remove it?**

* ANSWER HERE
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *
* *

Right! `sub`

Remember that re.sub takes the format:

`newstring = re.sub(pattern, replacement, original_string)`

When we want to get rid of something without replacing it, we use `""` as the replacement string.

So, in our case, we can do something like:

`cleaner_lyrics = re.sub(pattern, "", lyrics)`

**But what goes in the pattern?**

**Exercise: How can we use regex to remove the brackets and the info betwen them?**

Hints:
* In order to look for a `[`, you need to escape it like this `\[`
* A `.` represents any character
* A `*` after a character matches zero or more instances

In [8]:
cleaner_lyrics = re.sub("\[.*]", "", lyrics)

cleaner_lyrics

"\n\n\nWhat you want, baby, I got it\nWhat you need, do you know I got it?\n\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Hey baby\n(Just a little bit) when you get home\n(Just a little bit) mister\n(Just a little bit)\n\n\nI ain't gonna do you wrong while you're gone\nAin't gon' do you wrong 'cause I don't wanna\n\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Baby\n(Just a little bit) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n\nI'm about to give you all of my money\nAnd all I'm askin' in return, honey\nIs to give me my propers when you get home\n\n\n(Just a, just a, just a, just a) Yeah, baby\n(Just a, just a, just a, just a) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n\nOoh, your kisses, sweeter than honey\nAnd guess what? So is my money\n\n\nAll I want you to do for me, is give it to me when you get home\n(Re, re, re ,re) Yeah baby\n(Re, re, re ,re) Whip it to m

Now there are three extra newlines at the beginning.

**How can we remove these?**

Hints: 
* `\n` is the same in regex
* `^` before square brackets matches the start of a string
* `{}` with a number inside indicates the number of chars to match

In [9]:
cleanest_lyrics = re.sub("^[\n]{3}","", cleaner_lyrics)

In [10]:
cleanest_lyrics

"What you want, baby, I got it\nWhat you need, do you know I got it?\n\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Hey baby\n(Just a little bit) when you get home\n(Just a little bit) mister\n(Just a little bit)\n\n\nI ain't gonna do you wrong while you're gone\nAin't gon' do you wrong 'cause I don't wanna\n\n\nAll I'm askin' is for a little respect when you come home\n(Just a little bit) Baby\n(Just a little bit) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n\nI'm about to give you all of my money\nAnd all I'm askin' in return, honey\nIs to give me my propers when you get home\n\n\n(Just a, just a, just a, just a) Yeah, baby\n(Just a, just a, just a, just a) When you get home\n(Just a little bit) Yeah\n(Just a little bit)\n\n\nOoh, your kisses, sweeter than honey\nAnd guess what? So is my money\n\n\nAll I want you to do for me, is give it to me when you get home\n(Re, re, re ,re) Yeah baby\n(Re, re, re ,re) Whip it to me\n(Re

And with that, we're done! 

Now, one last thing: we need to save our clean lyrics to a file:

In [11]:
# this is how you save to a file in python

with open("lyrics.txt", "w") as file:
    file.writelines(cleanest_lyrics)

We did it!