## Cleaning Your Song Lyrics with Regex (and more Beautiful Soup)

_Lauren F. Klein wrote version 1.0 of this notebook. Dan Sinykin updated it in 2020 and Lauren Klein updated it again in 2021._

As part of the homework due last week, you did something like this:

In [None]:
# get the text of some song lyrics
import requests 

resp = requests.get("https://genius.com/Aretha-franklin-respect-lyrics") 
html_str = resp.text

And then the BeautifulSoup part of the process...

In [None]:
# use BeautifulSoup to find the lyrics tags on the page
from bs4 import BeautifulSoup
document = BeautifulSoup(html_str, "html.parser")

lyrics_divs = document.find_all("div", attrs={"class": "Lyrics__Container-sc-1ynbvzw-8"})

lyrics_divs

At this point there were several options for the next step:

In [None]:
lyrics = []

for div in lyrics_divs:
    lyrics.append(div.string) # this does not work in this particular case
    
lyrics

In [None]:
lyrics = []

for div in lyrics_divs:
    lyrics.append(div.text) # this gets us closer, but since the .text method just strips 
                            # out all the html, it leaves some whitespace errors 

lyrics

There is one more helpful Beautiful Soup method that we can use to extract the contents from the html: **the `get_text` function.**

This is a smarter version of the `.text` method we used last week.

`get_text` returns all the text in a document or-- crucially for us-- beneath the specified tag-- as a single string.

You can also include a delimeter as follows:

In [None]:
lyrics = []

for div in lyrics_divs:
    lyrics.append(div.get_text(separator='\n'))
    
lyrics

In [None]:
# let's print it out so that we get the newlines formatted properly

for chunk in lyrics:
    print(chunk)

This is pretty good. But It's still not perfect.

**What are some problems with this text?**

This is where regex comes in handy.

Let's remove the info between brackets.

**Which of the functions that we've discussed in class today should we use to remove it?**

Remember that re.sub takes the format:

`newstring = re.sub(pattern, replacement, original_string)`

When we want to get rid of something without replacing it, we use `""` as the replacement string.

So, in our case, we can do something like:

`cleaner_lyrics = re.sub(pattern, "", lyrics)`

**But what goes in the pattern?**

**Exercise: How can we use regex to remove the brackets and the info betwen them?**

Hints:
* In order to look for a `[`, you need to escape it like this `\[`
* A `.` represents any character
* A `*` after a character matches zero or more instances

In [None]:
import re

cleaner_lyrics = []

for chunk in lyrics: 
    cleaner_chunk = re.sub() #fill in the pattern, the replacement, and the original string 
    cleaner_lyrics.append(cleaner_chunk)
    
cleaner_lyrics

Now there are two extra newlines at the beginning.

**How can we remove these?**

Hints: 
* `\n` is the same in regex
* `^` before square brackets matches the start of a string
* `{}` with a number inside indicates the number of chars to match

In [None]:
cleanest_lyrics = []

for chunk in cleaner_lyrics: 
    cleanest_chunk = re.sub() #fill in the pattern, the replacement, and the original string 
    cleanest_lyrics.append(cleanest_chunk)

cleanest_lyrics

In [None]:
# print it out to see our beautiful work:

for chunk in cleanest_lyrics:
    print(chunk)

And with that, we're done! 

Now, one last thing: we need to save our clean lyrics to a file:

In [None]:
# this is how you save to a file in python

with open("lyrics.txt", "w") as file:
    file.writelines(cleanest_lyrics)

We did it!

**If time, pseudocode how we'd put all the pieces together to create our lyrics corpus**