## Cleaning Song Lyrics with Regex (and more Beautiful Soup)

As part of our web scraping lesson last week, we did something like this:

In [None]:
# get the text of some song lyrics
import requests 

resp = requests.get('https://raw.githubusercontent.com/laurenfklein/QTM340-Fall22/main/corpora/lyrics/Beyonce-break-my-soul-lyrics.html')
html_str = resp.text

And then the BeautifulSoup part of the process...

In [None]:
# use BeautifulSoup to find the lyrics tags on the page
from bs4 import BeautifulSoup
document = BeautifulSoup(html_str, "html.parser")

lyrics_divs = document.find("div", attrs={"data-lyrics-container": "true"})

# make sure we got something
lyrics_divs

And then we did a quick and dirty `.get_text()`

In [None]:
lyrics = lyrics_divs.get_text()

lyrics

One thing I forgot to mention last class is that you can include a delimeter with the `get_text` method, so that you don't end up with all of those words jammed together. This is how you do it:

In [None]:
lyrics = lyrics_divs.get_text(separator='\n')
    
lyrics

In [None]:
# let's print it out so that we get the newlines formatted properly

print(lyrics)

This is pretty good. But It's still not perfect.

**What are some problems with this text?**

This is where regex comes in handy.

Let's remove the info between brackets.

**Which of the functions that we've discussed in class today should we use to remove it?**

Remember that re.sub takes the format:

`newstring = re.sub(pattern, replacement, original_string)`

When we want to get rid of something without replacing it, we use `""` as the replacement string.

So, in our case, we can do something like:

`cleaner_lyrics = re.sub(pattern, "", lyrics)`

**But what goes in the pattern?**

**Exercise: How can we use regex to remove the brackets and the info betwen them?**

Hints:
* In order to look for a `[`, you need to escape it like this `\[`
* A `.` represents any character
* A `*` after a character matches zero or more instances

In [None]:
import re

cleaner_lyrics = re.sub() #fill in the pattern, the replacement, and the original string 

print(cleaner_lyrics)

That's almost all fixed. 

**How can we remove that first bit of info about in the intro?**

Hints: 
* `.` matches any character... except for newlines!
* You can do this with a few iterations if you need to

In [None]:
cleanest_lyrics = re.sub(,  , cleaner_lyrics) #fill in the pattern and the replacement
print(cleanest_lyrics)



And with that, we're done! 

Now, one last thing: we need to save our clean lyrics to a file:

In [None]:
# this is how you save a file from Colab to Google Drive

from google.colab import drive
drive.mount('/content/gdrive')

with open("/content/gdrive/My Drive/lyrics.txt", "w") as file:
    file.writelines(cleanest_lyrics)

We did it!

_Lauren F. Klein wrote version 1.0 of this notebook. Dan Sinykin updated it in 2020 and Lauren Klein updated it again in 2021 and 2022._
