## Cleaning Song Lyrics with Regex (and Beautiful Soup)

Regular expressions may seem inconsequential, but they turn out to be very useful when cleaning all sorts of text. Today we're going to use them to clean some text--more specifically, song lyrics--that I've scraped from Genius.com. Let's go!

In [None]:
# download the raw song HTML of the lyrics from GitHub
import requests

resp = requests.get('https://raw.githubusercontent.com/laurenfklein/QTM340-Fall22/main/corpora/lyrics/Beyonce-break-my-soul-lyrics.html')
html_str = resp.text

html_str

'<!doctype html>\n<html>\n  <head>\n    <title>Beyoncé – BREAK MY SOUL Lyrics | Genius Lyrics</title>\n\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<meta content=\'width=device-width,initial-scale=1\' name=\'viewport\'>\n\n  <meta name="apple-itunes-app" content="app-id=709482991">\n\n<link href="https://assets.genius.com/images/apple-touch-icon.png?1662478452" rel="apple-touch-icon" />\n\n\n  \n\n  <link href="https://assets.genius.com/images/apple-touch-icon.png?1662478452" rel="apple-touch-icon" />\n\n  \n\n  <!-- Mobile IE allows us to activate ClearType technology for smoothing fonts for easy reading -->\n  <meta http-equiv="cleartype" content="on">\n\n\n\n\n<META name="y_key" content="f63347d284f184b0">\n\n<meta property="og:site_name" content="Genius"/>\n<meta property="fb:app_id" content="265539304824" />\n<meta property="fb:pages" content="308252472676410" />\n\n<link title="Genius" type="application/opensearchdescription+xml" rel="search" href="htt

And then the BeautifulSoup part of the process...

In [None]:
# use BeautifulSoup (an HTML parser) to find the lyrics tags on the page
from bs4 import BeautifulSoup
document = BeautifulSoup(html_str, "html.parser")

lyrics_divs = document.find("div", attrs={"data-lyrics-container": "true"})

# take a look
lyrics_divs

<div class="Lyrics__Container-sc-1ynbvzw-6 YYrds" data-lyrics-container="true">[Intro: Big Freedia &amp; <i>Beyoncé</i>]<br/><a class="ReferentFragmentdesktop__ClickTarget-sc-110r0d9-0 jvutUp" href="/26106119/Beyonce-break-my-soul/Im-bout-to-explode-take-off-this-load-bend-it-bust-it-open-wont-ya-make-it-go-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-release-ya-wiggle-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-yaka-release-ya-wiggle"><span class="ReferentFragmentdesktop__Highlight-sc-110r0d9-1 jAzSMw">I'm 'bout to explode, take off this load<br/>Bend it, bust it open, won't ya make it go<br/>Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka<br/>Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)<br/>Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka<br/>Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)</span></a><span style="position:absolute;opacity:0;width:0;height:0;pointer-events:none;z-index:-1" tabindex=

Do a quick and dirty `.get_text()` to have something to start with:

In [None]:
lyrics = lyrics_divs.get_text(separator='\n')

lyrics

"[Intro: Big Freedia & \nBeyoncé\n]\nI'm 'bout to explode, take off this load\nBend it, bust it open, won't ya make it go\nYaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka\nYaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)\nYaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka\nYaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)\nLa-la-la-la, la-la-la-la, la-la-la-la\nLa-la-la-la, la-la-la-la, la-la-la-la, la-la-la-la, la\n[Chorus: Beyoncé]\nYou won't break my soul\nYou won't break my soul\nYou won't break my soul\nYou won't break my soul\nI'm tellin' everybody\nEverybody\nEverybody\nEverybody\n[Verse 1: Beyoncé]\nNow, I just fell in love\nAnd I just quit my job\nI'm gonna find new drive\nDamn, they work me so damn hard\nWork by nine\nThen off past five\nAnd they work my nerves\nThat's why I cannot sleep at night\n[Pre-Chorus: Beyoncé]\nI'm lookin' for motivation\nI'm lookin' for a new foundation, yeah\nAnd I'm on that new vibration\nI'm buildin' my own foundation, yeah\nHold up

In [None]:
# print it out so that we can see how the newlines are properly formatted

print(lyrics)

[Intro: Big Freedia & 
Beyoncé
]
I'm 'bout to explode, take off this load
Bend it, bust it open, won't ya make it go
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)
La-la-la-la, la-la-la-la, la-la-la-la
La-la-la-la, la-la-la-la, la-la-la-la, la-la-la-la, la
[Chorus: Beyoncé]
You won't break my soul
You won't break my soul
You won't break my soul
You won't break my soul
I'm tellin' everybody
Everybody
Everybody
Everybody
[Verse 1: Beyoncé]
Now, I just fell in love
And I just quit my job
I'm gonna find new drive
Damn, they work me so damn hard
Work by nine
Then off past five
And they work my nerves
That's why I cannot sleep at night
[Pre-Chorus: Beyoncé]
I'm lookin' for motivation
I'm lookin' for a new foundation, yeah
And I'm on that new vibration
I'm buildin' my own foundation, yeah
Hold up, oh, baby, baby
[Chorus: Beyoncé]


This is pretty good. But It's still not perfect.

**What are some problems with this text?**

In [None]:
# remove

# We still have words between brackets that aren't really part of the lyrics

This is where regex comes in handy.

Let's remove the info between brackets.

**Which of the functions that we've discussed in class today should we use to remove it?**

Remember that re.sub takes the format:

`newstring = re.sub(pattern, replacement, original_string)`

When we want to get rid of something without replacing it, we use `""` as the replacement string.

So, in our case, we can do something like:

`cleaner_lyrics = re.sub(pattern, "", lyrics)`

**But what goes in the pattern?**

**Exercise: How can we use regex to remove the brackets and the info betwen them?**

Hints:
* In order to look for a `[`, you need to escape it like this `\[`
* A `.` represents any character
* A `*` after a character matches zero or more instances

In [None]:
import re

cleaner_lyrics = re.sub("\[.*\]","", lyrics) #fill in the pattern, the replacement, and the original string

print(cleaner_lyrics)

[Intro: Big Freedia & 
Beyoncé
]
I'm 'bout to explode, take off this load
Bend it, bust it open, won't ya make it go
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)
La-la-la-la, la-la-la-la, la-la-la-la
La-la-la-la, la-la-la-la, la-la-la-la, la-la-la-la, la

You won't break my soul
You won't break my soul
You won't break my soul
You won't break my soul
I'm tellin' everybody
Everybody
Everybody
Everybody

Now, I just fell in love
And I just quit my job
I'm gonna find new drive
Damn, they work me so damn hard
Work by nine
Then off past five
And they work my nerves
That's why I cannot sleep at night

I'm lookin' for motivation
I'm lookin' for a new foundation, yeah
And I'm on that new vibration
I'm buildin' my own foundation, yeah
Hold up, oh, baby, baby

You won't break my soul (Na, na)
You won't break my soul (No-no, na, na)


In [None]:
# prompt: Give me the regex for matching the following pattern: a square bracket, followed by any characters, and closed by a square bracket. The characters can be spread across multiple lines

r"\[.*?\]"

In [None]:
cleanest_lyrics = re.sub(r"\[.*?\]", "", cleaner_lyrics, flags=re.DOTALL) #fill in the pattern and the replacement

print(cleanest_lyrics)


I'm 'bout to explode, take off this load
Bend it, bust it open, won't ya make it go
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka
Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)
La-la-la-la, la-la-la-la, la-la-la-la
La-la-la-la, la-la-la-la, la-la-la-la, la-la-la-la, la

You won't break my soul
You won't break my soul
You won't break my soul
You won't break my soul
I'm tellin' everybody
Everybody
Everybody
Everybody

Now, I just fell in love
And I just quit my job
I'm gonna find new drive
Damn, they work me so damn hard
Work by nine
Then off past five
And they work my nerves
That's why I cannot sleep at night

I'm lookin' for motivation
I'm lookin' for a new foundation, yeah
And I'm on that new vibration
I'm buildin' my own foundation, yeah
Hold up, oh, baby, baby

You won't break my soul (Na, na)
You won't break my soul (No-no, na, na)
You won't break my soul (No-no, 

That's almost all fixed.

**How can we remove that first bit of info about in the intro?**

Hints:
* `.` matches any character... except for newlines!
* You can do this with a few iterations if you need to

In [None]:
cleanest_lyrics = re.sub(,  , cleaner_lyrics) #fill in the pattern and the replacement

print(cleanest_lyrics)

And with that, we're done!

Now, one last thing: we need to save our clean lyrics to a file:

In [None]:
# this is how you save a file from Colab to Google Drive

from google.colab import drive
drive.mount('/content/gdrive')

with open("/content/gdrive/My Drive/lyrics.txt", "w") as file:
    file.writelines(cleanest_lyrics)

We did it!

_Lauren Klein wrote version 1.0 of this notebook. Dan Sinykin updated it in 2020 and Lauren Klein updated it again in 2021, 2022, and 2024._
