## Python Lyrics Data Collection Script Overview

This script gathers lyrics data from the Genius API. For details on how to register for the Genius API and obtain an API token, visit the [Genius API documentation](https://docs.genius.com/).

The code is flexible and can be modified to scrape lyrics from any artist. Simply change the token in the `genius = Genius()` line in the code with your API token, and replace "Taylor Swift" with the name of the artist whose lyrics you want to scrape and their albums.


In [1]:
!pip install lyricsgenius


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import re

def process_string(input_string):
    # Cut the beginning of the string till "Lyrics[Verse 1]\n"
    start_phrase = "Lyrics[Verse 1]\n"
    start_index = input_string.find(start_phrase)
    if start_index != -1:
        processed_string = input_string[start_index + len(start_phrase):]
    else:
        processed_string = input_string 

    # Remove all instances of text enclosed in square brackets
    processed_string = re.sub(r"\[[^\]]*\]", "", processed_string)

    # Cut the end of the string that contains a number followed by "Embed"
    # Also removing the number before "Embed"
    end_index = processed_string.rfind("Embed")
    if end_index != -1:
        # Find the start of the number preceding "Embed"
        number_start = end_index
        while number_start > 0 and processed_string[number_start-1].isdigit():
            number_start -= 1
        processed_string = processed_string[:number_start].rstrip()

    return processed_string

In [7]:
from lyricsgenius import Genius

genius = Genius('YOUR_TOKEN')
albums = ["Lover", "folklore", "evermore", 'Midnights']
for name in albums:
    album = genius.search_album(name, "Taylor Swift")
    for track in album.tracks:
        print(track.song.title)
        result = process_string(track.song.lyrics)
        song_lyrics = process_string(result)
        with open(f"{name}_%_{album.release_date_components.year}_%_{track.song.title}", 'w') as file:
            file.write(song_lyrics)


Searching for "Lover" by Taylor Swift...
I Forgot That You Existed
Cruel Summer
Lover
The Man
The Archer
I Think He Knows
Miss Americana & The Heartbreak Prince
Paper Rings
Cornelia Street
Death By A Thousand Cuts
London Boy
Soon You’ll Get Better
False God
You Need To Calm Down
Afterglow
ME!
It’s Nice To Have A Friend
Daylight
Searching for "folklore" by Taylor Swift...
​the 1
​cardigan
​the last great american dynasty
​exile
​my tears ricochet
​mirrorball
​seven
​august
​this is me trying
​illicit affairs
​invisible string
​mad woman
​epiphany
​betty
​peace
​hoax
Searching for "evermore" by Taylor Swift...
​willow
​champagne problems
​gold rush
​’tis the damn season
​tolerate it
​no body, no crime
​happiness
​dorothea
​coney island
​ivy
​cowboy like me
​l​ong story short
​marjorie
​closure
​evermore
Searching for "Midnights" by Taylor Swift...
Lavender Haze
Maroon
Anti-Hero
Snow On The Beach
You’re On Your Own, Kid
Midnight Rain
Question...?
Vigilante Shit
Bejeweled
Labyrinth
Karma
S

In [14]:
import os
import pandas as pd

directory = "./songs"

data = []

# Process each file in the directory
i = 0
save_album = None
files_list = os.listdir(directory)
files_list.sort(reverse=True)

for filename in files_list:
        # Extract album, year, and track_title from the filename
        parts = filename.split('_%_')
        album, year, track_title = parts[0], parts[1], parts[2]
        if save_album and save_album != album:
             i = 0
        save_album = album
        

        with open(os.path.join(directory, filename), 'r') as file:
            lines = file.readlines()

        # Process each line in the file
        for line_number, line in enumerate(lines, start=1):
            if line.strip():  # Exclude empty lines
                data.append({
                    'artist': 'Taylor Swift',
                    'album': album,
                    'track_title': track_title,
                    'track_n': i + 1,
                    'lyric': line.strip(),
                    'line': line_number,
                    'year': year
                })
        i = i + 1

df = pd.DataFrame(data)
df['track_title'] = df['track_title'].str.replace('\u200b', '', regex=False)

df.head()  



Unnamed: 0,artist,album,track_title,track_n,lyric,line,year
0,Taylor Swift,folklore,this is me trying,1,I've been having a hard time adjusting,1,2020
1,Taylor Swift,folklore,this is me trying,1,"I had the shiniest wheels, now they're rusting",2,2020
2,Taylor Swift,folklore,this is me trying,1,I didn't know if you'd care if I came back,3,2020
3,Taylor Swift,folklore,this is me trying,1,I have a lot of regrets about that,4,2020
4,Taylor Swift,folklore,this is me trying,1,Pulled the car off the road to the lookout,5,2020


In [15]:
df.to_parquet('../data/taylor_swift_4_albums.parquet')