# Scraping lyrics for songs in a Spotify playlist

In this notebook, we'll get a list of songs from a playlist using Spotipy. Then, we'll scrape the lyrics for these songs from Genius.com and output a list of these lyrics into a `.txt` file for downstream analysis.

In [16]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from bs4 import Tag, NavigableString, BeautifulSoup
import requests
import itertools
from configparser import ConfigParser
import re
import string
from collections.abc import Iterable


# Get playlist songs

First, we'll read our client id and secret information (to use for the Spotify client) from a file called `config.ini`. Then, we'll use the info from this config file to initiate Spotipy.

In [2]:
config_object = ConfigParser()
config_object.read("config.ini")

['config.ini']

In [4]:
client_credentials_manager = SpotifyClientCredentials(client_id=config_object['SPOTIFYINFO']['client_id'], client_secret=config_object['SPOTIFYINFO']['client_secret'])
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [5]:
# read in a user's playlist link (and, if url, convert to uri)
def parse_user_input(x):
    # if uri, return
    colon_split = x.split(":")
    if len(colon_split) == 3 and colon_split[0] == 'spotify' and colon_split[1] == 'playlist':
        return x
    # else convert to uri
    elif "open.spotify.com/playlist/" in x:
        return 'spotify:playlist:' + x.split("/")[-1]
    else:
        return ""

Once we're logged in, we'll get the track list using a link to our playlist of interest (either a URI or URL) — the Spotipy `playlist` function takes a URI, but the `parse_user_input` function provided above can convert a URL to a URI.

In [7]:
plist = sp.playlist(parse_user_input("https://open.spotify.com/playlist/79yBBLuDegzKj6dTcH8xH1?si=1a07281e992847d1"))

# Scrape Lyrics from Genius

In [10]:
# given an entry (representing a song in our playlist), return the url to its genius page
def convert_to_genius_url(entry):
    name = clean_song_name(entry['track']['name'])
    artist = clean_song_name(entry['track']['artists'][0]['name'])
    return("https://genius.com/" + artist + "-" + name + "-lyrics")

In [11]:
# convert song name to its representation in the genius url (spaces -> dashes, remove punctuation, etc.)
def clean_song_name(name):
    if "(" in name:
        in_parenthesis = re.findall(r'\(.*?\)', name)
        
        for x in in_parenthesis:
            # how to handle if it's a version or edit
            if "version" in x or "edit" in x or "remix" in x or "remaster" in x:
                name = name.replace(x,"")
            # how to handle if it's just part of the song name            
            else:
                name = name.replace(x,x.replace("(","").replace(")",""))
    if " - " in name:
        name = name.split("-")[0]
        
    regex=re.compile('|'.join(map(re.escape, [".", ",","!","?"])))
    return regex.sub("",name.strip().replace(" ","-").lower())
                

First, we look at each song in our playlist and identify its corresponding page on genius.com

In [14]:
urls = [convert_to_genius_url(t1) for t1 in plist['tracks']['items']]

In [15]:
urls[8]

'https://genius.com/queens-of-the-stone-age-no-one-knows-lyrics'

In [15]:
def scrape_lyrics(url):
    page = requests.get(url)
    if page.status_code == 200:
        html = BeautifulSoup(page.text, 'html.parser')
        lyrics = html.findAll("div", class_="Lyrics__Container-sc-1ynbvzw-6 jYfhrf")
        return list(flatten_deep([x.contents for x in lyrics]))
    else:
        return ""

In [17]:
def clean_line(line):
    if isinstance(line, Tag):
        if len(line.contents) > 0:
            return [clean_line(x) for x in line.contents[0]]
        else:
            return ""
    elif isinstance(line, NavigableString):
        return(line.string)
    else:
        return ""

def remove_tags(line):
    to_remove = ['chorus','verse','bridge','intro','outro','pre-chorus','hook','refrain','solo']
    res = re.sub(r'[^\w\s]', '', line)
    if any(res.startswith(x) for x in to_remove):
        return ""
    elif len(line) < 3:
        return ""
    
    return line
    
def clean_lyrics(lyrics):
    cleaned = [clean_line(x) for x in lyrics]
    cleaned = [x.lower() for x in flatten_deep(cleaned) if x != ""]
    cleaned = [remove_tags(x) for x in cleaned]
    return [x for x in cleaned if x != ""]

In [18]:
# taken from: https://github.com/jorgeorpinel/flatten_nested_lists/blob/master/flatten.py
# with added logic to not break strings into characters
def flatten_deep(arr: list):
    """ Flattens arbitrarily-nested list `arr` into single-dimensional. """

    while arr:
        if isinstance(arr[0], list) and not isinstance(arr[0], str):  # Checks whether first element is a list
            arr = arr[0] + arr[1:]  # If so, flattens that first element one level
        else:
            yield arr.pop(0)  # Otherwise yield as part of the flat array

This will be done within a loop, but let's take a quick look at how this works for one song:

In [19]:
lyric = scrape_lyrics(urls[8])

In [21]:
cleaned = clean_lyrics(lyric)
cleaned

['we get some rules to follow',
 'that and this, these and those',
 'no one knows',
 'we get these pills to swallow',
 'how they stick in your throat',
 'tastes like gold',
 'oh, what you do to me',
 'no one knows',
 "and i realize you're mine",
 'indeed a fool am i',
 "and i realize you're mine",
 'indeed a fool am i',
 'i journey through the desert',
 'of the mind with no hope',
 'i follow',
 'i drift along the ocean',
 'dead lifeboat in the sun',
 'end come undone',
 'pleasantly caving in',
 'i come undone',
 "and i realize you're mine",
 'indeed a fool am i',
 "and i realize you're mine",
 'indeed a fool am i',
 'heaven smiles above me',
 'what a gift here below',
 'but no one knows',
 'a gift that you give to me',
 'no one knows',
 '…buenas tardes señores y señoritas aquí está el',
 '"dj héctor bonifacio echevarría cervantes de la cruz arroyo rojas"',
 'esta es la radio quetzalcoatl',
 'estación donde el rock vive y no muere',
 'vamos a escuchar un par de temas de queens of the st

Now, we can scrape lyrics for all the songs in our playlist and save to a text file to use downstream.

In [22]:
all_lyrics = [clean_lyrics(scrape_lyrics(x)) for x in urls]

In [26]:
playlist_lyrics = list(flatten_deep(all_lyrics))

In [28]:
with open('playlist.txt', 'w') as f:
    for item in playlist_lyrics:
        f.write("%s\n" % item)