## Scraping song lyrics using Genius API

Import necessary packages

In [1]:
import sys
import json
import requests
import pandas as pd
from scrapy import Selector
from pprint import pprint
import re

# sys.path.append('../dees_tools')
# from deestools import *

Open JSON file containing credentials

In [2]:
credentials_file_path = '../credentials.json'

with open(credentials_file_path, 'r') as f:
    credentials = json.load(f)

Initialise a new session

In [3]:
my_session = requests.Session()

I created a custom function `generate_song_url` to generate the Genius page URL for a song using the title and artist of the song.

**Note: might need to change this to use Genius API to query for song URL instead if there are too many difference between artist and song title from Youtube data and Genius data**

In [4]:
def generate_song_url(song_artist, song_title):
    '''
    Returns a string of the URL for the Genius page of the song

        Parameters:
            song_artist (str): The artist of the song
            song_title (str): The title of the song

        Returns:
            song_url (str): The URL for the Genius page of the song
    '''
    
    base_url = 'https://genius.com/'
    
    # format the artist name and song title
    song_artist = song_artist.replace('&', 'and')
    formatted_artist = song_artist.lower().replace(' ', '-')
    formatted_title = song_title.lower().replace(' ', '-')
    
    # generate the song URL by concatenating strings according to Genius formatting
    song_url = f'{base_url}{formatted_artist}-{formatted_title}-lyrics'

    return song_url

I created a custom function `scrape_lyrics` to scrape song lyrics from the Genius page for any given song. 

Note that the lyrics returned are formatted such that each line of lyric appears in a new line, similar to how it is displayed on the Genius page.

In [5]:
def scrape_lyrics(session, song_url):
    '''
    Returns a string of song lyrics, with each line separated by a new line

        Parameters:
            session (variable): The session that has been initialised for requesting from the Genius website
            song_url (str): The URL of the Genius page for the song

        Returns:
            lyrics (str): The lyrics of the song
    '''
    
    # use initialised session to enhance performance
    response = session.get(song_url)
    sel = Selector(text=response.text)
    
    # scrape lyrics into one large string
    raw_lyrics = ' '.join(sel.css('div.Lyrics__Container-sc-1ynbvzw-1.kUgSbL ::text').getall())

    # clean lyrics using regular expression to remove words in square brackets
    pattern = r'\[.*?\]'
    result_string = re.sub(pattern, '', raw_lyrics)
    lyrics = ' '.join(result_string.split())

    return lyrics

-----

### **Scrape lyrics of songs in CSV**

At this point of data collection, we will have a pandas dataframe of already selected and filtered songs from using the YouTube API. Critically, the dataframe will have information on the name and artist of each song.

We now want to add the lyrics of each song into the dataframe.

In [6]:
test_df = pd.read_csv('../data/test_10_songs.csv')

In [7]:
# add Genius URL of each song to dataframe
test_df['Genius_URL'] = test_df.apply(lambda row: generate_song_url(row['Artist'], row['Song']), axis=1)

In [8]:
# add Genius lyrics of each song to dataframe
test_df['Genius_lyrics'] = test_df.apply(lambda row: scrape_lyrics(my_session, row['Genius_URL']), axis=1)

In [9]:
# save to CSV for use in data visualisation
test_df.to_csv('../data/10_songs_with_lyrics.csv')