# Task 1: Data Collection 

This project involves collecting song and artist data from the Genius open web API for four different artists:
- Elton John
- Hozier
- Kanye West
- Kendrick Lamar

This notebook covers <b>Task 1 - Data Collection</b>.

Since the API only provides data at the specific time that it is collected, in the case of variable data, the notebook code should be run multiple times at different time intervals to collection sufficient data for analysis purposes.


## Establishing Data Storage Options

In [10]:
import json, requests, urllib
import pandas as pd
from pathlib import Path
from bs4 import BeautifulSoup 
from datetime import datetime
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

Establishing the settings for the API.

In [11]:
# API prefix
api_prefix = "https://api.genius.com"
# Personal access token required by Genius
client_access_token = "8b43-Edao94eWLI083TYFXgzj5ODmZiZmdkdH8UDjDj-PL2a4201crc_hU060-9N"
# The artists that we would like to study
artist_names = ['Kendrick Lamar', 'Kanye West','Elton John','Hozier']

Creating directories for data storage.

In [12]:
# Raw data storage from the API
dir_raw = Path("raw")
dir_raw.mkdir(parents=True, exist_ok=True)
# Text storage for the song lyrics
dir_lyrics = Path("lyrics")
dir_lyrics.mkdir(parents=True, exist_ok=True)

Data storage directions in raw output data.

In [13]:
# Sections of the lists in the output files where different data is stored
search_data = 0
artist_data = 1

## Collecting the Data

We want to use the artist's names to collect information about them and their top songs.

The search capability in Genius allows us to obtain information about their top 10 most "interacted with" songs on Genius.

In [14]:
def fetch(path, params = {}, headers = None):
    #Construct the URL
    url = "/".join([api_prefix,path])
    #Create the access token using 'Bearer' as specified by Genius
    access_token = "Bearer %s" % client_access_token
    if headers:
        headers['Authorization'] = access_token
    else:
        headers = {"Authorization": access_token}
        
    print("\nFetching From %s" % url)
    
    #Fetch the page
    response = requests.get(url=url, params=params, headers=headers)
    jdata = response.text
    
    return json.loads(jdata)

Using the search capability and checking that the correct info is being obtained.

In [17]:
metadata_rows = []
artist_ID = {}

for artist_name in artist_names:
    data = fetch("search?q=%s" % urllib.parse.quote(artist_name))
    #Obtain the artist ID in order to obtain data using the 'artist' resource in Genius
    for x in range(len(data['response']['hits'])):
        #Account for the fact that they may not always be the main artist
        if data['response']['hits'][x]['result']['primary_artist']['name'] == artist_name:
            print("%s\nArtist ID: %s\n" %(artist_name,data['response']['hits'][0]['result']['primary_artist']['id']))
            artist_ID[artist_name] = data['response']['hits'][0]['result']['primary_artist']['id']
            #After the artist ID is returned once break out of the loop
            break


    for item in data['response']['hits']:
        row = {"Artist": artist_name}
        row['Song'] = item['result']['title']
        row['ID'] = item['result']['id']
        row['Annotations'] = item['result']['annotation_count']
        row['User Interest'] = item['result']['pyongs_count'] #Shares
        row['Page Views'] = item['result']['stats']['pageviews']
        metadata_rows.append(row)

df = pd.DataFrame(metadata_rows).set_index('Artist')
df.head(10)


Fetching From https://api.genius.com/search?q=Kendrick%20Lamar
Kendrick Lamar
Artist ID: 1421


Fetching From https://api.genius.com/search?q=Kanye%20West
Kanye West
Artist ID: 72


Fetching From https://api.genius.com/search?q=Elton%20John
Elton John
Artist ID: 560


Fetching From https://api.genius.com/search?q=Hozier
Hozier
Artist ID: 73910



Unnamed: 0_level_0,Song,ID,Annotations,User Interest,Page Views
Artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Kendrick Lamar,HUMBLE.,3039923,20,1204,11751898
Kendrick Lamar,​euphoria,10341021,89,140,9634346
Kendrick Lamar,Not Like Us,10359264,61,144,8725125
Kendrick Lamar,​m.A.A.d city,90478,82,2239,7449028
Kendrick Lamar,Swimming Pools (Drank),81159,48,986,6659169
Kendrick Lamar,DNA.,3035222,38,693,6350122
Kendrick Lamar,Money Trees,90475,54,994,6420582
Kendrick Lamar,XXX.,3047142,26,245,5806411
Kendrick Lamar,"Bitch, Don’t Kill My Vibe",90473,36,705,5201090
Kendrick Lamar,​meet the grahams,10356410,75,112,5071904


In [18]:
#Using the artist search on Genius and the artists ID verify that the correct artist ID is  found 
data = fetch("artists/%s" % artist_ID[artist_name])
print("Name: ",data['response']['artist']['name'])
print("Facebook Name: ",data['response']['artist']['facebook_name'])


Fetching From https://api.genius.com/artists/73910
Name:  Hozier
Facebook Name:  hoziermusic


Collect data for each of the artists and write out to a file every 10 minutes.

In [19]:
def fetch_current_data(artist_name, artist_ID):
    # Create endpoints for each of the calls
    search_endpoint = "search?q=%s" % urllib.parse.quote(artist_name)
    artist_endpoint = "artists/%s" % artist_ID[artist_name]
    # Retrieve both the data using the search capability and the artist's ID
    search_data = fetch(search_endpoint)
    artist_data = fetch(artist_endpoint)
    data = [search_data, artist_data]
    # Write to raw datafile 
    date_suffix = datetime.now().strftime("%Y%m%d-%H%M")
    fname = "%s-%s.json" % (artist_name, date_suffix)
    out_path = dir_raw / fname
    print("Writing data to %s" % out_path)
    fout = open(out_path,"w")
    json.dump(data, fout, indent = 4, sort_keys = True)
    fout.close()

In [20]:
for artist_name in artist_names:
    fetch_current_data(artist_name, artist_ID)


Fetching From https://api.genius.com/search?q=Kendrick%20Lamar

Fetching From https://api.genius.com/artists/1421
Writing data to raw\Kendrick Lamar-20240723-1949.json

Fetching From https://api.genius.com/search?q=Kanye%20West

Fetching From https://api.genius.com/artists/72
Writing data to raw\Kanye West-20240723-1949.json

Fetching From https://api.genius.com/search?q=Elton%20John

Fetching From https://api.genius.com/artists/560
Writing data to raw\Elton John-20240723-1949.json

Fetching From https://api.genius.com/search?q=Hozier

Fetching From https://api.genius.com/artists/73910
Writing data to raw\Hozier-20240723-1949.json


Create a function to scrape the lyrics of the top song of each artist using the song path provided by the API in order to get an approximate length (number of lyrics) of each song for later analysis. This data will be used for investigation of correlations between the number of lyrics and the interation with the song, through annotations, page views etc. 

(The length is approximate as the lyrics scraped from Genius contain some extra info about the lyrics, e.g. the section of the song, and the speration of some lyrics from others proves challenging due to the manner they are stored on each Genius HTML. For the required analysis, an approximate length will prove sufficient.)

In [21]:
def get_lyrics(song_path):
    url = "http://genius.com" + song_path
    lyric_calss_html_tag = "Lyrics__Container-sc-1ynbvzw-6 jYfhrf"
    response = requests.get(url)
    print("Fetching Lyrics From %s" % url)
    data = response.text
    #Employ BeautifulSoup
    html = BeautifulSoup(data, "html.parser")
    
    #Use the div class which signifies the lyric text in Genius HTML
    results = html.find_all("div", {"class":lyric_calss_html_tag})
    
    lyrics = ""
    for result in results:
        lyrics = lyrics + result.text.strip()

    return lyrics

In [22]:
for artist_name in artist_names:
    data = fetch("search?q=%s" % urllib.parse.quote(artist_name))
    for x in data['response']['hits']:
        lyrics = get_lyrics(x['result']['path'])
    
        fname = "%s-%s-lyrics.txt" % (x['result']['title'],(artist_name))
        out_path = dir_lyrics / fname
        print("Writing data to %s" % out_path)
        fout = open(out_path,"w")
        json.dump(lyrics, fout, indent = 4, sort_keys = True)
        fout.close()


Fetching From https://api.genius.com/search?q=Kendrick%20Lamar
Fetching Lyrics From http://genius.com/Kendrick-lamar-humble-lyrics
Writing data to lyrics\HUMBLE.-Kendrick Lamar-lyrics.txt
Fetching Lyrics From http://genius.com/Kendrick-lamar-euphoria-lyrics
Writing data to lyrics\​euphoria-Kendrick Lamar-lyrics.txt
Fetching Lyrics From http://genius.com/Kendrick-lamar-not-like-us-lyrics
Writing data to lyrics\Not Like Us-Kendrick Lamar-lyrics.txt
Fetching Lyrics From http://genius.com/Kendrick-lamar-maad-city-lyrics
Writing data to lyrics\​m.A.A.d city-Kendrick Lamar-lyrics.txt
Fetching Lyrics From http://genius.com/Kendrick-lamar-swimming-pools-drank-lyrics
Writing data to lyrics\Swimming Pools (Drank)-Kendrick Lamar-lyrics.txt
Fetching Lyrics From http://genius.com/Kendrick-lamar-dna-lyrics
Writing data to lyrics\DNA.-Kendrick Lamar-lyrics.txt
Fetching Lyrics From http://genius.com/Kendrick-lamar-money-trees-lyrics
Writing data to lyrics\Money Trees-Kendrick Lamar-lyrics.txt
Fetchin