## Scraping song lyrics from the Genius.com ##

This notebook contains the Python code described on my blog in the [scraping genius lyrics post](http://www.johnwmillr.com/blog/2017/scraping-genius-lyrics).

Head over to my [GitHub repository](https://github.com/johnwmillr/geniusapi) to clone my Python wrapper.


## Genius API ##

In [1]:
# Sign up for a free account at Genius.com to access the API
# http://genius.com/api-clients
client_access_token = 'aPa0OLqvXfY8ht00NeneEBzZyvCvVcCiy40c4U-n3nzjZbfLrhEd2trP0hP_1_X3'

In [2]:
# Let's take a look at how we might search for an artist using the Genius API.
import requests
import urllib.request

# Format a request URL for the Genius API
search_term = 'Andy Shauf'
_URL_API = "https://api.genius.com/"
_URL_SEARCH = "search?q="
querystring = _URL_API + _URL_SEARCH + urllib.request.quote(search_term)
request = urllib.request.Request(querystring)
request.add_header("Authorization", "Bearer " + client_access_token)
# request.add_header("User-Agent","curl/7.9.8 (i686-pc-linux-gnu) libcurl 7.9.8 (OpnSSL 0.9.6b) (ipv6 enabled)")
request.add_header("User-Agent", "")

In [3]:
# Now that we’ve formatted the URL, we can make a request to the database.
import json
response = urllib.request.urlopen(request, timeout=3)
raw = response.read()
json_obj = json.loads(raw)

In [4]:
# The JSON object is just a normal python dictionary
json_obj.keys()

dict_keys(['meta', 'response'])

In [5]:
# The 'hits` key stores info on each song in the search result.
# From here it's easy to grab the song title, album, etc.

# List each key contained within a single search hit
[key for key in json_obj['response']['hits'][0]['result']]

['annotation_count',
 'api_path',
 'full_title',
 'header_image_thumbnail_url',
 'header_image_url',
 'id',
 'lyrics_owner_id',
 'lyrics_state',
 'path',
 'pyongs_count',
 'song_art_image_thumbnail_url',
 'stats',
 'title',
 'title_with_featured',
 'url',
 'primary_artist']

In [6]:
# View the song name for each search hit
[song['result']['title'] for song in json_obj['response']['hits']]

['The Magician',
 'Quite Like You',
 'Early to the Party',
 'To You',
 'Wendell Walker',
 'Martha Sways',
 'Twist Your Ankle',
 'The Worst in You',
 "You're Out Wasting",
 'Eyes of Them All']

In [7]:
# URL to artist image
print(json_obj['response']['hits'][0]['result']['primary_artist']['image_url'])

https://images.genius.com/16423bad48ffd400aac3ba86d5b86ed4.850x850x1.jpg


<img src="https://images.genius.com/16423bad48ffd400aac3ba86d5b86ed4.850x850x1.jpg" style="width: 200px;"/> 

### Access a song or artist directly by ID ###

In [8]:
# If you have an artist or song ID, you can access that entry 
# directly by reformatting the request URL.
song_id = 2299297
querystring = "https://api.genius.com/songs/" + str(song_id)
request = urllib.request.Request(querystring)
request.add_header("Authorization", "Bearer " + client_access_token)
request.add_header("User-Agent", "")
response = urllib.request.urlopen(request, timeout=3)
raw = response.read()
json_obj = json.loads(raw)
print(json_obj)
print((json_obj['response']['song']['title'],\
       json_obj['response']['song']['primary_artist']['name']))

{'meta': {'status': 200}, 'response': {'song': {'annotation_count': 1, 'api_path': '/songs/2299297', 'description': {'dom': {'tag': 'root', 'children': [{'tag': 'p', 'children': ['?']}]}}, 'embed_content': "<div id='rg_embed_link_2299297' class='rg_embed_link' data-song-id='2299297'>Read <a href='https://genius.com/The-young-wild-not-a-one-lyrics'>“Not a One” by The\xa0Young Wild</a> on Genius</div> <script crossorigin src='//genius.com/songs/2299297/embed.js'></script>", 'featured_video': True, 'full_title': 'Not a One by\xa0The\xa0Young Wild', 'header_image_thumbnail_url': 'https://images.rapgenius.com/0fcb3103057c4d76c158eb778fa1d935.300x300x1.jpg', 'header_image_url': 'https://images.rapgenius.com/0fcb3103057c4d76c158eb778fa1d935.1000x1000x1.jpg', 'id': 2299297, 'lyrics_owner_id': 93685, 'lyrics_state': 'complete', 'path': '/The-young-wild-not-a-one-lyrics', 'pyongs_count': 3, 'recording_location': None, 'release_date': '2016-08-26', 'song_art_image_thumbnail_url': 'https://images.

In [26]:
import random
import numpy as np

# If you have an artist or song ID, you can access that entry 
# directly by reformatting the request URL.

def findMeanPageViews(numSongs):
    songViews = np.empty((0,))
    print(songViews.dtype)
    for i in range(numSongs):
        song_id = random.randint(0,2299297)
        querystring = "https://api.genius.com/songs/" + str(song_id)
        request = urllib.request.Request(querystring)
        request.add_header("Authorization", "Bearer " + client_access_token)
        request.add_header("User-Agent", "")
        try:
            response = urllib.request.urlopen(request, timeout=3)
            raw = response.read()
            json_obj = json.loads(raw)
            if 'pageviews' in json_obj['response']['song']['stats'].keys() and json_obj['response']['song']['stats']['pageviews'] != 0:
                print(json_obj['response']['song']['stats']['pageviews'])
                songViews = np.append(songViews, json_obj['response']['song']['stats']['pageviews'])
        except:
            pass
    return songViews

pageViews = findMeanPageViews(100)

print(pageViews)
print(np.mean(pageViews))
print(len(pageViews))
       # print((json_obj['response']['song']['title'],\
       #        json_obj['response']['song']['primary_artist']['name']))

float64
37044
11001
46681
[37044. 11001. 46681.]
31575.333333333332
3


In [65]:
def getSongData(numSongs):
    songData = {}
    for i in range(numSongs+1):
        song_id = i
        if (i % 5) == 0:
            print(i)
        querystring = "https://api.genius.com/songs/" + str(song_id)
        request = urllib.request.Request(querystring)
        request.add_header("Authorization", "Bearer " + client_access_token)
        request.add_header("User-Agent", "")
        try:
            response = urllib.request.urlopen(request, timeout=3)
            raw = response.read()
            json_obj = json.loads(raw)
            if 'pageviews' in json_obj['response']['song']['stats'].keys() and json_obj['response']['song']['stats']['pageviews'] != 0:
                #print(json_obj['response']['song']['stats']['pageviews'])
                song = json_obj['response']['song']
                title = song['title']
                song_id = song['id']
                artist = song['album']['artist']['name']
                release_date = song['release_date']
                lyrics = str(getLyrics(song['url']))
                url = song['url']
                pageviews = song['stats']['pageviews']
                currsongdata = {'url':url,'title':title,'artist':artist,'release_date':release_date,'lyrics':lyrics,'page_views':pageviews}
                songData[song_id] = currsongdata
        except Exception as e:
            pass
    return songData

def savejson(data):
    with open('songData.json','w') as outfile:
        json.dump(data,outfile)

        
mydata = getSongData(200)
savejson(mydata)        
        
        
        
        

0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
105
110
115
120
125
130
135
140
145
150
155
160
165
170
175
180
185
190
195
200


In [56]:
data = json.load(open('songData.json'))
print(data['19'])

{'url': 'https://genius.com/Camron-losing-weight-pt-2-lyrics', 'title': 'Losing Weight, Pt. 2', 'artist': "Cam'ron", 'release_date': '2002-05-14', 'lyrics': 'b\'\\n\\n[Chorus: Cam\\\'ron]\\nAyo, fuck losing weight\\nI\\\'m back on these highways moving cakes\\nLife\\\'s based upon what I\\\'ma do today\\nCop a car, new estate\\nNa, fuck it get the beef and brocs: blue and gray\\nBaby due today\\nI got to move an eighth\\nFuck the scrutiny\\nY\\\'all niggas screwing me\\nKilla never let the drama slide\\nY\\\'all gone hear a nigga momma die\\nYell out "homicide"\\n\\n[Verse 1: Cam\\\'ron]\\nYo, 18 months? Please, that ain\\\'t facing time\\nI\\\'m stressed anyway, need it for vacation time\\nI\\\'ma do the right thing though, take shock anyway\\n6 months, right back on the damn block anyway\\nBut look, money from across the street\\nThink it\\\'s sweet, think he get money across the street\\nMe and my peeps often meet\\nAnd 5-0 they work for us, walk the beat\\nWalk with heat \\\'cause 

### Scrape song lyrics ###

In [27]:
from bs4 import BeautifulSoup
import re
def getLyrics(url):
    page = requests.get(url)    
    html = BeautifulSoup(page.text, "html.parser") # Extract the page's HTML as a string

    # Scrape the song lyrics from the HTML
    lyrics = html.find("div", class_="lyrics").get_text().encode('ascii','ignore')
    #lyrics = re.sub('\[.*\]','',lyrics) # Remove [Verse] and [Bridge] stuff
    #lyrics = re.sub('\n{2}','',lyrics)  # Remove gaps between verses        
    #lyrics = str(lyrics).strip('\n')
    return lyrics

## Python wrapper ##
You may need to run this code from the Terminal after cloning the repo
https://github.com/johnwmillr/geniusapi

In [10]:
# Create an instance of the API interface
import lyricsgenius as genius
api = genius.Genius(client_access_token)

In [11]:
# Search for an artist
artist = api.search_artist('Andy Shauf', max_songs=5)
print(artist)

Searching for Andy Shauf...

Song 1: "Alexander All Alone"
Song 2: "Begin Again"
Song 3: "Comfortable With Silence"
Song 4: "Covered in Dust"
Song 5: "Crushes"

Reached user-specified song limit (5).
Found 5 songs.
Done.
Andy Shauf, 5 songs


In [12]:
# Search for a specific song
song = api.search_song('Wendell Walker', artist.name)
artist.add_song(song)
print(artist)
print(artist.songs[0].lyrics)

Searching for "Wendell Walker"...
Done.
Andy Shauf, 6 songs
Alexander all alone
Smoking a cigarette
The last pack he’d ever buy
At least that’s what he said
He stood up to stretch his back
And fell down to the ground

Alexander all alone
'Till the neighbour caught a glimpse
Cried out for his wife
To call the ambulance
Alexander all alone
Felt them check his pulse
He heard them pronounce him dead

Hell is found inside of me
And nothing else will set me free
If hell is found inside of me
Then open me up and spill me out

Alexander wondered why
No life flashed before his eyes
Why his soul did not depart
Why he found no peace of mind
Would it take a little while?
Was it the same for everyone?
Alexander realized

That hell is found inside of me
And nothing else will set me free
If hell is found inside of me
Then open me up and spill me out
