<div align="center">
  <h1> 1 - Getting Data - Eminem Lyrics</h1> <a name="0-bullet"></a>
</div>

- [1. Setup](#1-bullet)
    * [1.1 Set the working directory](#11-bullet)
- [2. Collect the data](#2-bullet)
    * [2.1 Connect to the Genius API](#21-bullet)
    * [2.2 Get the lyrics](#22-bullet)
- [3. Save/load the Genius Artist object](#3-bullet)
    * [3.1 Save the object](#31-bullet)
    * [3.2 Load the object](#32-bullet)
- [4. Convert to a Pandas DataFrame object](#4-bullet)
    * [4.1 Store the DataFrame object](#41-bullet)
    
> References: 
> * [LyricsGenius: a Python client for the Genius.com API](https://lyricsgenius.readthedocs.io/en/master/index.html)

---

You may have to install:
>`pip install lyricsgenius`

In [None]:
import os
import json
import pickle

import numpy as np
import pandas as pd
import lyricsgenius as genius

from tqdm import tqdm

---

# 1. Setup <a name="1-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

## 1.1 Set the working directory <a name="11-bullet"></a>

In [None]:
ROOT_DIR = "./eminem-lyrics-generator/notebooks/" 
IN_GOOGLE_COLAB = True

if IN_GOOGLE_COLAB:
    # mount google drive
    from google.colab import drive
    drive.mount('/content/gdrive')

    # change the current working directory
    %cd gdrive/'My Drive'

    # create a root directory if there's none
    if not os.path.isdir(ROOT_DIR):
        %mkdir $ROOT_DIR

    # change the current working directory
    %cd $ROOT_DIR

Mounted at /content/gdrive
/content/gdrive/My Drive
/content/gdrive/My Drive/eminem-lyrics-generator/notebooks


In [None]:
# specifies paths to all files in the project
SETTINGS_FILE_PATH = os.path.join(os.path.abspath(".."), 'SETTINGS.json')
settings = json.load(open(SETTINGS_FILE_PATH))

# 2. Collect the data <a name="2-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

## 2.1 Connect to the Genius API <a name="21-bullet"></a>

In [None]:
# assign Genius.com credentials
GENIUSCREDS = "<YOUR-GENIUS-CREDENTIALS>"

api = genius.Genius(access_token=GENIUSCREDS,                   # API key provided by Genius
                    verbose=True,                               # turn off status messages
                    remove_section_headers=False,               # remove section headers(e.g. [Chorus]) from lyrics
                    skip_non_songs=False,                       # attempts to skip non-songs (e.g. track listings)
                    excluded_terms=["(Remix)", "(Live)"],       # exclude songs with these words in their title
                    timeout=25,                                 # time before quitting on response (seconds)
                    retries=5)                                  # number of retries in case of timeouts and errors 

## 2.2 Get the lyrics <a name="22-bullet"></a>

In [None]:
ARTIST_NAME = "Eminem"

### a) including non-songs (skits, snippets, etc.)

In [None]:
api.skip_non_songs = False

artist_all = api.search_artist(ARTIST_NAME)

Searching for songs by Eminem...

Song 1: "Rap God"
Song 2: "Killshot"
Song 3: "Godzilla"
Song 4: "Lose Yourself"
Song 5: "The Monster"
Song 6: "Lucky You"
Song 7: "The Ringer"
Song 8: "River"
Song 9: "Venom"
Song 10: "Stan"
Song 11: "Berzerk"
Song 12: "Without Me"
Song 13: "Not Alike"
Song 14: "Fall"
Song 15: "The Real Slim Shady"
Song 16: "’Till I Collapse"
Song 17: "Kamikaze"
Song 18: "Walk on Water"
Song 19: "Love the Way You Lie"
Song 20: "Bad Guy"
Song 21: "8 Mile: B-Rabbit vs Papa Doc"
Song 22: "Mockingbird"
Song 23: "Not Afraid"
Song 24: "Headlights"
Song 25: "No Love"
Song 26: "Survival"
Song 27: "Beautiful"
Song 28: "Greatest"
Song 29: "When I’m Gone"
Song 30: "Cleanin’ Out My Closet"
Song 31: "My Name Is"
Song 32: "Love Game"
Song 33: "Superman"
Song 34: "The Way I Am"
Song 35: "Legacy"
Song 36: "Unaccommodating"
Song 37: "Space Bound"
Song 38: "Like Toy Soldiers"
Song 39: "Guts Over Fear"
Song 40: "Sing for the Moment"
Song 41: "Gnat"
Song 42: "Marshall Mathers"
Song 43: "S

Song 309: "We’re Back"
Song 310: "The Conspiracy Freestyle"
Song 311: "Alfred (Intro)"
Song 312: "Soap (Skit)"
Song 313: "Emulate"
Song 314: "Curtains Close (Skit)"
Song 315: "I Get Money (Remix)"
Song 316: "Devil’s Night Intro"
Song 317: "50 Ways"
Song 318: "Rap City"
Song 319: "W.E.G.O. (Interlude)"
Song 320: "It’s Only Fair to Warn [Freestyle]"
Song 321: "Lounge (Skit)"
Song 322: "Letter to Tupac’s Mother"
Song 323: "The Real Slim Shady (Clean)"
Song 324: "Freak"
Song 325: "Mr. Mathers (Skit)"
Song 326: "Invasion (The Realest)"
Song 327: "So Many Styles"
Song 328: "Thus Far (Interlude)"
Song 329: "8 Mile: D’Phuzion vs B-Rabbit"
Song 330: "Just Rhymin’ with Proof"
Song 331: "8 Mile: Lily’s Lullaby"
Song 332: "Everything"
Song 333: "Fuck You"
Song 334: "Paul (Skit) [2004]"
Song 335: "Rap City Freestyle (Keeping It Raw)"
Song 336: "Things Get Worse (Lost Version)"
Song 337: "1-833-2GET-REV (REVIVAL Voicemail)"
Song 338: "Curtains Up (Skit) [2004]"
Song 339: "Nut Up"
Song 340: "Curtains

Song 523: "Superman (Clean Radio Edit)"
Song 524: "The Real Slim Shady (Murda Mixtape Version)"
Song 525: "Eminem Full Discography"
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/Eminem-guilty-conscience-instrumental-lyrics
Song 526: "Guilty Conscience (Instrumental)"
Song 527: "The Kids (Demo)"
Song 528: "Eminem - Get You Mad (Türkçe Çeviri)"
Song 529: "Real Slim Shady (EDIT)"
Song 530: "WJLB 1998 Promo"
Song 531: "Interview With Hot 97 (Disses Benzino)"
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/Eminem-cleanin-out-my-closet-instrumental-lyrics
Song 532: "Cleanin’ Out My Closet (Instrumental)"
Song 533: "Everything I Do"
Couldn't find the lyrics section. Please report this if the song has lyrics.
Song URL: https://genius.com/Eminem-the-real-slim-shady-instrumental-lyrics
Song 534: "The Real Slim Shady (Instrumental)"
Song 535: "Cannon for nick"
Song 536: "Rap Game 

### b) only songs

In [None]:
api.skip_non_songs = True

artist_songs = api.search_artist(ARTIST_NAME)

Searching for songs by Eminem...

Song 1: "Rap God"
Song 2: "Killshot"
Song 3: "Godzilla"
Song 4: "Lose Yourself"
Song 5: "The Monster"
Song 6: "Lucky You"
Song 7: "The Ringer"
Song 8: "River"
Song 9: "Venom"
Song 10: "Stan"
Song 11: "Berzerk"
Song 12: "Without Me"
Song 13: "Not Alike"
Song 14: "Fall"
Song 15: "The Real Slim Shady"
Song 16: "’Till I Collapse"
Song 17: "Kamikaze"
Song 18: "Walk on Water"
Song 19: "Love the Way You Lie"
Song 20: "Bad Guy"
Song 21: "8 Mile: B-Rabbit vs Papa Doc"
Song 22: "Mockingbird"
Song 23: "Not Afraid"
Song 24: "Headlights"
Song 25: "No Love"
Song 26: "Survival"
Song 27: "Beautiful"
Song 28: "Greatest"
Song 29: "When I’m Gone"
Song 30: "Cleanin’ Out My Closet"
Song 31: "My Name Is"
Song 32: "Love Game"
Song 33: "Superman"
Song 34: "The Way I Am"
Song 35: "Legacy"
Song 36: "Unaccommodating"
Song 37: "Space Bound"
Song 38: "Like Toy Soldiers"
Song 39: "Guts Over Fear"
Song 40: "Sing for the Moment"
Song 41: "Gnat"
Song 42: "Marshall Mathers"
Song 43: "S

Song 276: "My Name Is (Original)"
Song 277: "Em360 Rapcity Backroom Freestyle"
Song 278: "The Equalizer : Soundtrack"
Song 279: "Jealousy Woes II"
Song 280: "8 Mile: Marv Won vs B-Rabbit"
Song 281: "It’s Been Real"
Song 282: "Maxine"
Song 283: "Greg"
Song 284: "8 Mile Background Music"
"Tonya (Skit)" is not valid. Skipping.
Song 285: "Kurt Loder Car Freestyle"
"Steve Berman (Skit) [2002]" is not valid. Skipping.
"Bang (Remix)" is not valid. Skipping.
Song 286: "2012 Something From Nothing: Art of Rap Freestyle"
Song 287: "My Name Is (Original Version)"
Song 288: "Our House"
Song 289: "Rap City Freestyle"
Song 290: "Intro (Curtain Call)"
Song 291: "We’re Back"
Song 292: "The Conspiracy Freestyle"
Song 293: "Alfred (Intro)"
"Soap (Skit)" is not valid. Skipping.
Song 294: "Emulate"
"Curtains Close (Skit)" is not valid. Skipping.
"I Get Money (Remix)" is not valid. Skipping.
Song 295: "Devil’s Night Intro"
Song 296: "50 Ways"
"Unstoppable (Miqu remix)" is not valid. Skipping.
Song 297: "Ra

Song 423: "Rap Game (Solo Mix)"
Song 424: "Wicked Shit Freestyle"
Song 425: "Male Prostitute"
Song 426: "The Kids (Unreleased Demo)"
Song 427: "My Name Is (Snippet)"
"What’s Your Intent? (Live)" is not valid. Skipping.
Song 428: "Passin’ out concussions"
Song 429: "Drop It Like It’s Hot Freestyle"
Song 430: "Verse 2"
"Man With Van (Skit)" is not valid. Skipping.
"Nicole (Skit)" is not valid. Skipping.
"Writer’s Block [DJ Premier Remix]" is not valid. Skipping.
Song 431: "The Half-Time Show Freestyle"
"Fat Beats (Skit)" is not valid. Skipping.
Song 432: "Keys To The City"
Song 433: "Canibus Impression Freestyle"
"Record Store 1 (Skit)" is not valid. Skipping.
Song 434: "Insult to Injury"
"Fuck You (Lab Rat Remix)" is not valid. Skipping.
"My Name Is (Instrumental)" is not valid. Skipping.
Song 435: "Business (A Cappella)"
Song 436: "I’m Shady (Snippet)"
"Victory (Remix)" is not valid. Skipping.
Song 437: "Bump Heads - DJ Green Lantern Version"
Song 438: "Guilty Conscience (Snippet)"
"Ju

# 3. Save/load the Genius Artist object <a name="3-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

Save/load the Artist object containing artist’s lyrics and metadata.

In [None]:
ARTIST_OBJECT_ALL_PATH = settings['ARTIST_OBJECT_ALL_PATH']
ARTIST_OBJECT_SONGS_PATH = settings['ARTIST_OBJECT_SONGS_PATH']

## 3.1 Save the object <a name="31-bullet"></a>

### a) all (songs, non-songs, etc. )

In [None]:
with open(ARTIST_OBJECT_ALL_PATH, 'wb') as handle:
    pickle.dump(artist_all, handle, protocol=pickle.HIGHEST_PROTOCOL)

### b) only songs

In [None]:
with open(ARTIST_OBJECT_SONGS_PATH, 'wb') as handle:
    pickle.dump(artist_songs, handle, protocol=pickle.HIGHEST_PROTOCOL)

## 3.2 Load the object <a name="32-bullet"></a>

### a) all (songs, non-songs, etc.)

In [None]:
with open(ARTIST_OBJECT_ALL_PATH, 'rb') as handle:
    artist_all = pickle.load(handle)

### b) only songs

In [None]:
with open(ARTIST_OBJECT_SONGS_PATH, 'rb') as handle:
    artist_songs = pickle.load(handle)

# 4. Convert to a Pandas DataFrame object <a name="4-bullet"></a> <a href="#0-bullet"> <sup><sup><sup>^</sup></sup></sup></a>

Collect only important information from the raw data and create a Pandas DataFrame object from it.

### a) all (songs, non-songs, etc.)

In [None]:
dataset_all = []

songs = artist_all.to_dict()['songs']

for popularity, song in tqdm(enumerate(songs)):
    song_record = {}
    
    try:
        # 1. song id
        song_record['song_id'] = song['id']
        # 2. lyrics state
        song_record['state'] = song['lyrics_state']
        # 3. title
        song_record['title'] = song['title']
        # 4. artist
        song_record['artist'] = song['artist']
        # 5. featured with
        song_record['featured'] = [artist['name'] for artist in song['featured_artists']]
        # 6. song popularity
        song_record['popularity'] = popularity + 1
        # 7. album id
        song_album = song['album']
        song_record['album_id'] = song_album['id'] if song_album else None
        # 8. album name
        song_record['album_name'] = song_album['name'] if song_album else None
        # 9. song's release date
        song_record['release_date'] = song['release_date']
        # 10. lyrics
        song_record['lyrics'] = song['lyrics']
        # 11. Genius lyrics page
        song_record['url'] = song['url']

        # 12. comments - sorted by votes (only the top 50 most voted comments)
        song_comments = api.song_comments(song['id'], per_page=50, page=1)
        song_record['comments'] = [comment['body']['plain'] for comment in song_comments['comments']]
        # 13. comments total count
        song_record['comments_total_count'] = song_comments['total_count']
        
        # add to the dataset
        dataset_all.append(song_record) 
        
    except Exception as e:
        print(f"\nSong: {song['title']} (id:{song['id']})")
        print(e, '\n')

586it [07:36,  1.28it/s]


In [None]:
eminem_all_df = pd.DataFrame(dataset_all)
eminem_all_df.head()

Unnamed: 0,song_id,state,title,artist,featured,popularity,album_id,album_name,release_date,lyrics,url,comments,comments_total_count
0,235729,complete,Rap God,Eminem,[],1,672689.0,The Marshall Mathers LP2 (Deluxe),2013-10-14,"[Intro]\n""Look, I was gonna go easy on you not...",https://genius.com/Eminem-rap-god-lyrics,[It’s clear now to everyone who the King and t...,459
1,3958196,complete,Killshot,Eminem,[],2,,,2018-09-14,"[Intro]\nYou sound like a bitch, bitch\nShut t...",https://genius.com/Eminem-killshot-lyrics,"[Eminem Sucks haha, GENIUS NOT EVEN READY LOL,...",1252
2,5180439,complete,Godzilla,Eminem,[Juice WRLD],3,594809.0,Music to Be Murdered By,2020-01-17,"[Intro]\nUgh, you're a monster\n\n[Verse 1: Em...",https://genius.com/Eminem-godzilla-lyrics,[This definitely takes the spot of most unexpe...,624
3,207,complete,Lose Yourself,Eminem,[],4,452012.0,The Singles,2002-10-28,"[Intro]\nLook, if you had one shot or one oppo...",https://genius.com/Eminem-lose-yourself-lyrics,[This was Eminem at the top of the entire rap ...,265
4,235732,complete,The Monster,Eminem,[Rihanna],5,672689.0,The Marshall Mathers LP2 (Deluxe),2013-10-29,[Intro: Rihanna]\nI'm friends with the monster...,https://genius.com/Eminem-the-monster-lyrics,[Why do people hate that Eminem is different n...,171


### b) only songs

In [None]:
dataset_songs = []

songs = artist_songs.to_dict()['songs']

for popularity, song in tqdm(enumerate(songs)):
    song_record = {}
    
    try:
        # 1. song id
        song_record['song_id'] = song['id']
        # 2. lyrics state
        song_record['state'] = song['lyrics_state']
        # 3. title
        song_record['title'] = song['title']
        # 4. artist
        song_record['artist'] = song['artist']
        # 5. featured with
        song_record['featured'] = [artist['name'] for artist in song['featured_artists']]
        # 6. song popularity
        song_record['popularity'] = popularity + 1
        # 7. album id
        song_album = song['album']
        song_record['album_id'] = song_album['id'] if song_album else None
        # 8. album name
        song_record['album_name'] = song_album['name'] if song_album else None
        # 9. song's release date
        song_record['release_date'] = song['release_date']
        # 10. lyrics
        song_record['lyrics'] = song['lyrics']
        # 11. Genius lyrics page
        song_record['url'] = song['url']

        # 12. comments - sorted by votes (only the top 50 most voted comments)
        song_comments = api.song_comments(song['id'], per_page=50, page=1)
        song_record['comments'] = [comment['body']['plain'] for comment in song_comments['comments']]
        # 13. comments total count
        song_record['comments_total_count'] = song_comments['total_count']
        
        # add to the dataset
        dataset_songs.append(song_record) 
        
    except Exception as e:
        print(f"\nSong: {song['title']} (id:{song['id']})")
        print(e, '\n')

457it [06:53,  1.25s/it]


Song: WJLB 1998 Promo (id:2868227)
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) 



493it [07:20,  1.12it/s]


In [None]:
eminem_songs_df = pd.DataFrame(dataset_songs)
eminem_songs_df.head()

Unnamed: 0,song_id,state,title,artist,featured,popularity,album_id,album_name,release_date,lyrics,url,comments,comments_total_count
0,235729,complete,Rap God,Eminem,[],1,672689.0,The Marshall Mathers LP2 (Deluxe),2013-10-14,"[Intro]\n""Look, I was gonna go easy on you not...",https://genius.com/Eminem-rap-god-lyrics,[It’s clear now to everyone who the King and t...,459
1,3958196,complete,Killshot,Eminem,[],2,,,2018-09-14,"[Intro]\nYou sound like a bitch, bitch\nShut t...",https://genius.com/Eminem-killshot-lyrics,"[Eminem Sucks haha, GENIUS NOT EVEN READY LOL,...",1252
2,5180439,complete,Godzilla,Eminem,[Juice WRLD],3,594809.0,Music to Be Murdered By,2020-01-17,"[Intro]\nUgh, you're a monster\n\n[Verse 1: Em...",https://genius.com/Eminem-godzilla-lyrics,[This definitely takes the spot of most unexpe...,624
3,207,complete,Lose Yourself,Eminem,[],4,452012.0,The Singles,2002-10-28,"[Intro]\nLook, if you had one shot or one oppo...",https://genius.com/Eminem-lose-yourself-lyrics,[This was Eminem at the top of the entire rap ...,265
4,235732,complete,The Monster,Eminem,[Rihanna],5,672689.0,The Marshall Mathers LP2 (Deluxe),2013-10-29,[Intro: Rihanna]\nI'm friends with the monster...,https://genius.com/Eminem-the-monster-lyrics,[Why do people hate that Eminem is different n...,171


## 4.1 Store the DataFrame object <a name="41-bullet"></a>

### a) all (songs, non-songs, etc.)

In [None]:
EMINEM_DF_ALL_RAW_PATH = settings['EMINEM_DF_ALL_RAW_PATH']

with open(EMINEM_DF_ALL_RAW_PATH, 'wb') as handle:
    pickle.dump(eminem_all_df, handle, protocol=pickle.HIGHEST_PROTOCOL)

### b) only songs

In [None]:
EMINEM_DF_SONGS_RAW_PATH = settings['EMINEM_DF_SONGS_RAW_PATH']

with open(EMINEM_DF_SONGS_RAW_PATH, 'wb') as handle:
    pickle.dump(eminem_songs_df, handle, protocol=pickle.HIGHEST_PROTOCOL)