# 1.A. Collection -- Lyrics -- Genius API

This notebook kicks off our project. Lyrical data, and the artists we are able to retrieve from **Genius.com**, a lyrics and music metadata site/community, will form the backbone of our dataset. In order to access information on Genius.com, we need to tap into their API. We'll leverage a python library called **lyricsgenius** which handles Genuis-API searches and has pre-existing functionality built in to handle errant calls. Our workflow will be as follows:

 1. We need to start with a comprehensive list of rappers. **Wikipedia** will be our seed source of rapper names.  *NOTE: a significant amount of work was done after the initial Genius pull was complete to improve our initial set of rappers. This included combining, deduplicating, and manually cleaning artist names across all of our data sources.
 2. With our list of rappers we will leverage **lyricsgenius** to search for said artist. For the purposes of this project, we have elected to limit ourselves to **50 songs per artist.** 
 3. For each call we make to Genius, we will extract up to 50 songs, inclusive of features like *producers, features, lyrics (in list format), album, and date*.
 

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import time
import requests
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
import lyricsgenius

import sys
import spotipy
import spotipy.util as util

## Wikipedia Artist Pull

Wikipedia has a nice, comprehensive list of hip hop musicians that we'll first grab to kick off our project. We'll take the time to clean out any gunk we extract alongside our list of artists as we are using BeautifulSoup here and may of the <a> tags we pick up are errant.

In [3]:
#Build our URL for wikipedia and save our Soup
url = 'https://en.wikipedia.org/wiki/List_of_hip_hop_musicians'
res = requests.get(url)
wiki_soup = BeautifulSoup(res.content)

#Find all 'a' tags
wiki_entries = wiki_soup.find_all('a')

#grab the text from our entries
wiki_text_list = [entry.text for entry in wiki_entries]

#manually clean out list of gunk
rapper_list_uncleaned_complete = wiki_text_list[37:1613]

#adjust our list to get rid of remainin gunk by checking elements
rapper_list_clean_complete = [x for x in rapper_list_uncleaned_complete if x and
                                                                           "[" not in x and
                                                                           "edit" not in x] 

pd.DataFrame(rapper_list_clean_complete).to_csv('rapper_names.csv', index=False, header=None)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


## Authentication and Param Setting for APIs


With our list of artists, it's now time to grab songs for each corresponding artist from **Genius.com**. We start by insantiating a connection via lyricsgenius with our *token* for the Genius API. Our library enables us to leverage some variables to skip non songs and avoid duplication by skipping over songs with (live) in the title which will help us cut down on noise.

In [2]:
'''GENIUS'''

#using our token, connect to Genius. Avoid non-song entries, and avoid live versions of songs when grabbing lyrics
genius = lyricsgenius.Genius("ZQuhkqr0OSaSithdN_D6paNYezzcg0tRwoBMmRfK2ikPPOfmTkZwzrplSBVsO3HK")

#Set up some of our vars for pulling our data
genius_results_per_page = 50
max_songs_per_artist = 50
genius.skip_non_songs = True
genius.excluded_terms = ['(Live)']


'INSERT PASSCODE HERE'

### Bring in Genius Names, isolate to top names. Pass to API

In [18]:
#Bring in list of artist names that we pulled from wikipedia to begin our scraping process. NOTE: A lot of iterative
#and manual work was done here-- multiple pulls were made later in the process. However, this was our base approach
rapper_list_for_genius_df = pd.read_csv('combined_rapper_data_1.csv', header=None)
rapper_list_for_genius = list(rapper_list_for_genius_df['wiki_name'])

## Sample Pull for Genius

In this step, we'll start by instantiating an empty dataframe. We'll then iteratively search for an artist's name using the lyricsgenius rapper. For each song we capture, we will capture song features (song, artist, album, date, lyrics, features, producers) and append them to a running list of each respective features. Once we're through the process we'll then read them into our dataframe.

In [15]:
#instatiate a dataframe to store our data
rapper_df = pd.DataFrame(columns = ['artist', 'song', 'album', 'date', 'lyrics', 'features', 'producers'])

In [1]:
#We're going to keep a running tally of artists that weren't succesfully pulled down from API
skipped_artist_genius_pull = []

#We're going to have a flag that ensures we continue to run our process until we reach the tail end of our list
loop_iterator = 1
while(loop_iterator):
    
    #build a list of rappers
    rapper_query_list = rapper_list_for_genius

    #grab all the data we can from genius for each rapper
    for rapper in rapper_query_list:

        try:



                #conduct search for artist and corresponding data; store our results
                artist = genius.search_artist(rapper, per_page=genius_results_per_page,  max_songs=max_songs_per_artist)

                #Build temporary dict for each iteration of a different rapper
                rapper_dict ={}

                #store name
                rapper_dict['artist'] = artist.name

                #ready lists for appending our data to
                song_list = []
                lyrics_list = []
                date_list = []
                album_list = []
                featured_list = []
                producers_list = []



                #For all the songs we have for our artist, iterate through each song and store that information
                for i in range(len(artist.songs)):
                    lyrics_list.append(artist.songs[i].lyrics)
                    album_list.append(artist.songs[i].album)
                    song_list.append(artist.songs[i].title)
                    date_list.append(artist.songs[i].year)

                    #Grab track features. If we can't, leave it blank
                    features = []
                    try:
                        for x in range(len(artist.songs[i].featured_artists)):
                            features.append(artist.songs[i].featured_artists[x]['name'])
                        featured_list.append(features)
                    except:
                        featured_list.append('')

                    #Grab track producers. If we can't, leave it blank
                    producers = []
                    try:
                        for x in range(len(artist.songs[i].producer_artists)):
                            producers.append(artist.songs[i].producer_artists[x]['name'])
                        producers_list.append(producers)
                    except:
                        producers_list.append('') 

                #With our list of information for each type of data, store it in our dict.    
                rapper_dict['lyrics'] = lyrics_list
                rapper_dict['album'] = album_list
                rapper_dict['date'] = date_list
                rapper_dict['song'] = song_list
                rapper_dict['features'] = featured_list
                rapper_dict['producers'] = producers_list

                #attach our dict to our df. Rinse and repeat.
                rapper_df = pd.concat([rapper_df, pd.DataFrame(rapper_dict)], ignore_index=True, axis=0)
        except:
                skipped_artist_genius_pull.append(rapper)
                print('sleeping because of error, skipping artist')
                time.sleep(60)
                
        #If we've hit the 7th to last artist, we can move on. We don't need the remainder       
        if(rapper == rapper_list_for_genius[-7]):
            loop_iterator = 0


In [20]:
rapper_df.to_csv('tuesday_genius_dataset_additions_2.csv', index=False)