<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Synopsis" data-toc-modified-id="Synopsis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Synopsis</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Data-Collection" data-toc-modified-id="Data-Collection-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Collection</a></span></li><li><span><a href="#Sources" data-toc-modified-id="Sources-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Sources</a></span></li></ul></div>

# Synopsis

Drake is one of the most successful rappers of all time. However he is not the best. There is a difference. Drake in many people's eyes is an amazing singer, songwriter, rapper, and the occassional dancer. Yet, he would not be compared to the likes of Jay-z, Nas, J.Cole, or Kendrick Lamar. 

Why? I want to know what sets Drake apart and makes him one of the most successful rappers today despite not being considered a top lyricists.

# Imports

In [3]:
# import necessary libraries
import csv
import re
from os import open

import codecs
import lyricsgenius
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import tqdm as tq
import pickle
import glob 
import json
import time
import tqdm

# importing warnings to turn off any future warnings
import warnings
warnings.simplefilter(action='ignore')

# Data Collection

We will understand Drake through the lens of his lyris vs. others. I will gather the lyrics from Genius.com's API. It allows you to insert in an artist, determine how many songs you would like based off popularity or title. I chose to go with the top 10 songs by popularity according to the Genius API. Ultimately I felt this was appropriate because it will allows us to understand the artist the way the general public understands them. Obscure songs would probably give us a distorted view. 

For now I choose to go with Artists that are popular and still relevant. Three of them are considered great lyricist (Jay-z, Nas, and Eminem) according to the general public (if you view message boards and the like). The other two are Drake's contemporaries  (Kanye West and Future). Hopefully we can draw some serious contrast between all the artist to make Drake stand out. 

Once I read in the data I will put it into a large corpus per artist and then it will be ready for pre-processing.

In [3]:
# This is the list of artist that I want to view for this project

artist_list = ['Drake', 'Jay-z', 'Nas', 'Eminem', 'Future', 'Kanye West']

In [7]:
# Search for the artist by first establishing the Genius API

def search_artist(name, max_amount_of_songs, how_to_sort_it = 'popularity'):
    token = '4QvN5TU9_c4T9xbDeMuYXkf-UWM15_UfG-KrqrgWZAlFZajfxWpQTSa3EJ3ILIqJ'
    genius = lyricsgenius.Genius(token)
    genius.verbose = False # in the future change this to true or set artist.save_lyrics() to not be returned
    artist = genius.search_artist(name,max_songs=max_amount_of_songs,get_full_info=True )
    return artist.save_lyrics()

In [None]:
# gather the data from the api

for art in tqdm.tqdm_notebook(artist_list):
    search_artist(art,10)
    time.sleep(60)

In [11]:
# Create a raw corpus of all they lyrics

d = {}
def to_raw_corpus(artists_name):
    """
    The glob module finds all the pathnames matching a specified pattern according
    to the rules used by the Unix shell, although results are returned in arbitrary order.
    
    This function allows you to glob or grab all of the files at once to place
    in the corpus for each artist. The first half will gather the information
    from where it is saved. The second half will place it into a string.
    """
    # fix name if it has unsual characters in it
    
    artists_name = re.sub(r'[^\w\s]', '', artists_name)
    artists_name = re.sub(r'[/\s/g]', '', artists_name)
    
    # Glob the files, aggregate the file names
    
    str_ = f'lyrics_{artists_name.lower()}*' # create a string of the artist for the file name
    print(f"This is what str_ looks like right now: {str_}")
    print()
    d[f'lyrics_{artists_name}'] = sorted(glob.glob(str_)) 
    print(f"This is what the {artists_name} key looks like right now: {d[f'lyrics_{artists_name}']}")
    
    # put it into it's own corpus
    
    corpus_raw = ""
    for lyrics in d[f'lyrics_{artists_name}']:
        print(f'Reading {lyrics}')
        with codecs.open(lyrics, "r") as lyrics_file:
            data = json.load(lyrics_file)
            corpus_raw += data['songs'][0]['lyrics']
            print(f'Corpus is now {len(corpus_raw)} characters long')
            print()
            
    # save the corpus
    
    with codecs.open(f"../Datasets/Pickled_Files/{artists_name}_corpus_raw", "wb") as file:
        pickle.dump(corpus_raw, file)

In [None]:
# ...and now that we have the function ready, let's collect.

for art in artist_list:
    print(f"initiating raw corpus read for {art}")
    to_raw_corpus(art)
    print("Done")
    print()

In [14]:
# creating a function to send the lyrics to a dataframe

def to_data_frame(artist_list):
    
    """
    We are going to need to place this information into a dataframe. Easier to
    manipulate that way. We will start by creating a dictionary of key: value
    pairs. The artist is the key, their lyrics are the value. After it is opened
    and saved to the artist's key we will return the data frame and establish
    the columns.
    """
    data_dict_for_DataFrame ={}
    for art in artist_list: 
        art = re.sub(r'[^\w\s]', '', art)
        art = re.sub(r'[/\s/g]', '', art)
        print(f"This is the artist: {art}")
        with codecs.open (f'../Datasets/Pickled_Files/{art}_corpus_raw', 'rb') as json_file:
            data_dict_for_DataFrame[f'{art}'] = pickle.load(json_file)
    return pd.DataFrame(data_dict_for_DataFrame.items(),columns = ('Artist', 'Lyrics'))

In [15]:
# Using the function created above let's execute...

raw_dataframe = to_data_frame(artist_list)
raw_dataframe

This is the artist: Drake
This is the artist: Jayz
This is the artist: Nas
This is the artist: Eminem
This is the artist: Future
This is the artist: KanyeWest


Unnamed: 0,Artist,Lyrics
0,Drake,"[Produced by Boi-1da, Frank Dukes, Noah ""40"" S..."
1,Jayz,[Intro: Hannah Williams]\nDo I find it so hard...
2,Nas,[Produced by Ron Browz]\n\n[Intro]\nFuck Jay Z...
3,Eminem,"[Verse 1]\nNow this shit's about to kick off, ..."
4,Future,[Intro]\nHigh Klassified な音楽\nI got the truth ...
5,KanyeWest,[Produced By Daft Punk & Kanye West]\n\n[Verse...


In [None]:
# Save the file for the next notebook

with codecs.open('Raw_Dataframe', 'wb') as f:
    pickle.dump(raw_dataframe,f)

# Sources

The following are sources I used to help guide me through this project.

1. https://github.com/johnwmillr/LyricsGenius 
2. https://stackoverflow.com/questions/47400466/using-genius-api
3. http://www.storybench.org/download-song-lyrics-genius-using-python/
4. https://www.johnwmillr.com/trucks-and-beer/
5. http://jdaytn.com/posts/download-blink-182-data/
6. https://github.com/Hugo-Nattagh/2017-Hip-Hop
7. https://towardsdatascience.com/does-country-music-drink-more-than-other-genres-a21db901940b
8. https://towardsdatascience.com/49-years-of-lyrics-why-so-angry-1adf0a3fa2b4
9. https://medium.com/@RareLoot/how-to-download-an-artists-lyrics-from-genius-com-using-python-984d298951c6