## 01_Collection: Retrieving RIAA and Spotify Data for Analysis

**Description**: Extracting data from the RIAA Gold and Platinum website, along with Spotify's API to serve as the dataset for this recommender system. 

**Disclaimer**: Since certain processes within this notebook require API keys (which are not stored within this notebook), it is not possible to run every cell from start to finish. If you'd like to do so, you'll need to request Spotify API access with client credentials [here](https://developer.spotify.com/dashboard/login).

## Table of Contents

1. [Retreiving RIAA Data](#1)
2. [Extracting Artists from Spotify API](#2)
3. [Retrieving Top-10 Songs per Artist from Spotify API](#3)
4. [Retrieving Audio Features from Spotify API](#4)

In [1]:
import csv
import json
import os
import pickle
import re
import time

import numpy as np
import pandas as pd

import splinter
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

from library import song_feat_pull, genre_p_artist, genre_psong

<a name="1"></a>
## 1. Retrieving RIAA Data  

*Original Date of Extracted Elements: 9/22/18*

Here, I'm querying the RIAA Gold and Platinum [website](https://www.riaa.com/gold-platinum/) for all artists whom have garnered a sales award throughout it's history. These artists will then be queried against the Spotify API for their top-10 most popular songs. I figured that retreiving songs from an assortment of RIAA award-winning artists ensures a relatively diverse song dataset, as well as one that is filled with titles that most users of this recommender would be able to recognize and suggest themselves, as cosine similarity is precomputed between each of the titles within this dataset (more in `04_cosine.ipynb`).

The data is extracted using [Splinter](https://splinter.readthedocs.io/en/latest/), a Python library used for browser automation and website testing.

### 1a. Using Splinter to Access RIAA Website

In [3]:
from splinter import Browser
browse = Browser()
browse.visit('https://www.riaa.com/gold-platinum/?tab_active=awards_by_artist&col=artist&ord=asc#search_section')

#### Found the "Load More" button. This generates more table rows

In [4]:
load_more = browse.find_by_xpath('//*[@id="loadmore"]')
load_more.click()

#### Extracting all Artists in RIAA table

In [13]:
for i in range(200):
    load_more.click()
    time.sleep(2)

### 1b. Extracting all artist information by `td class=artists_cell`

In [14]:
all_tds = browse.find_by_tag('td[class="artists_cell"]')
artist_list_td = [element.value for element in all_tds]

In [16]:
len(artist_list_td)

2514

#### Sanity Check of Artist Listing

In [5]:
artist_list_td[:10]

["'N SYNC",
 '"WEIRD AL" YANKOVIC',
 '10 YEARS',
 '10,000 MANIACS',
 '112',
 '2 CHAINZ',
 '2 LIVE CREW',
 '2 PAC',
 '2 PAC & OUTLAWZ',
 '2 UNLIMITED']

#### Pickling List for Access Later

In [17]:
with open('../pickle/artist_list_td.pkl', 'wb+') as f:
    pickle.dump(artist_list_td, f)

<a name="2"></a>
## 2. Retrieving Artists from Spotify API

Now that I have a base of artists to query the Spotify API with, I'm going to go ahead and do so, with the help of [Spotipy](https://spotipy.readthedocs.io/en/latest/#), an api wrapper built in Python.

### 2a. Logging Into Spotify


In [18]:
client_credentials_manager = SpotifyClientCredentials(client_id=os.getenv('S_CLIENT_ID'),
                                                      client_secret=os.getenv('S_CLIENT_SECRET'))
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

#### 2b. Retreiving Artist IDs

In order to retreive an artist's top-10 tracks. The `artist_id` for each artist must be found through the API's `search` endpoint (shown below).

#### Example Search

As you can see, with each given artist search, there can be multiple results. Spotify will present the choice it feels closest matches your song term first, though it may not always be the correct artist.

In [11]:
artist = sp.search("'N SYNC", type='artist')
artist

{'artists': {'href': 'https://api.spotify.com/v1/search?query=%27N+SYNC&type=artist&offset=0&limit=10',
  'items': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/6Ff53KvcvAj5U7Z1vojB5o'},
    'followers': {'href': None, 'total': 768549},
    'genres': ['boy band', 'dance pop', 'pop', 'post-teen pop'],
    'href': 'https://api.spotify.com/v1/artists/6Ff53KvcvAj5U7Z1vojB5o',
    'id': '6Ff53KvcvAj5U7Z1vojB5o',
    'images': [{'height': 1173,
      'url': 'https://i.scdn.co/image/c56cf0cc89c8ecfec7145cf065ea2006d0706605',
      'width': 1000},
     {'height': 751,
      'url': 'https://i.scdn.co/image/b119bdb5ef8a56c1b4aaa552f3ed686fab4aaff0',
      'width': 640},
     {'height': 235,
      'url': 'https://i.scdn.co/image/fc8b9594cca9210bb91ec1d63172ef021217905b',
      'width': 200},
     {'height': 75,
      'url': 'https://i.scdn.co/image/02f562244708fb6e951a93b1b9131d2fb9bfba06',
      'width': 64}],
    'name': '*NSYNC',
    'popularity': 71,
    'type': 'artist',
  

#### Storing "Most Likely" Artist Results in List

As opposed to iterating through multiple results for every artist in my dataset, I only saved the first result of each search. Afterwards, I did a manual quality check of the entire list, and changed out a few artists who were incorrectly retrieved. 

In [9]:
with open('../pickle/artist_list_td.pkl', 'rb') as f:
    artist_list_td = pickle.load(f)

In [60]:
confident_list = []
for artist in artist_list:
    results = sp.search(artist, type='artist')
    try:
        confident_list.append(results['artists']['items'][0])
    except:
        confident_list.append('no_results')
    time.sleep(1)

##### Ensuring all Artists Were Searched Through

In [61]:
len(confident_list)

2514

##### Example Result

In [362]:
confident_list[64]

{'external_urls': {'spotify': 'https://open.spotify.com/artist/3dkbV4qihUeMsqN4vBGg93'},
 'followers': {'href': None, 'total': 946690},
 'genres': ['classic soul',
  'funk',
  'memphis soul',
  'quiet storm',
  'soul',
  'soul blues',
  'southern soul'],
 'href': 'https://api.spotify.com/v1/artists/3dkbV4qihUeMsqN4vBGg93',
 'id': '3dkbV4qihUeMsqN4vBGg93',
 'images': [{'height': 640,
   'url': 'https://i.scdn.co/image/3204125720ef02294bbac0ff109a5919c141955f',
   'width': 640},
  {'height': 320,
   'url': 'https://i.scdn.co/image/6df201e877205c39bebe4c420f3b2ef84ad66b17',
   'width': 320},
  {'height': 160,
   'url': 'https://i.scdn.co/image/4c1d71cb0248f47f249b32c66f661f9ef3081ca8',
   'width': 160}],
 'name': 'Al Green',
 'popularity': 68,
 'type': 'artist',
 'uri': 'spotify:artist:3dkbV4qihUeMsqN4vBGg93'}

##### Pickling Listing for Later if Necessary

In [100]:
with open('../pickle/confident_list.pkl', 'wb+') as f:
    pickle.dump(confident_list, f)

##### Counting `no_results`

In [79]:
count = -1
index_list = []

for artist in list(enumerate(confident_list)):
    count += 1
    if artist == 'no_results':
        index_list.append(count)

In [81]:
len(index_list)

135

In [2]:
with open('../pickle/confident_list.pkl', 'rb') as f:
    confident_list = pickle.load(f)

##### Missing Artists: Artists Who Weren't Found via "Most Likely" Search

In [21]:
missing_artists = []
for i in list(enumerate(confident_list)):
    if i[1] == 'no_results':
        missing_artists.append((i[0], artist_list_td[i[0]]))

##### Sanity Check of List

In [23]:
missing_artists[:5]

[(8, '2 PAC & OUTLAWZ'),
 (11, '21 SAVAGE & METRO BOOMIN'),
 (36, 'A. DORATI / MINNEAPOLIS SYMPHONY'),
 (117, 'ANI DI FRANCO'),
 (165, 'BABY AKA THE #1 STUNNA')]

In [98]:
with open('../pickle/missing_artists.pkl', 'wb+') as f:
    pickle.dump(missing_artists, f)

##### Manually Finding Artists

This is the result of a manual search for artists through Spotify's app. The artists, now tossed into a list can have their id's retrieved through the api.

In [112]:
mal = [
    (8, "2Pac Outlawz"), (36, "Minneapolis Symphony Orchestra"), (117, "Ani DiFranco"), (208, "Herbert von Karajan"),
    (215, "Big Audio Dynamite"), (268, "Mike Bloomfield"), (277, "Bob & Doug McKenzie"), (284, "Earl Klugh"),
    (287, "Bob Rivers"), (295, "Bobby McFerrin"), (328, "Brian McKnight"), (367, "C.W. McCall"), (374, "The Diplomats"),
    (388, "Mahavishnu John McLaughlin"), (397, "CeCe Peniston"), (520, "Dallas Holm"), (549, "WordHarmonic"),
    (627, "D.J. Magic Mike & M.C. Madness"), (641, "Don McLean"), (650, "Donnie McClurkin"), (662, "Mark Beshara"),
    (703, "Edwin McCain"), (759, "Extreme"), (811, "Francis Lai"), (820, "Francesca Battistelli"), (1013, "Nelson Eddy"),
    (1056, "Claude Bolling"), (1067, "The Outlaws"), (1075, "Jesse Johnson"), (1076, "Jesse McCartney"),
    (1115, "Joey McIntyre"), (1185, "KC & The Sunshine Band"), (1274, "Larry Elgart & His Manhatten Swing Orchestra"), 
    (1276, "LL Cool J"), (1309, "Les McCann & Eddie Harris"), (1319, "Lil Jon & The East Side Boyz"), (1323, "Lila McCann"), 
    (1333, "Lisa Lisa & Cult Jam"), (1354, "London Symphony Orchestra"), (1357, "Loreena McKennitt"),
    (1360, "Los Angeles Azules"), (1424, "Marco Antonio Solís"), (1430, "Marilyn McCoo & Billy Davis Jr."),
    (1463, "Frankie Beverly"), (1465, "McBride & The Ride"), (1466, "MC Eiht"), (1467, "McFadden & Whitehead"),
    (1524, "Mindy McCready"), (1569, "Leonard Bernstein"), (1588, "Neal McCoy"),(1678, "Maurice André"),
    (1799, "REO Speedwagon"), (1832, "Real McCoy"), (1833, "Reba McEntire"), (1844, "Return To Forever"),
    (1870, "Rob Base & DJ EZ Rock"), (1929, "SWV"), (1953, "Sarah McLachlan"), (2043, "Tha Eastsidaz"),
    (2206, "Dallas Brass"), (2209, "Deftones"), (2253, "Sesame Street"), (2279, "Talking Heads"),
    (2306, "38 Special"),(2366, "Trey Parker"),(2371, "Trin-I-Tee 5:7"), (2394, "UB40"),
    (2401, "U.S.A. For Africa"),(2418, "VeggieTales"),(2426, "Victoria Justice"),(2440, "Walter Murphy"),
]

In [125]:
len(missing_artists) - len(mal)

64

##### Missing Artists Not Included

If artists from the `missing_artists` table are not included in the `mal` table, it could be for the following reasons:
- Spotify doesn't have the artist
- It's a combo of artists, and at least one of them appears elsewhere on the list

Based on these, there's a total of 64 artists who were not included in the `mal` table.

In [4]:
with open('../pickle/missing_artists.pkl', 'rb') as f:
    missing_artists = pickle.load(f)

##### Grabbing Found Artists Info

In [117]:
found_artists = []
for artist in missing_artists:
    artist_list = sp.search(artist[1], type='artist')
    try:
        found_artists.append(artist_list['artists']['items'][0])
    except:
        found_artists.append('no_results')
    time.sleep(0.5)

In [127]:
with open('../pickle/found_artists.pkl', 'wb+') as f:
    pickle.dump(found_artists, f)

##### Checking to Make Sure There's No Missing Results in `found_artists`

In [122]:
count = -1
index_list = []

for artist in list(enumerate(found_artists)):
    count += 1
    if artist == 'no_results':
        index_list.append(count)

In [123]:
index_list

[]

### 2c. Creating Dataframe of Full Artist Listing

In [146]:
list(confident_list[0].keys())

['external_urls',
 'followers',
 'genres',
 'href',
 'id',
 'images',
 'name',
 'popularity',
 'type',
 'uri']

Passing `confident_list` (Artist list 1 of 2) into a DataFrame by first passing it into a series. This will enable me to keep the current structure, despite the fact the dict values are not of the same length.

In [206]:
df = pd.DataFrame(pd.Series(confident_list[0]).reset_index()).T
df.columns = ["external_urls", "followers", "genres", "href", "id", "images", 
              "name", "popularity",  "type", "uri"]
df.drop(labels='index', inplace=True)

##### Adding `found_artists` into DataFrame

In [250]:
df_2 = pd.DataFrame(pd.Series(found_artists[0]).reset_index()).T

In [251]:
count = 0

for i in found_artists[1:]:
    count += 1
    df_2.loc[count] = pd.Series(i).reset_index().T.iloc[1]

In [255]:
df_2.columns = ["external_urls", "followers", "genres", "href", "id", 
                "images", "name", "popularity",  "type", "uri"]
df_2.drop(labels='index', inplace=True)

##### Combining DataFrames

In [261]:
df = pd.concat([df, df_2])

In [266]:
df.head()

Unnamed: 0,external_urls,followers,genres,href,id,images,name,popularity,type,uri
0,{'spotify': 'https://open.spotify.com/artist/6...,"{'href': None, 'total': 747136}","[boy band, dance pop, europop, pop, post-teen ...",https://api.spotify.com/v1/artists/6Ff53KvcvAj...,6Ff53KvcvAj5U7Z1vojB5o,"[{'height': 1173, 'url': 'https://i.scdn.co/im...",*NSYNC,71,artist,spotify:artist:6Ff53KvcvAj5U7Z1vojB5o
1,{'spotify': 'https://open.spotify.com/artist/1...,"{'href': None, 'total': 337751}","[antiviral pop, comedy rock, comic]",https://api.spotify.com/v1/artists/1bDWGdIC2ha...,1bDWGdIC2hardyt55nlQgG,"[{'height': 563, 'url': 'https://i.scdn.co/ima...","""Weird Al"" Yankovic",59,artist,spotify:artist:1bDWGdIC2hardyt55nlQgG
2,{'spotify': 'https://open.spotify.com/artist/0...,"{'href': None, 'total': 243895}","[alternative metal, nu metal, post-grunge, rap...",https://api.spotify.com/v1/artists/0REMf7H0VP6...,0REMf7H0VP6DwfZ9MbuWph,"[{'height': 640, 'url': 'https://i.scdn.co/ima...",10 Years,60,artist,spotify:artist:0REMf7H0VP6DwfZ9MbuWph
3,{'spotify': 'https://open.spotify.com/artist/0...,"{'href': None, 'total': 108829}","[alternative rock, folk, folk-pop, lilith, mel...",https://api.spotify.com/v1/artists/0MBIKH9DjtB...,0MBIKH9DjtBkv8O3nS6szj,"[{'height': 640, 'url': 'https://i.scdn.co/ima...","10,000 Maniacs",52,artist,spotify:artist:0MBIKH9DjtBkv8O3nS6szj
4,{'spotify': 'https://open.spotify.com/artist/7...,"{'href': None, 'total': 455231}","[boy band, dance pop, gangster rap, hip hop, h...",https://api.spotify.com/v1/artists/7urq0VfqxEY...,7urq0VfqxEYEEiZUkebXT4,"[{'height': 130, 'url': 'https://i.scdn.co/ima...",112,68,artist,spotify:artist:7urq0VfqxEYEEiZUkebXT4


In [None]:
df.dropna(inplace=True)

In [398]:
df.to_csv('../data/spotify_artists_unclean.csv')

### 2c. Cleaning Up Artist DataFrame

In [293]:
di = {'href': None, 'total': 747136}
list(di.values())[1]

747136

##### Extracting `followers` Count from List & Into it's own Column

In [392]:
followers = []

for i in df['followers']:
    for k, v in i.items():
        if isinstance(v, int):
            followers.append(v)

In [395]:
df['followers_2'] = followers

##### Dropping the Original `followers` Column, as it's no Longer Necessary

In [400]:
df.drop('followers', axis=1, inplace=True)

In [406]:
df.to_csv('../data/spotify_artists_clean.csv')

### 2d. Spotify Artists Duplicate Entry Check

In [2]:
df[df['id'].duplicated()].sort_values('id').shape

(40, 11)

Turns out that I had about 40 duplicated artists that I initially pulled, which resulted in a number of duplicate song pulls from the Spotify API.

### 2e. Cleaning & Adding Artists

I just noticed that one of the retreived results `8 ball & MJG & Mr E of RPS Fam` was an unfortunate consequence of my "I'm feeling lucky" approach to gathering artists. I'll retrieve the correct `8ball & MJG` entry and append it to the artist table.

In [5]:
result = sp.search('8Ball & MJG', type='artist')

In [8]:
result['artists']['items'][0]

{'external_urls': {'spotify': 'https://open.spotify.com/artist/7iUhmKPNkkPPS6FCQxqtNq'},
 'followers': {'href': None, 'total': 129015},
 'genres': ['deep southern trap',
  'dirty south rap',
  'gangster rap',
  'memphis hip hop',
  'pop rap',
  'southern hip hop',
  'trap music'],
 'href': 'https://api.spotify.com/v1/artists/7iUhmKPNkkPPS6FCQxqtNq',
 'id': '7iUhmKPNkkPPS6FCQxqtNq',
 'images': [{'height': 640,
   'url': 'https://i.scdn.co/image/fafd121594231b770a47e6cdd893e2992d672c33',
   'width': 640},
  {'height': 300,
   'url': 'https://i.scdn.co/image/0c56c95a331fc161b7e0b6147574d914b840db16',
   'width': 300},
  {'height': 64,
   'url': 'https://i.scdn.co/image/72f790df70c1d7c3bf5098de019db0e6576865e3',
   'width': 64}],
 'name': '8Ball & MJG',
 'popularity': 49,
 'type': 'artist',
 'uri': 'spotify:artist:7iUhmKPNkkPPS6FCQxqtNq'}

In [81]:
df.head()

Unnamed: 0,external_urls,genres,href,id,images,name,popularity,type,uri,followers_2
0,{'spotify': 'https://open.spotify.com/artist/6...,"['boy band', 'dance pop', 'europop', 'pop', 'p...",https://api.spotify.com/v1/artists/6Ff53KvcvAj...,6Ff53KvcvAj5U7Z1vojB5o,"[{'height': 1173, 'url': 'https://i.scdn.co/im...",*NSYNC,71,artist,spotify:artist:6Ff53KvcvAj5U7Z1vojB5o,747136
1,{'spotify': 'https://open.spotify.com/artist/1...,"['antiviral pop', 'comedy rock', 'comic']",https://api.spotify.com/v1/artists/1bDWGdIC2ha...,1bDWGdIC2hardyt55nlQgG,"[{'height': 563, 'url': 'https://i.scdn.co/ima...","""Weird Al"" Yankovic",59,artist,spotify:artist:1bDWGdIC2hardyt55nlQgG,337751
2,{'spotify': 'https://open.spotify.com/artist/0...,"['alternative metal', 'nu metal', 'post-grunge...",https://api.spotify.com/v1/artists/0REMf7H0VP6...,0REMf7H0VP6DwfZ9MbuWph,"[{'height': 640, 'url': 'https://i.scdn.co/ima...",10 Years,60,artist,spotify:artist:0REMf7H0VP6DwfZ9MbuWph,243895
3,{'spotify': 'https://open.spotify.com/artist/0...,"['alternative rock', 'folk', 'folk-pop', 'lili...",https://api.spotify.com/v1/artists/0MBIKH9DjtB...,0MBIKH9DjtBkv8O3nS6szj,"[{'height': 640, 'url': 'https://i.scdn.co/ima...","10,000 Maniacs",52,artist,spotify:artist:0MBIKH9DjtBkv8O3nS6szj,108829
4,{'spotify': 'https://open.spotify.com/artist/7...,"['boy band', 'dance pop', 'gangster rap', 'hip...",https://api.spotify.com/v1/artists/7urq0VfqxEY...,7urq0VfqxEYEEiZUkebXT4,"[{'height': 130, 'url': 'https://i.scdn.co/ima...",112,68,artist,spotify:artist:7urq0VfqxEYEEiZUkebXT4,455231


In [82]:
df.columns

Index(['external_urls', 'genres', 'href', 'id', 'images', 'name', 'popularity',
       'type', 'uri', 'followers_2'],
      dtype='object')

In [83]:
df = df.append({'external_urls':"{'spotify': 'https://open.spotify.com/artist/7iUhmKPNkkPPS6FCQxqtNq'}",
           'genres':['deep southern trap', 'dirty south rap','gangster rap','memphis hip hop','pop rap','southern hip hop','trap music'],
           'href':'https://api.spotify.com/v1/artists/7iUhmKPNkkPPS6FCQxqtNq',
           'id':'7iUhmKPNkkPPS6FCQxqtNq',
           'images':[{'height': 640,
           'url': 'https://i.scdn.co/image/fafd121594231b770a47e6cdd893e2992d672c33',
           'width': 640},
          {'height': 300,
           'url': 'https://i.scdn.co/image/0c56c95a331fc161b7e0b6147574d914b840db16',
           'width': 300},
          {'height': 64,
           'url': 'https://i.scdn.co/image/72f790df70c1d7c3bf5098de019db0e6576865e3',
           'width': 64}],
           'name':'8Ball & MJG',
           'popularity':49,
           'type':'artist',
           'uri':'spotify:artist:7iUhmKPNkkPPS6FCQxqtNq',
           'followers_2':129015}, ignore_index=True)

In [84]:
df.drop(labels = 24, inplace=True)

### 2e. Checking for Non-Music

Every artist pulled from the Spotify listing isn't a musician or musical act. This is because the RIAA gives shipment awards for any type of audio content that's distributed within the U.S. In order to ensure my recommender is able to provide the best results, I've attempted to remove all non-musical artists programmatically.

##### Loading In Artist Table Again

In [14]:
df = pd.read_csv('../data/spotify_artists_clean.csv', index_col=0)

In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2449 entries, 0 to 2449
Data columns (total 10 columns):
external_urls    2449 non-null object
genres           2449 non-null object
href             2449 non-null object
id               2449 non-null object
images           2449 non-null object
name             2449 non-null object
popularity       2449 non-null int64
type             2449 non-null object
uri              2449 non-null object
followers_2      2449 non-null int64
dtypes: int64(2), object(8)
memory usage: 210.5+ KB


#### Removing Extraneous Columns from Artist Dataframe

In [85]:
df.drop(labels=['external_urls',
                'href',
                'images',
                'type',
                'uri'],
        axis=1,
        inplace=True)
df.rename({'followers_2':'followers'}, axis=1, inplace=True)

### 2f. Sorting Through Artist Genres

Spotify, through it's API, stores genre information with each artist (in a list), as opposed to each song. That being stated, genres are still incredibly important on a song recommendation level, especially because Spotify takes them so [seriously](http://everynoise.com/EverynoiseIntro.pdf), and meticulously classifies each artist. Therefore, in order to gain a better understanding of the genres, I've extracted them from their list format, and appended them into separate columns within the artist Dataframe.

##### Cleaning Genre Column Format

In [87]:
df['genres'] = df['genres'].apply(lambda x: str(x).strip("[]"))
df['genres'] = df['genres'].apply(lambda x: x.split(","))

In [91]:
df.head(2)

Unnamed: 0,genres,id,name,popularity,followers
0,"['boy band', 'dance pop', 'europop', 'pop',...",6Ff53KvcvAj5U7Z1vojB5o,*NSYNC,71,747136
1,"['antiviral pop', 'comedy rock', 'comic']",1bDWGdIC2hardyt55nlQgG,"""Weird Al"" Yankovic",59,337751


In [51]:
df.set_index('Unnamed: 0', inplace=True)

##### Songs with the Highest Number of Genres Tied to them

In [94]:
df['genres'].apply(lambda x: len(x)).sort_values(ascending=False).head()

2236    19
2108    18
2098    16
2328    16
618     16
Name: genres, dtype: int64

#### Separating Genres into Separate Dataframe

In [114]:
df2 = pd.DataFrame(df['genres'].tolist(), columns=['genre_{}'.format(i) for i in range(19)])

In [122]:
df.reset_index(drop=True, inplace=True)

#### Merging Genres Dataframe into Artist Dataframe

In [124]:
df = pd.concat([df.drop('genres', axis=1), df2], axis=1)

In [140]:
df.replace('', value={'genre_0':None}, inplace=True)

In [160]:
df.set_index('id', inplace=True)

In [161]:
df.head(2)

Unnamed: 0_level_0,name,popularity,followers,genre_0,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,...,genre_9,genre_10,genre_11,genre_12,genre_13,genre_14,genre_15,genre_16,genre_17,genre_18
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6Ff53KvcvAj5U7Z1vojB5o,*NSYNC,71,747136,'boy band','dance pop','europop','pop','post-teen pop',,,...,,,,,,,,,,
1bDWGdIC2hardyt55nlQgG,"""Weird Al"" Yankovic",59,337751,'antiviral pop','comedy rock','comic',,,,,...,,,,,,,,,,


In [162]:
df.to_csv('../data/spotify_artists.csv')

<a name="3"></a>
## 3. Retreiving Top-10 Tracks for Every Artist

Here, I'm querying the Spotify API `Artist` enpoint for each artist's corresponding top-10 tracks. Again, I'm utilizing the Spotipy wrapper.

In [14]:
song_list = []
for aid in artist_ids:
    try:
        song_list.append(sp.artist_top_tracks(aid))
    except:
        song_list.append('no_results')
    time.sleep(1)

##### Checking Results

In [281]:
len(song_list)

2450

##### Song List to Pickle

In [16]:
with open('../pickle/song_list.pkl', 'wb+') as f:
    pickle.dump(song_list, f)

In [5]:
with open('../pickle/song_list.pkl', 'rb+') as f:
    song_list = pickle.load(f)

#### Pulling Essential Song Information from `song_list` Into Separate Listing

In [7]:
master_song_list = song_feat_pull.song_feat_pull(song_list)

In [206]:
master_song_list = pd.DataFrame(master_song_list)
master_song_list.set_index('song_id', inplace=True)
master_song_list.to_csv('../data/spotify_song_list.csv')

#### Duplicate Song Title Check

In [9]:
df_2 = pd.read_csv('../data/spotify_song_list.csv', index_col = 0)
df_2[df_2.duplicated()].sort_values('song_id').shape

(759, 7)

These will be deleted prior to EDA & Preprocessing of data.

<a name="4"></a>
## 4. Moving Genres Into Separate Table & Creating Lookup Table

In order to eventually merge the genres into the master song listing, I've separated the genres from the artist table, and created a lookup table to match each song with the genres tied to the songs performing artist. 

In [16]:
df['genres'] = df['genres'].apply(lambda x: x.split(sep=','))
df['genres'] = df['genres'].apply(lambda x: [i.replace('[', '') for i in x])
df['genres'] = df['genres'].apply(lambda x: [i.replace(']', '') for i in x])
df['genres'] = df['genres'].apply(lambda x: [i.replace('\'', '') for i in x])
df['genres'] = df['genres'].apply(lambda x: [i.strip() for i in x])

In [17]:
df['genres'].loc[0]

['boy band', 'dance pop', 'europop', 'pop', 'post-teen pop']

### 4a. Separating Genres and Their Associated `artist_id` Into a Separate Dataframe

In [18]:
genre_per_artist = df[['genres', 'id']]

In [19]:
genre_per_artist.head()

Unnamed: 0,genres,id
0,"[boy band, dance pop, europop, pop, post-teen ...",6Ff53KvcvAj5U7Z1vojB5o
1,"[antiviral pop, comedy rock, comic]",1bDWGdIC2hardyt55nlQgG
2,"[alternative metal, nu metal, post-grunge, rap...",0REMf7H0VP6DwfZ9MbuWph
3,"[alternative rock, folk, folk-pop, lilith, mel...",0MBIKH9DjtBkv8O3nS6szj
4,"[boy band, dance pop, gangster rap, hip hop, h...",7urq0VfqxEYEEiZUkebXT4


#### Dropping Non-Artists From `genre_per_artist`

In [103]:
non_artists = pd.read_csv('../data/non_artists.csv', index_col=0)
non_artists_id = non_artists['s_artist_id']

In [105]:
genre_todrop = genre_per_artist.query("id in @non_artists_id")

In [109]:
genre_per_artist.drop(labels=genre_todrop.index, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


### 4b. Creating Genre-Song Table

In order to examine the relationship with each genre as it's applied to each song, I've created the following lookup table.

In [25]:
song_df = pd.read_csv('../data/song_list_v2.csv', index_col=0)

In [6]:
song_df['s_song_id'].values

array(['6SluaPiV04KOaRTOIScoff', '5qEVq3ZEGr0Got441lueWS',
       '5kqIPrATaCc2LqxVWzQGbk', ..., '2ayphdNEBBaAErWdTMAhsm',
       '7BY005dacJkbO6EPiOh2wb', '09WwqFDqX2zv8rlvf4xYAk'], dtype=object)

In [8]:
song_df.columns

Index(['s_song_id', 'album_release_date', 'artist_id', 'artist_name',
       'duration_ms', 'explicit', 'linked_album', 'song_title'],
      dtype='object')

##### Making Genre-to-Artist Dictionary to Map Into Song Dataframe

In [21]:
genre_w_artists = genre_p_artist.genre_p_artist(genre_per_artist)

### 4c. Retreiving Genres from Master Song List to Create Separate Genres Dataframe

In [26]:
genre_songs = genre_psong.genre_psong(genre_w_artists, song_df)

##### Checking for `null` (Genre-less) Results

In [23]:
for entry in genre_songs:
    if None in entry:
        print(entry)

In [24]:
genre_songs.insert(0,(0, 'song_id', 'genre', 'artist_id'))

### 4d. Creating Genre Table

Creating a seperate table for just genres, in order to help map genres to each song.

In [21]:
genre_list = df.genres.tolist()
genre_list = [i.split(sep=',') for i in genre_list]

["['boy band'", " 'dance pop'", " 'europop'", " 'pop'", " 'post-teen pop']"]

In [38]:
genre_list = [[i.replace('[', '') for i in lst] for lst in genre_list]
genre_list = [[i.replace(']', '') for i in lst] for lst in genre_list]
genre_list = [[i.replace("\'", '') for i in lst] for lst in genre_list]
genre_list = [[i.strip() for i in lst] for lst in genre_list]

In [39]:
all_genres = []

for genre_set in genre_list:
    for genre in genre_set:
        all_genres.append(genre)

In [72]:
all_genres = set(all_genres)
len(all_genres)
all_genres = list(all_genres)
del all_genres[0]

In [91]:
genre_df = pd.DataFrame(all_genres, columns=['genre'])

In [50]:
genre_df.head()

Unnamed: 0,genre
0,a cappella
1,acid house
2,acid jazz
3,acoustic blues
4,acoustic pop


In [51]:
genre_dict = genre_df['genre'].to_dict()

In [93]:
genre_df.to_csv('../data/genres.csv')

In [35]:
genre_df = pd.read_csv('../data/genres.csv', index_col=0)
genre_df.head()

Unnamed: 0,genre
0,a cappella
1,acid house
2,acid jazz
3,acoustic blues
4,acoustic pop


In [45]:
with open('../data/all_genres.pkl', 'wb') as f:
    pickle.dump(all_genres, f)

### 4d. Casting Genre Dict into Lookup Table

In [58]:
genre_ids_songs = []
for gs_id, song_id, genre, artist_id in genre_songs[1:]:
    for k, v in genre_dict.items():
        if genre == v:
            genre_ids_songs.append((gs_id, song_id, k, genre, artist_id))

In [60]:
len(genre_ids_songs)

121283

In [59]:
for genre in genre_ids_songs:
    if genre[4] == '7hbzXJ5K6AqdPnbUcCL1hL':
        print(genre)

##### Lookup Dict Into Lookup Table 

In [61]:
gsong_lookup = pd.DataFrame(genre_ids_songs, columns=['gs_id', 'song_id', 'genre_id', 'genre_name', 'artist_id']).set_index('gs_id')

In [62]:
gsong_lookup.shape

(121283, 4)

In [63]:
gsong_lookup.head()

Unnamed: 0_level_0,song_id,genre_id,genre_name,artist_id
gs_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,6SluaPiV04KOaRTOIScoff,139,dance pop,6UE7nl9mha6s8z0wFQFIZ2
2,6SluaPiV04KOaRTOIScoff,189,electropop,6UE7nl9mha6s8z0wFQFIZ2
3,6SluaPiV04KOaRTOIScoff,196,europop,6UE7nl9mha6s8z0wFQFIZ2
4,6SluaPiV04KOaRTOIScoff,404,pop,6UE7nl9mha6s8z0wFQFIZ2
5,6SluaPiV04KOaRTOIScoff,410,pop rock,6UE7nl9mha6s8z0wFQFIZ2


In [64]:
gsong_lookup.to_csv('../data/gsong_lookup.csv')

#### Next notebook: 01b_collection_audio_feat