# 1.C. Collection -- Audio -- Spotify

In this notebook, we'll continue with our collection workflow. To recap, we leveraged a seed list of rappers from Wikipedia to extract song data from Genius.com leveraging lyricsgenius. Now, we'll turn to **Spotify** to build on our current track level features.

**Spotify** is the world's premiere streaming platform and has a robust API. There are two levels to API authorization-- the commercial version, which unlocks access to deeper metrics on user level data; the free version, which requires account authentication and unlocks access to a variety of audio features and data regarding your own account. Here, we're leveraging the latter to extract the following data:

 - **Pre-existing Audio Features**: *Danciness, Instrumentaliness* are one of several features that are available in Spotify's API. We'll be tapping into these
 - **Social data**: follower count, popularity are two social metrics that will help us rank artists
 - **Text data**: album, track title, data-- some basic values we'll also want to capture
 - **Audio Snippets**: We want to grab preview_urls provided by Spotify for each of our tracks. This will allow us to run custom audio analysis. *While Spotify's pre-existing audio features (danciness) are much more reliable features for classification/recommendation, our goal is to explore some of the processing steps that lead up to these pre-engineered features*
 
The workflow for this notebook is as follows:
 1. Instantiate connection to Spotify via **Stopipy**, a wrapper which will handle errors and streamline data collection.
 2. Search for artists from our Genius pull, extract relevant information on artists including the **social features** mentioned above
 3. With our list of artists, grab their full discography by iterating through album searches and corresponding track searches for each album

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import time
import requests
import seaborn as sns
import warnings
import matplotlib.pyplot as plt
import lyricsgenius

import sys
import spotipy
import spotipy.util as util

## Authentication and Param Setting for APIs


In [2]:
'''SPOTIFY'''

scope = 'user-library-read'
username = 'Ocurtis'

#leverage stopipy's authentication method. We need to refresh this every time we pull data down.
token = util.prompt_for_user_token(
    username,scope,
    client_id='INSERT ID HERE',
    client_secret='INSERT SECRET HERE',
    redirect_uri='http://google.com/')

In [3]:
token

'BQCXrTNl8yhMDeP7b8ycwlzpK3EMsC8Ri0mDyMuzAgL-fuHxbQGVGeR-3FW1paBYggAdndzXT8S3n-mlbYFavaC5KizSD12SxRKFwaCwIByhk1ErfcVKd8YQftgdRIjxazGolkc7AGIL14B3mgiRd_RTKayT'

## Grab Spotify Artsts

In [6]:
#Read in the list of artists we need to pull down
artist_list_df = pd.read_csv('missing_artists_final_push.csv')

In [7]:
#Isolate for unique artist names
spotify_query_list = list(artist_list_df['artist'].unique())

In [8]:
#save our token and connect to Spotify via Stopipy
sp = spotipy.Spotify(auth=token)

In [14]:
#Set up our lists for appending as we loop through our spotify artist information
spotify_artist_list = []
spotify_artist_id_list = []
spotify_artist_image_list = []
spotify_artist_genres_list = []
spotify_artist_popularity_list = []
spotify_artist_follower_tot_list = []


#Store results for our rapper in a dictionary
for name in spotify_query_list:
        
    try:
    
        json_spot = sp.search(q='artist:' + str(name), type='artist')


        spotify_artist_list.append(json_spot['artists']['items'][0]['name'])
        spotify_artist_id_list.append(json_spot['artists']['items'][0]['id'])


        #try grabbing the first image. pass if there's a problem        
        try:
            spotify_artist_image_list.append(json_spot['artists']['items'][0]['url'])
        except:
            spotify_artist_image_list.append('')

        #try grabbing the genre list. pass if there's a problem        
        try:
            spotify_artist_genres_list.append("|".join(json_spot['artists']['items'][0]['genres']))
        except:
            spotify_artist_genres_list.append('')
            
        #try grabbing the popularity. pass if there's a problem                        
        try:
            spotify_artist_popularity_list.append(json_spot['artists']['items'][0]['popularity'])
        except:
            spotify_artist_popularity_list.append('')
                    
        #try grabbing the followers. pass if there's a problem    
        try:
            spotify_artist_follower_tot_list.append(json_spot['artists']['items'][0]['followers']['total'])
        except:
            spotify_artist_follower_tot_list.append('')


    except:
        spotify_artist_list.append('')
        spotify_artist_id_list.append('')
        spotify_artist_image_list.append('')
        spotify_artist_genres_list.append('')
        spotify_artist_popularity_list.append('')
        spotify_artist_follower_tot_list.append('')

In [22]:
#Build dataframe out of our results
spotify_artist_df = pd.DataFrame({'arist':spotify_artist_list,
                                  'artist_id':spotify_artist_id_list,
                                  'genres':spotify_artist_genres_list,
                                  'image':spotify_artist_image_list,
                                  'pop':spotify_artist_popularity_list,
                                  'follower':spotify_artist_follower_tot_list,})

In [44]:
#Drop anything that has no followers -- these are junk artists
spotify_artist_df = spotify_artist_df.loc[spotify_artist_df['follower'] > 1]

In [45]:
#Save to CSV
spotify_artist_df.to_csv('spotify_artist_table.csv', index=False)

In [9]:
#Read back in
spotify_artist_df = pd.read_csv('spotify_artist_table.csv')

## Grab Spotfy Album and Track Info

In [52]:
#Isolate to our artist Id
spotify_album_track_info_grab_list = list(spotify_album_track_info_grab_df['artist_id'])

In [9]:
#Set up track info df
spotify_track_info_df = pd.DataFrame(columns=['album_name', 'artist_id', 'album_id', 'track_name', 'track_id',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms',
       'time_signature'])

In [10]:
#function that will grab all related album and track information for a given artist
def grab_album_track_info(artist_id):

    #we'll keep a running tally of the full discography here
    discog = []
    #we'll store the album objects we get in this list
    albums = []
    
    #search for albums for a given artist using Stopipy. put the return into albums
    results = sp.artist_albums(artist_id, album_type='album')
    albums.extend(results['items'])
    
    #while there are still pages of results to parse through (as signified by a 'next' key)
    while results['next']:
        
        #pass Stopipy the key that will allows us to iterate through the next page
        results = sp.next(results)
        albums.extend(results['items'])
    
    #Here, we'll set up a set to deduplicate albums 
    seen = set() 
    
    #sort our albums aplhabetically
    albums.sort(key=lambda album:album['name'].lower())
    
    #go through every album in our sorted list
    for album in albums:
        
        #if it is not in our set of albums
        name = album['name']
        if name not in seen:
            
            #add it.
            seen.add(name)
            
            #now that we've added, let's set up our call for track information. Start with an empty list
            tracks = []
            
            #Pass the album ID to Stopipy to grab tracks for the album. Add the object to our list
            track_results = sp.album_tracks(album['id'])
            tracks.extend(track_results['items'])
            
            #Same structure here. While there are more pages to iterate through, continue to iterate through
            while results['next']:
                track_results = sp.next(track_results)
                tracks.extend(track_results['items'])
                
            #for all the track objects we have stored
            for track in tracks:
                
                #instantiate a dictionary for each track and store relevant information
                single_track = {}
                single_track ['album_name'] = album['name']
                single_track ['artist_id'] = artist_id
                single_track ['album_id'] = album['id']               
                single_track['track_name'] = track['name']
                single_track['track_id'] = track['id']
                single_track['preview_url'] = track['preview_url']
    
                audio_features = sp.audio_features(track['id'])
                
                #for audio features, there are several. Iteratively grab them.
                for feature, value in audio_features[0].items():
                    single_track[feature] = value
                
                #Append the single track to the discography list we're keeping
                discog.append(single_track)

            

    #return the discography    
    return(discog)

In [34]:
spotify_track_info_df =spotify_track_info_df.drop_duplicates('track_href')

In [134]:
spotify_track_info_df.to_csv('full_set_of_spotify_trackdata.csv', index=False)

In [5]:
spotify_track_info_df = pd.merge(spotify_track_info_df,spotify_artist_df, on='artist_id')

In [7]:
spotify_track_info_df.to_csv('full_set_of_spotify_trackdata.csv', index=False)

In [47]:
final_spotify = pd.concat([spotify_track_info_df,spotify_track_info_df2]).drop_duplicates('track_href')

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [43]:
final_spotify.shape

(240556, 29)

In [48]:
final_spotify = pd.merge(final_spotify,spotify_artist_df, left_on='artist_id', right_on = 'artist_id', how='left')

In [49]:
final_spotify.to_csv('spotify_final_230k.csv', index=False)

In [17]:
final_spotify = pd.read_csv('spotify_final_230k.csv')

  interactivity=interactivity, compiler=compiler, result=result)
