# Spotify Song Data Scraping

### References:
#### Based on the following tutorials: <br />
Max Hilsdorf, "How to Create Large Music Datasets Using Spotipy", <i>Towards Data Science</i>, 25 April 2020: <br />
https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6 <br />
Max Tingle, "Getting Started with Spotify’s API & Spotipy", <i>Towards Data Science</i>, 3 Oct 2019: <br />
https://medium.com/@maxtingle/getting-started-with-spotifys-api-spotipy-197c3dc6353b <br />
Sandra Radgowska, "How to use Spotify API and what data science opportunities can it open up?", <i>My Journey As A Data Scientist</i>, 18 August 2021:<br />
https://datascientistdiary.com/index.php/2021/03/04/how-to-use-spotify-api-and-what-data-science-opportunities-can-it-open-up/<br />
Angelica Dietzel, "How to Extract Any Artist’s Data Using Spotify’s API, Python, and Spotipy", <i>Better Programming</i>, 25 March 2020:<br />
https://betterprogramming.pub/how-to-extract-any-artists-data-using-spotify-s-api-python-and-spotipy-4c079401bc37

## Setup

### Import packages

In [1]:
import json
import time
import tqdm
import pandas as pd
import creds
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

### Spotify Credentials

#### Load credentials

Loads the creds.py file, containing the following two lines for variables client_id and secret, which is gitignored for sharing. 

client_id = 'Your Client ID Here'<br />
secret = 'Your secret here'

In [2]:
%run -i 'creds.py'

#### Set credentials

In [3]:
client_credentials_manager = SpotifyClientCredentials(client_id=client_id,client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

## Functions for data extraction

### Get track data including features 
#### Details: uri, name, album, artist name, release date, explicit T/F, duration in mins
#### Audio features: acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, time_signature

#### Create track container dictionaries

* Note that since tracks variable is no created in cell with function call, subsequent calls will be appended to the same dictionary

In [4]:
tracks_with_features = []

#### Function to extract all the track ids from your playlist:

In [5]:
def get_track_ids(playlist_id):
    music_id_list = []
    playlist = sp.playlist(playlist_id)
    for item in playlist['tracks']['items']:
        music_track = item['track']
        music_id_list.append(music_track['id'])
    return music_id_list 

#### Function to extract all the details and features of each track by passing its ID:

In [6]:
def get_track_data(track_id):
    meta = sp.track(track_id)
    features = sp.audio_features(track_id)
    analysis = sp.audio_analysis(track_id)
    track_details = {'uri': meta['uri'],
                    'name': meta['name'],
                    'album': meta['album']['name'],
                    'artist': meta['album']['artists'][0]['name'],
                    'release_date': meta['album']['release_date'],
                    'explicit': meta['explicit'],
                    'duration_in_mins': round((meta['duration_ms'] * 0.001) / 60.0, 2),
                    'acousticness' : features[0]['acousticness'],
                    'danceability' : features[0]['danceability'],
                    'energy' : features[0]['energy'],
                    'instrumentalness' : features[0]['instrumentalness'],
                    'liveness' : features[0]['liveness'],
                    'loudness' : features[0]['loudness'],
                    'speechiness' : features[0]['speechiness'],
                    'tempo' : features[0]['tempo'],
                    'time_signature' : features[0]['time_signature'],
                    'track_duration_in_seconds' : analysis['track']['duration'],
                    'end_of_fade_in' : analysis['track']['end_of_fade_in'],
                    'start_of_fade_out' : analysis['track']['start_of_fade_out'],
                    'time_signature' : analysis['track']['time_signature'],
                    'key' : analysis['track']['key'],
                    'mode' : analysis['track']['mode']
                    }
    return track_details

####  Extract track data

Extract info of each track

For testing:  playlist_id = '0qfagBJB5ou0r1kwQDZ8Op'

In [7]:
# Get the ids for all the songs in your playlist
playlist_id = input('Enter the playlist id')
track_ids = get_track_ids(playlist_id)
print(len(track_ids))
print(track_ids)

#  Loop over track ids and get their data points
for i in range(len(track_ids)):
    time.sleep(.5)
    track = get_track_data(track_ids[i])
    tracks_with_features.append(track)

Enter the playlist id0qfagBJB5ou0r1kwQDZ8Op
21
['0bXpmJyHHYPk6QBFj25bYF', '0Xy9xPPs2zRRFqljGqKXel', '0RqRN88qFUEeOQZ7VklA14', '1QZJiYulh7ak7GpZ8OAdwI', '0wrBkSiN1Y1StFR1Q3ZC28', '4Q2UM2QSR7Gye03jvl4Rdw', '5urkJ8dcmxtvsnruNfx6ZS', '6ujlgkgbsrPskgWslEcZmR', '3pyTksNccLM1jRvzQ4zTke', '7AiIlhSSVKAyMJTygWuut2', '7pKfPomDEeI4TPT6EOYjn9', '0WWz2AaqxLoO0fa9ou6Fqc', '5qJbIYo8AitajCbnOGPIvI', '7IbMonU3CRITpQ0cRVThsV', '5yEPxDjbbzUzyauGtnmVEC', '2gANywSFYF58YFMPdDSAjC', '29zkoUsOE50f0I3n44LjjU', '6d3geXDfoj6hz882o9Ip9S', '2UKYMN7VnsQo40n0qCt6Sa', '6Prs4p7iVZxODcO62NIiA6', '7kCrYUDtWsPldohOKPTKPL']


#### Create dataframe

In [8]:
df_features = pd.DataFrame(tracks_with_features)
df_features

Unnamed: 0,uri,name,album,artist,release_date,explicit,duration_in_mins,acousticness,danceability,energy,...,liveness,loudness,speechiness,tempo,time_signature,track_duration_in_seconds,end_of_fade_in,start_of_fade_out,key,mode
0,spotify:track:0bXpmJyHHYPk6QBFj25bYF,Intro,xx,The xx,2009-08-16,False,2.13,0.459,0.617,0.778,...,0.128,-8.871,0.027,100.363,4,127.92,2.48454,117.9922,9,0
1,spotify:track:0Xy9xPPs2zRRFqljGqKXel,Pyro,Come Around Sundown,Kings of Leon,2010-10-19,False,4.18,0.00757,0.365,0.606,...,0.103,-7.174,0.0415,114.995,4,250.53333,0.15075,242.08543,1,0
2,spotify:track:0RqRN88qFUEeOQZ7VklA14,Restless,Music Complete,New Order,2015-09-25,False,5.47,0.00595,0.512,0.963,...,0.0646,-5.083,0.0498,143.012,4,328.24,1.00227,308.62802,11,0
3,spotify:track:1QZJiYulh7ak7GpZ8OAdwI,Control,After the Disco,Broken Bells,2014-01-13,False,3.69,0.00676,0.741,0.634,...,0.337,-6.625,0.0264,113.999,4,221.42667,0.31011,214.01833,9,0
4,spotify:track:0wrBkSiN1Y1StFR1Q3ZC28,All I'm Saying,La Petite Mort,James,2014-06-02,False,4.97,0.246,0.483,0.781,...,0.152,-6.49,0.0927,118.824,4,297.93124,0.0,291.61362,5,1
5,spotify:track:4Q2UM2QSR7Gye03jvl4Rdw,I'm Outta Time,Dig Out Your Soul,Oasis,2008-10-06,False,4.17,0.00107,0.427,0.622,...,0.431,-5.467,0.028,77.406,4,250.01334,0.0,234.64925,4,0
6,spotify:track:5urkJ8dcmxtvsnruNfx6ZS,Swans (Extended Version),Swans Collection,Unkle Bob,2006-10-23,False,4.12,0.289,0.486,0.669,...,0.105,-5.611,0.0293,118.464,4,247.29729,0.62707,237.18604,11,0
7,spotify:track:6ujlgkgbsrPskgWslEcZmR,Ballad of the Mighty I,Chasing Yesterday,Noel Gallagher's High Flying Birds,2015-02-27,False,5.25,0.00689,0.479,0.874,...,0.178,-4.563,0.0426,121.916,4,315.09332,0.0,292.00833,1,1
8,spotify:track:3pyTksNccLM1jRvzQ4zTke,Never Tear Us Apart,INXS Remastered,INXS,2011-01-01,False,3.08,0.00309,0.664,0.613,...,0.175,-7.56,0.0273,96.6,3,184.58667,0.17324,171.76381,0,1
9,spotify:track:7AiIlhSSVKAyMJTygWuut2,Hey Jude,Back In The U.S.,Paul McCartney,2003-05-17,False,7.02,0.332,0.267,0.935,...,0.967,-3.016,0.0621,148.39,4,421.02667,0.0,421.02667,5,1


### Get artist data (id, artist name, genre, popularity, followers)

#### Create track container dictionaries

* Note that since tracks variable is no created in cell with function call, subsequent calls will be appended to the same dictionary

In [9]:
artists = []

#### Function to extract all of the tracks' artist ids from your playlist:

In [10]:
def get_artist_ids(playlist_id):
    artist_id_list = []
    playlist = sp.playlist(playlist_id)
    for item in playlist['tracks']['items']:
        music_track = item['track']
        artist_id_list.append(music_track['artists'][0]['id'])
    return artist_id_list 

#### Function to extract all the details of each artist by passing their ID:

In [11]:
def get_artist_data(artist_id):
    meta = sp.artist(artist_id)
    artist_details = {'artist id': meta['id'],
                    'artist name': meta['name'],
                    'genres': meta['genres'],
                    'popularity': meta['popularity'],
                    'followers': meta['followers']['total']
                    }
    return artist_details

####  Extract artist data

Extract artist data of each track

For testing:  playlist_id = '0qfagBJB5ou0r1kwQDZ8Op'

In [12]:
# Get the ids for all the songs in your playlist
playlist_id = input('Enter the playlist id')
artist_ids = get_artist_ids(playlist_id)
print(len(artist_ids))
print(artist_ids)

#  Loop over track ids and get their data points
for i in range(len(artist_ids)):
    time.sleep(.5)
    artist = get_artist_data(artist_ids[i])
    artists.append(artist)

Enter the playlist id0qfagBJB5ou0r1kwQDZ8Op
21
['3iOvXCl6edW5Um0fXEBRXy', '2qk9voo8llSGYcZ6xrBzKx', '0yNLKJebCb8Aueb54LYya3', '6dgwEwnK0YtDfS9XhRwBTG', '0qLNsNKm8bQcMoRFkR8Hmh', '2DaxqgrOhkeH0fpeiQq2f4', '3Bf4u6r96pGx1eIbaGqfvf', '7sjttK1WcZeyLPn3IsQ62L', '1eClJfHLoDI4rZe5HxzBFv', '4STHEaNw4mPZ2tzheohgXB', '4x1nvY2FN8jxqAFA0DA02H', '4gzpq5DPGxSnKTe4SA8HAU', '0qLNsNKm8bQcMoRFkR8Hmh', '7sjttK1WcZeyLPn3IsQ62L', '2cGwlqi3k18jFpUyTrsR84', '2DaxqgrOhkeH0fpeiQq2f4', '0k17h0D3J5VfsdmQ1iZtE9', '4W48hZAnAHVOC2c8WH8pcq', '3OsRAKCvk37zwYcnzRf5XF', '63MQldklfxkjYDoUE4Tppz', '51Blml2LZPmy7TTiAg47vQ']


#### Create dataframe

In [13]:
artist_df = pd.DataFrame(artists)
artist_df.head()

Unnamed: 0,artist id,artist name,genres,popularity,followers
0,3iOvXCl6edW5Um0fXEBRXy,The xx,"[downtempo, dream pop, indietronica]",70,3788801
1,2qk9voo8llSGYcZ6xrBzKx,Kings of Leon,"[modern rock, rock]",78,4896158
2,0yNLKJebCb8Aueb54LYya3,New Order,"[art rock, dance rock, madchester, new romanti...",69,1617958
3,6dgwEwnK0YtDfS9XhRwBTG,Broken Bells,"[alternative dance, alternative rock, indie po...",61,514308
4,0qLNsNKm8bQcMoRFkR8Hmh,James,"[britpop, madchester, new wave, new wave pop, ...",64,429673


### Get track's audio features directly from playlist (for concept only, still a WIP)

#### Function to extract each track's audio features from a playlist directly

In [None]:
def get_playlist_tracks(playlist_id):
    track_attributes = sp.playlist_tracks(playlist_id)
    return track_attributes

In [None]:
playlist_tracks_data = []
playlist_ids = ['0qfagBJB5ou0r1kwQDZ8Op']

#  Loop over playlist ids and get their data points
for i in range(len(playlist_ids)):
    time.sleep(.5)
    playlist_track = get_playlist_tracks(playlist_ids[i])
    playlist_tracks_data.append(playlist_track)

In [None]:
playlist_df = pd.DataFrame(playlist_tracks_data)
playlist_df