# Get music features from Spotify API for Billboard Top 100 songs

In this notebook, ([lyrics data of Billboard Top100 songs between 1964 and 2015](https://github.com/walkerkq/musiclyrics)) is used. Each song is matched with musical features from [Spotify API](https://developer.spotify.com/), containing 12 audio features for each track (including acousticness, liveness, speechiness and instrumentalness, energy, loudness, danceability valence, duration, tempo, key, and mode).

In [1]:
#import libraries
import pandas as pd
import numpy as np
import collections
import re
import sys
import time
from nltk.tokenize import word_tokenize
#!pip install wordninja
import wordninja

In [2]:
#load billboard songs and lyrics from github.com/walkerkq
file_path="https://github.com/walkerkq/musiclyrics/blob/master/billboard_lyrics_1964-2015.csv"
df = pd.read_csv(file_path,delimiter=',', encoding='latin-1')

In [3]:
df.columns

Index(['Rank', 'Song', 'Artist', 'Year', 'Lyrics', 'Source'], dtype='object')

In [4]:
#get list of artists
artists=df['Artist'].to_list()

## 1. Data Cleaning
Cleaning of artist and song names needs to be done in order to find the songs' musical features using the Spotify API

### Clean artist names   


In [5]:
#remove featuring artists
artists_clean=[]
for artist in artists:
    a_string=artist
    search_string=' featuring '
    if search_string not in a_string:
            artists_clean.append(a_string)
    else:
        split_string = a_string.split(search_string, 1)
        substring = split_string[0]
        artists_clean.append(substring)
len(artists_clean)


5100

In [6]:
#for each artist string, evaluate if string is containted in other string and replace longer by shorter string
# this eliminates remaining featured artists that will cause errors when looking up the song in Spotify API

for i in range(len(artists_clean)):
    target_string=artists_clean[i]
    for search_string in artists_clean:
        if target_string==search_string:
            pass
        else:
            if len(search_string)>3 and target_string.find(" "+search_string+" ")!=-1:
                artists_clean[i]=search_string
            else:
                pass
                

            
len(artists_clean)


5100

In [7]:
#append artists_clean column to data frame containing billboard songs, ranks and lyrics
df['Artists clean']=artists_clean

In [8]:
#make list of artist-tracks for billboard songs
songs=df['Artists clean']+" " +df['Song']
print(len(songs))

5100


In [9]:
#assign to artist_song1
df['artist_song1']=songs

### Clean song names

In [10]:
#titles may be messy and have words crunched together
#wordninja is a package that allows to robabilistically split concatenated words using NLP based on English Wikipedia unigram frequencies.
#Reference: https://github.com/keredson/wordninja

#make alternative title with wordninja package for songs that cannot be found
#split words into individual syllables/words to be able to find them on spotify
songs_clean=[]
for song in df.Song :
    string=wordninja.split(song)
    title=""
    for s in string:
        title+=s+" " 
    songs_clean.append(title.strip())

In [11]:
df['songs_clean']=songs_clean
songs2=df['Artists clean']+" " +df['songs_clean']
df['artist_song2']=songs2

## 2. Retrieve musical features from Spotify API

In [None]:
#connect to spotify api
import requests
from spotipy.oauth2 import SpotifyClientCredentials #To access authorised Spotify data
from tqdm import notebook
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
AUTH_URL = 'https://accounts.spotify.com/api/token'
CLIENT_ID = ''
CLIENT_SECRET = ''

auth_response = requests.post(AUTH_URL, {
    'grant_type': 'client_credentials',
    'client_id': CLIENT_ID,
    'client_secret': CLIENT_SECRET,
})

# convert the response to JSON
auth_response_data = auth_response.json()

# save the access token
access_token = auth_response_data['access_token']

BASE_URL = 'https://api.spotify.com/v1/'
headers = {
    'Authorization': 'Bearer {token}'.format(token=access_token)
}
client_credentials_manager = SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager) #spotify object to access API

In [13]:
df.head()

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source,Artists clean,artist_song1,songs_clean,artist_song2
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3.0,sam the sham and the pharaohs,sam the sham and the pharaohs wooly bully,wooly bully,sam the sham and the pharaohs wooly bully
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1.0,four tops,four tops i cant help myself sugar pie honey b...,i cant help myself sugar pie honey bunch,four tops i cant help myself sugar pie honey b...
2,3,i cant get no satisfaction,the rolling stones,1965,,1.0,the rolling stones,the rolling stones i cant get no satisfaction,i cant get no satisfaction,the rolling stones i cant get no satisfaction
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1.0,we five,we five you were on my mind,you were on my mind,we five you were on my mind
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1.0,the righteous brothers,the righteous brothers youve lost that lovin f...,you ve lost that lovin feel in,the righteous brothers you ve lost that lovin ...


In [14]:
#function to retrieve track information
def get_tracks(df, col1, col2):
    tracks={}
    for i in notebook.tqdm(range(df.shape[0])):
        try:
            # Get the track URI (a unique ID) corresponding to track
            track_uri = sp.search(df[col1][i])['tracks']['items'][0]['uri']
            track_id=track_uri.replace("spotify:track:", "")
            #get artist corresponding to track
            artist=sp.search(df[col1][i])['tracks']['items'][0]['artists'][0]['name']
            r = requests.get(BASE_URL + 'audio-features/' + track_id, headers=headers)
            values=r.json()
            tracks[i]=values
        except:
            try:
                # Get the track URI (a unique ID) corresponding to track
                track_uri = sp.search(df[col2][i])['tracks']['items'][0]['uri']
                track_id=track_uri.replace("spotify:track:", "")
                #get artist corresponding to track
                artist=sp.search(df[col2][i])['tracks']['items'][0]['artists'][0]['name']
                r = requests.get(BASE_URL + 'audio-features/' + track_id, headers=headers)
                values=r.json()
                tracks[i]=values
            except:
                values={'danceability': '', 'energy': '', 'key': '', 'loudness': '', 'mode': '', 'speechiness': '', 'acousticness': '',   'instrumentalness': '',   'liveness': '',   'valence': '',   'tempo': '',   'type': '',   'id': '',   'uri': '',   'track_href': '',  'analysis_url': '',   'duration_ms': '',   'time_signature': ''}
                tracks[i]=values
        
    return tracks


In [15]:
#check for track info by first using songs and then songs cleaned
tracklist=get_tracks(df, 'artist_song1', 'artist_song2')

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=5100.0), HTML(value='')))




In [200]:
df_tracks=pd.DataFrame.from_dict(tracklist).T

In [195]:
df_tracks

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,error
0,0.696,0.692,0,-9.586,1,0.0575,0.577,0.000688,0.246,0.904,140.978,audio_features,0SRkuudTEWe2HOloI1Nssq,spotify:track:0SRkuudTEWe2HOloI1Nssq,https://api.spotify.com/v1/tracks/0SRkuudTEWe2...,https://api.spotify.com/v1/audio-analysis/0SRk...,138867,4,
1,0.667,0.599,0,-8.894,1,0.0291,0.245,0,0.107,0.971,127.935,audio_features,6b6IMqP565TbtFFZg9iFf3,spotify:track:6b6IMqP565TbtFFZg9iFf3,https://api.spotify.com/v1/tracks/6b6IMqP565Tb...,https://api.spotify.com/v1/audio-analysis/6b6I...,160280,4,
2,0.723,0.863,2,-7.89,1,0.0338,0.0383,0.0317,0.128,0.931,136.302,audio_features,2PzU4IB8Dr6mxV3lHuaG34,spotify:track:2PzU4IB8Dr6mxV3lHuaG34,https://api.spotify.com/v1/tracks/2PzU4IB8Dr6m...,https://api.spotify.com/v1/audio-analysis/2PzU...,222813,4,
3,0.507,0.56,1,-8.118,0,0.0473,0.554,0,0.0517,0.645,143.259,audio_features,36ckFm0oicmvX8bWEErIHd,spotify:track:36ckFm0oicmvX8bWEErIHd,https://api.spotify.com/v1/tracks/36ckFm0oicmv...,https://api.spotify.com/v1/audio-analysis/36ck...,155960,4,
4,0.369,0.305,1,-14.303,1,0.0267,0.522,0,0.0579,0.376,94.791,audio_features,6AeyHqzNHJthYJbn0tvJ4b,spotify:track:6AeyHqzNHJthYJbn0tvJ4b,https://api.spotify.com/v1/tracks/6AeyHqzNHJth...,https://api.spotify.com/v1/audio-analysis/6Aey...,226453,4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5095,0.608,0.628,8,-6.968,1,0.106,0.212,0,0.0727,0.587,180.163,audio_features,6EYwUarLDfsZ0BMD3IYETF,spotify:track:6EYwUarLDfsZ0BMD3IYETF,https://api.spotify.com/v1/tracks/6EYwUarLDfsZ...,https://api.spotify.com/v1/audio-analysis/6EYw...,210756,4,
5096,0.706,0.572,8,-5.799,0,0.0326,0.0262,0,0.585,0.574,139.982,audio_features,7seTcUFOhn5caSDbiSfsp0,spotify:track:7seTcUFOhn5caSDbiSfsp0,https://api.spotify.com/v1/tracks/7seTcUFOhn5c...,https://api.spotify.com/v1/audio-analysis/7seT...,214726,4,
5097,0.672,0.52,8,-7.747,1,0.0353,0.859,0,0.115,0.37,120.001,audio_features,5O2P9iiztwhomNh8xkR9lJ,spotify:track:5O2P9iiztwhomNh8xkR9lJ,https://api.spotify.com/v1/tracks/5O2P9iiztwho...,https://api.spotify.com/v1/audio-analysis/5O2P...,226600,4,
5098,0.893,0.48,1,-3.728,0,0.356,0.00855,0,0.116,0.382,86.976,audio_features,5lFDtgWsjRJu8fPOAyJIAK,spotify:track:5lFDtgWsjRJu8fPOAyJIAK,https://api.spotify.com/v1/tracks/5lFDtgWsjRJu...,https://api.spotify.com/v1/audio-analysis/5lFD...,170638,4,


In [22]:
print("Missing songs: ", df_tracks.loc[df_tracks.uri==""].shape[0])

Missing songs:  63


In [30]:
missing_songs=df.iloc[df_tracks.loc[df_tracks.uri==""].index]

In [32]:
missing_songs.artist_song2.to_dict()

{199: 'wilson pickett 6345789 souls ville usa',
 278: 'keith 986',
 291: 'bill cosby little ole man uptight everything s alright',
 374: 'stevie wonder shoo be doo be doodad ay',
 602: 'carole king its too late i feel the earth move',
 625: 'murray head  the trindad singers superstar',
 633: 'daddy dewdrop chic kaboom dont ya jes love it',
 650: 'the osmonds yoyo',
 659: 'paul mccartney another day oh woman oh why',
 663: '8th day she s not just another woman',
 699: 'perry como somos novi os its impossible',
 837: 'bobby boris pickett and the cryptkickers monster mash',
 885: 'blue ridge rangers jamb a lay a on the bayou',
 1020: 'van mccoy  the soul city symphony the hustle',
 1081: 'al green s halala make me happy',
 1092: 'neil sedaka  elton john bad blood',
 1099: 'discotex and the sexolettes get dancin',
 1109: 'walter murphy  the big apple band a fifth of beethoven',
 1167: 'wing and a prayer fife and drum corps baby face',
 1270: 'meco star wars theme can tina band',
 1272: 'lo

In [179]:
#Manually edit song titles that could not be found
songs_edit={199: 'wilson pickett 6345789 souls ville usa',
 278: 'keith 98.6',
 291: 'bill cosby little old man',
 374: 'stevie wonder shoo be doo',
 602: 'carole king its too late',
 625: 'murray head superstar',
 633: 'daddy dewdrop chick a boom',
 650: 'the osmonds yo yo',
 659: 'paul mccartney another day',
 663: '8th day she s not just another woman',
 699: 'perry como its impossible',
 837: 'bobby boris pickett monster mash',
 885: 'blue ridge rangers jambalaya',
 1020: 'van mccoy the hustle',
 1081: 'al green make me happy',
 1092: 'neil sedaka bad blood',
 1099: 'discotex and the sexolettes get dancin',
 1109: 'walter murphy fifth of beethoven',
 1167: 'wing and a prayer fife and drum corps baby face',
 1270: 'meco star wars theme',
 1272: 'david dundas jeans on',
 1299: 'cj  company devils gun',
 1476: 'linda ronstadt ooh baby baby',
 1510: 'rupert holmes escape',
 1681: 'john schneider its now or never',
 1695: 'gary wright really want to know you',
 1741: 'buckner  garcia pac man fever',
 1796: 'greg guidry going down',
 1968: 'irene cara brea kdance',
 2244: 'michael jackson i just cant stop loving you',
 2361: 'gloria estefan sound machine 1 2 3',
 2408: 'will to power baby i love your way',
 2513: 'paula abdul opposites attract',
 2568: 'tyler collins girls nite out',
 2594: 'ame lorain whole wide world',
 2595: 'motley crue without you',
 2602: 'cc music factory gonna make you sweat',
 2607: 'hifive i like the way',
 2634: 'cc music factory here we go',
 2650: 'cc music factory things that make you go hmmm',
 2913: 'saltnpepa whatta man',
 2915: 'mariah carey without you',
 3100: "los del rio macarena",
 3114: 'gin blossoms follow you down',
 3116: '2pac california love',
 3128: 'alanis morissette you learn',
 3137: 'monica before you walk out of my life',
 3145: 'monica why i love you so much',
 3165: 'adam clayton mission impossible',
 3201: 'jewel foolish games you were meant for me',
 3258: 'toni braxton i dont want to',
 3267: 'brock and the bizz my baby daddy',
 3281: "los del rio macarena",
 3347: 'sylke fyne romeo juliet',
 3348: "mya sisqo its all about me",
 3386: 'jewel foolish games',
 3591: 'kevon edmonds 24-7',
 3877: 'erykah badu love of my life',
 4092: 'jayz numb encore',
 4274: 'ti big shit poppin',
 4461: 'jayz empire state of mind',
 4520: 'jayz empire state of mind',
 4547: 'kris allen live like were dying'}


In [180]:
#get musical features of edited songnames

missing_tracks={}
for k,v in notebook.tqdm(songs_edit.items()):
    try:
            # Get the track URI (a unique ID) corresponding to track
        track_uri = sp.search(v)['tracks']['items'][0]['uri']
        track_id=track_uri.replace("spotify:track:", "")
            #get artist corresponding to track
        artist=sp.search(v)['tracks']['items'][0]['artists'][0]['name']
        r = requests.get(BASE_URL + 'audio-features/' + track_id, headers=headers)
        values=r.json()
        missing_tracks[k]=values
    except:
        pass

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=63.0), HTML(value='')))




In [188]:
df_missing_tracks=pd.DataFrame.from_dict(missing_tracks).T
df_missing_tracks.shape

(47, 18)

In [190]:
df_tracks.shape

(5100, 19)

In [207]:
#drop missings from df_tracks and append df_missing_tracks
df_tracks_2=df_tracks.loc[df_tracks.uri!='']

In [209]:
df_alltracks=df_tracks_2.append(df_missing_tracks)

In [211]:
df_alltracks.shape

(5084, 19)

In [214]:
df_alltracks.sort_index().tail()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,error
5095,0.608,0.628,8,-6.968,1,0.106,0.212,0.0,0.0727,0.587,180.163,audio_features,6EYwUarLDfsZ0BMD3IYETF,spotify:track:6EYwUarLDfsZ0BMD3IYETF,https://api.spotify.com/v1/tracks/6EYwUarLDfsZ...,https://api.spotify.com/v1/audio-analysis/6EYw...,210756,4,
5096,0.706,0.572,8,-5.799,0,0.0326,0.0262,0.0,0.585,0.574,139.982,audio_features,7seTcUFOhn5caSDbiSfsp0,spotify:track:7seTcUFOhn5caSDbiSfsp0,https://api.spotify.com/v1/tracks/7seTcUFOhn5c...,https://api.spotify.com/v1/audio-analysis/7seT...,214726,4,
5097,0.672,0.52,8,-7.747,1,0.0353,0.859,0.0,0.115,0.37,120.001,audio_features,5O2P9iiztwhomNh8xkR9lJ,spotify:track:5O2P9iiztwhomNh8xkR9lJ,https://api.spotify.com/v1/tracks/5O2P9iiztwho...,https://api.spotify.com/v1/audio-analysis/5O2P...,226600,4,
5098,0.893,0.48,1,-3.728,0,0.356,0.00855,0.0,0.116,0.382,86.976,audio_features,5lFDtgWsjRJu8fPOAyJIAK,spotify:track:5lFDtgWsjRJu8fPOAyJIAK,https://api.spotify.com/v1/tracks/5lFDtgWsjRJu...,https://api.spotify.com/v1/audio-analysis/5lFD...,170638,4,
5099,0.673,0.859,11,-3.007,0,0.056,0.00426,0.015,0.128,0.111,127.978,audio_features,6rMwYjoFvxKiaEH4If8ZpZ,spotify:track:6rMwYjoFvxKiaEH4If8ZpZ,https://api.spotify.com/v1/tracks/6rMwYjoFvxKi...,https://api.spotify.com/v1/audio-analysis/6rMw...,256560,4,


16 songs had to be dropped as musical features could not be identified in the Spotify API.

In [215]:
#merge track feature dataframe with song title and rank dataframe
df_music=df.join(df_alltracks)
df_music

Unnamed: 0,Rank,Song,Artist,Year,Lyrics,Source,Artists clean,artist_song1,songs_clean,artist_song2,...,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,error
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...,3.0,sam the sham and the pharaohs,sam the sham and the pharaohs wooly bully,wooly bully,sam the sham and the pharaohs wooly bully,...,0.904,140.978,audio_features,0SRkuudTEWe2HOloI1Nssq,spotify:track:0SRkuudTEWe2HOloI1Nssq,https://api.spotify.com/v1/tracks/0SRkuudTEWe2...,https://api.spotify.com/v1/audio-analysis/0SRk...,138867,4,
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...,1.0,four tops,four tops i cant help myself sugar pie honey b...,i cant help myself sugar pie honey bunch,four tops i cant help myself sugar pie honey b...,...,0.971,127.935,audio_features,6b6IMqP565TbtFFZg9iFf3,spotify:track:6b6IMqP565TbtFFZg9iFf3,https://api.spotify.com/v1/tracks/6b6IMqP565Tb...,https://api.spotify.com/v1/audio-analysis/6b6I...,160280,4,
2,3,i cant get no satisfaction,the rolling stones,1965,,1.0,the rolling stones,the rolling stones i cant get no satisfaction,i cant get no satisfaction,the rolling stones i cant get no satisfaction,...,0.931,136.302,audio_features,2PzU4IB8Dr6mxV3lHuaG34,spotify:track:2PzU4IB8Dr6mxV3lHuaG34,https://api.spotify.com/v1/tracks/2PzU4IB8Dr6m...,https://api.spotify.com/v1/audio-analysis/2PzU...,222813,4,
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...,1.0,we five,we five you were on my mind,you were on my mind,we five you were on my mind,...,0.645,143.259,audio_features,36ckFm0oicmvX8bWEErIHd,spotify:track:36ckFm0oicmvX8bWEErIHd,https://api.spotify.com/v1/tracks/36ckFm0oicmv...,https://api.spotify.com/v1/audio-analysis/36ck...,155960,4,
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...,1.0,the righteous brothers,the righteous brothers youve lost that lovin f...,you ve lost that lovin feel in,the righteous brothers you ve lost that lovin ...,...,0.376,94.791,audio_features,6AeyHqzNHJthYJbn0tvJ4b,spotify:track:6AeyHqzNHJthYJbn0tvJ4b,https://api.spotify.com/v1/tracks/6AeyHqzNHJth...,https://api.spotify.com/v1/audio-analysis/6Aey...,226453,4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5095,96,el perdon,nicky jam and enrique iglesias,2015,enrique iglesias dime si es verdad me dijeron ...,3.0,nicky jam and enrique iglesias,nicky jam and enrique iglesias el perdon,el per don,nicky jam and enrique iglesias el per don,...,0.587,180.163,audio_features,6EYwUarLDfsZ0BMD3IYETF,spotify:track:6EYwUarLDfsZ0BMD3IYETF,https://api.spotify.com/v1/tracks/6EYwUarLDfsZ...,https://api.spotify.com/v1/audio-analysis/6EYw...,210756,4,
5096,97,she knows,neyo featuring juicy j,2015,,,neyo,neyo she knows,she knows,neyo she knows,...,0.574,139.982,audio_features,7seTcUFOhn5caSDbiSfsp0,spotify:track:7seTcUFOhn5caSDbiSfsp0,https://api.spotify.com/v1/tracks/7seTcUFOhn5c...,https://api.spotify.com/v1/audio-analysis/7seT...,214726,4,
5097,98,night changes,one direction,2015,going out tonight changes into something red ...,1.0,one direction,one direction night changes,night changes,one direction night changes,...,0.37,120.001,audio_features,5O2P9iiztwhomNh8xkR9lJ,spotify:track:5O2P9iiztwhomNh8xkR9lJ,https://api.spotify.com/v1/tracks/5O2P9iiztwho...,https://api.spotify.com/v1/audio-analysis/5O2P...,226600,4,
5098,99,back to back,drake,2015,oh man oh man oh man not againyeah i learned ...,1.0,drake,drake back to back,back to back,drake back to back,...,0.382,86.976,audio_features,5lFDtgWsjRJu8fPOAyJIAK,spotify:track:5lFDtgWsjRJu8fPOAyJIAK,https://api.spotify.com/v1/tracks/5lFDtgWsjRJu...,https://api.spotify.com/v1/audio-analysis/5lFD...,170638,4,


In [216]:
#Save data frame - alternative to CSV: Pandas's to_pickle method
df_music.to_pickle("df_music")