# Spotify Search API
- Using Spotify API to get the song_id for the song given a concatenated string of both song title and artistname, after pre-processing
- This API call will also get the song's popularity. We'll be using another endpoint to retrieve further track features
- Some further pre-processing (removing non-unicode chars)
- We use a language detector do determine which market to use as a parameter of the search call (alpha-2 country code e.g. 'us', 'it')
- Spotify API requires authentication (see below)


Need to set environment variables for Spotify authentication
- export SPOTIPY_CLIENT_ID='your-spotify-client-id'
- export SPOTIPY_CLIENT_SECRET='your-spotify-client-secret'


In [78]:
import os
## Set environment variables
os.environ['SPOTIPY_CLIENT_ID'] = ''
os.environ['SPOTIPY_CLIENT_SECRET'] = ''

In [79]:
## Checking that environment variables exist
# print(os.environ.get("SPOTIPY_CLIENT_ID"))
# print(os.environ.get("SPOTIPY_CLIENT_SECRET"))
# ! printenv

## Simple Demo of Search API Call
- OAuth authentication with the sp object
- use .search function (input: demo string, limit 1, type=track)
- Explore data fields

In [80]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import sys
import os
import pprint

import pandas as pd
import numpy as np
import re

from tqdm.notebook import tqdm

In [81]:
## WORKED, SINGLE SONG EXAMPLE
## making search API call
sp = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials())

demo_string = "the clash straight to hell"
## results is a dictionary
results = sp.search(q=demo_string, limit=1, type="track")


In [82]:
## pretty printer to better handle dictionary printing, see request response json;
## uncomment to check it out

# pp = pprint.PrettyPrinter()
# pp.pprint(results)

In [83]:
## get the target variables from the call
song_id = results["tracks"]["items"][0].get("id")
song_name = results["tracks"]["items"][0].get("name")
artist_name = results["tracks"]["items"][0]["album"]["artists"][0].get("name")
song_popularity = results["tracks"]["items"][0].get("popularity")

print(song_id, song_name, artist_name, song_popularity)

2ax1vei61BzRGsEn6ckEdL Straight to Hell - Remastered The Clash 59


### Working to get popularity on a small subset of our DF
- Eventually, of course, the Full DF (MSD)

In [84]:
# df used to use the search API (SongNumber, SearchStr, market)
search_df = pd.read_csv("data/search_subset.csv")

In [85]:
search_df.shape

(31054, 5)

In [86]:
## remove the entries where SearchStr isn't present or nan
search_df = search_df[~search_df["SearchStr"].isnull()]
search_df.shape

(31054, 5)

In [87]:
## space to try out more pre-processing techniques to improve Spotify response

# remove parenthesis from the SearchStr? -- might not do it, since we're using levenshtein edit distance 
search_df["SearchStr_nopar"] = search_df.SearchStr.apply(lambda x: re.sub("\(.*\)", "", x))

In [88]:
## https://stackoverflow.com/questions/34753821/remove-diacritics-from-string-for-search-function
## removing diacritics (special accents) from the strings?
import unicodedata
def shave_marks(txt):
    """This method removes all diacritic marks from the given string"""
    norm_txt = unicodedata.normalize('NFD', txt)
    shaved = ''.join(c for c in norm_txt if not unicodedata.combining(c))
    return unicodedata.normalize('NFC', shaved)

search_df["SearchStr_decode_nopar"] = search_df["SearchStr_nopar"].apply(lambda x: shave_marks(x))

In [89]:
search_df.head(2)

Unnamed: 0,SongNumber,Title,ArtistName,SearchStr,market,SearchStr_nopar,SearchStr_decode_nopar
0,77629,Fire Dance,STRATOVARIUS,fire dance stratovarius,it,fire dance stratovarius,fire dance stratovarius
1,575703,Don't Worry,Fred Thomas,don't worry fred thomas,us,don't worry fred thomas,don't worry fred thomas


In [90]:
print(search_df.shape)
search_df.sample(5)

(31054, 7)


Unnamed: 0,SongNumber,Title,ArtistName,SearchStr,market,SearchStr_nopar,SearchStr_decode_nopar
30792,478316,Universal Daddy,Alphaville,universal daddy alphaville,us,universal daddy alphaville,universal daddy alphaville
1085,404403,This Land Is Your Land,The Seekers,this land is your land the seekers,us,this land is your land the seekers,this land is your land the seekers
26747,311824,Schuld war wieder die Nacht,Münchener Freiheit,schuld war wieder die nacht münchener freiheit,us,schuld war wieder die nacht münchener freiheit,schuld war wieder die nacht munchener freiheit
1200,261244,dark breaker,Sundial Aeon,dark breaker sundial aeon,us,dark breaker sundial aeon,dark breaker sundial aeon
25400,233084,Dyrt å spå,Hellbillies,dyrt å spå hellbillies,us,dyrt å spå hellbillies,dyrt a spa hellbillies


In [91]:
## following above format, let's get tups for all of our songs in the subset; 
## tups are (SearchStr, market)
search_tups = []

for i in tqdm(range(len(search_df))):
    ## change the string column to be used for the search here
    vals_tup = tuple((search_df.iloc[i].SearchStr_nopar, search_df.iloc[i].market))
    search_tups.append(vals_tup)
    

  0%|          | 0/31054 [00:00<?, ?it/s]

In [92]:
search_tups[:10]

[('fire dance stratovarius', 'it'),
 ("don't worry fred thomas", 'us'),
 ("guess who's coming to dinner chops", 'us'),
 ('deep river  the five blind boys of alabama', 'us'),
 ('intro/love line interlude1 deviants of reality', 'us'),
 ('danny boy foster & allen', 'us'),
 ("postcard finn's motel", 'us'),
 ('i come down joseph arthur', 'us'),
 ('move up tony roots', 'us'),
 ('automatic " scott wozniak_ angelica linares', 'us')]

In [93]:
## iterate over our search strings, use the spotipy package to get data
## result list is list of tuples (id<str>, song_name<str>, artist_name<str>, popularity<int>)
import time

result_list = []

## chunks for API calls
# https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for o in range(0, len(lst), n):
        yield lst[o:o + n]

chunk_search_tups = list(chunks(search_tups, 1400))

In [95]:
# chunk_search_tups[1]


In [96]:
for chunk in tqdm(chunk_search_tups):

    time.sleep(3)

    for t in chunk:
        # print(t)

        real_result = sp.search(q=t[0], limit=1, type="track", market=t[1])
        ## error handling if response returns an empty list
        if real_result["tracks"]["items"] == []:
            result_list.append("API-fail " + t[0])

        else:
            result_list.append((real_result["tracks"]["items"][0].get("id"),
                                real_result["tracks"]["items"][0].get("name"),
                                real_result["tracks"]["items"][0]["album"]["artists"][0].get("name"),
                                real_result["tracks"]["items"][0].get("popularity")))
    # time.sleep(5)


  0%|          | 0/23 [00:00<?, ?it/s]

In [97]:
## computing percentage of songs where the API failed
# 0.19 with parenthesis (.head(100))
# 0.1 without parenthesis (.head(100))
# 0.1 without parenthesis or diacritics (.head(100))

perc_API_failed = len([i for i in result_list if "API-fail" in i]) / len(result_list)
perc_API_failed
# result_list

0.15762864687318864

In [98]:
## seeing which queries failed, and hopefully why
# len([i for i in result_list if not "API-fail" in i])
" ".join([i for i in result_list if "API-fail" in i][0].split(" ")[1:])
result_list[:10]


[('3YOhXYCLFRQxEmUlzjiWEJ', 'Fire Dance', 'Stratovarius', 17),
 ('53RfjM48r9xNywpDiog6zG', "Don't Worry", 'Fred Thomas', 1),
 "API-fail guess who's coming to dinner chops",
 ('3XeMKCwdyW5aRqbpR9zMal',
  'Look Where He Brought Me From',
  'The Blind Boys Of Alabama',
  11),
 ('73vcZCFNErwqHxuS1BXH0e',
  'Intro/Love Line Interlude 1',
  'Deviants Of Reality',
  0),
 ('5V2GVAhUtjXwEfYNUjDUyz', 'Danny Boy', 'Foster & Allen', 26),
 ('0Qdy0Vu9xir8mjc6iQ6vTA', 'Postcard', "Finn's Motel", 0),
 ('6llvyjyBy6iORyHIkcpVJW', 'I Come Down', 'Joseph Arthur', 1),
 ('1NccL2OKYnyukD6xMsmdG1', 'Move Up', 'Tony Roots', 0),
 ('1Hxm5EQNE94R0ZgRvmNQ33',
  'Automatic (feat. Angelica Linares) - Wozniak Vocal Mix',
  'Scott Wozniak',
  0)]

In [99]:
## results that return failed api call
pos_results = [i for i in result_list if  "API-fail" in i]
pos_results[:3]

["API-fail guess who's coming to dinner chops",
 'API-fail cycle of the streets thought riot',
 'API-fail i can let go now lee ryan']

In [100]:
## we gotta restructure the results list because the API fail error message is a string
## replace it with a (a,b,c,d) tuple -- works
results_copy = result_list.copy()

for i, r in enumerate(result_list):
    # print(r)
    # break
    if "API-fail" in r:
        results_copy[i] = ("", " ".join(r.split(" ")[1:]), np.nan)
        

In [101]:
# results_copy
for i in results_copy:
    if "" in i:
        print(i)
        break

('', "guess who's coming to dinner chops", nan)


In [102]:
results_copy[10]

('', 'cycle of the streets thought riot', nan)

In [103]:
## get target variables from results_copy to add as features to our data; specially to evaluate performance
_ids = [i[0] for i in results_copy]
_namesong = [i[1] for i in results_copy]
_nameartist = [i[2] for i in results_copy]
_pop = [i[3] if len(i) == 4 else np.nan for i in results_copy]

In [104]:
## now we have the song_id, title from spotify, and the popularity for our subset
search_df["SongID"] = _ids
search_df["SpotifySongTitle"] = _namesong
search_df["SpotifyArtistTitle"] = _nameartist
search_df["Popularity"] = _pop

In [105]:
## ! we need to think of a way to evaluate how well we did in our popularity retrieval,
## some can be matched exactly but more pre-processing and lighter rule-based things might be needed
search_df.sample(10)

Unnamed: 0,SongNumber,Title,ArtistName,SearchStr,market,SearchStr_nopar,SearchStr_decode_nopar,SongID,SpotifySongTitle,SpotifyArtistTitle,Popularity
30655,861944,See You Tomorrow,John Powell,see you tomorrow john powell,us,see you tomorrow john powell,see you tomorrow john powell,7D9Jhcu3r5UcVbqwKuZKsV,See You Tomorrow,John Powell,49.0
29949,550959,A Porter's Love Song,James P. Johnson,a porter's love song james p. johnson,us,a porter's love song james p. johnson,a porter's love song james p. johnson,3EM3lXPY4hLtxZdj1HJJQf,A Porter's Love Song To A Cham,James P. Johnson,4.0
1951,973161,Before The Day,Newsong,before the day newsong,us,before the day newsong,before the day newsong,3CsrLqHtqIT9ZGgh8LRBFf,Before The Day,Newsong,17.0
1480,702992,Homeward Strut,Tommy Bolin,homeward strut tommy bolin,us,homeward strut tommy bolin,homeward strut tommy bolin,3q7C3yW4bmsNvqvqUdyqDY,Homeward Strut,Tommy Bolin,13.0
28974,722393,Time Out,Joe Walsh,time out joe walsh,us,time out joe walsh,time out joe walsh,0VezDmUV7Vkpo0aLUu4jqJ,Time Out,Joe Walsh,30.0
14376,748719,The Valley People,Earlimart,the valley people earlimart,us,the valley people earlimart,the valley people earlimart,4tXNAwcs2WzDAO0b5lGuKW,The Valley People,Earlimart,4.0
28203,909008,Conclusions / Concussions,Air Conditioning,conclusions / concussions air conditioning,fr,conclusions / concussions air conditioning,conclusions / concussions air conditioning,2LvGlr6chqKC36t5tHAkR1,Conclusions / Concussions,Air Conditioning,0.0
18438,302059,Walking In Memphis,Lonestar,walking in memphis lonestar,us,walking in memphis lonestar,walking in memphis lonestar,3c9WJhG3QvTtHmwKb5wz3i,Walking In Memphis,Lonestar,54.0
7332,819881,Youth Goes,U.S. Bombs,youth goes u.s. bombs,us,youth goes u.s. bombs,youth goes u.s. bombs,0Um6KiGWNVrEt0SwEavxFi,Youth Goes,U.S. Bombs,10.0
21110,401251,Equinoxe (Part 5),New Electronic Soundsystem,equinoxe (part 5) new electronic soundsystem,us,equinoxe new electronic soundsystem,equinoxe new electronic soundsystem,0NBsaHT4fIKgDC49lJAhHr,Equinoxe (Part 5),New Electronic Soundsystem,0.0


In [107]:
## write it, let's not lose it. Use later to merge data together with track features stuff
# search_df.to_csv("data/dataSpotify.csv", index=False)
