Originally, I wrote this all in one big `.py` file, but it didn't lend itself nicely to presentation. This notebook provides a look at how playlists are analyzed. 

First, all the packages are imported and some boring but important stuff (authentication) is set up. 

In [2]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd
import numpy as np
import requests
import config  # used for auth stuff
import re
BASE_URL = 'https://api.spotify.com/v1/'
AUTH_URL = 'https://accounts.spotify.com/api/token'

In [3]:
# get authentication headers and stuff back (boring)
def auth_stuff(client_id: str, client_secret: str):
    
    auth_response = requests.post(AUTH_URL, {
        'grant_type': 'client_credentials',
        'client_id': client_id,
        'client_secret': client_secret,
    })
    auth_response_data = auth_response.json()

    # save the access token
    access_token_ = auth_response_data['access_token']

    headers_ = {
        'Authorization': f'Bearer {access_token_}'
    }

    return access_token_, headers_  # does this need the access_token_ to be returned?

This is where the playlist processing starts. The first step is to pull the playlist data from the Spotify API. To do that, you need the playlist id, which corresponds to the last segment of the URL when you open a spotify playlist link in a browser, like the section bolded below:
<pre>
https://open.spotify.com/playlist/<span style="font-weight: bold;">2wIJNW0Zn7LK6mi2slwdAF</span>
</pre>

That segment of the URL is a unique identifier for each playlist, and with it, you can pull back a large amount of metadata. For this analysis, I'll be working with a playlist called "4 aspen pt. 4". This is one I've been working on for my little sister while she's been away on a mission trip for our church, and the fourth in a series of playlists I have made for her over the past couple years. 

In [4]:
# pull playlist info back from api
# get track, artist, album ids from the playlist id
def playlist_processor(playlist_id: str, headers_: dict):
    playlist_info = requests.get(f'{BASE_URL}playlists/{playlist_id}/tracks', headers=headers_)

    # playlist info will be what's returned from the api call in get_all_playlist_ids
    try:
        songs = playlist_info.json()['items']
    except KeyError as ke:
        print(f'\nOh no! This message indicates an error ({ke=}) occurred. Make sure the playlist you\'re attempting to analyze is not private.')
        songs = []

    song_ids = []
    album_ids = []
    artist_ids = []

    try:
        for i in range(0, len(songs)):
            song_ids.append(songs[i]['track']['id'])  # add all the song ids to a list
            album_ids.append(songs[i]['track']['album']['id'])  # add all the album ids to a list
            artists = songs[i]['track']['artists']  # add all the artist ids to a list
            artist_id_subsets = []  # leave this here for when u go back to fix the artist api calls

            for j in range(0, len(artists)):
                artist_id_subsets.append(artists[j]['id'])
            
            artist_ids.append(artist_id_subsets)
    
    except TypeError as te:
        print(f'iteration {i} caused a problem: {te=}. Does this playlist contain songs?')
        if len(song_ids) == i-1:
            song_ids[i] = None
        if len(album_ids) == i-1:
            album_ids[i] = None
        if len(artist_ids) == i-1:
            artist_ids[i] = None

    return song_ids, album_ids, artist_ids


# The API only allows you to get info on 50 records at a time. This function groups your list of ids to query information about into lists of 50 ids, nested within each list.
# you can change the 'members' paramter to be less than 50 if you want (i.e., if the api changes or something maybe>)
def nest_id_lists(id_list: list, members: int):
    if len(id_list) > members:  # if the list is longer than `members` items, figure out how many API calls to make
        if len(id_list) % members != 0:
            how_many_times = (len(id_list) / members) + 1
        else:
            how_many_times = len(id_list) / members
        
        start = 0
        end = members

        nested_list = []

        for i in range(0, int(how_many_times)):
            nested_list.append(list(id_list)[start:end])
            start += members  # increment to get the next values starting at previous end
            if end + members > len(id_list):
                end = len(id_list)
            else:
                end += members
        
        song_ids_ = nested_list

    else:
        song_ids_ = []
        song_ids_.append(id_list)

    return song_ids_

One cool feature of Spotify's API is that you can pull back some internal metrics that they use, including one to quantify "popularity". It's scaled from 0-100, and indicates how popular a certain track or album is. 

Initially, I had wanted to come up with my own metric for tracking popularity using a formula like the following - that was the genesis for this project:
### TODO -- ADD ONE OF THE FORMULAS IN HERE
$$
track\_popularity = \frac{1}{log(num\_of\_song\_plays)} \times \frac{1}{log(artists\_monthly\_listeners)}
$$
However, some of the datapoints I wanted, like plays on a track and total monthly listeners for an artist were not available via the API, so I made do with the precomputed "popularity" feature. 

I made do with Spotify's "popularity" metric, and it worked well. 

In [5]:
# get popularity metrics for songs and albums
def get_popularities(id_list: list, headers_: dict, endpoint: str, batch_size: int):
    #  endpoint should be either 'tracks' or 'albums'
    try:
        ids_new = nest_id_lists(id_list, batch_size)
        popularity_list = []

        for list_ in ids_new:
            subset_of_ids = str(list_).strip('[\'').strip('\']').replace('\', \'', ',')#.replace('\'], [\'', '\',\'')
            popularities_raw = requests.get(f'{BASE_URL}{endpoint}/?ids={subset_of_ids}', headers=headers_)
            
            for j in range(0, len(list_)):
                popularity_list.append(popularities_raw.json()[endpoint][j]['popularity'])
    
    except KeyError as k:
        print(f'{k=}. Did you use the wrong endpoint?')

    return popularity_list

Once I had that working, I let the project rest for a while - that part has been done since August, and was initially all I set out to do. However, the other day, I decided I wanted to try expanding the analysis. After reading through some of the Spotify API docs, I decided to work with the "Audio Analysis" endpoint to get more data back. It offers mostly quantitative variables that are used to express everything from how acoustic a song sounds to whether it's in a major or minor key.

In [6]:
def get_more_features(id_list: list, headers_: dict, endpoint: str, batch_size: int):
    #  endpoint should be either 'tracks' or 'albums'
    try:
        ids_new = nest_id_lists(id_list, batch_size)
        dfs_to_concat = []

        for list_ in ids_new:
            subset_of_ids = str(list_).strip('[\'').strip('\']').replace('\', \'', ',')#.replace('\'], [\'', '\',\'')
            features_raw = requests.get(f'{BASE_URL}{endpoint}/?ids={subset_of_ids}', headers=headers_)
            music_data = pd.json_normalize(features_raw.json()['audio_features'])
            music_data.drop(['type', 'id', 'uri', 'analysis_url', 'time_signature'], axis=1, inplace=True)
            dfs_to_concat.append(music_data)
            # useful_music_features

        useful_music_features = pd.concat(dfs_to_concat)
    
    except KeyError as k:
        print(f'{k=}. Did you use the wrong endpoint?')

    return useful_music_features


def preprocess_the_data(df: pd.DataFrame):
    track_ids = df['track_href']
    del df['track_href']
    data_array = np.array(df)

    return data_array, track_ids


def get_mean_song(df: np.array, ids: pd.Series, headers_):
    
    ids = np.array(ids).reshape(1, -1)
    mean_ = np.mean(df, axis=0)
    min_distance = 1000000  # an arbitrary value that should not ever be exceeded based on the range of the values that the variables can take (+/-10)
    for i in range(0, df.shape[0]):
        distance = euclidean_distances(np.array(df[i,:]).reshape(1, -1), np.array(mean_).reshape(1, -1))
        if distance < min_distance:
            min_distance = distance
            mean_song = ids[0][i]

    ms = requests.get(mean_song, headers=headers_)
    track_name = ms.json()['name']
    artist = ms.json()['album']['artists'][0]['name']  ### TODO -- update this to support multiple artists

    track_message = f'The song that best represents this playlist is {track_name} by {artist}.'

    return track_message

Once the data is returned, there are about 30 features to work with. I decided to find out what the "average" song was for a given playlist based on these metrics. To do that, I treated each row as a vector. To figure out the mean values for all the songs in the playlist, I take a column-wise average and return that as the "average" song, and then compute the a similarity metric between each song in the playlist and the "average song" vector. The song with the highest similarity score is then returned as the "average" song in the playlist, or the single song on the playlist that best represents the playlist.

The similarity metric I decided to use was the Euclidean distance. Initially, I was using cosine similarity, but I wanted to take the magnitude of the vectors into acccount too. Euclidean distance is a metric that returns the distance between the tips of two vectors; intuitively, you can think about it as showing the distance between any two points in space. In the context of this project, what this is calculating is the distance between the "average" song and a given song from the playlist. Based on this interpretation, the "average" song is then the song vector whose tip is "closest" to the "average song" vector's tip. While it has the disadvantage of not explicitly returning information about the angle between the two vectors, if two vectors have a low euclidean distance, they will likely be pointing in a similar direction.

In [7]:
def get_artist_popularities(id_list: list, headers_: dict):
    
    popularity_list = []
    lengths = []
    starred = []

    for i in id_list:
        lengths.append(len(i))
        starred = [*starred, *i]

    raw_popularities = get_popularities(starred, headers_, 'artists', 20)  # artists api is rate-limited to 20 instead of 50

    start = 0
    end = 0
    for i in lengths:
        end += i
        popularity_list.append(raw_popularities[start:end])
        start = end

    for i in range(0, len(popularity_list)):
        if len(popularity_list[i]) != len(id_list[i]):
            print('bro')

    return popularity_list


def get_artist_follower_count(id_list: list, headers_: dict, batch_size: int):  # TODO -- NOT TESTED YET
    # almost identical to the get popularity one, but it returns # of an artist's followers as opposed to popularity

    clean_followers_list = []
    lengths = []
    starred = []

    for i in id_list:
        lengths.append(len(i))
        starred = [*starred, *i]

    ids_new = nest_id_lists(starred, batch_size)

    followers_list = []

    for list_ in ids_new:
        subset_of_ids = str(list_).strip('[\'').strip('\']').replace('\', \'', ',')#.replace('\'], [\'', '\',\'')
        followers_raw = requests.get(f'{BASE_URL}artists/?ids={subset_of_ids}', headers=headers_)
        for j in range(0, len(list_)):
            followers_list.append(followers_raw.json()['artists'][j]['followers']['total'])

    start = 0
    end = 0
    for i in lengths:
        end += i
        clean_followers_list.append(followers_list[start:end])
        start = end

    return clean_followers_list


def get_artists_top_songs(id_list: list, headers_: dict, batch_size: int):
    clean_top_songs = []
    lengths = []
    starred = []

    for i in id_list:
        lengths.append(len(i))
        starred = [*starred, *i]

    top_songs = []

    for artist in starred:
        top_songs_raw = requests.get(f'https://api.spotify.com/v1/artists/{artist}/top-tracks?market=US', headers=headers_)  #requests.get(f'{BASE_URL}artists/{starred[0]}/top-tracks', headers=headers_)
        for j in range(0, len(top_songs_raw.json()['tracks'])):
            top_songs.append(top_songs_raw.json()['tracks'][j]['album']['id'])

    start = 0
    end = 0
    for i in lengths:
        end += i
        clean_top_songs.append(top_songs[start:end])
        start = end

    return clean_top_songs


def get_playlist_name(playlist_id: str, headers_: dict):

    playlist_name_raw = requests.get(f'{BASE_URL}playlists/{playlist_id}', headers=headers_)
    try:
        playlist_name_clean = playlist_name_raw.json()['name']
    except KeyError as ke:
        print('\nThere seems to be a problem retrieving the playlist name. Double-check to make sure the playlist ID is correct, and that it is not set to private.')
        playlist_name_clean = RuntimeWarning
        import warnings
        warnings.filterwarnings("ignore", category=RuntimeWarning)

    return playlist_name_clean


Once all the preprocessing functions are up and running, it's time to tie it all together in the `main()` function. This bundles everything else in the script together, pulling data about artists, songs, and playlists, and summarizes it all in a concise message about the "indieness" rating and the most representative song. 

In [9]:
# 
def main(playlist_ids: str, client_id: str, client_secret: str):
    access_token, headers = auth_stuff(client_id, client_secret)

    # playlists = get_all_playlist_ids(user_id, headers)
    
    # split the string into a list if necessary
    playlist_id_list = re.sub('\s*', '', playlist_ids).split(',')

    song_ids_all, album_ids_all, artist_ids_all = [], [], []
    
    for i in range(0, len(playlist_id_list)):
        song_ids, album_ids, artist_ids = playlist_processor(playlist_id_list[i], headers)
        if len(song_ids) > 0:
            song_ids_all.append(song_ids)
            album_ids_all.append(album_ids)
            artist_ids_all.append(artist_ids)

        medians = []

        for k in range(0, len(song_ids_all)):
            song_popularities = get_popularities(song_ids_all[k], headers, 'tracks', 50)
            
            more_features = get_more_features(song_ids_all[k], headers, 'audio-features', 50)

            scaled_features, ids = preprocess_the_data(more_features)

            track_message = get_mean_song(scaled_features, ids, headers)

            artist_popularities = get_artist_popularities(artist_ids_all[k], headers)

            artist_followers = get_artist_follower_count(artist_ids_all[k], headers, 50)

            album_popularities = get_popularities(album_ids_all[k], headers, 'albums', 20)

            all_raw_metrics = []
            for j in range(0, len(song_popularities)):
                jth_song_metric = sum(
                    [(100 - song_popularities[j]),  # get the 
                    (100 - np.mean(artist_popularities[j])),
                    (100 - np.mean(album_popularities[j])),
                    ((80 - (np.mean(artist_followers[j]) / 1000000)) * 1.25)]
                ) / 4
                # append all metric scores to a list for each playlist
                all_raw_metrics.append(jth_song_metric)
            
            medians.append(np.median(all_raw_metrics)) # get the median score for a given playlist

        playlist_name = get_playlist_name(playlist_id_list[i], headers)
        if type(playlist_name) != str:
            print(f'\nYour playlist score could not be calculated. Be sure the playlist is public before trying again.')
        else:
            print(f'For the playlist {playlist_name}, this is your indieness score:\n{np.mean(medians)}\nIt is scaled from 0-100, with 100 being the most indie and 0 being the least indie.')
            print(f'\n{track_message}')

    return np.mean(medians)

main('2wIJNW0Zn7LK6mi2slwdAF', config.id_, config.secret)

For the playlist 4 aspen pt. 4, this is your indieness score:
49.220773593749996
It is scaled from 0-100, with 100 being the most indie and 0 being the least indie.

The song that best represents this playlist is Past Life by Tame Impala.


49.220773593749996

The output of the cell above shows the 