# Spotify Custom Playlist Generator

This Notebook uses the Spotify API to do the following:
    1. Return's a specified Spotify user's saved tracks from their music library
    2. Utilizes K-Medoids Clustering to identify clusters of similar types of music in the user's library
    3. Identifies external tracks that are similar to the tracks in each cluster
    4. Creates a playlist for each cluster
    5. Adds the playlists to the user's Spotify account

In [1]:
# Setting up the credentials to access the spotify API
import spotipy
import spotipy.util as util
import requests
import json
from spotipy.oauth2 import SpotifyClientCredentials
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 300)

# setting up credentials-- specify the spotify username you want to use and enter your client id and secret
username = 'username'
client_id = 'client_id'
client_secret = 'client_secret'

token = util.prompt_for_user_token(username= username,
                           scope = 'user-library-read playlist-modify-private',
                           client_id=client_id,
                           client_secret=client_secret,
                           redirect_uri='http://localhost:8888/callback')



            User authentication requires interaction with your
            web browser. Once you enter your credentials and
            give authorization, you will be redirected to
            a url.  Paste that url you were directed to to
            complete the authorization.

        
Opened https://accounts.spotify.com/authorize?client_id=41abc22336c94ca2b190816d8711ddd8&response_type=code&redirect_uri=http%3A%2F%2Flocalhost%3A8888%2Fcallback&scope=playlist-modify-private+user-library-read in your browser


Enter the URL you were redirected to: http://localhost:8888/callback?code=AQCe2fGBp-tADL_Nn444_yAQCPL-DrOvPmsCBIi6SQOaujJtlMpfFFBWKC5xjPM7xstuYsFdewtLKDDUNCwAWRew_NEQ6ipe28sZvf8lZkL9DY2KbXXbF1OF0IY5m0_KQYtbo8cJ-xo0EdafLH8g4O7JvWAXex1uJJ_RAkJ7j13VTJA_lUCr10D0O8RbouB-e8XnF_mtVwWfmizM-lb7DMDmpCR1ikP43BgK2ieIwW25qTZt_sDybDeLimIVje9u




## Loading User Data

In [2]:
artist_name = []
artist_id = []
track_name = []
popularity = []
track_id = []

# Finding the total number of tracks in the user's library
url = "https://api.spotify.com/v1/me/tracks"
headers = {'Authorization': "Bearer {}".format(token)}

request = requests.get(url, headers=headers)
parsed = json.loads(request.text)

total_songs = parsed['total']

# Looping through to return all tracks in the library
sp = spotipy.Spotify(auth=token)

for i in range(0,total_songs,50):
    track_results = sp.current_user_saved_tracks(limit=50, offset=i)
    for item in track_results['items']:
        track = item['track']
        artist_name.append(track['artists'][0]['name'])
        artist_id.append(track['artists'][0]['id'])
        track_name.append(track['name'])
        track_id.append(track['id'])
        
# Making a dataframe of the tracks
df_tracks = pd.DataFrame(columns=['artist','track','id'])
df_tracks['artist'] = artist_name
df_tracks['track'] = track_name
df_tracks['id'] = track_id
df_tracks['artist_id'] = artist_id

In [3]:
# returning the audio features for all of the tracks in the user's library

df = pd.DataFrame(
        columns=['acousticness', 'analysis_url', 'danceability', 'duration_ms', 'energy',
                'id', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
                'speechiness', 'tempo', 'time_signature', 'track_href', 'type', 'uri',
                'valence'])

for tracks in track_id:
    track_feat = sp.audio_features(tracks)
    df = df.append(track_feat[0], ignore_index=True)

retrying ...1secs


In [4]:
# merging the features with the track information
df = df.merge(df_tracks, on='id', how='inner')

# dropping duplicate rows (tracks that may be on two different albumns in the user's library)
df = df.drop_duplicates(subset='id', keep="first")

In [5]:
df.shape

(274, 21)

# EDA

In [6]:
import pandas_profiling
df.profile_report()



Since the audio feature variables are skewed and contain outliers, we need to use a clustering approach that is not sensitive to outliers. Lets implement k-medoids

## Clustering the User Track Library with K Medoids

In [7]:
import pyclustering
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.utils.metric import distance_metric, type_metric
from sklearn.preprocessing import StandardScaler

np.random.seed(1)

# Creating a copy of the dataset
df_full = df

# Taking only the columns we want to cluster with 
features = df_full[['acousticness',  'danceability', 'energy',
       'instrumentalness', 'liveness', 'loudness',
       'speechiness', 'tempo','valence']]

# Standardizing the Features
scaler = StandardScaler()
scaler.fit(features)

X = scaler.transform(features)

### Identifying the Optimal Number of Clusters with the Silhouette Score

Using the Silhouette Score allows us to automate the process of selecting 'k'. The Silhouette score is a measure of how similar an object is to its own cluster compared to other clusters. The Silhouette score ranges from -1 to 1 with a higher score indicating the object is better matched to its own cluster. This code takes the average Silhouette score from each clustering iteration to determine the optimal number of clusters. This code tests up to 10 clusters.

In [8]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
np.random.seed(1)

# Choosing how many 'Ks' we want to test
range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10]


# Setting the distance metric
metric = distance_metric(type_metric.EUCLIDEAN)

# initializing the final number of clusters variable
best_score = 0
best_k = 0

for n_clusters in range_n_clusters:
    initial_medoids = []
    for i in range(0,n_clusters): 
        initial_medoids.append(i+1)
    
    # Create instance of K-Medoids algorithm.
    kmedoids_instance = kmedoids(X, initial_medoids, metric = metric)
    
    # Run cluster analysis and obtain results.
    kmedoids_instance.process()
    clusters = kmedoids_instance.get_clusters()
    
    # Adding getting the output into the array we need for silhouette analysis
    clust = {}
    clust_output = []
    for k in range(0,len(clusters)):
        clust[k] = clusters[k]

    for k in range(0,len(X)):
        for j in range(0,len(clusters)):
            if k in clust[j]:
                clust_output.append(j)
    clust_output = np.asarray(clust_output)
    
    # Printing the Silhouette average for each cluster
    silhouette_avg = silhouette_score(X, clust_output, metric='euclidean')
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)
    
    # returning the number of clusters that results in the highest silhouette score
    if ((silhouette_avg > best_score) & (n_clusters > 2)):
        best_k = n_clusters
        best_score = silhouette_avg
        
print(' ')        
print('The optimal number of clusters is: ', best_k)

For n_clusters = 2 The average silhouette_score is : 0.15657621780840034
For n_clusters = 3 The average silhouette_score is : 0.07850896605403268
For n_clusters = 4 The average silhouette_score is : 0.10794322049857244
For n_clusters = 5 The average silhouette_score is : 0.08444904019677357
For n_clusters = 6 The average silhouette_score is : 0.10065261927806833
For n_clusters = 7 The average silhouette_score is : 0.10188526521246477
For n_clusters = 8 The average silhouette_score is : 0.1157886001706057
For n_clusters = 9 The average silhouette_score is : 0.08724401719204285
For n_clusters = 10 The average silhouette_score is : 0.09562751865324405
 
The optimal number of clusters is:  8


### Clustering with the Optimal Number of K

In [9]:
# Lets re-run the clustering algorithm with the optimal number of clusters
import pyclustering
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.utils.metric import distance_metric, type_metric
np.random.seed(1)

initial_medoids = []
for i in range(0,best_k): 
    initial_medoids.append(i+1)
    
# Setting the distance metric
metric = distance_metric(type_metric.EUCLIDEAN)

# Create instance of K-Medoids algorithm.
kmedoids = kmedoids(X, initial_medoids, metric = metric)

# Run cluster analysis and obtain results.
kmedoids.process()
clusters = kmedoids.get_clusters()

In [10]:
# Adding the clusters back to the dataframe
df_full['cluster'] = 0

clust = {}

for i in range(0,len(clusters)):
    clust[i] = clusters[i]

for i in range(0,len(df_full)):
    for j in range(0,len(clusters)):
        if i in clust[j]:
            df_full.iloc[i,21] = j
            
# Returning the value_counts of each cluster
df_full['cluster'].value_counts()

6    53
7    48
2    38
4    34
5    33
3    29
0    26
1    13
Name: cluster, dtype: int64

In [11]:
temp = pd.DataFrame(X, columns= ['acousticness',  'danceability', 'energy',
       'instrumentalness', 'liveness', 'loudness',
       'speechiness', 'tempo','valence'])

j = 0
medoids = kmedoids.get_medoids()
for i in kmedoids.get_medoids():
    medoids[j] = list(temp.iloc[i,:])
    j = j+1

temp['cluster'] = df_full['cluster']
temp['acousticness_med'] = 0
temp['danceability_med'] = 0
temp['energy_med'] = 0
temp['instrumentalness_med'] = 0
temp['liveness_med'] = 0
temp['loudness_med'] = 0
temp['speechiness_med'] = 0
temp['tempo_med'] = 0
temp['valence_med'] = 0

# Adding the medoid of the cluster each track is a part of as columns to the dataframe
for i in range(0,len(temp)):
    cluster = temp.iloc[i,9]
    temp.iloc[i,10] = medoids[cluster][0]
    temp.iloc[i,11] = medoids[cluster][1]
    temp.iloc[i,12] = medoids[cluster][2]
    temp.iloc[i,13] = medoids[cluster][3]
    temp.iloc[i,14] = medoids[cluster][4]
    temp.iloc[i,15] = medoids[cluster][5]
    temp.iloc[i,16] = medoids[cluster][6]
    temp.iloc[i,17] = medoids[cluster][7]
    temp.iloc[i,18] = medoids[cluster][8]

# Computing the Euclidean distance from the track to the medoid
temp['distance'] =  (((temp['acousticness']-temp['acousticness_med'])**2) +
                     ((temp['danceability']-temp['danceability_med'])**2) +
                     ((temp['energy']-temp['energy_med'])**2) +
                     ((temp['instrumentalness']-temp['instrumentalness_med'])**2) +
                     ((temp['liveness']-temp['liveness_med'])**2) +
                     ((temp['loudness']-temp['loudness_med'])**2) +
                     ((temp['speechiness']-temp['speechiness_med'])**2) +
                     ((temp['tempo']-temp['tempo_med'])**2) +
                     ((temp['valence']-temp['valence_med'])))**.5

# Adding the distance for each track to its medoid to the main table
df_full['distance'] = temp['distance']
df_full = df_full.sort_values(['cluster','distance'])

### Exploring the Clusters

In [12]:
# Printing the 3 most representative songs of each cluster
for i in range(0,len(clusters)):
    print(' ')
    print("The top 3 tracks for cluster {} are: ".format(i))
    print(df_full[df_full['cluster']==i][['artist','track']].head(3))

 
The top 3 tracks for cluster 0 are: 
                    artist                      track
10              Luke Combs                  Hurricane
113        KIDS SEE GHOSTS  Freeee (Ghost Town Pt. 2)
81   Red Hot Chili Peppers               Venice Queen
 
The top 3 tracks for cluster 1 are: 
             artist                                      track
204            Saba                                       MOST
2    Kendrick Lamar  Swimming Pools (Drank) - Extended Version
207            Saba                   The Billy Williams Story
 
The top 3 tracks for cluster 2 are: 
                    artist                    track
70   Red Hot Chili Peppers     Universally Speaking
268  Red Hot Chili Peppers        This Velvet Glove
161  Red Hot Chili Peppers  Runaway - 2006 Remaster
 
The top 3 tracks for cluster 3 are: 
            artist                                              track
12    Brad Paisley                                   She's Everything
117  Ariana Grande  Somewher

## Building the Playlists

### Finding New Artist Similar to those in the User's Library

The API has a function to return the artists that Spotify says are related to an input artist. Here we are getting all of the related artist to each artist in each cluster.

In [13]:
sp = spotipy.Spotify(auth=token)

df_new_artists = pd.DataFrame(columns=['artist','artist_id','popularity','cluster'])

for cluster in df_full['cluster'].unique():
    # Dropping duplicate artist_id's that are in the user's library
    cluster_artist_id = list(dict.fromkeys(list(df_full[df_full['cluster']==cluster]['artist_id'])))

    new_artist_name = []
    new_popularity = []
    new_artist_id = []
    results = []

    for id in cluster_artist_id:
        results = sp.artist_related_artists(id)
        for artist in results['artists']:
            new_artist_name.append(artist['name'])
            new_artist_id.append(artist['id'])
            new_popularity.append(artist['popularity'])

    # Making a dataframe of the tracks
    df_temp = pd.DataFrame(columns=['artist','artist_id','popularity','cluster'])
    df_temp['artist'] = new_artist_name
    df_temp['artist_id'] = new_artist_id
    df_temp['popularity'] = new_popularity
    df_temp['cluster'] = cluster

    # Dropping duplicates
    df_temp = df_temp.drop_duplicates()
    
    # Appending to the final dataframe
    df_new_artists = df_new_artists.append(df_temp)

retrying ...1secs


### Finding the Top 10 Songs for Each New Artist

The API has a function to return the top 10 tracks of any artist. Here we are getting the top 10 tracks of each new artist we found in the previous step

In [14]:
new_tracks = pd.DataFrame(columns=['id','artist_id','track name','popularity','cluster'])

for cluster in df_full['cluster'].unique():    
    new_artist_id = list(df_new_artists[df_new_artists['cluster']==cluster]['artist_id'])

    sp = spotipy.Spotify(auth=token)

    new_track_name = []
    new_track_id = []
    artist_list = []
    new_track_popularity = []
    
    for id in new_artist_id:
        results = sp.artist_top_tracks(id, country='US')
        for track in results['tracks']:
            new_track_name.append(track['name'])
            new_track_id.append(track['id'])
            new_track_popularity.append(track['popularity'])
            artist_list.append(id)
            

    # we will use the name and id columns later
    temp = pd.DataFrame()
    temp['id'] = new_track_id
    temp['artist_id'] = artist_list
    temp['track name'] = new_track_name
    temp['popularity'] = new_track_popularity
    temp['cluster'] = cluster
    # dropping duplicate rows (tracks that may be on two different albumns in the user's library)
    temp = temp.drop_duplicates(subset='id', keep="first")
    
    new_tracks = new_tracks.append(temp)

# dropping duplicate tracks
new_tracks = new_tracks.drop_duplicates()

retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...2secs
retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...1secs


In [15]:
# Removing tracks that are already in the User's Library
user_tracks = list(df['id'])
new_track_id = [x for x in new_track_id if x not in user_tracks]

# Filtering new tracks to include those not already in the User's Library
new_tracks = new_tracks[new_tracks['id'].isin(new_track_id)]

### Finding the Audio Features for Each New Track

This can take some time, particularly if the user has a large library of tracks

In [16]:
# Lets re-run the code for creating the token (sometimes it will expire before the playlist are created)
token = util.prompt_for_user_token(username= username,
                           scope = 'user-library-read playlist-modify-private',
                           client_id=client_id,
                           client_secret=client_secret,
                           redirect_uri='http://localhost:8888/callback')

# returning the features for all of the new tracks (this can take a while...)
sp = spotipy.Spotify(auth=token)

df_new = pd.DataFrame(
        columns=['acousticness', 'analysis_url', 'danceability', 'duration_ms', 'energy',
                'id', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
                'speechiness', 'tempo', 'time_signature', 'track_href', 'type', 'uri',
                'valence'])

for tracks in new_track_id:
    track_feat = sp.audio_features(tracks)
    df_new = df_new.append(track_feat[0], ignore_index=True)

new_tracks = new_tracks.merge(df_new, how='inner', on='id')

retrying ...1secs
retrying ...1secs
retrying ...1secs
retrying ...2secs
retrying ...1secs
retrying ...1secs
retrying ...1secs


### Finding the Distance from Each New Track to it's Medoid

In [17]:
# Taking the columns we used for clustering
df_full_new = new_tracks

new_features = df_full_new[['acousticness',  'danceability', 'energy',
       'instrumentalness', 'liveness', 'loudness',
       'speechiness', 'tempo','valence']]

# Scaling so we can compute distance
X_new = scaler.transform(new_features) # here we are using same scaler that we used to scale the User Library

In [18]:
# Computing the distance from each new song to the medoid of the cluster it is associated with

X_new = pd.DataFrame(X_new, columns= ['acousticness',  'danceability', 'energy',
       'instrumentalness', 'liveness', 'loudness',
       'speechiness', 'tempo','valence'])

X_new['cluster'] = df_full_new['cluster']
X_new['acousticness_med'] = 0
X_new['danceability_med'] = 0
X_new['energy_med'] = 0
X_new['instrumentalness_med'] = 0
X_new['liveness_med'] = 0
X_new['loudness_med'] = 0
X_new['speechiness_med'] = 0
X_new['tempo_med'] = 0
X_new['valence_med'] = 0

# Adding the medoid of the cluster each track is a part of as columns to the dataframe
for i in range(0,len(X_new)):
    cluster = X_new.iloc[i,9]
    X_new.iloc[i,10] = medoids[cluster][0]
    X_new.iloc[i,11] = medoids[cluster][1]
    X_new.iloc[i,12] = medoids[cluster][2]
    X_new.iloc[i,13] = medoids[cluster][3]
    X_new.iloc[i,14] = medoids[cluster][4]
    X_new.iloc[i,15] = medoids[cluster][5]
    X_new.iloc[i,16] = medoids[cluster][6]
    X_new.iloc[i,17] = medoids[cluster][7]
    X_new.iloc[i,18] = medoids[cluster][8]

# Computing the Euclidean distance from the track to the centroid
X_new['distance'] = (((X_new['acousticness']-X_new['acousticness_med'])**2) +
                     ((X_new['danceability']-X_new['danceability_med'])**2) +
                     ((X_new['energy']-X_new['energy_med'])**2) +
                     ((X_new['instrumentalness']-X_new['instrumentalness_med'])**2) +
                     ((X_new['liveness']-X_new['liveness_med'])**2) +
                     ((X_new['loudness']-X_new['loudness_med'])**2) +
                     ((X_new['speechiness']-X_new['speechiness_med'])**2) +
                     ((X_new['tempo']-X_new['tempo_med'])**2) +
                     ((X_new['valence']-X_new['valence_med'])))**.5

# Adding the distance for each track to its centroid to the main table
df_full_new['distance'] = X_new['distance']
df_full_new = df_full_new.sort_values(['cluster','distance'])

In [19]:
# Adding the Artist Name to the dataframe
df_new_artists = df_new_artists[['artist_id','artist']]
df_new_artists = df_new_artists.drop_duplicates()
df_last = df_full_new.merge(df_new_artists, how='left', on='artist_id')

### Playlist Filters

Here you can filter the final playlist using a distance measurement and a popularity score. The distance is measured in Euclidean distance, with a smaller distance indicating more similarity to the cluster medoid. Popularity is rated by Spotify on a scale from 0 - 100 with a higher score indicating higher popularity

In [20]:
# Filtering for distance from the medoid -- This adds only the most similar tracks to the playlist
distance = 1.75

filtered = df_last[df_last['distance'] < distance]

In [21]:
# Filtering for popularity -- This adds tracks that reach a defined popularity threshold
popularity = 65

final = filtered[filtered['popularity'] >= popularity]

#### The Final Number of Tracks to be Added to Each Playlist

In [22]:
print(final['cluster'].value_counts())

7    153
0     43
5     33
6     27
2     20
4      6
3      4
1      1
Name: cluster, dtype: int64


In [23]:
# Viewing the tracks on the final playlists

final[['track name','artist','cluster']].sort_values('cluster')

Unnamed: 0,track name,artist,cluster
0,Nobody But You (Duet with Gwen Stefani),Blake Shelton,0
79,Walk,Foo Fighters,0
89,Ridin’ Roads,Dustin Lynch,0
94,Learn to Fly,Foo Fighters,0
96,Movement,Hozier,0
100,Peer Pressure,James Bay,0
109,Wherever You Are,Kodaline,0
112,Come As You Are,Nirvana,0
117,Human,The Killers,0
77,Somewhere On A Beach,Dierks Bentley,0


## Adding the Playlist to the User's Account

In [24]:
token = util.prompt_for_user_token(username= username,
                           scope = 'user-library-read playlist-modify-private',
                           client_id=client_id,
                           client_secret=client_secret,
                           redirect_uri='http://localhost:8888/callback')

user_id = sp.me()['uri'][13:]

In [25]:
sp = spotipy.Spotify(auth=token)

for i in final['cluster'].unique():
    playlist = sp.user_playlist_create(user_id,'cluster {} Playlist'.format(i) , public=False)
    for j in range(0,len(final[final['cluster']==i]), 5):
        sp.user_playlist_add_tracks(user_id, playlist['id'], np.asarray(final[final['cluster']==i].iloc[j:j+5,0]))

#### Code to Clear the Token 

In [26]:
# This is code to clear the token if you want to run this code for different user accounts 
# (You may also need to clear your browser cookies and cache)

import os
username = 'username'
os.remove(f".cache-{username}")

FileNotFoundError: [WinError 2] The system cannot find the file specified: '.cache-username'