## **Song recommendation with kmeans clustering**

The purpose of this project is to write a python function which takes a song as a user input and then recommends a similar song based on the audio features of the song. To do this we will use the spotify api to save a playlist of around 5000 songs with their audio features (such as danceability, loudness etc) as a pandas dataframe. Then we will fit the kmeans algorithm to the dataframe to create clusters of songs.  
 
Finally we will write a function that takes a song as a user input and checks if the song is already in the playlist dataframe. If it is, the function will recommend a different song from the same cluster. If the song isn't already in the playlist, the function will grab the audio features of the song from the spotify api and then based on these features our kmeans model will predict which cluster the song should belong to and then recommend the user another song from this cluster.

In [1]:
 # import libraries

import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


# set up spotify api with my credentials

sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id="",
                                                           client_secret=""))

In [2]:
# this function returns a pandas df based on a spotify playlist with all of the audio features for each song

def playlist_df(playlist_id):
    
    # Create empty dataframe
    playlist_features_list = ["artist", "album", "track_name",  "track_id", "danceability", "energy", "key", "loudness", "mode", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_ms", "time_signature"]
    
    playlist_df = pd.DataFrame(columns = playlist_features_list)
    
    # Loop through every track in the playlist, extract features and append the features to the playlist df

    results = sp.user_playlist_tracks("spotify", playlist_id)
    tracks = results['items']

    for oset in range(100,results['total'],100):
        results = sp.user_playlist_tracks("spotify", playlist_id, offset=oset)
        tracks += results['items']

   
    for track in tracks:
            # Create empty dict
        playlist_features = {}
            # Get metadata
        playlist_features["artist"] = track["track"]["album"]["artists"][0]["name"]
        playlist_features["album"] = track["track"]["album"]["name"]
        playlist_features["track_name"] = track["track"]["name"]
        playlist_features["track_id"] = track["track"]["id"]
        
        # Get audio features
        audio_features = sp.audio_features(playlist_features["track_id"])[0]
        for feature in playlist_features_list[4:]:
            playlist_features[feature] = audio_features[feature]
            
            # Concat the dfs
        track_df = pd.DataFrame(playlist_features, index = [0])
        playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
        
    return playlist_df

In [3]:
# apply function to a playlist with 5000 songs from spotify

spotify_playlist = playlist_df("4rnleEAOdmFAbRcNCgZMpY")

In [4]:
spotify_playlist

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,Hozier,Hozier (Deluxe),Take Me To Church,7dS5EaCoMnN7DzlpT6aRn2,0.566,0.664,4,-5.303,0,0.0464,0.63400,0,0.116,0.437,128.945,241688,4
1,Mike Posner,31 Minutes to Takeoff,Cooler Than Me - Single Mix,2V4bv1fNWfTcyRJKmej6Sj,0.768,0.820,7,-4.630,0,0.0474,0.17900,0,0.689,0.625,129.965,213293,4
2,"Tyler, The Creator",Flower Boy,See You Again (feat. Kali Uchis),7KA4W4McWYRpgf0fWsJZWB,0.558,0.559,6,-9.222,1,0.0959,0.37100,0.000007,0.109,0.620,78.558,180387,4
3,Bastille,Bad Blood,Pompeii,3gbBpTdY8lnQwqxNCcf795,0.679,0.715,9,-6.383,1,0.0407,0.07550,0,0.271,0.571,127.435,214148,4
4,Shakira,"Oral Fixation, Vol. 2 (Expanded Edition)",Hips Don't Lie (feat. Wyclef Jean),3ZFTkvIE7kyPt6Nu3PEa7V,0.778,0.824,10,-5.892,0,0.0707,0.28400,0,0.405,0.758,100.024,218093,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5290,MARINA,The Family Jewels,Hermit the Frog,4Zcz6saEkOII3PlXd9gN3o,0.609,0.679,0,-4.545,1,0.0312,0.24300,0,0.199,0.487,122.034,215960,4
5291,Olivia Rodrigo,deja vu,deja vu,61KpQadow081I2AsbeLcsb,0.439,0.610,9,-7.236,1,0.1160,0.59300,0.000011,0.341,0.172,181.088,215508,4
5292,BIA,FOR CERTAIN,WHOLE LOTTA MONEY,5yorXJWdBan1Vlh116ZtQ7,0.897,0.371,1,-5.019,1,0.3680,0.09040,0,0.325,0.441,81.008,156005,4
5293,Ashnikko,DEMIDEVIL,Slumber Party (feat. Princess Nokia),11ZulcYY4lowvcQm4oe3VJ,0.964,0.398,11,-8.981,0,0.0795,0.00151,0.000039,0.101,0.563,105.012,178405,4


In [5]:
# check the data types in case we need to convert some features to numerics

spotify_playlist.dtypes

artist               object
album                object
track_name           object
track_id             object
danceability        float64
energy              float64
key                  object
loudness            float64
mode                 object
speechiness         float64
acousticness        float64
instrumentalness     object
liveness            float64
valence             float64
tempo               float64
duration_ms          object
time_signature       object
dtype: object

In [6]:
# convert all relevant columns to integers or floats

spotify_playlist_copy = spotify_playlist.convert_dtypes(convert_integer=True, convert_floating=True)

In [7]:
spotify_playlist_copy.dtypes

artist               string
album                string
track_name           string
track_id             string
danceability        Float64
energy              Float64
key                   Int64
loudness            Float64
mode                  Int64
speechiness         Float64
acousticness        Float64
instrumentalness    Float64
liveness            Float64
valence             Float64
tempo               Float64
duration_ms           Int64
time_signature        Int64
dtype: object

In [8]:
# set the track_id to be the index so we have a way of seeing which songs belong to which cluster later
# i will not use the song name or artists in the final dataset on which the model will be fitted

spotify_playlist_copy.set_index("track_id")

Unnamed: 0_level_0,artist,album,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
7dS5EaCoMnN7DzlpT6aRn2,Hozier,Hozier (Deluxe),Take Me To Church,0.566,0.664,4,-5.303,0,0.0464,0.634,0.0,0.116,0.437,128.945,241688,4
2V4bv1fNWfTcyRJKmej6Sj,Mike Posner,31 Minutes to Takeoff,Cooler Than Me - Single Mix,0.768,0.82,7,-4.63,0,0.0474,0.179,0.0,0.689,0.625,129.965,213293,4
7KA4W4McWYRpgf0fWsJZWB,"Tyler, The Creator",Flower Boy,See You Again (feat. Kali Uchis),0.558,0.559,6,-9.222,1,0.0959,0.371,0.000007,0.109,0.62,78.558,180387,4
3gbBpTdY8lnQwqxNCcf795,Bastille,Bad Blood,Pompeii,0.679,0.715,9,-6.383,1,0.0407,0.0755,0.0,0.271,0.571,127.435,214148,4
3ZFTkvIE7kyPt6Nu3PEa7V,Shakira,"Oral Fixation, Vol. 2 (Expanded Edition)",Hips Don't Lie (feat. Wyclef Jean),0.778,0.824,10,-5.892,0,0.0707,0.284,0.0,0.405,0.758,100.024,218093,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4Zcz6saEkOII3PlXd9gN3o,MARINA,The Family Jewels,Hermit the Frog,0.609,0.679,0,-4.545,1,0.0312,0.243,0.0,0.199,0.487,122.034,215960,4
61KpQadow081I2AsbeLcsb,Olivia Rodrigo,deja vu,deja vu,0.439,0.61,9,-7.236,1,0.116,0.593,0.000011,0.341,0.172,181.088,215508,4
5yorXJWdBan1Vlh116ZtQ7,BIA,FOR CERTAIN,WHOLE LOTTA MONEY,0.897,0.371,1,-5.019,1,0.368,0.0904,0.0,0.325,0.441,81.008,156005,4
11ZulcYY4lowvcQm4oe3VJ,Ashnikko,DEMIDEVIL,Slumber Party (feat. Princess Nokia),0.964,0.398,11,-8.981,0,0.0795,0.00151,0.000039,0.101,0.563,105.012,178405,4


In [9]:
# select numeric data to be scaled

numeric = spotify_playlist_copy.select_dtypes(include=["Float64", "Int64"])

# save scaler to use later

scaler = StandardScaler()

# fit and transform the data

transformer = scaler.fit(numeric)

X = transformer.transform(numeric)

In [10]:
# initialise kmeans model with 6 clusters

kmeans = KMeans(n_clusters=6, random_state=93)

# fit model to data

kmeans.fit(X)

KMeans(n_clusters=6, random_state=93)

In [11]:
# use model to form clusters

clusters = kmeans.predict(X)

# check counts for the clusters

pd.Series(clusters).value_counts().sort_index()

0    1071
1    1476
2     492
3     148
4     603
5    1505
dtype: int64

In [12]:
# add a new column to the dataframe which contains the cluster the given song belongs to
# we will use this later to recommend similar songs to the user

spotify_playlist['cluster'] = clusters

spotify_playlist

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,cluster
0,Hozier,Hozier (Deluxe),Take Me To Church,7dS5EaCoMnN7DzlpT6aRn2,0.566,0.664,4,-5.303,0,0.0464,0.63400,0,0.116,0.437,128.945,241688,4,2
1,Mike Posner,31 Minutes to Takeoff,Cooler Than Me - Single Mix,2V4bv1fNWfTcyRJKmej6Sj,0.768,0.820,7,-4.630,0,0.0474,0.17900,0,0.689,0.625,129.965,213293,4,5
2,"Tyler, The Creator",Flower Boy,See You Again (feat. Kali Uchis),7KA4W4McWYRpgf0fWsJZWB,0.558,0.559,6,-9.222,1,0.0959,0.37100,0.000007,0.109,0.620,78.558,180387,4,2
3,Bastille,Bad Blood,Pompeii,3gbBpTdY8lnQwqxNCcf795,0.679,0.715,9,-6.383,1,0.0407,0.07550,0,0.271,0.571,127.435,214148,4,1
4,Shakira,"Oral Fixation, Vol. 2 (Expanded Edition)",Hips Don't Lie (feat. Wyclef Jean),3ZFTkvIE7kyPt6Nu3PEa7V,0.778,0.824,10,-5.892,0,0.0707,0.28400,0,0.405,0.758,100.024,218093,4,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5290,MARINA,The Family Jewels,Hermit the Frog,4Zcz6saEkOII3PlXd9gN3o,0.609,0.679,0,-4.545,1,0.0312,0.24300,0,0.199,0.487,122.034,215960,4,1
5291,Olivia Rodrigo,deja vu,deja vu,61KpQadow081I2AsbeLcsb,0.439,0.610,9,-7.236,1,0.1160,0.59300,0.000011,0.341,0.172,181.088,215508,4,0
5292,BIA,FOR CERTAIN,WHOLE LOTTA MONEY,5yorXJWdBan1Vlh116ZtQ7,0.897,0.371,1,-5.019,1,0.3680,0.09040,0,0.325,0.441,81.008,156005,4,4
5293,Ashnikko,DEMIDEVIL,Slumber Party (feat. Princess Nokia),11ZulcYY4lowvcQm4oe3VJ,0.964,0.398,11,-8.981,0,0.0795,0.00151,0.000039,0.101,0.563,105.012,178405,4,4


In [13]:
# get the audio features from spotify based on the track_id
# this function is needed for the recommend_song function to work

def get_audio_features(track_id):
    
    audio_features = pd.DataFrame(sp.audio_features(track_id)).set_index("id")

    audio_features[["instrumentalness", "duration_ms", "time_signature", "key", "mode"]] = audio_features[["instrumentalness", "duration_ms", "time_signature", "key", "mode"]].convert_dtypes(convert_integer=True, convert_floating=True)
    
    return audio_features

In [23]:
def recommend_song(df):

    # ask for user input 
    
    song = input("Please choose a song")

    # if the given song is in the spotify playlist the function will return the cluster to which it belongs and another random song from the cluster

    if song in df["track_name"].values:
        cluster_label = df[df['track_name'] == song]['cluster'].values[0]
    
        df_cluster =  df[df['cluster'] == cluster_label]
        df_cluster = df_cluster[df_cluster['track_name'] != song]

        return cluster_label, df_cluster.sample(1)['track_name'].values[0]  

    # if the song is not in the original playlist - search spotify for it and return the audio features and song info as a df

    elif song not in df["track_name"].values:
        query = f'track:{song}'
        results = sp.search(q=query)
        try:
            id = results['tracks']['items'][0]['id']
            song_to_predict = get_audio_features(id)
        except:
            return "check the song name"
    
        # now the function will scale the df using the same scaler we used for the original playlist

        X = transformer.transform(song_to_predict.select_dtypes(include=["float64", "int64", "Float64", "Int64"]))

        # predict which cluster the songs belongs to

        predicted_cluster = kmeans.predict(X)[0]

        # after the model predicts the cluster the function filters the dataframe for all the songs which belong to this cluster

        df_cluster2 =  df[df['cluster'] == predicted_cluster]

        # finally it returns a random song from the df which belongs to this cluster

        return "Recomended track: {0}".format(df_cluster2.sample(1)['track_name'].values[0])
    
    else:
        print("Please check the spelling of your song")

In [27]:
recommend_song(spotify_playlist)

'Recomended track: Jonkun Ex'