# Spotify Recommendation Algorithm 

The purpose of this project is to design a music recommendation algorithm that fits closer to my taste. For a detailed writeup please visit this doc: https://docs.google.com/document/d/1XB79YVOwlACy4Q7-vdG1uTmbOqfpizOogoq2l9OrFM0/edit?usp=sharing

Thanks for checking out this project!

### Create datasets from Spotify playlists utilizing the Spotipy API

In [1]:
import spotipy
import pandas as pd
import numpy as np

In [2]:
# spotipy connectivity setup 
CLIENT_ID = 'bdd43f9c189349e582fdc5fa6a3f0891'
CLIENT_SECRET = '541e1a33bf23481aa51d97043c9516e9'
client_credentials_manager = spotipy.oauth2.SpotifyClientCredentials(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [3]:
# stores and convert spotify playlist data into a df 
# the code was altered from this website
# https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6
def analyze_playlist(creator, playlist_id):
    
    # Create empty dataframe
    playlist_features_list = ["artist","album","track_name",  "track_id","danceability","energy","key","loudness","mode", "speechiness","instrumentalness","liveness","valence","tempo", "duration_ms","time_signature"]
    
    playlist_df = pd.DataFrame(columns = playlist_features_list)
    
    # Loop through every track in the playlist, extract features and append the features to the playlist df
    # take out 'track' level dict if it's my own play list 
    playlist = sp.user_playlist_tracks(creator, playlist_id)['tracks']["items"]
    for track in playlist:
        # Create empty dict
        playlist_features = {}
        # Get metadata
        playlist_features["artist"] = track["track"]["album"]["artists"][0]["name"]
        playlist_features["album"] = track["track"]["album"]["name"]
        playlist_features["track_name"] = track["track"]["name"]
        playlist_features["track_id"] = track["track"]["id"]
        
        # Get audio features
        audio_features = sp.audio_features(playlist_features["track_id"])[0]
        for feature in playlist_features_list[4:]:
            playlist_features[feature] = audio_features[feature]
        
        # Concat the dfs
        track_df = pd.DataFrame(playlist_features, index = [0])
        playlist_df = pd.concat([playlist_df, track_df], ignore_index = True)
        
    return playlist_df

In [195]:
# this df includes songs in my current playlists and songs that I dislike
df = pd.read_csv('track data.csv')

### Data Wrangling &  Exploration

In [196]:
df.isnull().sum()

artist              0
album               0
track_name          0
track_id            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
label               0
dtype: int64

In [197]:
df.describe()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,label
count,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0
mean,0.620039,0.541988,5.038627,-8.583193,0.669528,0.067315,0.113319,0.16555,0.470689,118.453966,218081.008584,3.905579,0.553648
std,0.142777,0.2139,3.565068,3.251491,0.471396,0.06793,0.23795,0.128468,0.238778,29.910018,54586.865616,0.334269,0.498184
min,0.192,0.0529,0.0,-23.465,0.0,0.0232,0.0,0.0264,0.0517,60.496,98544.0,1.0,0.0
25%,0.532,0.381,2.0,-10.381,0.0,0.0315,3e-06,0.0944,0.262,93.842,181893.0,4.0,0.0
50%,0.624,0.537,5.0,-8.216,1.0,0.0393,0.00089,0.112,0.472,117.401,209999.0,4.0,1.0
75%,0.739,0.717,8.0,-6.118,1.0,0.0688,0.0471,0.192,0.633,139.98,245426.0,4.0,1.0
max,0.887,0.968,11.0,-2.724,1.0,0.44,0.938,0.954,0.967,203.908,515865.0,4.0,1.0


## Model Training

### SVM

"Linear kernel" worked the best with the highest score (mean accuracy). I also dropped the loudness and duration_ms columns to increase the accurracy. 

In [73]:
from sklearn import svm
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [200]:
x = df2.loc[:,'danceability':'time_signature']
y = df2['label']

In [201]:
svm_model = svm.SVC(kernel = 'linear').fit(x, y)
svm_model.score(x, y)

0.575107296137339

In [109]:
# check the coefficients from the data 
svm_model.coef_

array([[-2.63222673e+00, -9.65268823e+00,  5.78623479e+01,
        -1.48507967e+02,  6.00000000e+00, -8.60379321e-01,
         2.26166693e+00,  1.07576171e-01, -5.47104210e+00,
        -1.03275776e+01,  7.48187154e-02, -5.00000000e+00]])

In [204]:
# drop irrrelevant coefficients 
x = x.drop(columns = ['loudness','duration_ms'])
svm_model = svm.SVC(kernel = 'linear').fit(x, y)
svm_model.score(x, y)

0.7553648068669528

### K Nearest Neighbors

In [114]:
from sklearn.neighbors import KNeighborsRegressor

In [211]:
# also tried n_neighbors =  2, 3, & 4
knn_model = KNeighborsRegressor(n_neighbors=5).fit(x,y)
knn_model.score(x,y)

0.18651460942158604

### Logistic Regression

In [205]:
from sklearn.linear_model import LogisticRegression

In [206]:
# tried solver = 'liblinear'
log_model = LogisticRegression(max_iter = 200, solver = 'lbfgs').fit(x,y)
log_model.score(x,y)

0.7510729613733905

### Neural Network

In [148]:
from sklearn.neural_network import MLPClassifier

In [237]:
# continued using lbfgs 
mlp = MLPClassifier(hidden_layer_sizes=(10,10),max_iter=500, solver='lbfgs', random_state = 1).fit(x,y)
mlp.score(x,y)

0.7939914163090128

### Model Results
Here I output the results to Excel files then manually classified the songs. Again for the detailed write-up, please visit this page. 
https://docs.google.com/document/d/1XB79YVOwlACy4Q7-vdG1uTmbOqfpizOogoq2l9OrFM0/edit?usp=sharing

Note that I did not export the results for "Discover Weekly" since the playlist only contains 30 songs and the classification results can be observed fairly easily (please refer to the dataframe in each model).

In [250]:
# outputs prediction results based on the model and the dataset
def pred_results(df, model):
    x = df.loc[:,'danceability':'time_signature']
    x = x.drop(columns = ['loudness','duration_ms'])
    x_pred = model.predict(x)
    df['pred'] = x_pred
    return df

### SVM

Please note that Spotify playlist tracks get updated from time to time. Therefore the metrics maybe inconsistent by the time you run this code. 

In [None]:
# import "Discover Weekly" playlist
df_discover = analyze_playlist("spotify","37i9dQZEVXcDjNKMX0xagW?si=b0f662cb410f4082")

In [244]:
# Get the results from svm model 
# three misclassification - Cloud9(dislike), Silence(dislike), and What a Beautiful Night(like)
discover_weekly_svm = pred_results(df_discover, svm_model)
discover_weekly_svm

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,pred
0,柯智棠,The Joy Of Sorrow,The Joy Of Sorrow,78oPAwgoMFL9QxnueTuNwy,0.462,0.12,5,-16.798,1,0.0396,1.9e-05,0.143,0.232,73.228,207435,4,1
1,Tomoyo Harada,Candle Lights,ラヴ・ミー・テンダー - Haruomi Hosono Rework,1U14MlCuOIk5XDhvJD8Ct0,0.456,0.168,9,-13.825,1,0.0294,0.00135,0.106,0.066,137.711,210453,4,1
2,Tokyo Incidents,大人,黄昏泣き,17OAQmU919MnDrmb8nDv7U,0.319,0.385,2,-11.213,0,0.0332,0.923,0.106,0.239,73.611,208267,4,1
3,HONNE,nswy: dream edits,free love - dream edit,3HAsf0o0TJY9WL4zKCzE3u,0.67,0.22,8,-10.214,1,0.0422,7.6e-05,0.111,0.139,119.604,249945,4,1
4,Faye Wong,王菲97,你快樂所以我快樂,31kKBUeeCGuMt9fPxDybde,0.676,0.307,8,-12.919,0,0.0279,0.00284,0.123,0.163,110.857,258427,4,1
5,Hanare Gumi,帰ってから、歌いたくなってもいいようにと思ったのだ。,ねむるのまち~Tidur Tidur~,2CXT8P6nv9BfKMn4xSMU8V,0.51,0.0328,7,-22.611,1,0.039,0.083,0.205,0.428,131.64,310520,3,1
6,Tanya Chua,若你碰到他,若你碰到他,74vamOEacSToHKKvY076mq,0.372,0.211,11,-9.255,1,0.0294,0.0,0.0921,0.225,137.489,333680,3,1
7,中村佳穂,AINOU,忘れっぽい天使,3vVOzWA6SNSGXNYeg4LnYN,0.443,0.065,0,-14.52,1,0.0601,0.0,0.111,0.287,71.77,235506,4,1
8,NELL,Let’s Part,Time Walking on Memories,6XkrfYmgPGSvgufoivTQgj,0.399,0.331,10,-9.221,1,0.0478,1e-06,0.11,0.193,171.208,389853,4,1
9,Anni Hung,洪安妮 我喜歡你,我喜歡你,08cKcaqRuwzfEYZac1RKAF,0.407,0.274,1,-12.247,1,0.0392,0.000358,0.119,0.235,134.577,246924,4,1


In [212]:
# global top 50 playist  
df_global = analyze_playlist("spotify","37i9dQZEVXbMDoHDwVN2tF?si=5ec773049fe544c9")

In [259]:
# exporting svm results to csv
glob_svm = pred_results(df_global, trainedsvm)
glob_results.to_csv('Global Results.csv', index_label = None)

In [222]:
#df_indie = analyze_playlist("spotify","37i9dQZF1DX2Nc3B70tvx0?si=016b1ff54ad04b4e")
indie_svm = pred_results(df_indie, svm_model)
indie_svm.to_csv('Indie Results.csv', index_label = None)

### Logistic Regression Results

In [253]:
discover_weekly_log = pred_results(df_discover, log_model)
discover_weekly_log

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,pred
0,柯智棠,The Joy Of Sorrow,The Joy Of Sorrow,78oPAwgoMFL9QxnueTuNwy,0.462,0.12,5,-16.798,1,0.0396,1.9e-05,0.143,0.232,73.228,207435,4,1
1,Tomoyo Harada,Candle Lights,ラヴ・ミー・テンダー - Haruomi Hosono Rework,1U14MlCuOIk5XDhvJD8Ct0,0.456,0.168,9,-13.825,1,0.0294,0.00135,0.106,0.066,137.711,210453,4,1
2,Tokyo Incidents,大人,黄昏泣き,17OAQmU919MnDrmb8nDv7U,0.319,0.385,2,-11.213,0,0.0332,0.923,0.106,0.239,73.611,208267,4,1
3,HONNE,nswy: dream edits,free love - dream edit,3HAsf0o0TJY9WL4zKCzE3u,0.67,0.22,8,-10.214,1,0.0422,7.6e-05,0.111,0.139,119.604,249945,4,1
4,Faye Wong,王菲97,你快樂所以我快樂,31kKBUeeCGuMt9fPxDybde,0.676,0.307,8,-12.919,0,0.0279,0.00284,0.123,0.163,110.857,258427,4,1
5,Hanare Gumi,帰ってから、歌いたくなってもいいようにと思ったのだ。,ねむるのまち~Tidur Tidur~,2CXT8P6nv9BfKMn4xSMU8V,0.51,0.0328,7,-22.611,1,0.039,0.083,0.205,0.428,131.64,310520,3,1
6,Tanya Chua,若你碰到他,若你碰到他,74vamOEacSToHKKvY076mq,0.372,0.211,11,-9.255,1,0.0294,0.0,0.0921,0.225,137.489,333680,3,1
7,中村佳穂,AINOU,忘れっぽい天使,3vVOzWA6SNSGXNYeg4LnYN,0.443,0.065,0,-14.52,1,0.0601,0.0,0.111,0.287,71.77,235506,4,1
8,NELL,Let’s Part,Time Walking on Memories,6XkrfYmgPGSvgufoivTQgj,0.399,0.331,10,-9.221,1,0.0478,1e-06,0.11,0.193,171.208,389853,4,1
9,Anni Hung,洪安妮 我喜歡你,我喜歡你,08cKcaqRuwzfEYZac1RKAF,0.407,0.274,1,-12.247,1,0.0392,0.000358,0.119,0.235,134.577,246924,4,1


In [254]:
# output top 50 global playlist to Excel for further calculation  
glob_log = pred_results(df_global, log_model)
glob_log.to_csv('Global Log.csv', index = False)

In [256]:
# output ultimate indie to Excel
indie_log = pred_results(df_indie, log_model)
indie_log.to_csv('Indie Log.csv', index_label = None)

### Neural Network Results

In [252]:
discover_weekly_nn = pred_results(df_discover, mlp)
discover_weekly_nn

Unnamed: 0,artist,album,track_name,track_id,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,pred
0,柯智棠,The Joy Of Sorrow,The Joy Of Sorrow,78oPAwgoMFL9QxnueTuNwy,0.462,0.12,5,-16.798,1,0.0396,1.9e-05,0.143,0.232,73.228,207435,4,1
1,Tomoyo Harada,Candle Lights,ラヴ・ミー・テンダー - Haruomi Hosono Rework,1U14MlCuOIk5XDhvJD8Ct0,0.456,0.168,9,-13.825,1,0.0294,0.00135,0.106,0.066,137.711,210453,4,1
2,Tokyo Incidents,大人,黄昏泣き,17OAQmU919MnDrmb8nDv7U,0.319,0.385,2,-11.213,0,0.0332,0.923,0.106,0.239,73.611,208267,4,1
3,HONNE,nswy: dream edits,free love - dream edit,3HAsf0o0TJY9WL4zKCzE3u,0.67,0.22,8,-10.214,1,0.0422,7.6e-05,0.111,0.139,119.604,249945,4,1
4,Faye Wong,王菲97,你快樂所以我快樂,31kKBUeeCGuMt9fPxDybde,0.676,0.307,8,-12.919,0,0.0279,0.00284,0.123,0.163,110.857,258427,4,1
5,Hanare Gumi,帰ってから、歌いたくなってもいいようにと思ったのだ。,ねむるのまち~Tidur Tidur~,2CXT8P6nv9BfKMn4xSMU8V,0.51,0.0328,7,-22.611,1,0.039,0.083,0.205,0.428,131.64,310520,3,1
6,Tanya Chua,若你碰到他,若你碰到他,74vamOEacSToHKKvY076mq,0.372,0.211,11,-9.255,1,0.0294,0.0,0.0921,0.225,137.489,333680,3,1
7,中村佳穂,AINOU,忘れっぽい天使,3vVOzWA6SNSGXNYeg4LnYN,0.443,0.065,0,-14.52,1,0.0601,0.0,0.111,0.287,71.77,235506,4,1
8,NELL,Let’s Part,Time Walking on Memories,6XkrfYmgPGSvgufoivTQgj,0.399,0.331,10,-9.221,1,0.0478,1e-06,0.11,0.193,171.208,389853,4,1
9,Anni Hung,洪安妮 我喜歡你,我喜歡你,08cKcaqRuwzfEYZac1RKAF,0.407,0.274,1,-12.247,1,0.0392,0.000358,0.119,0.235,134.577,246924,4,1


In [255]:
# output top 50 global playlist to Excel  
glob_nn = pred_results(df_global, mlp)
glob_nn.to_csv('Global NN.csv', index = False)

In [257]:
# output ultimate indie to excel
indie_nn = pred_results(df_indie, mlp)
indie_nn.to_csv('Indie NN.csv', index_label = None)