Classification model of choice is multi-layer perceptron. After testing a bunch of different classification methods, neural networks seemed to fit the best to our needs. Multi-layer perceptron is good for approximation, and such is also good for classificaton.

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth

import random
import string
import pandas as pd
import numpy as np


#Write here the client ID and secret ID from spotify API
SPOTIPY_CLIENT_ID = ''
SPOTIPY_CLIENT_SECRET = ''
REDIRECT_URI = 'http://localhost:7000/callback'
scope = "user-library-read"

cache_handler = spotipy.cache_handler.MemoryCacheHandler()
auth_manager = SpotifyClientCredentials(client_id = SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET, cache_handler=cache_handler)
sp = spotipy.Spotify(auth_manager = auth_manager)

### Next, we make a function to fetch sample of songs of specific genre, that can be used to test playlist predictor

In [23]:
#function to fetch songs from a specific genre
#returns: dataframe with song name, artist name, and audio features
#Can be used to test the model

def fetch_songs(sp, genre, year, number):
    #DF where the songs are stored
    df = pd.DataFrame()
    
    #Fetch songs until there are more than number of songs in the DataFrame
    while (df.shape[0] < number):
        
        #Create empty list for storing songs with one fetch
        song_data = []
        
        #Make random search by some random letter
        offset = random.randint(1,1000)
        random_character = random.choice(string.ascii_letters)
        random_search = random.choice([random_character + '%'
                                       ,'%' + random_character
                                       ,'%' + random_character + '%'])
        songs = sp.search(q = 'track:' + random_search + ' year:' + year + ' genre: ' +  genre, type = 'track', market = 'FI', offset = offset, limit = 50)
        
        #Go through all songs from the fetch and extract needed features
        for song in songs['tracks']['items']:
            name = song['name']
            artist = song['artists'][0]['name']
            popularity = song['popularity']
            audio_features = sp.audio_features(song['id'])
            
            song_data.append([name, artist, popularity] + list(audio_features[0].values()))
    
        #Concatenate the found songs to a dataframe and remove duplicates
        columns = ['song_name', 'artist_name', 'popularity'] + list(audio_features[0].keys())
        new_df = pd.DataFrame(columns = columns, data = song_data)
        df = pd.concat([df, new_df], ignore_index = True)
        df = df.drop_duplicates(subset = ['id'])
    
    df = df.drop(['type', 'track_href', 'analysis_url', 'time_signature'], axis = 1)
    
    return df

#Test run, fetches 10 rap songs from 2019
data = fetch_songs(sp, 'rap', '2020', 10)
data.head()

Unnamed: 0,song_name,artist_name,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,id,uri,duration_ms
0,"...And to Those I Love, Thanks for Sticking Ar...",$uicideboy$,82,0.792,0.511,2,-6.876,1,0.0409,0.124,9e-05,0.14,0.111,113.983,30QR0ndUdiiMQMA9g1PGCm,spotify:track:30QR0ndUdiiMQMA9g1PGCm,168490
1,Take You Dancing,Jason Derulo,78,0.789,0.711,2,-4.248,1,0.041,0.0332,0.0,0.0876,0.753,112.985,59qrUpoplZxbIZxk6X0Bm3,spotify:track:59qrUpoplZxbIZxk6X0Bm3,190306
2,WAP (feat. Megan Thee Stallion),Cardi B,81,0.935,0.454,1,-7.509,1,0.375,0.0194,0.0,0.0824,0.357,133.073,4Oun2ylbjFKMPTiaSbbCih,spotify:track:4Oun2ylbjFKMPTiaSbbCih,187541
3,Took Her To The O,King Von,79,0.82,0.592,1,-7.002,1,0.29,0.0149,5e-06,0.121,0.4,159.98,7fEoXCZTZFosUFvFQg1BmW,spotify:track:7fEoXCZTZFosUFvFQg1BmW,196180
4,Our Time,Lil Tecca,78,0.895,0.439,5,-12.142,0,0.319,0.199,0.0,0.0683,0.633,109.988,2WxUIiq06XXPYWl9YcRJnD,spotify:track:2WxUIiq06XXPYWl9YcRJnD,98413


In [15]:
df = pd.read_csv('data_processed.csv')
df.head()

Unnamed: 0,playlist_name,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,...,key|4,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1
0,#vainsuomihitit,0.174043,0.113267,0.17625,0.029384,0.070482,1e-06,0.019962,0.192761,0.104197,...,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5
1,#vainsuomihitit,0.108605,0.14627,0.169903,0.020346,0.106024,0.000172,0.019962,0.091003,0.161951,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0
2,#vainsuomihitit,0.107078,0.080666,0.156801,0.013733,0.113654,4e-06,0.018404,0.096794,0.111126,...,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5
3,#vainsuomihitit,0.050802,0.145666,0.17013,0.005401,0.001078,1e-05,0.023745,0.002834,0.061054,...,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0
4,#vainsuomihitit,0.131072,0.15935,0.165754,0.002182,0.001112,0.149146,0.033092,0.125336,0.064682,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0


Time to train the model. Response variable (class) is the playlist name, input data is the normalized audio features of each song in a playlist. The final score for the model is 30%, which is good enough considering that each playlist has small amount of data points, and many of the playlists have pretty similar songs, so they get easily confused. Since we will return a list of the top 5 most similar playlists, it is not required that the first prediction is the actual playlist. After testing the model, it seems to find the correct playlist in the top 5 most similar playlists.

In [17]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

X = df.iloc[:,1:].values
y = df['playlist_name']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10)
clf = MLPClassifier(solver = 'lbfgs', verbose = True).fit(X_train, y_train)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


0.30973451327433627

The cell below uses the test data to predict top 5 playlists that sound the same to the given song. Seems to work surprisingly well! 

In [18]:
import numpy as np

n = 5

probas = clf.predict_proba(X_test)
top_n_lables_idx = np.argsort(-probas, axis=1)[:, :n]
top_n_probs = np.round(-np.sort(-probas),3)[:, :n]
top_n_labels = [clf.classes_[i] for i in top_n_lables_idx]

y_test = y_test.reset_index(drop = True)

results = list(zip(top_n_labels, top_n_probs))

labels = pd.concat([y_test, pd.DataFrame(results)], axis = 1)
labels

Unnamed: 0,playlist_name,0,1
0,Internet People,"[Bileräppiä, Internet People, Power Gaming, PO...","[0.433, 0.248, 0.233, 0.057, 0.007]"
1,Calming Acoustic,"[Calming Acoustic, Lava Lamp, Feel Good Beats,...","[0.965, 0.012, 0.009, 0.007, 0.005]"
2,Paras fiilis!,"[Paras fiilis!, Parhaat suomihitit 00-luvulta,...","[0.1, 0.098, 0.089, 0.084, 0.064]"
3,Lava Lamp,"[Lava Lamp, Evening Acoustic, Chill Vibes, Jyt...","[0.969, 0.012, 0.008, 0.007, 0.003]"
4,Big Country,"[Big Country, Jonnet ei muista, Ensisoitossa, ...","[0.295, 0.119, 0.106, 0.092, 0.091]"
...,...,...,...
334,Chill Vibes,"[Best New Pop, Chill Pop, Viikonloppufiilis, C...","[0.218, 0.209, 0.166, 0.06, 0.051]"
335,#vainsuomihitit,"[Jytää, purkkaa ja Finnhitsejä, Suomirockin kl...","[0.203, 0.168, 0.148, 0.068, 0.057]"
336,New Music Friday Suomi,"[Aitoa suomiräppiä, POLLEN, Viikonloppufiilis,...","[0.116, 0.114, 0.077, 0.073, 0.063]"
337,Jonnet ei muista,"[Matkalaulut, Jonnet ei muista, Jytää, purkkaa...","[0.345, 0.223, 0.101, 0.089, 0.06]"
