Imports for SOM implementation and for spotify.

In [1]:
import pandas as pd
import numpy as np
from minisom import MiniSom    
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

seed = 25
np.random.seed(seed)

auth_manager = SpotifyClientCredentials(client_id='5e87783324eb47cba39f43f39e374c71',client_secret='68427108059946e9abfa226df5780371')
sp = spotipy.Spotify(auth_manager=auth_manager)

Import the spotify song dataset and restrict the dataset to usable columns.

In [2]:
df = pd.read_csv('./datasets/spotify_songs.csv')
df.drop_duplicates('track_id', inplace=True)
df = df[['genre','track_id','popularity','acousticness','danceability','duration_ms','energy','instrumentalness','liveness','loudness','speechiness','tempo','valence']]

df.describe()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0
mean,36.273162,0.404135,0.541068,236127.2,0.557025,0.172073,0.224531,-10.137605,0.127395,117.203679,0.451595
std,17.391016,0.366302,0.190387,130513.2,0.275839,0.322936,0.211027,6.395551,0.204345,31.325091,0.26782
min,0.0,0.0,0.0569,15387.0,2e-05,0.0,0.00967,-52.457,0.0222,30.379,0.0
25%,25.0,0.0456,0.415,178253.0,0.344,0.0,0.0975,-12.851,0.0368,92.006,0.222
50%,37.0,0.288,0.558,219453.0,0.592,7e-05,0.13,-8.191,0.0494,115.0065,0.44
75%,49.0,0.791,0.683,268547.0,0.789,0.0908,0.277,-5.631,0.102,138.79975,0.667
max,100.0,0.996,0.989,5552917.0,0.999,0.999,1.0,3.744,0.967,242.903,1.0


This section encodes non-numeric data into numeric data and normalizes the data so the SOM can effectively map all features. The Genre column is not normalized here. This results in a heavier weight on the genre column, but it's suitable for this scenario for song recommendation.

In [3]:
def flatten(l):
    return [item for sublist in l for item in sublist]

def encode_values(col):
    unique_items = col.unique().tolist()
    items_to_encoded = {x: i for i, x in enumerate(unique_items)}
    encoded_to_items = {i: x for i, x in enumerate(unique_items)}
    return (items_to_encoded, encoded_to_items)

genre_items_to_encoded, genre_encoded_to_items = encode_values(df['genre'])
df['genre'] = df['genre'].map(genre_items_to_encoded)

track_id_items_to_encoded, track_id_encoded_to_items = encode_values(df['track_id'])
df['track_id'] = df['track_id'].map(track_id_items_to_encoded)

# do not normalize genre, as a heavier weight on the genre will produce a more preferable prediction
normalized_df = df[['popularity','acousticness','danceability','duration_ms','energy','instrumentalness','liveness','loudness','speechiness','tempo','valence']]
normalized_df=(normalized_df-normalized_df.mean())/normalized_df.std()
normalized_df.insert(0, "genre", df['genre'].to_numpy(), True)

# fairly hacky way of correlating the normalized data with a spotify track id
mapping_to_track_id = {(item[1], item[2], item[3]): df.iloc[index]['track_id'] for index, item in enumerate(normalized_df.to_numpy())}

Create the SOM instance and train it with the dataset.

In [4]:
data = normalized_df.to_numpy()
shape = (40, 40)
som = MiniSom(shape[0], shape[1], data.shape[1], sigma=0.5, learning_rate=0.5)
som.pca_weights_init(data)
som.train_batch(data, 100000, verbose=True)

win_map = som.win_map(data)

 [ 100000 / 100000 ] 100% - 0:00:00 left 
 quantization error: 3.240989823347801


This section covers the song prediction. Given a song, we can get the closest neuron. From a neuron, we can get all of the tracks associated from a neuron. In this code, it's randomly grabbing 5 random tracks from this set which are the recommended tracks.

The Spotify API is used here to better describe the mentioned here.

In [9]:
def print_track(track_id):
    #print(track)
    track = sp.track(track_id)
    print('Track: ' + track['name'] + ' [' + track['id'] + ']')
    artists = sp.artists([artist['id'] for artist in track['artists']])['artists']
    artist_names = [artist['name'] for artist in artists]
    artist_genres = np.unique(flatten([artist['genres'] for artist in artists]))
    print('Artists: ' + str.join(', ', artist_names))
    print('Genres: ' + str.join(', ', artist_genres))

# 6C7RJEIUDqKkJRZVWdkfkH rap
# 3eHtVkc0qhvwr0EWzi0gra Spirited Away
# 5zHfh7X1ioaXz0r534ScWY classical
# 2xYlyywNgefLCRDG8hlxZq country
chosen_track_id = '5zHfh7X1ioaXz0r534ScWY'
index = track_id_items_to_encoded[chosen_track_id]
chosen_song = normalized_df.iloc[index].to_numpy()

print ('Chosen Song:')
print_track(chosen_track_id)
print ('')

winner = som.winner(chosen_song)
similar_tracks = win_map[winner]

array_shuffle = np.arange(len(similar_tracks))
np.random.shuffle(array_shuffle)

recommended_tracks = np.array(similar_tracks)[array_shuffle[:5]]
print ('Recommended Songs:')
for track in recommended_tracks:
    print_track(track_id_encoded_to_items[mapping_to_track_id[(track[1], track[2], track[3])]])
    print('----')



Chosen Song:
Track: Nocturnes, Op. 9: No. 2, Andante in E-Flat Major [5zHfh7X1ioaXz0r534ScWY]
Artists: Frédéric Chopin, Henrik Måwe
Genres: classical, early romantic era, nordic classical piano, polish classical

Recommended Songs:
Track: Chopin: Nocturne No. 17 in B Major, Op. 62 No. 1 [3UiLTMeqt1zQOyZqmU5jiP]
Artists: Frédéric Chopin, Elisabeth Leonskaja
Genres: classical, classical piano, early romantic era, polish classical, russian classical piano
----
Track: Vivaldi: The Four Seasons, Violin Concerto in F Major, Op. 8 No. 3, RV 293 "Autumn": II. Adagio molto [4Fwh7LLCxx9Z3nKQSszq4d]
Artists: Antonio Vivaldi, Nigel Kennedy, English Chamber Orchestra
Genres: baroque, bow pop, british orchestra, chamber orchestra, classical, early music, italian baroque, violin
----
Track: Finale (From "East of Eden") [59eF6QxMRWpLGlFsKcHR7S]
Artists: London Symphony Orchestra, Charles Gerhardt, Lee Holdridge
Genres: british orchestra, classic soundtrack, classical, orchestra
----
Track: All Gone (P