# Collaborative Filtering

Collaborative Filtering means that interests of other users are taking into account for making a playlist prediction. Spotify uses a matrix where each row is a user and each column is a song, each entry therefore resembles how often a user has listened to a song. Since this data is not publicly available, we will try the same approach, but with a different matrix. The matrix will have one row for each playlist and each column will be a song. The entries of the matrix are going to be 1 if the song is in the playlist and 0 if not. Read more about collaborative filtering [here](TODO).

Let's first import all important libraries. pandas DataFrames are used to store the data and numpy arrays for matrix computations.

In [1]:
import warnings
import numpy as np
import pandas as pd

from scripts.matrix_factorization import MF  # for matrix factorization

Load the raw data.

In [2]:
df = pd.read_csv("data/sorted_processed_data_train.csv")
print(f"We have a dataset of {len(df)} entries")

We have a dataset of 64574 entries


We want to build a matrix where each row is a playlist, and each column resembles a song. The dimensions of our matrix resemble those of the playlist number and track number.

In [3]:
num_playlists = df["name"].nunique()  # count distinct values, this is the number of playlists
num_tracks = df["track_name"].nunique()  # count distinct values, this is the number of tracks
print(f"Playlists: {num_playlists} \nTracks: {num_tracks}")

Playlists: 833 
Tracks: 29384


Group the data by the playlist name. This results in a Series.

In [4]:
playlists = df.groupby('name')["track_name"].apply(list)
playlists.head()

name
 CHiLl         [Fresh Eyes, i hate u, i love u (feat. olivia ...
 Frozen        [Queen Elsa of Arendelle - Score Demo, Reindee...
 indie rock    [Back In Your Head, Be Good (RAC Remix), Bambi...
#Relaxed       [Bag Lady, On & On, I Can't Stop Loving You, L...
#Workout       [Can't Feel My Face - Martin Garrix Remix, Ign...
Name: track_name, dtype: object

We need a list of the unique songs to create each playlist vector.

In [5]:
unique_songs = list(df["track_name"].unique())  # list of unique songs, maps each song to an index
print(f"Number of unique songs: {len(unique_songs)}")

Number of unique songs: 29384


We can now iteratively build our matrix by creating each playlist in a vector of its songs. This is done by one-hot encoding, meaning every column is a song, and every row a playlist, and the row-column combination is one if the song was added to the playlist. 

In [6]:
one_hot_playlists = list()
for playlist in playlists:
    playlist_array = np.zeros(num_tracks)
    for song in playlist:
        playlist_array[unique_songs.index(song)] = 1  # set array to 1 at index of the song
    one_hot_playlists.append(playlist_array)
one_hot_playlists = np.array(one_hot_playlists)  # convert to one numpy array (matrix)

For example, the first playlist "CHiLl" includes the song "Make Me (Cry)" and does not include "Mr. Brightside". Check if the value is one and zero respectively at the corresponding positions.

In [7]:
print(one_hot_playlists[0][unique_songs.index("Make Me (Cry)")] == 1.0)
print(one_hot_playlists[0][unique_songs.index("Mr. Brightside")] == 0.0)

True
True


The shape of our playlists should be playlist number times distinct track number.

In [8]:
one_hot_playlists.shape

(833, 29384)

We now apply matrix factorization on our data. This means, we try to find two matrices, which multiplied are as close to the original matrix as possible. We train using gradient descent, meaning we try to minimize the error in each iteration.

In [9]:
mf = MF(one_hot_playlists, K=2, alpha=0.1, beta=0.01, iterations=100)
mf.train()

Iteration: 10 ; error = 2.1155
Iteration: 20 ; error = 1.1530
Iteration: 30 ; error = 0.7647
Iteration: 40 ; error = 0.5546
Iteration: 50 ; error = 0.4267
Iteration: 60 ; error = 0.3408
Iteration: 70 ; error = 0.2765
Iteration: 80 ; error = 0.2321
Iteration: 90 ; error = 0.1969
Iteration: 100 ; error = 0.1673


[(0, 15.295526520377306),
 (1, 8.377218879031965),
 (2, 5.834762190446305),
 (3, 4.557920102193173),
 (4, 3.760484254323002),
 (5, 3.2188950014340665),
 (6, 2.8396133221875415),
 (7, 2.5422054537340264),
 (8, 2.3147908110243582),
 (9, 2.115548275804261),
 (10, 1.970929788054907),
 (11, 1.818285740833489),
 (12, 1.6992442439892712),
 (13, 1.5895652702434653),
 (14, 1.499103205963745),
 (15, 1.4164011478723069),
 (16, 1.3395150733582029),
 (17, 1.2702212053383057),
 (18, 1.2063403710806984),
 (19, 1.1529861114380264),
 (20, 1.1005930957979086),
 (21, 1.0511870356714754),
 (22, 1.002157978870854),
 (23, 0.9671003053577073),
 (24, 0.9237745039057815),
 (25, 0.8890992072866498),
 (26, 0.8519255032738529),
 (27, 0.8233356365084263),
 (28, 0.7919509679083936),
 (29, 0.7647067113001538),
 (30, 0.7371832047002738),
 (31, 0.7141744922060613),
 (32, 0.6870928233026763),
 (33, 0.6724193711969266),
 (34, 0.6472667350298804),
 (35, 0.6257651187469201),
 (36, 0.607141264780451),
 (37, 0.5885560197090

As a result, we get a matrix where values should be close to their original values, but unknown values, in our case songs that are not in the playlist, are approximated by the matrix factorization. For example, if we look at the same song from earlier, the value is close to 1. Looking in at a song that has not been added to the playlist, the value is now approximated to how likely it should be added to the playlist.

In [10]:
print(mf.full_matrix()[0][unique_songs.index("Make Me (Cry)")])
print(mf.full_matrix()[0][unique_songs.index("Mr. Brightside")])

0.9993663650257003
1.00004585129019


We use this matrix for making our recommendation. We do so by looking for the songs with the least difference to the number 1, since this number was given to songs that were in the playlist. Let's predict 5 more songs for our first playlist "ChiLl".

In [11]:
rec_matrix = mf.full_matrix()
playlist = 0  # choose which playlist to look at
playlist_vals = rec_matrix[playlist]  # all predicted values for the first playlist
diff_vals = playlist_vals - 1  # subtract 1 from each value to get difference to 1

k=5  # find the k smallest values
smallest_k_indexes = np.argpartition(diff_vals, k)  # get list of indexes ordered according to descending order of values

i = 0  # record how many songs were output
for idx in np.flip(smallest_k_indexes):  # look at which song corresponds to found indexes
    if one_hot_playlists[playlist][idx] == 0:  # song was not in playlist already
        print(unique_songs[idx])  # use index to recover name of song
        i += 1
    if i == k:
        break

2 Step (feat. T-Pain, Jim Jones & E-40)
Main Chick REMIX
Manea
Your Boyfriend
Reindeer(s) Remix - Outtake


We can store our Matrix to make predictions in the future.

In [12]:
with open('data/matrix.npy', 'wb') as f:
    np.save(f, mf.full_matrix())

In [17]:
unique_songs_uris

['4nMlau89VAjmV7agkl7OY3',
 '7vRriwrloYVaoAe3a9wJHe',
 '152lZdxL1OR0ZMW6KquMif',
 '3eze1OsZ1rqeXkKStNfTmi',
 '2QbFClFyhMMtiurUjuQlAe',
 '04DwTuZ2VBdJCCC5TROn7L',
 '6j328qpqI6nScYgOTW5rBl',
 '5uKryGXlPNcs1hqq7nCbVe',
 '3xgK660fsZH7ZDcOMfIdfB',
 '5uDASfU19gDxSjW8cnCaBp',
 '6DNtNfH8hXkqOX1sjqmI7p',
 '3pndPhlQWjuSoXhcIIdBjv',
 '1wZqJM5FGDEl3FjHDxDyQd',
 '4vNeJ6DoAlTPzoYO9W4mJH',
 '68EMU2RD1ECNeOeJ5qAXCV',
 '5dKyZWlgjWw1oJgLa4GCZD',
 '5xV0Czdqefft6sPDqjmFBu',
 '2BrzlUj1u1CtvaJDGIKpsP',
 '4F7A0DXBrmUAkp32uenhZt',
 '78rIJddV4X0HkNAInEcYde',
 '3aJkV6DUTSCqOwVwaBDG9B',
 '0KdczQzjtSU7PnVx792TAl',
 '1SN5BiwLs8hIVlyetAfirB',
 '3LwmXEZGgMUdkCgHbz7tBr',
 '43HHWK1V3YvvCsG9WstCob',
 '2xTeBEE0wJ14jjsDp6Aw6C',
 '0fwPpZRks3445e4Gq0s0n1',
 '4WOSnPruog3605oiLpZAqx',
 '5DRuqedbCnIskJqp7mQMPZ',
 '7HXMQ2jWI2de8iZaLCM98N',
 '2vVQHScO9IkSgmofSIjYEx',
 '2HIK7ZGx4PvFD89vd9wxxq',
 '10HbvFUtf6Yo0VtFTRcRvF',
 '5Clxl180dvIgvHZsJaPRT0',
 '5tam15sgdLxoxcwuSeyvim',
 '5q9k7uDMyVrLe1C23f1bwy',
 '2ybG7j6iMJ6cbbDwqKyvc1',
 

Save track IDs and playlist names in order to be able to reconstruct the information.

In [19]:
track_name_to_uri = dict(zip(df.track_name, df.track_uri))
unique_songs_uris = [track_name_to_uri[x] for x in unique_songs]

with open('data/unique_songs_uris.txt', 'w') as f:
	f.write('\n'.join(unique_songs_uris))
    
with open('data/playlists.txt', 'w', encoding="utf-8") as f:
	f.write('\n'.join(list(playlists.index)))