# Collaborative Filtering

Collaborative Filtering means that interests of other users are taking into account for making a playlist prediction. Spotify uses a matrix where each row is a user and each column is a song, each entry therefore resembles how often a user has listened to a song. Since this data is not publicly available, we will try the same approach, but with a different matrix. The matrix will have one row for each playlist and each column will be a song. The entries of the matrix are going to be 1 if the song is in the playlist and 0 if not. Read more about collaborative filtering [here](TODO).

Let's first import all important libraries. pandas DataFrames are used to store the data and numpy arrays for matrix computations.

In [1]:
import warnings
import numpy as np
import pandas as pd

from scripts.matrix_factorization import MF  # for matrix factorization

Load the raw data.

In [2]:
df = pd.read_csv("data/processed_data_train.csv")
print(f"We have a dataset of {len(df)} entries")

We have a dataset of 60732 entries


We want to build a matrix where each row is a playlist, and each column resembles a song. The dimensions of our matrix resemble those of the playlist number and track number.

In [3]:
num_playlists = df["name"].nunique()  # count distinct values, this is the number of playlists
num_tracks = df["track_name"].nunique()  # count distinct values, this is the number of tracks
print(f"Playlists: {num_playlists} \nTracks: {num_tracks}")

Playlists: 779 
Tracks: 27902


Group the data by the playlist name. This results in a Series.

In [4]:
playlists = df.groupby('name')["track_name"].apply(list)
playlists.head()

name
 CHiLl         [Fresh Eyes, i hate u, i love u (feat. olivia ...
 Frozen        [Queen Elsa of Arendelle - Score Demo, Reindee...
 indie rock    [Back In Your Head, Be Good (RAC Remix), Bambi...
#Relaxed       [Bag Lady, On & On, I Can't Stop Loving You, L...
#Workout       [Can't Feel My Face - Martin Garrix Remix, Ign...
Name: track_name, dtype: object

We need a list of the unique songs to create each playlist vector.

In [5]:
unique_songs = list(df["track_name"].unique())  # list of unique songs, maps each song to an index
print(f"Number of unique songs: {len(unique_songs)}")

Number of unique songs: 27902


We can now iteratively build our matrix by creating each playlist in a vector of its songs. This is done by one-hot encoding, meaning every column is a song, and every row a playlist, and the row-column combination is one if the song was added to the playlist. 

In [6]:
one_hot_playlists = list()
for playlist in playlists:
    playlist_array = np.zeros(num_tracks)
    for song in playlist:
        playlist_array[unique_songs.index(song)] = 1  # set array to 1 at index of the song
    one_hot_playlists.append(playlist_array)
one_hot_playlists = np.array(one_hot_playlists)  # convert to one numpy array (matrix)

For example, the first playlist "CHiLl" includes the song "Make Me (Cry)" and does not include "Mr. Brightside". Check if the value is one and zero respectively at the corresponding positions.

In [7]:
print(one_hot_playlists[0][unique_songs.index("Make Me (Cry)")] == 1.0)
print(one_hot_playlists[0][unique_songs.index("Mr. Brightside")] == 0.0)

True
True


The shape of our playlists should be playlist number times distinct track number.

In [8]:
one_hot_playlists.shape

(779, 27902)

We now apply matrix factorization on our data. This means, we try to find two matrices, which multiplied are as close to the original matrix as possible. We train using gradient descent, meaning we try to minimize the error in each iteration. The implementation for the training can be found [here](https://albertauyeung.github.io/2017/04/23/python-matrix-factorization.html/#a-simple-implementation-in-python).

In [9]:
mf = MF(one_hot_playlists, K=2, alpha=0.1, beta=0.01, iterations=100)
mf.train()

Iteration: 10 ; error = 2.1337
Iteration: 20 ; error = 1.1777
Iteration: 30 ; error = 0.7843
Iteration: 40 ; error = 0.5694
Iteration: 50 ; error = 0.4353
Iteration: 60 ; error = 0.3460
Iteration: 70 ; error = 0.2824
Iteration: 80 ; error = 0.2342
Iteration: 90 ; error = 0.1972
Iteration: 100 ; error = 0.1687


[(0, 15.137104681795547),
 (1, 8.395923543172806),
 (2, 5.905774329275563),
 (3, 4.589205610552773),
 (4, 3.7920265661266224),
 (5, 3.2604174176844274),
 (6, 2.8554104196691426),
 (7, 2.5697030001588086),
 (8, 2.332888275194447),
 (9, 2.1336808713877056),
 (10, 1.9775644845700677),
 (11, 1.8391702320970116),
 (12, 1.7228683193246555),
 (13, 1.620158779541546),
 (14, 1.5192618845276225),
 (15, 1.4358603341320344),
 (16, 1.364015615028337),
 (17, 1.295442136923394),
 (18, 1.2320106213865727),
 (19, 1.177677485889524),
 (20, 1.126242367235167),
 (21, 1.0779138481992756),
 (22, 1.0318510297308876),
 (23, 0.9830203439235599),
 (24, 0.9456917266809035),
 (25, 0.9125149857658436),
 (26, 0.8748441184023356),
 (27, 0.8421250134574092),
 (28, 0.8179708755395345),
 (29, 0.784317598906995),
 (30, 0.7565387196532855),
 (31, 0.7322976335835992),
 (32, 0.7052414052611392),
 (33, 0.6848554393884322),
 (34, 0.6712571454995647),
 (35, 0.6467243020136767),
 (36, 0.6207174500668287),
 (37, 0.6032783117007

As a result, we get a matrix where values should be close to their original values, but unknown values, in our case songs that are not in the playlist, are approximated by the matrix factorization. For example, if we look at the same song from earlier, the value is close to 1. Looking in at a song that has not been added to the playlist, the value is now approximated to how likely it should be added to the playlist.

In [10]:
print(mf.full_matrix()[0][unique_songs.index("Make Me (Cry)")])
print(mf.full_matrix()[0][unique_songs.index("Mr. Brightside")])

0.9961668579089007
1.000924550402289


We can store our Matrix and use it in the [evaluation notebook](evaluation.ipynb) for making recommendations.

In [11]:
with open('data/matrix.npy', 'wb') as f:
    np.save(f, mf.full_matrix())

Save track IDs and playlist names in order to be able to reconstruct the information.

In [12]:
track_name_to_uri = dict(zip(df.track_name, df.track_uri))
unique_songs_uris = [track_name_to_uri[x] for x in unique_songs]

with open('data/unique_songs_uris.txt', 'w') as f:
	f.write('\n'.join(unique_songs_uris))
    
with open('data/playlists.txt', 'w', encoding="utf-8") as f:
	f.write('\n'.join(list(playlists.index)))