# Collaborative Filtering 

It is intuitive that people who listen to the same songs have similar tastes in music, and this is the basic assumption of collaborative filtering. We built up the collaborative filtering models based on Million Playlist Dataset, assuming that users with similar past behaviors are likely to listen to the same songs in the future. We have three collaborative filtering models which are:
    - Baseline Model
    - Advanced Model
    - Meta-Playlist Model
We used the [Spotlight library](https://maciejkula.github.io/spotlight/index.html) to build these models. These three models differ the most in their training dataset.

## Baseline Model

DESCRIPTION + CODE

## Advanced Model

DESCRIPTION + CODE

## Meta-Playlist Model

To use as many data as possible, we created these ‘meta-playlist’. We observed that the
MPD dataset contains many sets of playlists with shared titles, the top five of which being ‘country’,
‘chill’, ‘rap’, ‘workout’, and ‘oldies’. We combined each set of playlists with the same title into a
meta-playlist. Each meta-playlist contains all the songs from the sub-playlists and keeps track of the
number of times each song appears in all the sub-playlists.
Then we fitted a collaborative filtering model on the meta-playlists. This method contains fewer rows
and more columns compared to the previous model described.

Currently, we selected the top 100 common titles to be the titles of our 100 meta-playlists.
This includes 272,584 playlists from MPD in total and 939,760 distinct songs, which give us 4,620,153
(playlist, song, rating) pairs. We treated each meta-playlist as one user, whose music preference is clearly
stated in the title. The number of appearance of each song was treated as the song’s rating by the user.
The final model would give a score to every song included in the training dataset for every playlist
in the test dataset. For every playlist in the test dataset, we could generate the songs to which the model
thinks that the playlist will give rather high scores.

We built up an implicit feedback model and use matrix factorization techniques for this
problem . We fed the model with (playlist, song, rating) interactions. 80% of the data were used for
training, the rest was used for validation. 10 epoches were run to give an idea on this model.

In [None]:
import numpy as np

from spotlight.interactions import Interactions
from spotlight.cross_validation import random_train_test_split
from spotlight.datasets.movielens import get_movielens_dataset
from spotlight.evaluation import rmse_score
from spotlight.evaluation import mrr_score
from spotlight.evaluation import precision_recall_score
from spotlight.factorization.explicit import ExplicitFactorizationModel
from spotlight.factorization.implicit import ImplicitFactorizationModel

from enum import Enum

In [None]:
# Load data
import json
with open('CF_matrix_condensed.json', 'r') as f:
    data = json.load(f)
    
user = [u for u, i, r in data]
item = [i for u, i, r in data]
rating = [r for u, i, r in data]

In [None]:
# Create an Enum object to index the user names
User_id = Enum(value = 'User_id', names = list(set(user)))

In [None]:
# Replace the user name strings with integer indices specified in the Enum object
for i in range(len(user)):
    u = user[i]
    user[i] = User_id[u].value

In [None]:
# Replace the item id strings with integer indices specified in the Enum object
a = list(set(item))

item_id = {}
id = 1
for i in range(len(a)):
    item_id[a[i]] = id
    id = id+1

for i in range(len(item)):
    it = item[i]
    item[i] = item_id[it]    

In [None]:
# Train the full model with 10 epochs
data = Interactions(user_ids=np.array(user), item_ids=np.array(item), ratings=np.array(rating))
train, test = random_train_test_split(data)
model_full = ImplicitFactorizationModel()
model_full.fit(train, verbose=1)


In [None]:
# Save the model
import torch
torch.save(model_full, 'meta_playlist_full')

In [None]:
# Load the validation data
with open('Val_X.json', 'r') as f:
    validation = json.load(f)

In [None]:
# Generate the score dictionaries
reverse_item_id = {}
for key, value in item_id.items():
    reverse_item_id[value] = key
    
def top_500_dic(inp):
    dic = {}
    score = np.array([0] * len(model.predict(1)))
    sum_w = 0
    for i in range(1,101):
        s = model.predict(i, np.array(inp))
        w = sum(s) / (np.linalg.norm(s) * np.sqrt(len(inp)))
        sum_w += w
        score = w * model.predict(i) + score
    score = (score/sum_w)[1:]
    for index in np.argsort(score)[-1:-501:-1]:
        dic[reverse_item_id[index + 1]] = score[index] # map the song id in string format to numeric score
    return dic

rec_playlist = []
for each_input in validation:
    input_id = []
    for i in range(len(each_input)):
        it = each_input[i]
        try: 
            input_id.append(item_id[it]) # Handle songs not in item_id
        except:
            pass
    if len(input_id) == 1: # Model has to predict on more than one item
        input_id.append(input_id[0])
    if len(input_id) == 0: #If no known input, randomly select songs
        input_id.append(1)
        input_id.append(2)
    rec_playlist.append(top_500_dic(input_id))

In [None]:
# Output the results into a json file
with open('score_500_full.json', 'w') as outfile:
    json.dump(rec_playlist, outfile)