# Building a simple collaborative filtering recommender system

In this Notebook I will explore the subject of recommender systems. The goal will be to understand how a recommender system works, and after that, to build a simple one.

## 1 Building a simple Recommender system

To start I will build a simple collaborative filtering algorithm that will recommend new items to a user by finding similar items or similar users.

### 1.1 Dataset

To do testing, I decided to use a movie dataset, containing a list of movies, users and ratings.

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History
and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4,
Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

Let's explore this dataset !

In [1]:
import pandas as pd

ratings_data = pd.read_csv("datasets/movies/ratings.dat", delimiter="::", engine='python', encoding='latin1', header=None, names=['UserID', 'MovieID', 'Rating', 'Timestamp'])

ratings_data.head()


Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


This file contains the data about the ratings, the first column is the userid, the second one is the id of the movie, the third one is the rating ($r$) such that $$r \in \{ n \in \mathbb{Z}| 1 \leq r \leq 5 \} $$

We also have access to the timestamp from when the rating was published by the user.

In [2]:
movies_data = pd.read_csv("datasets/movies/movies.dat", delimiter="::", engine='python', encoding='latin1', header=None, names=['MovieID', 'Title', 'Genres'])
movies_data.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


For the movies data, like the ratings data, the format is the same, column are separated using the "::" separator, and the genres for the movies are separated by "|". There is a thirs file, we will, come back to it later.

### 1.2 User-based collaborative filtering

Now that we explored our dataset, the first implementation will be User-based, meaning that we will try to recommend a movie to a user by comparing users ratings on certain movies.

To do that, I will use cosine similarity, and then retrieve the closest neighbors to get similar items.

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# first let's create a matrix, the rows will represent each user, and the each column a movie,
# and then the value represents the rating the user gave to that movie (0 = no rating given)

ratings_matrix = ratings_data.pivot_table(index='UserID', columns='MovieID', values='Rating', fill_value=0)

ratings_matrix.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


What we did here is pivot the dataframe to create a 2D array (matrix), where each index/row is the id of each users (they range from 1 to 6040) and each column is a movie id (ranging from 1 to 3952). Each values placed are the rating for each movie and user, the default value in case a user didn't post any review for a film is 0.

In [4]:
# Now we compute the consine similarity between users

user_similarity = cosine_similarity(ratings_matrix)

user_similarity

array([[1.        , 0.09638153, 0.12060981, ..., 0.        , 0.17460369,
        0.13359025],
       [0.09638153, 1.        , 0.1514786 , ..., 0.06611767, 0.0664575 ,
        0.21827563],
       [0.12060981, 0.1514786 , 1.        , ..., 0.12023352, 0.09467506,
        0.13314404],
       ...,
       [0.        , 0.06611767, 0.12023352, ..., 1.        , 0.16171426,
        0.09930008],
       [0.17460369, 0.0664575 , 0.09467506, ..., 0.16171426, 1.        ,
        0.22833237],
       [0.13359025, 0.21827563, 0.13314404, ..., 0.09930008, 0.22833237,
        1.        ]])

For each user, we have a score for every other user that represents how similar they are. The closer the cosine value is to 1, the more similar they are.

In [5]:
# for example let's see how similar the user with the id 1 is to itself
user_similarity[1][1]

1.0

Indeed, if we compare the same user, we can see that they are completly identical. Now that we established this matrix, we can compute for each user his closest neighbor(s)

In [6]:
import numpy as np

num_users = user_similarity.shape[0]  # Number of users
N = 5  # Number of nearest neighbors to find, excluding the user itself

nearest_neighbors = []

for user_idx in range(num_users):
    # Get the similarity scores for the current user and sort them in descending order
    sorted_user_similarities = np.argsort(-user_similarity[user_idx])
    
    # Select the top N+1 scores (including the user itself) and exclude the user's own index
    top_user_neighbors = sorted_user_similarities[1:N+1]
    
    # Append the indices (user IDs) of the top N neighbors to the list
    nearest_neighbors.append(top_user_neighbors)

# Convert the list to a numpy array for easier handling
nearest_neighbors = np.array(nearest_neighbors)

# Now, `nearest_neighbors` contains the indices of the N nearest neighbors for each user

nearest_neighbors


array([[5342, 5189, 1480, 1282, 5704],
       [3107,   94, 2813, 4600, 2302],
       [2999,  478, 5690, 3499, 1903],
       ...,
       [2208, 3582, 4653,  757, 2798],
       [ 930, 3024, 4853, 1321, 4929],
       [1631, 1898, 2543, 3370,  587]], dtype=int64)

With the code above, we now know for each user what are their top five similar users.

Now, let's predict how a user might rate an item based on how similar they are with other users

In [7]:
def predict_ratings(ratings, similarity):
    weighted_ratings_sum = similarity @ ratings

    sum_of_similarities = np.abs(similarity).sum(axis=1)

    predicted_ratings = weighted_ratings_sum / sum_of_similarities[:, np.newaxis]

    return predicted_ratings


On the first line, we multily the similarity matrix, containing how each user is smilar to every other user with the matrix contaning how every user rated every movie. 
    On the second line, we compute the sum of similarities between users, and then on the third line we divide our weighted ratings by our sum of similarities. This normalize the results. Finally, we return a matrix containing for each users, predictions on every items (movies). These predictions will indicate how much the user might be interested in.


In [10]:
items_predictions = predict_ratings(ratings_matrix, user_similarity)

items_predictions

MovieID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
0,2.054946,0.504251,0.290642,0.098829,0.193052,0.677964,0.340108,0.051739,0.046035,0.620867,...,0.055924,0.003251,0.014833,0.036058,0.027046,0.605842,0.242890,0.040780,0.025325,0.271867
1,1.759311,0.497478,0.303615,0.101957,0.195575,0.891571,0.348699,0.041623,0.071578,0.795444,...,0.055026,0.002810,0.010542,0.045111,0.035651,0.604696,0.249614,0.043125,0.023568,0.294951
2,1.826247,0.503666,0.290386,0.087227,0.183043,0.779166,0.311113,0.041655,0.061570,0.746142,...,0.051699,0.002535,0.013342,0.041037,0.032039,0.585343,0.229307,0.038662,0.019378,0.249233
3,1.609167,0.462226,0.234894,0.068619,0.140951,0.825224,0.245626,0.036128,0.061268,0.716728,...,0.046308,0.002160,0.010942,0.036974,0.030079,0.515848,0.216983,0.037398,0.018002,0.233062
4,1.785284,0.444852,0.279900,0.110243,0.178421,0.900822,0.319379,0.038977,0.053720,0.662400,...,0.073144,0.003563,0.010534,0.044786,0.036399,0.669509,0.328265,0.050475,0.034345,0.340771
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6035,1.843988,0.520496,0.305181,0.113655,0.198643,0.870890,0.360357,0.045046,0.062374,0.722687,...,0.068391,0.003646,0.012192,0.044550,0.042927,0.630655,0.303603,0.052206,0.031766,0.320471
6036,1.700594,0.425311,0.245199,0.083872,0.152715,0.780486,0.288387,0.036849,0.047511,0.602439,...,0.061162,0.003087,0.009339,0.037237,0.043136,0.561067,0.266720,0.046884,0.027001,0.291260
6037,1.905858,0.457500,0.281252,0.096784,0.179220,0.681179,0.361495,0.040764,0.044078,0.619567,...,0.056699,0.003486,0.012451,0.034242,0.031968,0.582531,0.243922,0.043912,0.023285,0.269543
6038,1.841318,0.478597,0.273621,0.092350,0.177308,0.647794,0.348992,0.046448,0.041651,0.594751,...,0.063108,0.004631,0.012863,0.033035,0.046230,0.541132,0.243713,0.045698,0.026104,0.266003


There we go ! We now have for each user (rows), a prediction on how they would be interested in each item (columns). We should note that we haven't excluded the items they have already seen. For example if we look at the first line and column we can see that the recommendation score is very high because that user is similar to himself and has rated this movie 5/5 stars.