# Collaborative Filtering based Recommender System From Scratch


In [1]:
import os

import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error

import re

# The MovieLens 100K Dataset

There are a number of datasets available for recommendations systems research. Amongst them, the [MovieLens](https://grouplens.org/datasets/movielens) dataset is probably one of the more popular ones. MovieLens is a non-commercial web-based movie recommender system, created in 1997 and run by the GroupLens Research Project, at University of Minnesota. It's data has been critical for several research studies including personalized recommendation and social psychology.

There are several versions of the dataset available. We'll use the well-known [MovieLens 100k](https://grouplens.org/datasets/movielens/100k/), which consists of 100,000 ratings form 943 users on 1682 movies. Some simple demographic information such as age, gender, genres for the users and items are also available, but we'll not use them.

In [2]:
if not os.path.isdir("ml-100k"):
    # # Download the dataset
    !wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
    # # Unzip it
    !unzip ml-100k.zip
    # # Get rid of the .zip file
    !rm ml-100k.zip

--2024-10-23 07:37:08--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4,7M) [application/zip]
Saving to: ‘ml-100k.zip’


2024-10-23 07:37:11 (2,01 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base    

The dataset folder contains several files, but we need only two of them:

* **u.data**: The full dataset with 100,000 user ratings. Each user has rated at least 20 movies.
* **u.item**: Contains informations about the movies, but we will use only movies ids and titles. This data is not required for the understanding of the CF technique, but we will use it for a more friendly feedback of our system.

Let's begin by loading the ratings dataframe.

In [3]:
ratings = pd.read_csv(
    "ml-100k/u.data",
    sep="\t", # this is a tab separated data
    names=["user_id", "movie_id", "rating", "timestamp"], # the columns names
    usecols=["user_id", "movie_id", "rating"], # we do not need the timestamp column
    low_memory=False
)
ratings

Unnamed: 0,user_id,movie_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1
...,...,...,...
99995,880,476,3
99996,716,204,5
99997,276,1090,1
99998,13,225,2


Each entry consists in a *user_id*, *movie_id* and a movie *rating* in range of 1 to 5.

The next information we need is the number of movies and users in the dataset. Although we already know this from the dataset information, for didactic purposes we'll get them from the data itself.

In [4]:
n_users = len(ratings.user_id.unique())
n_movies = len(ratings.movie_id.unique())

print(f"Number of users: {n_users} -- Number of movies: {n_movies}")

Number of users: 943 -- Number of movies: 1682


## Train/test split

Now we'll create the train and test sets that we will use to evaluate the performance of our system. 20% of each user ratings will be used for testing, and the remaining that will be used for training.

In [5]:
test_split = 0.2 #percent of data to be used for testing

# Initialize the train and test dataframes.
train_set, test_set = pd.DataFrame(), pd.DataFrame()

# Check each user.
for user_id in ratings.user_id.unique():
    user_df = ratings[ratings.user_id == user_id].sample(
        frac=1,
        random_state=42
    ) # select only samples of the actual user and shuffle the resulting dataframe
    
    n_entries = len(user_df)
    n_test = int(round(test_split * n_entries))
    
    test_set = pd.concat((test_set, user_df.tail(n_test)))
    train_set = pd.concat((train_set, user_df.head(n_entries - n_test)))

train_set = train_set.sample(frac=1).reset_index(drop=True)
test_set = test_set.sample(frac=1).reset_index(drop=True)

train_set.shape, test_set.shape

((80000, 3), (20000, 3))

We also need a function to compute the *interactions matrix* of a given ratings dataframe.

In [6]:
def build_interactions_matrix(r_mat, n_users, n_items):
    iter_m = np.zeros((n_users, n_items))
    
    for _, user_id, movie_id, rating in r_mat.itertuples():
        iter_m[user_id-1, movie_id-1] = rating
    
    return iter_m

In [14]:
iter_m = build_interactions_matrix(ratings, n_users, n_movies)
iter_m.shape

(943, 1682)

## Similarity measurement
In this post we'll build our system using the **memory based** approach, in which similarities between users/items are computed using the rating data itself. Therefore, the *i*-th row of an interactions matrix is considered as the **feature vector** of user *i*, while the *j*-th column of an interaction matrix is considered as the **feature vector** of item *j*.

The similarity between two users is represented by some **distance measurement** between their feature vectors. Multiple measures, such as Pearson correlation and vector cosine are used for this. For example, the similarity between users $u$ and $u'$ can be computed using vector cosine as:
$$
sim(u, u') = cos(\textbf{r}_u, \textbf{r}_{u'}) = 
\frac{\textbf{r}_u \textbf{.} \textbf{r}_{u'}}{|\textbf{r}_u||\textbf{r}_{u'}|} =
\frac{\sum_i r_{ui}r_{u'i}}{\sqrt{\sum_i r_{ui}^2}\sqrt{\sum_i r_{u'i}^2}}
$$
where $\textbf{r}_u$ and $\textbf{r}_{u'}$ are the feature vectors of users $u$ and $u'$, respectively, and $r_{ui}$ is a rating value given by user $u$ to item $i$. The same procedure is applied when computing the similarity between items $i$ and $i'$.

In [15]:
def build_similarity_matrix(interactions_matrix, kind="user", eps=1e-9):
    # takes rows as user features
    if kind == "user":
        similarity_matrix = interactions_matrix.dot(interactions_matrix.T)
    # takes columns as item features
    elif kind == "item":
        similarity_matrix = interactions_matrix.T.dot(interactions_matrix)
    norms = np.sqrt(similarity_matrix.diagonal()) + eps
    return similarity_matrix / (norms[np.newaxis, :] * norms[:, np.newaxis])

In [16]:
u_sim = build_similarity_matrix(iter_m, kind="user")
i_sim = build_similarity_matrix(iter_m, kind="item")

print(f"User similarity matrix shape: {u_sim.shape}\nUser similarity matrix sample:\n{u_sim[:4, :4]}")
print("-" * 97)
print(f"Item similarity matrix shape: {i_sim.shape}\nItem similarity matrix sample:\n{i_sim[:4, :4]}")

User similarity matrix shape: (943, 943)
User similarity matrix sample:
[[1.         0.16693098 0.04745954 0.06435782]
 [0.16693098 1.         0.11059132 0.17812119]
 [0.04745954 0.11059132 1.         0.34415072]
 [0.06435782 0.17812119 0.34415072 1.        ]]
-------------------------------------------------------------------------------------------------
Item similarity matrix shape: (1682, 1682)
Item similarity matrix sample:
[[1.         0.40238218 0.33024479 0.45493792]
 [0.40238218 1.         0.27306918 0.50257077]
 [0.33024479 0.27306918 1.         0.32486639]
 [0.45493792 0.50257077 0.32486639 1.        ]]


The similarity matrix is a symmetric matrix with values in range 0 to 1. The diagonal elements contains the auto-similarities of all users/items, so all elements are equal to 1.

As we can see, this method alone can improve greatly our system's prediction power. Later in this post, we'll try leverage the effect of the number of neighbors to do a simple tunning in our system.

# Making Predictions with bias subtraction

Now we are able to make predictions. Depending on which approach we have chosen for our system, we have two different objectives:

1. If we choose the user-based approach, we'll infer a missing rating $r_{ui}$ of an user $u$ to an item $i$ by taking the normalized weighted sum of **all ratings of other users to this item**.

$$
r_{ui} = \frac{\sum_{u'} sim(u, u')r_{u'i}}{\sum_{u'} |sim(u, u')|}
$$

2. If we choose the item-based approach instead, we'll infer a missing rating $r_{ui}$ of an user $u$ to an item $i$ by taking the normalized weighted sum of **all other ratings of this user to the other items**.

$$
r_{ui} = \frac{\sum_{i'} sim(i, i')r_{ui'}}{\sum_{i'} |sim(i, i')|}
$$

Now we'll try to deal with the rating bias associated with an user or an item. The ideia here is that certain users may tend to always give high or low ratings to all movies, so the *relative difference* in ratings may be more important than the *absolute rating* values.

For a user-based approach this methodology can be mathematically described as:

$$
r_{ui} = \overline{r}_{u} + \frac{\sum_{u'} sim(u, u')(r_{u'i} - \overline{r}_{u'})}{\sum_{u'} |sum(u, u')|}
$$

where $\overline{r}_{u}$ is the average rating given by user *u*, or for a item-based approach as:

$$
r_{ui} = \overline{r}_{i} + \frac{\sum_{i'} sim(i, i')(r_{ui'} - \overline{r}_{i'})}{\sum_{i'} |sum(i, i')|}
$$

where $\overline{r}_{i}$ is the average rating of item *i*

Lets modify our *Recommender* class once more to include this feature.

In [17]:
class Recommender:
    def __init__(
        self,
        n_users,
        n_items,
        r_mat,
        k=40,
        kind="user",
        bias_sub=False,
        eps=1e-9
    ):
        self.n_users = n_users
        self.n_items = n_items
        self.kind = kind
        self.iter_m = build_interactions_matrix(r_mat, self.n_users, self.n_items)
        self.sim_m = build_similarity_matrix(self.iter_m, kind=self.kind)
        self.bias_sub = bias_sub
        self.k = k
        self.eps = eps
        self.predictions = self._predict_all()
    
    def _predict_all(self):
        pred = np.empty_like(self.iter_m)
        if self.kind == "user":
            # Computes the new interaction matrix if needed.
            iter_m = self.iter_m
            if self.bias_sub:
                user_bias = self.iter_m.mean(axis=1)[:, np.newaxis]
                iter_m -= user_bias
            # An user has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
            for user_id, k_users in enumerate(sorted_ids):
                pred[user_id, :] = self.sim_m[user_id, k_users].dot(iter_m[k_users, :])
                pred[user_id, :] /= \
                    np.abs(self.sim_m[user_id, k_users] + self.eps).sum() + self.eps
            if self.bias_sub:
                pred += user_bias
            
        elif self.kind == "item":
            # Computes the new interaction matrix if needed.
            iter_m = self.iter_m
            if self.bias_sub:
                item_bias = self.iter_m.mean(axis=0)[np.newaxis, :]
                iter_m -= item_bias
            # An item has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
            for item_id, k_items in enumerate(sorted_ids):
                pred[:, item_id] = self.sim_m[item_id, k_items].dot(iter_m[:, k_items].T)
                pred[:, item_id] /= \
                    np.abs(self.sim_m[item_id, k_items] + self.eps).sum() + self.eps
            if self.bias_sub:
                pred += item_bias
                
        return pred.clip(0, 5)

## Model evaluation

In [18]:
def build_predictions_df(preds_m, dataframe):
    preds_v = []
    for row_id, user_id, movie_id, _ in dataframe.itertuples():
        preds_v.append(preds_m[user_id-1, movie_id-1])
    preds_df = pd.DataFrame(data={"user_id": dataframe.user_id, "movie_id": dataframe.movie_id, "rating": preds_v})
    return preds_df

def get_mse(estimator, train_set, test_set):
    train_preds = build_predictions_df(estimator.predictions, train_set)
    test_preds = build_predictions_df(estimator.predictions, test_set)
    
    train_mse = mean_squared_error(train_set.rating, train_preds.rating)
    test_mse = mean_squared_error(test_set.rating, test_preds.rating)
    
    return train_mse, test_mse



In [19]:
train_mse, test_mse = get_mse(
    Recommender(n_users, n_movies, train_set, kind="user", bias_sub=True),
    train_set,
    test_set
)

print(f"User-based train MSE: {train_mse} -- User-based test MSE: {test_mse}")
print("-" * 97)

train_mse, test_mse = get_mse(
    Recommender(n_users, n_movies, train_set, kind="item", bias_sub=True),
    train_set,
    test_set
)

print(f"Item-based train MSE: {train_mse} -- Item-based test MSE: {test_mse}")

User-based train MSE: 5.6854492668354295 -- User-based test MSE: 6.39456526201878
-------------------------------------------------------------------------------------------------
Item-based train MSE: 5.4271747547608555 -- Item-based test MSE: 6.214109193478462


Although this methodology did not improved the results for this scenario, possibly due to the characteristics of the dataset, it can be effective with another dataset.

# Tuning up

There is one question left: how do we find the right number of similar users/items we should use when predicting a rating?

Our use case is quite simple, as we have only two parameters (*k* and *bias_sub*). I have run an optimization on objective ab found the best values

Now we can retrieve the best parameters found.

In [20]:
best_k = 12
best_bias_sub = False

print("Best parameters found:")
print(f"  - k = {best_k}")
print(f"  - bias_sub = {best_bias_sub}")

Best parameters found:
  - k = 12
  - bias_sub = False


# Item recommendation

Now that we have defined our parameters, we can make our system do what it is suposed to: recommend items. There are many ways to acomplish this, but I optioned to go for the simple way and just recommend items most similar to a item using a *item-based* system.

Let's make the last modification to our *Recommender* class.

In [21]:
class Recommender:
    def __init__(
        self,
        n_users,
        n_items,
        r_mat,
        k=40,
        kind="user",
        bias_sub=False,
        eps=1e-9
    ):
        self.n_users = n_users
        self.n_items = n_items
        self.kind = kind
        self.iter_m = build_interactions_matrix(r_mat, self.n_users, self.n_items)
        self.sim_m = build_similarity_matrix(self.iter_m, kind=self.kind)
        self.bias_sub = bias_sub
        self.k = k
        self.eps = eps
        self.predictions = self._predict_all()
    
    def _predict_all(self):
        pred = np.empty_like(self.iter_m)
        if self.kind == "user":
            # Computes the new interaction matrix if needed.
            iter_m = self.iter_m
            if self.bias_sub:
                user_bias = self.iter_m.mean(axis=1)[:, np.newaxis]
                iter_m -= user_bias
            # An user has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
            for user_id, k_users in enumerate(sorted_ids):
                pred[user_id, :] = self.sim_m[user_id, k_users].dot(iter_m[k_users, :])
                pred[user_id, :] /= np.abs(self.sim_m[user_id, k_users]).sum() + self.eps
            if self.bias_sub:
                pred += user_bias
            
        elif self.kind == "item":
            # Computes the new interaction matrix if needed.
            iter_m = self.iter_m
            if self.bias_sub:
                item_bias = self.iter_m.mean(axis=0)[np.newaxis, :]
                iter_m -= item_bias
            # An item has the higher similarity score with itself,
            # so we skip the first element.
            sorted_ids = np.argsort(-self.sim_m)[:, 1:self.k+1]
            for item_id, k_items in enumerate(sorted_ids):
                pred[:, item_id] = self.sim_m[item_id, k_items].dot(iter_m[:, k_items].T)
                pred[:, item_id] /= np.abs(self.sim_m[item_id, k_items]).sum() + self.eps
            if self.bias_sub:
                pred += item_bias
                
        return pred.clip(0, 5)
    
    def get_top_recomendations(self, item_id, n=6):
        if self.kind == "user":
            # For an user-based system, only similarities between users were computed.
            # This strategy will not be covered in this post, but a solution to this
            # could be of finding the top better rated items of similiar users.
            # I'll leave this exercise to you =]
            pass
        if self.kind == "item":
            sim_row = self.sim_m[item_id - 1, :]
            # once again, we skip the first item for obviouos reasons.
            items_idxs = np.argsort(-sim_row)[1:n+1]
            similarities = sim_row[items_idxs]
            return items_idxs + 1, similarities

We added a method to return the $n$ most similar items to a given item. Now, we just need to buil our model with the parameters found previously.

In [22]:
rs_model = Recommender(
    n_users, 
    n_movies, 
    ratings, # the model will be built on the full dataset now
    k=best_k, 
    kind="item", 
    bias_sub=best_bias_sub
)
get_mse(rs_model, train_set, test_set)

(3.2755618709089025, 3.2546408150520763)

We'll also define two functions: one that maps a movie title to an id and other that maps a list of movie ids into a list of movie titles.

In [23]:
def title2id(mapper_df, movie_title):
    return mapper_df.loc[mapper_df.movie_title == movie_title, "movie_title"].index.values[0]

def ids2title(mapper_df, ids_list):
    titles = []
    for id in ids_list:
        titles.append(mapper_df.loc[id, "movie_title"])
    return titles

Those functions need a dataframe with the movies ids and titles, that will act as a mapper. So we'll load the **u.item** file from the dataset folder.

In [24]:
# Columns names
movies_mapper_cols = [
    "movie_id", 
    "movie_title", 
    "release_date", 
    "video_release_date", 
    "IMDb_URL", 
    "unknown",
    "Action",
    "Adventure",
    "Animation",
    "Childrens",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film_Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci_Fi",
    "Thriller",
    "War",
    "Western" 
]
movies_mapper = pd.read_csv(
    "ml-100k/u.item",
    sep="|",
    encoding="latin",
    names=movies_mapper_cols,
    usecols=["movie_id", "movie_title"], # we only need these columns
    index_col="movie_id"
)
# Remove movies release years from titles
movies_mapper["movie_title"] = movies_mapper["movie_title"].apply(
    lambda title: re.sub("\(\d{4}\)", "", title).strip()
)
movies_mapper

Unnamed: 0_level_0,movie_title
movie_id,Unnamed: 1_level_1
1,Toy Story
2,GoldenEye
3,Four Rooms
4,Get Shorty
5,Copycat
...,...
1678,Mat' i syn
1679,B. Monkey
1680,Sliding Doors
1681,You So Crazy


Now we can make our recommendations.

In [25]:
def print_recommendations(model, mapper, movie_title):
    ids_list, similarities = rs_model.get_top_recomendations(title2id(mapper, movie_title))
    titles = ids2title(movies_mapper, ids_list)
    for title, similarity in zip (titles, similarities):
        print(f"{similarity:.2f} -- {title}")

In [26]:
print_recommendations(rs_model, movies_mapper, "Toy Story")

0.73 -- Star Wars
0.70 -- Return of the Jedi
0.69 -- Independence Day (ID4)
0.66 -- Rock, The
0.64 -- Mission: Impossible
0.64 -- Willy Wonka and the Chocolate Factory


In [27]:
print_recommendations(rs_model, movies_mapper, "Batman Returns")

0.71 -- Batman
0.64 -- Batman Forever
0.62 -- Stargate
0.62 -- Die Hard: With a Vengeance
0.61 -- True Lies
0.61 -- Crow, The


In [28]:
print_recommendations(rs_model, movies_mapper, "GoldenEye")

0.66 -- Under Siege
0.62 -- Top Gun
0.62 -- True Lies
0.62 -- Batman
0.60 -- Stargate
0.60 -- Cliffhanger


In [29]:
print_recommendations(rs_model, movies_mapper, "Godfather, The")

0.70 -- Star Wars
0.67 -- Godfather: Part II, The
0.65 -- Fargo
0.63 -- Return of the Jedi
0.59 -- Raiders of the Lost Ark
0.58 -- Pulp Fiction


In [30]:
print_recommendations(rs_model, movies_mapper, "Billy Madison")

0.50 -- Dumb & Dumber
0.49 -- Ace Ventura: Pet Detective
0.45 -- Hot Shots! Part Deux
0.44 -- Brady Bunch Movie, The
0.44 -- Young Guns II
0.43 -- Tommy Boy


In [31]:
print_recommendations(rs_model, movies_mapper, "Lion King, The")

0.75 -- Aladdin
0.69 -- Beauty and the Beast
0.68 -- Forrest Gump
0.66 -- Jurassic Park
0.65 -- E.T. the Extra-Terrestrial
0.65 -- Empire Strikes Back, The


In [32]:
print_recommendations(rs_model, movies_mapper, "Star Wars")

0.88 -- Return of the Jedi
0.76 -- Raiders of the Lost Ark
0.75 -- Empire Strikes Back, The
0.73 -- Toy Story
0.70 -- Godfather, The
0.69 -- Independence Day (ID4)


The notebook of this post can found [here](https://github.com/TheCamilovisk/DSNotebooks/blob/main/RecommenderSystems/CollaborativeFilteringMemoryBased.ipynb).

For ensemble learning please see the blog post [here](https://www.kaggle.com/code/satishgunjal/ensemble-learning-bagging-boosting-stacking)