# Date Night Movie Recommender

In [1]:
import pandas as pd
import numpy as np
import random

## 1. Data Collection and Preprocessing

The MovieLens 1M dataset is downloaded. It consists of three dataframes: ratings, items and users.

In [2]:
!mkdir ../data
!cd ../data && wget https://files.grouplens.org/datasets/movielens/ml-1m.zip && unzip ml-1m.zip && rm ml-1m.zip

--2024-07-07 22:32:10--  https://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152, 64:ff9b::8065:4198
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘ml-1m.zip’


2024-07-07 22:32:12 (3.35 MB/s) - ‘ml-1m.zip’ saved [5917549/5917549]

Archive:  ml-1m.zip
   creating: ml-1m/
  inflating: ml-1m/movies.dat        
  inflating: ml-1m/ratings.dat       
  inflating: ml-1m/README            
  inflating: ml-1m/users.dat         


In [3]:
df_ratings = pd.read_csv(
    "../data/ml-1m/ratings.dat",
    delimiter="::",
    engine="python",
    names=["userId", "movieId", "rating", "timestamp"]
)

df_items = pd.read_csv(
    "../data/ml-1m/movies.dat",
    delimiter="::",
    engine="python",
    names=["movieId", "title", "genres"],
    encoding="ISO-8859-1"
)
df_items.set_index("movieId", inplace=True)

df_users = pd.read_csv(
    "../data/ml-1m/users.dat",
    delimiter="::",
    engine="python",
    names=["userId", "gender", "age", "occupation", "zipcode"],
    encoding="ISO-8859-1"
)
df_users.set_index("userId", inplace=True)

In [4]:
df_ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [5]:
df_items["year"] = df_items["title"].str[-5:-1]
df_items["title"] = df_items["title"].str[:-7]
df_items

Unnamed: 0_level_0,title,genres,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story,Animation|Children's|Comedy,1995
2,Jumanji,Adventure|Children's|Fantasy,1995
3,Grumpier Old Men,Comedy|Romance,1995
4,Waiting to Exhale,Comedy|Drama,1995
5,Father of the Bride Part II,Comedy,1995
...,...,...,...
3948,Meet the Parents,Comedy,2000
3949,Requiem for a Dream,Drama,2000
3950,Tigerland,Drama,2000
3951,Two Family House,Drama,2000


In [6]:
df_users

Unnamed: 0_level_0,gender,age,occupation,zipcode
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,F,1,10,48067
2,M,56,16,70072
3,M,25,15,55117
4,M,45,7,02460
5,M,25,20,55455
...,...,...,...,...
6036,F,25,15,32603
6037,F,45,1,76006
6038,F,56,1,14706
6039,F,45,0,01060


Several IMDb datasets were also downloaded:
- **title.basics.tsv.gz**: diverse metada about movies,
- **title.crew.tsv.gz**: information about directors and writers,
- **name.basics.tsv.gz**: information about actors,
- **title.ratings.tsv.gz**: average rating and number of ratings.

In [7]:
!mkdir ../data/imdb/
!cd ../data/imdb/ && wget https://datasets.imdbws.com/title.basics.tsv.gz
!cd ../data/imdb/ && wget https://datasets.imdbws.com/title.crew.tsv.gz
!cd ../data/imdb/ && wget https://datasets.imdbws.com/name.basics.tsv.gz
!cd ../data/imdb/ && wget https://datasets.imdbws.com/title.ratings.tsv.gz

--2024-07-07 22:32:15--  https://datasets.imdbws.com/title.basics.tsv.gz
Resolving datasets.imdbws.com (datasets.imdbws.com)... 52.222.149.104, 52.222.149.89, 52.222.149.11, ...
Connecting to datasets.imdbws.com (datasets.imdbws.com)|52.222.149.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 191394068 (183M) [binary/octet-stream]
Saving to: ‘title.basics.tsv.gz’


2024-07-07 22:32:41 (7.06 MB/s) - ‘title.basics.tsv.gz’ saved [191394068/191394068]

--2024-07-07 22:32:41--  https://datasets.imdbws.com/title.crew.tsv.gz
Resolving datasets.imdbws.com (datasets.imdbws.com)... 52.222.149.37, 52.222.149.11, 52.222.149.89, ...
Connecting to datasets.imdbws.com (datasets.imdbws.com)|52.222.149.37|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71386776 (68M) [binary/octet-stream]
Saving to: ‘title.crew.tsv.gz’


2024-07-07 22:32:54 (5.08 MB/s) - ‘title.crew.tsv.gz’ saved [71386776/71386776]

--2024-07-07 22:32:55--  https://datasets.imdbws

In [8]:
imdb_basics = pd.read_csv("../data/imdb/title.basics.tsv.gz", compression='gzip', sep='\t')
imdb_basics = imdb_basics[imdb_basics["titleType"] == "movie"]
imdb_basics

  imdb_basics = pd.read_csv("../data/imdb/title.basics.tsv.gz", compression='gzip', sep='\t')


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897,\N,100,"Documentary,News,Sport"
498,tt0000502,movie,Bohemios,Bohemios,0,1905,\N,100,\N
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906,\N,70,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907,\N,90,Drama
...,...,...,...,...,...,...,...,...,...
10911695,tt9916622,movie,Rodolpho Teóphilo - O Legado de um Pioneiro,Rodolpho Teóphilo - O Legado de um Pioneiro,0,2015,\N,57,Documentary
10911722,tt9916680,movie,De la ilusión al desconcierto: cine colombiano...,De la ilusión al desconcierto: cine colombiano...,0,2007,\N,100,Documentary
10911734,tt9916706,movie,Dankyavar Danka,Dankyavar Danka,0,2013,\N,\N,Comedy
10911744,tt9916730,movie,6 Gunn,6 Gunn,0,2017,\N,116,Drama


In [9]:
imdb_directors = pd.read_csv("../data/imdb/title.crew.tsv.gz", compression='gzip', sep='\t')
imdb_directors

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N
...,...,...,...
10261738,tt9916848,nm1485677,"nm9187127,nm1485677,nm9826385,nm9299459,nm1628284"
10261739,tt9916850,nm1485677,"nm9187127,nm1485677,nm9826385,nm1628284"
10261740,tt9916852,nm1485677,"nm9187127,nm1485677,nm9826385,nm9299459,nm1628284"
10261741,tt9916856,nm10538645,nm6951431


In [10]:
"""
imdb_cast = pd.read_csv("../data/imdb/name.basics.tsv.gz", compression='gzip', sep='\t')

imdb_cast[["movie1", "movie2", "movie3", "movie4"]] = imdb_cast["knownForTitles"].str.split(',', expand=True)

imdb_cast = imdb_cast.melt(id_vars=['nconst', 'primaryName'],
                                    value_vars=['movie1', 'movie2', 'movie3'],
                                    value_name='tconst')
                                    
imdb_cast = imdb_cast.groupby('tconst').agg({
    'cast': ','.join
}).reset_index()

imdb_cast
"""

'\nimdb_cast = pd.read_csv("../data/imdb/name.basics.tsv.gz", compression=\'gzip\', sep=\'\t\')\n\nimdb_cast[["movie1", "movie2", "movie3", "movie4"]] = imdb_cast["knownForTitles"].str.split(\',\', expand=True)\n\nimdb_cast = imdb_cast.melt(id_vars=[\'nconst\', \'primaryName\'],\n                                    value_vars=[\'movie1\', \'movie2\', \'movie3\'],\n                                    value_name=\'tconst\')\n                                    \nimdb_cast = imdb_cast.groupby(\'tconst\').agg({\n    \'cast\': \',\'.join\n}).reset_index()\n\nimdb_cast\n'

In [11]:
imdb_ratings = pd.read_csv("../data/imdb/title.ratings.tsv.gz", compression='gzip', sep='\t')
imdb_ratings

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2063
1,tt0000002,5.6,279
2,tt0000003,6.5,2030
3,tt0000004,5.4,180
4,tt0000005,6.2,2796
...,...,...,...
1453786,tt9916730,7.0,12
1453787,tt9916766,7.1,23
1453788,tt9916778,7.2,37
1453789,tt9916840,7.2,10


All IMDb dataframes are merged in one.

In [12]:
imdb_basics = pd.merge(imdb_basics, imdb_directors, on="tconst")
#imdb_basics = pd.merge(imdb_basics, imdb_cast, on="tconst")
imdb_basics = pd.merge(imdb_basics, imdb_ratings, on="tconst")

IMDb information are then added to the MovieLens movies dataframe to combine all information about movies together.

In [13]:
df_movies = pd.merge(df_items, imdb_basics, left_on=["title", "year"], right_on=["originalTitle", "startYear"])
df_movies

Unnamed: 0,title,genres_x,year,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres_y,directors,writers,averageRating,numVotes
0,Toy Story,Animation|Children's|Comedy,1995,tt0114709,movie,Toy Story,Toy Story,0,1995,\N,81,"Adventure,Animation,Comedy",nm0005124,"nm0005124,nm0230032,nm0004056,nm0710020,nm0923...",8.3,1076431
1,Jumanji,Adventure|Children's|Fantasy,1995,tt0113497,movie,Jumanji,Jumanji,0,1995,\N,104,"Adventure,Comedy,Family",nm0002653,"nm0378144,nm0852430,nm0833164,nm0885575",7.1,380163
2,Grumpier Old Men,Comedy|Romance,1995,tt0113228,movie,Grumpier Old Men,Grumpier Old Men,0,1995,\N,101,"Comedy,Romance",nm0222043,nm0425756,6.6,29874
3,Waiting to Exhale,Comedy|Drama,1995,tt0114885,movie,Waiting to Exhale,Waiting to Exhale,0,1995,\N,124,"Comedy,Drama,Romance",nm0001845,"nm0573334,nm0060103",6.0,12314
4,Father of the Bride Part II,Comedy,1995,tt0113041,movie,Father of the Bride Part II,Father of the Bride Part II,0,1995,\N,106,"Comedy,Family,Romance",nm0796124,"nm0352443,nm0329304,nm0583600,nm0796124",6.1,41946
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2341,Get Carter,Thriller,1971,tt0067128,movie,Get Carter,Get Carter,0,1971,\N,112,"Action,Crime,Thriller",nm0388198,"nm0388198,nm0507794",7.3,37018
2342,Meet the Parents,Comedy,2000,tt0212338,movie,Meet the Parents,Meet the Parents,0,2000,\N,108,"Comedy,Romance",nm0005366,"nm0322839,nm0164898,nm0381272,nm0357453",7.0,357427
2343,Requiem for a Dream,Drama,2000,tt0180093,movie,Requiem for a Dream,Requiem for a Dream,0,2000,\N,102,Drama,nm0004716,"nm0782968,nm0004716",8.3,904474
2344,Tigerland,Drama,2000,tt0170691,movie,Tigerland,Tigerland,0,2000,\N,101,"Drama,War",nm0001708,"nm0458462,nm0570077",6.9,43504


Only some columns containing valuable features are selected.

In [14]:
df_movies = df_movies[["title", "startYear", "runtimeMinutes", "genres_x", "directors", "averageRating", "numVotes"]]

In [15]:
df_movies["startYear"] = pd.to_numeric(df_movies["startYear"], errors='coerce')
df_movies["runtimeMinutes"] = pd.to_numeric(df_movies["runtimeMinutes"], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_movies["startYear"] = pd.to_numeric(df_movies["startYear"], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_movies["runtimeMinutes"] = pd.to_numeric(df_movies["runtimeMinutes"], errors='coerce')


In [16]:
df_movies

Unnamed: 0,title,startYear,runtimeMinutes,genres_x,directors,averageRating,numVotes
0,Toy Story,1995,81.0,Animation|Children's|Comedy,nm0005124,8.3,1076431
1,Jumanji,1995,104.0,Adventure|Children's|Fantasy,nm0002653,7.1,380163
2,Grumpier Old Men,1995,101.0,Comedy|Romance,nm0222043,6.6,29874
3,Waiting to Exhale,1995,124.0,Comedy|Drama,nm0001845,6.0,12314
4,Father of the Bride Part II,1995,106.0,Comedy,nm0796124,6.1,41946
...,...,...,...,...,...,...,...
2341,Get Carter,1971,112.0,Thriller,nm0388198,7.3,37018
2342,Meet the Parents,2000,108.0,Comedy,nm0005366,7.0,357427
2343,Requiem for a Dream,2000,102.0,Drama,nm0004716,8.3,904474
2344,Tigerland,2000,101.0,Drama,nm0001708,6.9,43504


This dataframe would need some feature engineering to split the genres into several binary features. The cast was first included to the dataframe, but then discarded, because the features would have been to sparse if they included all the actors.

However, this dataframe waswas not used further. While initially considered methods would have required these features, the final approach being a collaborative filtering technique, only requires a user-item interaction matrix.

## 2. Feature Engineering

The user-item interaction matrix for collaborative filtering is created through a pivot on the ratings dataframe.

In [17]:
ratings_matrix = df_ratings.pivot(index="userId", columns="movieId", values="rating").fillna(0)
ratings_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0.0,0.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. Model Development

Let's start with helper functions to prepare the ratings dataset.

In [18]:
def split_ratings(
    ratings: np.ndarray, val_size: int
) -> tuple[np.ndarray, np.ndarray]:
    """Function to create a validation set of ratings out of a sparse matrix

    Parameters
    ----------
    ratings : np.ndarray
        The matrix of ratings, can be sparse, full of zeros or nan values.
    val_size : float or int
        The size of the validation set.
        If float, the size is expressed in percentage of total size of the ratings array.
        if int, the size is expressed in raw number ratings in the validation set.

    Returns
    -------
    Tuple[np.ndarray, np.ndarray]
        The tuple of training and validation ratings.
    """
    # Get the couples of indices with significant values
    rows, cols = np.where(ratings > 0)
    val_ratings = np.zeros(ratings.shape)
    train_ratings = ratings.copy()

    if val_size < 1:  # Hence it is a percentage
        val_size = int(rows * val_size)

    val_ids = random.sample(range(len(rows)), val_size)

    for row, col in zip(rows[val_ids], cols[val_ids]):
        train_ratings[row, col] = 0
        val_ratings[row, col] = ratings[row, col]

    return train_ratings, val_ratings

def normalizeRatings(Y: np.ndarray, R: np.ndarray, axis: int = 1) -> tuple:
    """
    Preprocess data by subtracting mean rating for every movie (every row).
    Only include real ratings: R(i,j)=1.

    [Ynorm, Ymean] = normalizeRatings(Y, R) normalized Y so that each movie
    has a rating of 0 on average. Unrated moves then have a mean rating (0)
    Returns the mean rating in Ymean.

    Parameters
    ----------
    Y : np.ndarray
        The rating array to normalise.
    R : np.ndarray
        The array indicating whether a user rated a movie.
    axis : int
        The numpy axis to use for taking the average rating.

    Returns
    -------
    Tuple
        The normalised array of ratings and the mean rating.

    """
    Ymean = (np.sum(Y * R, axis=axis) / (np.sum(R, axis=axis) + 1e-12)).reshape(-1, 1)

    if axis == 0:
        Ynorm = Y.T - np.multiply(Ymean, R.T)
    else:
        Ynorm = Y - np.multiply(Ymean, R)

    return (Ynorm, Ymean)

The different matrices for Funk SVD are prepared:
- `Y`: The original ratings matrix.
- `R`: Binary mask indicating where ratings exist.
- `Y_train` and `Y_val`: Training and validation sets of Y for model training and evaluation.
- `Y_norm` and `Y_mean`: Normalized ratings matrix and mean ratings used for normalizing the data before training.

In [19]:
Y = ratings_matrix.T.values
R = (Y != 0).astype(int)
Y_train, Y_val = split_ratings(Y, 1000)
Y_norm, Y_mean = normalizeRatings(Y_train, R, axis=0)

In [20]:
n_ratings = int(np.sum(R))
n_ratings_train = np.count_nonzero(Y_train)
n_ratings_val = np.count_nonzero(Y_val)
n_items, n_users = Y.shape

print(f"The number of users is {n_users}")
print(f"The number of items is {n_items}")
print(f"The number of ratings is {n_ratings}")
print(f"The number of training ratings is {n_ratings_train}")
print(f"The number of validation ratings is {n_ratings_val}")

The number of users is 6040
The number of items is 3706
The number of ratings is 1000209
The number of training ratings is 999209
The number of validation ratings is 1000


The user-item interaction matrix is decomposed into user and item matrices:
- `user_mat`: This matrix represents users' preferences across latent_dim dimensions. Each row corresponds to a user and each column to a latent feature.
- `item_mat`: This matrix represents items' characteristics across the same latent_dim dimensions. Each row corresponds to a latent feature and each column to an item.

In Funk SVD, the goal is to find the optimal values for user_mat and item_mat such that their product approximates Y.

In [21]:
latent_dim = 120  # The dimension of latent space is an hyperparameter of the model

user_mat = np.random.rand(n_users, latent_dim)
item_mat = np.random.rand(latent_dim, n_items)

assert (
    np.dot(user_mat, item_mat).shape == Y.T.shape
), "Something is wrong with matrices dimensions."

The Funk SVD model is defined following the implementation seen in class.

In [22]:
class FunkSVD:
    """Object to perform matrix factorisation using a basic form of FunkSVD with no regularisation.

    Parameters
    ----------
    latent_dim : int
        The dimension of the latent space for the FunkSVD algorithm.


    Attributes
    ----------
    user_mat_ : np.ndarray
        Attribute calculated only after calling the fit method.
        It is the latent matrix representing users encodings.

    item_mat_ : np.ndarray
        Attribute calculated only after calling the fit method.
        It is the latent matrix representing items encodings.

    mse_ : float
        Attribute calculated only after calling the fit method.
        It is the mean squared error between reconstructed matrix and ratings one.

    hist : Dict
        Attribute containing lists of quantities varying during training.


    Methods
    -------
    fit
        The method performing the training for the model.

    predict_rating
        The method predicting a user rating for a certain item.

    """

    def __init__(self, latent_dim: int):
        """Constructor for the class"""
        self.latent_dim = latent_dim
        self.hist = {}

    @staticmethod
    def _random_initialise(n: int, m: int, seed: int = None) -> np.ndarray:
        """Private method to initialise a random matrix

        Parameters
        ----------
        n : int
            Number of rows
        m : int
            Number of columns
        seed : int, optional
            The random seed to make results reproducible, by default None.

        Returns
        -------
        np.ndarray
            The randomly initialised array of shape (n,m).
        """
        if seed is not None:
            np.random.seed(seed)

        return np.random.rand(n, m)

    def fit(
        self,
        ratings: np.ndarray,
        r_mask: np.ndarray,
        n_users: int,
        n_items: int,
        max_iter: int = 300,
        learning_rate: float = 1.0e-5,
    ):
        """Fit Method for the FunkSVD object.

        Parameters
        ----------
        ratings : np.ndarray
            The ratings matrix.
        r_mask : np.ndarray
            The mask matrix to determine whether the ratings are defined.
        n_users : int
            Number of users.
        n_items : int
            Number of items.
        max_iter : int, optional
            Number of iterations for the algorithm.
            Default: 300
        learning_rate : float, optional
            Learning rate for gradient descent updates.
            Default: 1.e-5

        Returns
        -------
        None
            One only has an updated version of the object.
            The method creates the attributes `user_mat_` and `item_mat_`.
        """
        # Initialise user and items matrices
        user_mat = self._random_initialise(n_users, self.latent_dim)
        item_mat = self._random_initialise(self.latent_dim, n_items)

        # Initialise an empty list for sum of errors
        self.hist["mse"] = []

        # Loop over iterations
        for _ in range(max_iter):
            # Create boolean mask of non-zero entries in rating matrix
            mask = r_mask > 0

            # compute difference matrix
            diff_mat = ratings - np.dot(user_mat, item_mat)
            n_ratings = len(diff_mat)

            # Normalise the difference
            diff_mat /= n_ratings

            # compute mean squared error
            mse = np.sum(np.square(diff_mat[mask]))
            self.hist["mse"].append(mse)

            # update user and items matrices
            user_mat += learning_rate * np.dot(diff_mat, item_mat.T)
            item_mat += learning_rate * np.dot(user_mat.T, diff_mat)

        self.mse_ = mse
        self.user_mat_ = user_mat
        self.item_mat_ = item_mat

        print(
            f"""Training complete.
        \nMean squared error: {self.mse_:.3f}"""
        )

    def predict_rating(self, user_id: int, item_id: int) -> float:
        """The method predicting the rating the user would give to the item.

        Parameters
        ----------
        user_id : int
            id of the user.
        item_id : int
            id of the item.

        Returns
        -------
        float
            The predicted rating.
        """
        if not hasattr(self, "user_mat_"):
            message = "The model has not been fitted yet. Please call the `fit` method before."
            raise NotFittedError(message)

        # -1 because of the indexing in python
        return np.dot(self.user_mat_[user_id - 1, :], self.item_mat_[:, item_id - 1])

The model was trained using the normalized user-item matrix. Several values were tested for the hyperparameters — specifically, the number of iterations and the learning rate — and those that achieved the lowest mean squared error with an acceptable training time were selected.

In [23]:
funk_model = FunkSVD(latent_dim=latent_dim)
funk_model.fit(Y_norm, R.T, n_users, n_items, max_iter=300, learning_rate=5e-4);

Training complete.
        
Mean squared error: 0.172


The function below allows inference by predicting ratings for all movies associated with a specified user using the trained SVD model.

In [24]:
def predict_all_ratings_for_user(Y_val: np.ndarray, model: FunkSVD, user_id: int):
    """Function to retrieve predicted ratings for all movies for a specified user.

    Parameters
    ----------
    Y_val : np.ndarray
        The validation matrix.
    model : FunkSVD
        The model to produce predictions.
    user_id : int
        The user ID for which ratings are to be retrieved.

    Returns
    -------
    np.array
        A numpy array containing predicted ratings for all movies for the specified user.
    """
    predictions = []
    
    for movie_id in range(n_items):
        pred = float(model.predict_rating(user_id, movie_id + 1))
        predictions.append(pred)

    predictions = np.array(predictions)
    
    predictions_std = (predictions - predictions.min()) / (predictions.max() - predictions.min())
    predictions_scaled = predictions_std * 5

    return predictions_scaled

Let's predict the rating of user 42 for example.

In [25]:
user_id = 42

results = pd.DataFrame()
results["y_pred"] = predict_all_ratings_for_user(Y_val, funk_model, user_id)

results.index = ratings_matrix.columns

results

Unnamed: 0_level_0,y_pred
movieId,Unnamed: 1_level_1
1,3.079956
2,1.654616
3,2.848276
4,2.037523
5,2.445163
...,...
3948,2.256516
3949,4.223109
3950,2.007138
3951,1.577772


## 4. Recommendation Algorithm

The "couple score" is defined to combine the predictions for two specified users. The score consists of multilplying the two scores and dividing them by 5.

Here is an example for users 42 and 155:

In [26]:
user_id_1 = 42
user_id_2 = 5215

recommendation = pd.DataFrame()

recommendation["y_pred_1"] = pd.DataFrame(predict_all_ratings_for_user(Y_val, funk_model, user_id_1))
recommendation["y_pred_2"] = pd.DataFrame(predict_all_ratings_for_user(Y_val, funk_model, user_id_2))

recommendation["couple_score"] = recommendation["y_pred_1"] * recommendation["y_pred_2"] / 5

recommendation.index = ratings_matrix.columns

recommendation.sort_values("couple_score", ascending=False)

Unnamed: 0_level_0,y_pred_1,y_pred_2,couple_score
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3104,3.773292,4.439388,3.350221
2664,4.034015,3.834987,3.094079
1500,4.010251,3.740001,2.999669
2362,3.684595,3.915482,2.885393
2248,3.257531,4.368939,2.846391
...,...,...,...
2900,2.272218,0.186454,0.084733
531,0.129402,2.807865,0.072669
2268,2.961310,0.016319,0.009665
1941,0.000000,2.162179,0.000000


Let's see the recommended movies for these two users by merging the recommendations with the MovieLens movies dataframe:

In [27]:
pd.merge(recommendation, df_items, on="movieId").sort_values("couple_score", ascending=False)

Unnamed: 0_level_0,y_pred_1,y_pred_2,couple_score,title,genres,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3104,3.773292,4.439388,3.350221,Midnight Run,Action|Adventure|Comedy|Crime,1988
2664,4.034015,3.834987,3.094079,Invasion of the Body Snatchers,Horror|Sci-Fi,1956
1500,4.010251,3.740001,2.999669,Grosse Pointe Blank,Comedy|Crime,1997
2362,3.684595,3.915482,2.885393,Glen or Glenda,Drama,1953
2248,3.257531,4.368939,2.846391,Say Anything...,Comedy|Drama|Romance,1989
...,...,...,...,...,...,...
2900,2.272218,0.186454,0.084733,Monkey Shines,Horror|Sci-Fi,1988
531,0.129402,2.807865,0.072669,"Secret Garden, The",Children's|Drama,1993
2268,2.961310,0.016319,0.009665,"Few Good Men, A",Crime|Drama,1992
1941,0.000000,2.162179,0.000000,Hamlet,Drama,1948


## 5. Evaluation

The non-zero indices represent the movies reviewed by actual users. They will allow comparisons with the predicted ratings for scoring the Funk SVD model. 

In [28]:
non_zero_indices = np.nonzero(Y_val.T)

The following function predicts ratings for a specified number of known ratings using the trained SVD model and returns the predictions and the actual ratings. It will be used for scoring the model.

In [29]:
def evaluate_model_predictions(Y_val: np.ndarray, model: FunkSVD, num_ratings: int = 100):
    """Function to evaluate the FunkSVD model by predicting ratings and comparing them against actual ratings from a validation set.
    
    Parameters
    ----------
    Y_val : np.ndarray
        The validation matrix.
    model : FunkSVD
        The model to produce predictions.
    num_ratings : int
        The number of ratings to include.
    
    Returns
    -------
    (np.array, np.array)
        A tuple of two arrays containing the predicted and the actual ratings.
    """
    predictions = []
    ratings = []
    
    rows, cols = non_zero_indices
    for k in range(num_ratings):
        pred = float(model.predict_rating(rows[k] + 1, cols[k] + 1))
        rating = Y_val.T[rows[k], cols[k]]
        predictions.append(pred)
        ratings.append(rating)

    predictions = np.array(predictions)
    ratings = np.array(ratings)
    
    predictions_std = (predictions - predictions.min()) / (predictions.max() - predictions.min())
    predictions_scaled = predictions_std * 5

    return predictions_scaled, ratings

The final step consists in scoring the model using different metrics to compare `y_pred` (the predicted ratings) and `y_true` (the actual ratings).

The chosen metrics are:
- **Root Mean Square Error**: measures the average magnitude of errors between predicted and actual ratings. 
- **R-Squared score**: indicates how well the model's predictions fit compared to a baseline model (using the mean rating)

In [30]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred, y_true = evaluate_model_predictions(Y_val, funk_model, num_ratings=1000)

rmse = mean_squared_error(y_true, y_pred, squared=False)
r2 = r2_score(y_true, y_pred)

print(f"Root Mean Square Error: {rmse}")
print(f"R-squared score: {r2}")



Root Mean Square Error: 1.6271098287940085
R-squared score: -1.15571586823323


These metrics provide insights into the model's performance in predicting movie ratings:
- **RMSE**: The score is higher than 1 and indicates significant differences between the predicted and the actual ratings.
- **R2 score**: With a negative value, the model performs worse than just using the mean rating for predictions.