# Recommender system in PyTorch

PyTorch is basically a deep learning framework. This means that we can solve a huge amount of real-world problems using pyTorch.

Even before the re-rise of deep learning approaches, the field of machine learning and data analysis were beloved research topics of this field. These approaches were not just studied in an academy. There were a lot of applications, and among them, recommender system is a representative one that made a great success in real-world.

Recommender system is a model that predicts which item a user will like. As a prior knowledge, we will be given a data on user's preference on some of the items that the user has already rated. Additional information on users or items could be given as an extra data. There are some different types of models, but the most popular form of the model is the one that predicts user's preferences on items that the user has not rated yet. In other words, it is to predict the missing data from the known data. After then, for a specific user, we can sort the items that the user has not rated by their predicted preference in decreasing order, and suggest the first few items to the user. There has been a lot of methods to make such predictions, including a lot of ones that does not rely on deep learning.

In this tutorial, we'll take a look at the classical recommender system. Perhaps the *Netflix Prize* is a competition that took researchers' interest. It was a competition to predict users' star-rating on the movies with some of the given rating data. We'll follow the similar way with the similar data. Although it was not the final winner of the competition, but we'll take a deeper look on a model that made a historical, and the most important breakthrough in this field; It is a collaborative filtering method which uses a singular value decomposition.

Please keep in mind that  neural networks will not appear in this tutorial. Instead, please enjoy the way how a deep learning framework is used on a work without deep learning!

### The Problem

First of all, let's define the problem. Suppose that you're maintaing a website where users rate the movies. There are $N$ users and $M$ movies. Some of the users have already rated some movies. You'll be given the rating data as a input of the problem. For the output, you have to predict the rest of the rating data; How the users will rate the movies which they haven't rated yet.

How would you solve this problem? If you have no idea, then just think about the movies you like. Suppose that you like a movie with particular acting person, or a movie directed by a particular director, or a movie of a specific genre. Then you'll probably like the ones with the same acting person, the same director, or the same genre. Or suppose that you have a friend who has the same list of favorite movies. If there's a new movie that your friend likes, then you'll probably like that as well. These are well known approach to this problem, and of course, there are tons of other ideas.

But how it is related with a recommendation? If you can predict a user's rating on movies, then you can also predict the list of the movies the user will like. Predict the ratings, sort them, and just print the first few movies in the list which have ratings over a pre-determined threshold! So predicting the rating is simply everything of a recommender system.

### MovieLens Dataset

In this tutorial, we will use MovieLens dataset. It contains movie ratings by users of a website called MovieLens. There are some additonal data as well but we will only use the ratings data here. Please refer to ``https://grouplens.org/datasets/movielens/`` for detailed information.

There are three bunches of data; A small one, medium-size one, and a large one. As you can see in the code below, each dataset has different names. Let's use the smallest one for convenience.

In [1]:
DATASET_NAME = 'ml-latest-small' # MovieLens Latest Datasets (Small) recommended for education and development
# DATASET_NAME = 'ml-25m' # MovieLens 25M Dataset recommended for new research
# DATASET_NAME = ''ml-latest' # MovieLens Latest Datasets (Full) recommended for education and development

Now we will download the dataset file from the website. While unzipping the downloaded file, you can see the names of the files included. As mentioned above, we will just use ``ratings.csv`` data here.

In [2]:
import os
os.system(f"wget -nv 'https://files.grouplens.org/datasets/movielens/{DATASET_NAME}.zip'")
os.system(f"unzip {DATASET_NAME}.zip")

Archive:  ml-latest-small.zip


2021-12-01 23:16:35 URL:https://files.grouplens.org/datasets/movielens/ml-latest-small.zip [978202/978202] -> "ml-latest-small.zip.2" [1]
replace ml-latest-small/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:  NULL
(EOF or read error, treating as "[N]one" ...)


256

Let's load the data. To make it easy to work on the data, let's use ``pandas``.

In [3]:
import pandas as pd
data = pd.read_csv(f"{DATASET_NAME}/ratings.csv")

And take a bit of look on the data.

In [4]:
print(data.sample(10))
print('min-----')
print(data.min())
print('max-----')
print(data.max())
print(f'Number of ratings: {len(data)}')

       userId  movieId  rating   timestamp
79424     493      493     4.0  1001562900
91721     594     8636     5.0  1137982172
79784     496    84374     3.0  1415166319
27908     190     7438     4.0  1504310074
18747     119   162350     4.5  1505072367
80793     509    97938     3.5  1435994828
69537     448     6618     4.0  1166303478
18407     117      280     3.0   844163543
85858     555     4040     3.0   978744803
52978     348    79132     5.0  1378850281
min-----
userId               1.0
movieId              1.0
rating               0.5
timestamp    828124615.0
dtype: float64
max-----
userId       6.100000e+02
movieId      1.936090e+05
rating       5.000000e+00
timestamp    1.537799e+09
dtype: float64
Number of ratings: 100836


As you can see, each row of the data contains user information, movie information, rating information, and time information. The IDs are integer values and ratings are floating numbers. Though it might not be an exact number, it seems that the numbers of users and movies are aroung 610 and 193609, respectively. The dataset covers a large number of movies, but not so many of users; It surely is just some portion of the original data, suitable for introductory purpose. Ratings starts from 0.5 and does not exceed 5.0. We can easily guess that the unit is 0.5, and that there are 10 unique rating values. The number of entries is far from the number of combinations of users and movies (just 0.08% only!). We have to guess the rest from this really, really small portion of data.

Okay, let's go further.

### Ratings as a Matrix

We can regard ratings as matrix, with possibly some missing entries. The rows of the matrix correspond to the users, and columns correspond to the movies. If $i$-th user rated $j$-th movie with $r_{ij}$, then $(i, j)$ element of the matrix has a value of $r_{ij}$. If there is no rating information in the data for the same pair of user and movie, $(i, j)$ element is missing and we have to predict the value. Then we can think of a problem of filling missing elements of a matrix.

As we saw in the sample data as above, there could be a large number of users and/or movies. The number of combination of them is even larger. This really is a huge problem space and we need an efficient algorithm.

One thing to mention is that, we will allow arbitrary values for predicted ratings. The value might not end with '0.0' and '0.5'. It could be even smaller than 0.5, or bigger than 5.0. As we will see soon, it does not bother evaluating the result.

### Evaluation

After predicting missing values of the matrix, how can we evaluate the result? Or essentially, what does it mean to evaluate a recommendation?

It might not be the right time to explain this topic, but let's think about it. It would be good if the movies the user actually likes is included in the recommendation. We measure it with a *recall* value. But what if the system just predicts that the user will like all of the movies? All of the to-be-favorite movies will be included in the recommendation, but we know that such recommendation is useless. We also have to measure the number of recommended movies that the user actually likes, by a metric called *precision*. In this case, the system might just predict nothing. Then we can say that all of the recommended movies were true ones, but it also is a useless case.

To overcome such cases,  metrics designed to be reflecting both precison and recall are often used. But it also has a drawback in the basis. Suppose that the recommender system output a movie that the user will like. However, we cannot know the actual response of the user from the input data. We just have to actually recommend the movie to the user to know the result.

Then how can we evalute the result just with the given data? A widely used metric is an RMSE (Root Mean Square Error) score. We first select some of the known ratings. Let's call them a test dataset. And we run the system with the rest of the data, called a train dataset. We'll get a predicted rating for user-movie pairs in the test dataset. And we also know the actual rating. So we can measure the error, and can square them, get the average (mean value) value of them, and a square root of it. That's the RMSE score.

For this purpose, let's split the dataset into train and test dataset. For convenience, we will use 10% of the original data as a test dataset, as stated in ``test_portion`` in the code below.