# torchrec MovieLens Tutorial

In [6]:
# ! wget https://files.grouplens.org/datasets/movielens/ml-25m.zip
# %load_ext autoreload
%autoreload 2


## Table of contents
1. Instantiating MovieLens-25M dataset
2. Defining model
3. Training and evaluating model
4. Finding similar movies

## 1. Instantiating MovieLens-25M dataset

To start, we can load the MovieLens-25M dataset using `torchrec.datasets.movielens.movielens_25m`. The function loads just the user-movie ratings data in `ratings.csv` by default; we call the function with `include_movies_data=True` such that it adds movie data from `movies.csv` to each user-movie sample.

In [7]:
# from torchrec.datasets.movielens import movielens_25m

# dp = movielens_25m("ml-25m", include_movies_data=True)
from movielens import movielens_25m
dp = movielens_25m("s3://torchrec-movielens-dataset/", include_movies_data=True)

Let's check out a single sample.

In [8]:
next(iter(dp))

{'userId': 1,
 'movieId': 296,
 'rating': 5.0,
 'timestamp': 1147880044,
 'title': 'Pulp Fiction (1994)',
 'genres': 'Comedy|Crime|Drama|Thriller'}

Seems reasonable.

Next, we instantiate datapipes representing training and validation data splits and apply shuffling and batching.

In [9]:
from torchrec.datasets.utils import rand_split_train_val

train_dp, val_dp = rand_split_train_val(dp, 0.9)
batched_train_dp = train_dp.shuffle(buffer_size=int(1e5)).batch(8192)
batched_val_dp = val_dp.batch(8192)

Turns out that the integer user ids and movie ids referenced by the dataset aren't contiguous. Let's remap them to contiguous values so that we can use them with `torch.nn.Embedding` more easily downstream.

To do so, we first populate dictionaries that map movie and user ids to ids in contiguous ranges

In [10]:
# contig_movie_ids = {}
# contig_user_ids = {}
# movie_id_to_title_genre = {}

# available_movie_id = 0
# available_user_id = 0
# for sample in dp:
#     if sample["movieId"] not in contig_movie_ids:
#         contig_movie_ids[sample["movieId"]] = available_movie_id
#         available_movie_id += 1
#     if sample["userId"] not in contig_user_ids:
#         contig_user_ids[sample["userId"]] = available_user_id
#         available_user_id += 1
#     movie_id_to_title_genre[sample["movieId"]] = (sample["title"], sample["genres"])

, and then define a function `_transform` that uses those dictionaries to remap movie and user ids for a batch of data. While we're at it, we'll also have `_transform` reformat the batch as tensors representing user ids, movie ids, and labels (numerical movie ratings given by users).

In [11]:
unique_user_ids_num = 162541
unique_movie_ids_num = 59047

In [12]:
import torch
from threading import Lock

class Transform(object):
    def __init__(self):
        self._contig_movie_ids = {}
        self._contig_user_ids = {}
        self._movie_id_to_title_genre = {}
        
        self._lock = Lock()
        self._available_movie_id = 0
        self._available_user_id = 0
    
    def __call__(self, batch):
        self._update_ids(batch)
        return self._transform(batch)
        
    def _update_ids(self, batch):
        with self._lock:
            for sample in batch:
                if sample["movieId"] not in self._contig_movie_ids:
                    self._contig_movie_ids[sample["movieId"]] = self._available_movie_id
                    self._available_movie_id += 1
                if sample["userId"] not in self._contig_user_ids:
                    self._contig_user_ids[sample["userId"]] = self._available_user_id
                    self._available_user_id += 1
                self._movie_id_to_title_genre[sample["movieId"]] = (sample["title"], sample["genres"])
    
    def _transform(self, batch):
        user_ids = torch.tensor([self._contig_user_ids[sample["userId"]] for sample in batch], dtype=torch.int32)
        movie_ids = torch.tensor([self._contig_movie_ids[sample["movieId"]] for sample in batch], dtype=torch.int32)
        labels = torch.tensor([sample["rating"] for sample in batch], dtype=torch.float)
        return user_ids, movie_ids, labels
            

# def _transform(batch):
#     user_ids = torch.tensor([contig_user_ids[sample["userId"]] for sample in batch], dtype=torch.int32)
#     movie_ids = torch.tensor([contig_movie_ids[sample["movieId"]] for sample in batch], dtype=torch.int32)
#     labels = torch.tensor([sample["rating"] for sample in batch], dtype=torch.float)
#     return user_ids, movie_ids, labels

_transform = Transform()

Finally, we configure our training and validation datapipes to apply `_transform` to each batch of data using `map`.

In [13]:
preproc_train_dp = batched_train_dp.map(_transform)
preproc_val_dp = batched_val_dp.map(_transform)

At this point, `preproc_train_dp` and `preproc_val_dp` are set up to produce the data that our model expects.

In [14]:
next(iter(preproc_train_dp))

(tensor([  0,   0,   1,  ...,  27,  53, 109], dtype=torch.int32),
 tensor([   0,    1,    2,  ...,  515, 3289, 1688], dtype=torch.int32),
 tensor([3.5000, 4.0000, 5.0000,  ..., 4.5000, 4.5000, 5.0000]))

## 2. Defining model

Next, we define the model we're going to train. We'll go with a simplified two-tower model `TwoTowerModel` resembling a matrix factorization model that attempts to learn a low-rank approximation of the user-movie ratings matrix. More specifically, we want to find matrices $U \in \mathbb{R}^{u \times d}$ and $M \in \mathbb{R}^{m \times d}$ such that $U M^T \approx A$, where each row in $U$ represents a user embedding of dimension $d$ and each row in $M$ a movie embedding also of dimension $d$. Once we find matrices $U$ and $M$, we can infer the rating that the $i$-th user gives the $j$-th movie as $u_i^T \cdot m_j^T$, i.e. the dot product of the $i$-th row in $U$ and $j$-th row in $M$.

`TwoTowerModel` represents $U$ and $M$ as embedding tables — instances of `torch.nn.Embedding`.

In [15]:
class TwoTowerModel(torch.nn.Module):
    def __init__(self, num_embeddings_0, num_embeddings_1, embedding_dim):
        super().__init__()
        self.model_0 = torch.nn.Embedding(num_embeddings_0, embedding_dim)
        self.model_1 = torch.nn.Embedding(num_embeddings_1, embedding_dim)
    
    def forward(self, input):
        embeddings_0 = self.model_0(input[0])
        embeddings_1 = self.model_1(input[1])
        return torch.sum(embeddings_0 * embeddings_1, axis=1)

## 3. Training and evaluating model
We're ready to train our model. Let's instantiate the model we just defined

In [16]:
model = TwoTowerModel(
    unique_user_ids_num, #len(contig_user_ids),
    unique_movie_ids_num, #len(contig_movie_ids),
    32
)

, instantiate our loss function and optimizer

In [17]:
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-6)

, and define our train and test loops.

In [18]:
def train_loop(dp, model, loss_fn, optimizer):
    for batch, (users, movies, labels) in enumerate(dp):
        pred = model((users, movies))
        loss = loss_fn(pred, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(labels)
            print(f"loss: {loss:>7f}; batch: {batch}")

def test_loop(dp, model, loss_fn):
    test_loss = 0
    batch_count = 0
    with torch.no_grad():
        for batch, (users, movies, labels) in enumerate(dp):
            pred = model((users, movies))
            test_loss += loss_fn(pred, labels).item()
            batch_count += 1
    
    print(f"Test loss: {test_loss / batch_count}")

And now, we train.

In [19]:
epochs = 3

for __ in range(epochs):
    train_loop(preproc_train_dp, model, loss_fn, optimizer)
    test_loop(preproc_val_dp, model, loss_fn)

loss: 45.422134; batch: 0
loss: 38.413521; batch: 100
loss: 34.630333; batch: 200
loss: 30.334789; batch: 300
loss: 30.588533; batch: 400
loss: 27.773775; batch: 500
loss: 27.661900; batch: 600
loss: 26.092945; batch: 700
loss: 26.207884; batch: 800
loss: 24.714758; batch: 900
loss: 23.758249; batch: 1000
loss: 23.802877; batch: 1100
loss: 22.610825; batch: 1200
loss: 22.531580; batch: 1300
loss: 20.955700; batch: 1400
loss: 21.457108; batch: 1500
loss: 21.612574; batch: 1600
loss: 20.770386; batch: 1700
loss: 20.645947; batch: 1800
loss: 20.181532; batch: 1900
loss: 20.713175; batch: 2000
loss: 19.967794; batch: 2100
loss: 19.591295; batch: 2200
loss: 20.147646; batch: 2300
loss: 19.731419; batch: 2400
loss: 20.435057; batch: 2500
loss: 19.381195; batch: 2600
loss: 20.029749; batch: 2700
Test loss: 18.510868870354944
loss: 17.948822; batch: 0
loss: 17.586926; batch: 100
loss: 17.682665; batch: 200
loss: 17.496422; batch: 300
loss: 17.567238; batch: 400
loss: 17.433815; batch: 500
loss

We've got a trained model!

## 4. Finding similar movies
For kicks, let's see if we can use our model's trained embeddings to find movies that are most similar to some query movie. In theory, movies with embeddings that are similar should themselves be similar.

In [23]:
movie_id_to_title_genre = _transform._movie_id_to_title_genre
contig_to_movie_id = {v: k for k, v in _transform._contig_movie_ids.items()}
contig_movie_ids = _transform._contig_movie_ids

In [24]:
# contig_to_movie_id = {v: k for k, v in contig_movie_ids.items()}

def get_topk_sim_movies(movie_id, k=20):
    embedding = model.model_1(torch.tensor([contig_movie_ids[movie_id]]))
    movie_embeddings = model.get_parameter("model_1.weight")
    movie_similarities = torch.sum(embedding * movie_embeddings, axis=1) / torch.maximum(torch.norm(embedding) * torch.norm(movie_embeddings, dim=1), torch.ones(movie_embeddings.shape[0]) * 1e-12)
    topk_sim = torch.topk(movie_similarities, 20)
    contig_ids = topk_sim.indices.tolist()
    return [
        (*movie_id_to_title_genre[contig_to_movie_id[movie_id]], contig_to_movie_id[movie_id]) 
        for movie_id in contig_ids
    ]

In [25]:
# Drive
get_topk_sim_movies(88129)

[('Drive (2011)', 'Crime|Drama|Film-Noir|Thriller', 88129),
 ('The Hunger Games (2012)', 'Action|Adventure|Drama|Sci-Fi|Thriller', 91500),
 ('Social Network, The (2010)', 'Drama', 80463),
 ('Prometheus (2012)', 'Action|Horror|Sci-Fi|IMAX', 94864),
 ('Harry Potter and the Order of the Phoenix (2007)',
  'Adventure|Drama|Fantasy|IMAX',
  54001),
 ('Looper (2012)', 'Action|Crime|Sci-Fi', 96610),
 ('The Butterfly Effect (2004)', 'Drama|Sci-Fi|Thriller', 7254),
 ('In Bruges (2008)', 'Comedy|Crime|Drama|Thriller', 57669),
 ('Kiss Kiss Bang Bang (2005)', 'Comedy|Crime|Mystery|Thriller', 38061),
 ('X-Men: First Class (2011)', 'Action|Adventure|Sci-Fi|Thriller|War', 87232),
 ('Old Boy (2003)', 'Mystery|Thriller', 27773),
 ('Lord of War (2005)', 'Action|Crime|Drama|Thriller|War', 36529),
 ('John Wick (2014)', 'Action|Thriller', 115149),
 ('Birdman: Or (The Unexpected Virtue of Ignorance) (2014)',
  'Comedy|Drama',
  112183),
 ('District 9 (2009)', 'Mystery|Sci-Fi|Thriller', 70286),
 ('Moon (2009

In [26]:
# Lost in Translation
get_topk_sim_movies(6711)

[('Lost in Translation (2003)', 'Comedy|Drama|Romance', 6711),
 ('Who Framed Roger Rabbit? (1988)',
  'Adventure|Animation|Children|Comedy|Crime|Fantasy|Mystery',
  2987),
 ('Top Gun (1986)', 'Action|Romance', 1101),
 ('Memento (2000)', 'Mystery|Thriller', 4226),
 ('Shining, The (1980)', 'Horror', 1258),
 ('High Fidelity (2000)', 'Comedy|Drama|Romance', 3481),
 ('Jaws (1975)', 'Action|Horror', 1387),
 ('Psycho (1960)', 'Crime|Horror', 1219),
 ('Groundhog Day (1993)', 'Comedy|Fantasy|Romance', 1265),
 ('Exorcist, The (1973)', 'Horror|Mystery', 1997),
 ('Ice Age (2002)', 'Adventure|Animation|Children|Comedy', 5218),
 ('Platoon (1986)', 'Drama|War', 1090),
 ('Wizard of Oz, The (1939)', 'Adventure|Children|Fantasy|Musical', 919),
 ('Seven Samurai (Shichinin no samurai) (1954)',
  'Action|Adventure|Drama',
  2019),
 ('Scream (1996)', 'Comedy|Horror|Mystery|Thriller', 1407),
 ('V for Vendetta (2006)', 'Action|Sci-Fi|Thriller|IMAX', 44191),
 ('Willy Wonka & the Chocolate Factory (1971)',
  'C

In [27]:
# Ratatouille
get_topk_sim_movies(50872)

[('Ratatouille (2007)', 'Animation|Children|Drama', 50872),
 ('Iron Man (2008)', 'Action|Adventure|Sci-Fi', 59315),
 ('Ice Age (2002)', 'Adventure|Animation|Children|Comedy', 5218),
 ('Dark Knight Rises, The (2012)', 'Action|Adventure|Crime|IMAX', 91529),
 ('Incredibles, The (2004)',
  'Action|Adventure|Animation|Children|Comedy',
  8961),
 ('Shrek 2 (2004)',
  'Adventure|Animation|Children|Comedy|Musical|Romance',
  8360),
 ('Interstellar (2014)', 'Sci-Fi|IMAX', 109487),
 ("Ocean's Twelve (2004)", 'Action|Comedy|Crime|Thriller', 8984),
 ('District 9 (2009)', 'Mystery|Sci-Fi|Thriller', 70286),
 ('Toy Story 3 (2010)',
  'Adventure|Animation|Children|Comedy|Fantasy|IMAX',
  78499),
 ('Collateral (2004)', 'Action|Crime|Drama|Thriller', 8798),
 ('Finding Nemo (2003)', 'Adventure|Animation|Children|Comedy', 6377),
 ('X-Men: First Class (2011)', 'Action|Adventure|Sci-Fi|Thriller|War', 87232),
 ('Sherlock Holmes (2009)', 'Action|Crime|Mystery|Thriller', 73017),
 ('Last Samurai, The (2003)', '

What do you think? Can we do better?