In this IPython Notebook we demonstrate how to use Graph techniques to improve matrix factorization for movie recommendation. 

In our blogpost at [insert link] we provide more details. 

The underlying data set is the kaggle "The Movies Dataset" https://www.kaggle.com/rounakbanik/the-movies-dataset. 

This colab is structured as follows: 

0. Setup: make sure to upload an API token from kaggle into colab to load the dataset
1. [Load Data and Networkx Graph with the data](#Load-Data)
2. [Split Edges](#Split-Edges)
3. [Simple Embedding (Matrix factorization) model](#Simple-Embedding-Model-(Matrix-Factorization))
4. [Improve model by changing loss function to BRP](#Simple-Embedding-Model-with-BRP-Loss)
5. [Improve model by smoothing the embeddings via LightGCN](#Embedding-Smoothing-with-LGCN)
6. [Summary and Outlook](#Summary-and-Outlook)

We recommend using a GPU for this Colab.

Please click Runtime and then Change runtime type. Then set the hardware accelerator to <b>GPU<b/>.

### Setup

In [None]:
# install requirements 
! pip install dask
! pip install 'fsspec>=0.3.3'
! pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.10.0+cpu.html
! pip install kaggle

##### Before executing the following cell: download an API token from you kaggle account (find that under Account settings) and subsequently upload the kaggle.json file into colab.
https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/

In [None]:
# download the movies data
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! mkdir data
! mkdir data

import kaggle
kaggle.api.dataset_download_files('rounakbanik/the-movies-dataset', path='data/', unzip=True)

In [None]:
# download supporting files (evaluation.py, utils.py) 
# from git (contains helper functions etc.)
! curl -o evaluation.py https://raw.githubusercontent.com/hlgchen/movie_recommendations_cs224w/main/colab/evaluation.py
! curl -o utils.py https://raw.githubusercontent.com/hlgchen/movie_recommendations_cs224w/main/colab/utils.py

In [None]:
import pandas as pd
import dask.dataframe as dd

import networkx as nx
import torch
from torch import nn
from torch.optim import Adam

from torch_scatter import scatter_add
from torch_geometric.utils.num_nodes import maybe_num_nodes
from torch_geometric.nn.conv import MessagePassing

from matplotlib import pyplot as plt

import evaluation
from utils import *

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Load Data <a name="Load-Data"></a>

In [None]:
ratings = pd.read_csv("data/ratings.csv")
ratings.columns = ratings.columns.str.lower()

# limit number of users and thus limit the dataset size 
# (otherwise graphs get too big for colab)
ratings = ratings.loc[ratings.userid < 1500].copy()

In [None]:
ratings.head()

### Map userid and movie_id to index

For subsequent models, it is helpful to map users and movies to the same index space.
We sort users by userid and movies by movieid. We map the concatenation of user and movies to nodeids. 

Example: if our data consists of users: [5,3] and movies [2321, 5]. We would have the following mapping: 
nodeid_userid = {1:3, 2:5},
nodeid_movieid = {3:5, 4:2321}

Note that we will only recommend movies to users that at least one user has watched in the dataset. 

In [None]:
nodeid_userid, nodeid_movieid, userid_nodeid, movieid_nodeid = get_mapping(ratings)

### Transform Data to Graph

In [None]:
%%time
# use dask for parallel computing (might not make a huge difference on colab)
ddata = dd.from_pandas(ratings, npartitions=10)

def create_edge(x): 
    return (userid_nodeid[int(x.userid)], movieid_nodeid[int(x.movieid)], x.rating)

edges = ddata.map_partitions(lambda df: df.apply((lambda row: create_edge(row)), axis=1)).compute() 
edges = edges.tolist() # list of tuples [(node1, node2, weight)]

In [None]:
%%time
G = nx.Graph(directed=False)
G.add_weighted_edges_from(edges)

### Calculate some Summary Statistics

Since G is a networkx graph, we can use an array of networkx graph functions on it

In [None]:
print("number of nodes:", G.number_of_nodes())
print("number of edges:", G.number_of_edges())
cc = 2 * G.number_of_edges() / G.number_of_nodes()
print("average node degree:", cc)
print("density of network:", nx.density(G))

### Visualization of Subset of Data

Visualize the subgraph induced by node 2, 4 and all neighbors of node 2 (take these nodes and all edges that exist between them in G).
Note that 2 and 4 are users and all neighbors of 2 are movies. 

In [None]:
nodes_plot = list(G.neighbors(2)) + [2, 4]
S = G.subgraph(nodes_plot)

In [None]:
values = [0 if node < 1499 else 1 for node in S.nodes()]
pos = {2: (3, 1), 4: (7, 1)}
pos.update({n: (i, 2) for i, n in enumerate(S.neighbors(2))})

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
nx.draw(
    S,
    cmap=plt.get_cmap("Greys"),
    pos=pos,
    node_color=values,
    with_labels=True,
    font_color="orange",
    node_size=1000,
    edgecolors="black",
    vmin=0,
    vmax=1,
    ax=ax,
)
ax.set_title("bipartite subgraph of G")
plt.show()

# 2. Split Edges <a name="Split-Edges"></a>

We split edges into training, validation and test edges. During training only training edges will be known to the model (note that we don't do any splits with nodes). During validation/test the model trained with training edges makes recommendations which are benchmarked against validation/test edges. 

In [None]:
pos_edge_list = graph_to_edge_list(G)

# split edges
split_dict = {"train": 0.75, "valid": 0.1, "test": 0.15}
edges = transductive_edge_split(pos_edge_list, split_dict, seed=825)

# 3. Simple Embedding Model (Matrix Factorization) <a name="Simple-Embedding-Model-(Matrix-Factorization)"></a>

The baseline model for our blogpost is a simple embedding model which is equivalent to the popular Matrix Factorization model. 

We try to learn embeddings for users and movies such that wenn we take the dot product of a user with a movie we get a high value if an edge exists and low value otherwise.

For the loss function we will use the binary cross entropy loss. 

### Model Class definition

In [None]:
class SimpleEmbedding(nn.Module):
    """Simple embedding model for movie recommendations. The proximity of movies and users
    is calculated as the scalarproduct between the embddings.
    """

    def __init__(self, emb):

        super(SimpleEmbedding, self).__init__()

        self.emb = emb
        self.sigmoid = nn.Sigmoid()

    def forward(self, edges):
        embedded_nodes = self.emb(edges) # replaces each scalar k in edges with the embedding at row k
        s_product = torch.mul(embedded_nodes[0], embedded_nodes[1]).sum(axis=1)
        out = self.sigmoid(s_product)
        return out

    def recommend(self, edges):
        """
        Takes possible edges and predicts how likely it is to exist,
        i.e. estimates how close both nodes in the edge are in terms of
        their embedding.
        Function needed for evaluation.

        Returns sorted edges (by prediction score) and the sorted prediction score
        """
        pred = self.forward(edges)
        ranking = torch.argsort(pred, descending=True)
        return edges[:, ranking], pred[ranking]

### Create Negative Samples and Labels

In order to train the model we will need negative edges (edges that don't exist). The model subsequently tries to asign high scores to edges that exist, and low scores to those that don't. 

In [None]:
pos_edge_index = dict()
neg_edge_index = dict()
pos_label = dict()
neg_label = dict()

for key, ls in edges.items():
    pos_edge_index[key] = edge_list_to_tensor(ls)

    neg_edge_list = sample_negative_edges(G, len(ls))
    neg_edge_index[key] = edge_list_to_tensor(neg_edge_list)

    pos_label[key] = torch.ones(pos_edge_index[key].shape[1])
    neg_label[key] = torch.zeros(neg_edge_index[key].shape[1])

### Train Embeddings

#### Training Loop

In [None]:
def train(
    model,
    train_label,
    train_edge,
    valid_label=None,
    valid_edge=None,
    epochs=5000,
    early_stopping=3,
):

    """
    Training loop for SimpleEmbedding Model.

    Params:
        - model: SimpleEmbedding Model
        - train_label: torch.Tensor with labels corresponding to train_edges
                        shape: ([num_pos_edges + num_neg_edges])
        - train_edge: torch.Tensor with training edges (should be in same order as train_label)
                        shape: ([2, num_pos_edges + num_neg_edges])
        - valid_label: analogous to train_label
        - valid_edge: analogous to train_edge
        - epochs: number of maximum epochs to train
        - early_stopping: (int) if this value is greater than 0, training is stopped if the
                            validation accuracy goes down "early_stopping" times in a row.
    """

    learning_rate = 0.003
    optimizer = Adam(model.parameters(), lr=learning_rate)
    loss_fn = nn.BCELoss()

    descreasing = 0
    valid_accuracy = 0

    for i in range(epochs):
        optimizer.zero_grad()

        pred = model(train_edge)
        loss = loss_fn(pred, train_label)

        loss.backward()
        optimizer.step()

        if valid_edge is not None:
            pred_validation = model(valid_edge)
            valid_accuracy_new = accuracy(pred_validation, valid_label)
            if early_stopping > 0:
                if valid_accuracy_new < valid_accuracy:
                    decreasing += 1
                else:
                    decreasing = 0
                if decreasing == early_stopping:
                    break
            valid_accuracy = valid_accuracy_new

        if i % 500 == 0:
            print_message = f"epoch {i}: loss is: {loss:.3f}, accuracy train: {accuracy(pred, train_label)}"
            if valid_edge is not None:
                print_message += f" valid: {valid_accuracy}"
            print(print_message)

#### Initialize Embedding Training

In [None]:
torch.manual_seed(1)
emb = create_node_emb(num_node=G.number_of_nodes()).to(device)
simple_embedding_model = SimpleEmbedding(emb)

train_label = torch.cat([pos_label["train"], neg_label["train"]], dim=0).to(device)
train_edge = torch.cat([pos_edge_index["train"], neg_edge_index["train"]], dim=1).to(device)

valid_label = torch.cat([pos_label["valid"], neg_label["valid"]], dim=0).to(device)
valid_edge = torch.cat([pos_edge_index["valid"], neg_edge_index["valid"]], dim=1).to(device)

train(simple_embedding_model, train_label, train_edge, valid_label, valid_edge)

### Recall@100 on Testset

Recall@100 measures what percentage of movies actually watched by an user appear in the top 100 recommendations made by the model.

In [None]:
recall100_simple_embedding_model = evaluation.avg_recall_at_k(
    seen_edges=torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1),
    test_edges=pos_edge_index["test"],
    model=simple_embedding_model,
    library=nodeid_movieid.keys(),
    users=nodeid_userid.keys(),
    k=100,
)

recall100_simple_embedding_model

# 4. Simple Embedding Model with BRP Loss <a name="Simple-Embedding-Model-with-BRP-Loss"></a>

Because we care about Recall@k, the binary cross entropy loss might not be the best choice as the loss function. Using the BRP loss we can achieve better performances.

For BRP loss to work, we need a different way of sampling negative edges.

In [None]:
# negative edges will be sampled on the fly during training
pos_edge_index = dict()
for key, ls in edges.items():
    pos_edge_index[key] = edge_list_to_tensor(ls)

### Train Embeddings

#### Training Loop

In [None]:
def batch_train(
    model, train_edges, n_batches, valid_edges=None, epochs=181, early_stopping=2
):
    """
    Trains embeddings with BRP loss and using user batches. Negtaive Edges for each user
    are sampled on the fly.

    Params:
        - model: SimpleEmbedding Model
        - train_edges: torch.Tensor with shape (2, n_positive_edges).
                    Conatains positive edges of training set.
        - n_batches: number of user batches.
        - valid_edges: analogous to train_edge
        - epochs: number of maximum epochs to train
        - early_stopping: (int) if this value is greater than 0, training is stopped if the
                            validation accuracy goes down "early_stopping" times in a row.

    """

    learning_rate = 0.003
    optimizer = Adam(emb.parameters(), lr=learning_rate)

    users, unique_users, index = get_pos_edges_users(train_edges)
    _, unique_movies, _ = get_pos_edges_movies(train_edges)

    descreasing = 0
    valid_recall_k = 0

    for i in range(epochs):
        user_batches = user_batch_generator(unique_users, n_batches)
        for batch in user_batches:
            optimizer.zero_grad()
            user_losses = []
            for u in batch:

                # get positive and sample 10 negative edges for each user 
                # negative edges: (user, movies), where haven't been watched by the user 
                pos_edges_user, neg_edges_user = get_pos_neg_edges_for_user(
                    edges=train_edges,
                    users=users,
                    u=u,
                    unique_movies_set=set(unique_movies),
                )

                # make predictions and calculate loss
                f_pos = model.forward(pos_edges_user)
                f_neg = model.forward(neg_edges_user)

                # individual brp loss for one user
                ul = brp_loss(f_pos, f_neg)
                user_losses.append(ul)

            batch_loss = torch.stack(user_losses).mean()

            batch_loss.backward()
            optimizer.step()

        if i % 10 == 0:

            if valid_edges is not None:
                valid_recall_k_new = evaluation.avg_recall_at_k(
                    seen_edges=pos_edge_index["train"],
                    test_edges=pos_edge_index["valid"],
                    model=model,
                    library=nodeid_movieid.keys(),
                    users=nodeid_userid.keys(),
                    k=100,
                )
                if early_stopping > 0:
                    if valid_recall_k_new <= valid_recall_k:
                        decreasing += 1
                    else:
                        decreasing = 0
                    if decreasing == early_stopping:
                        break
                valid_recall_k = valid_recall_k_new

            print(
                f"epoch {i}: loss is: {batch_loss}, valid recall@100: {valid_recall_k}"
            )

#### Initialize Embedding Training

In [None]:
torch.manual_seed(1)
emb = create_node_emb(num_node=G.number_of_nodes()).to(device)

embedding_brp_model = SimpleEmbedding(emb)

# takes around 7min on colab with GPU 
batch_train(
    embedding_brp_model,
    pos_edge_index["train"],
    n_batches=100,
    valid_edges=pos_edge_index["valid"],
    early_stopping=2,
)

### Recall@100 on Testset

In [None]:
recall100_embedding_brp_model = evaluation.avg_recall_at_k(
    seen_edges=torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1),
    test_edges=pos_edge_index["test"],
    model=embedding_brp_model,
    library=nodeid_movieid.keys(),
    users=nodeid_userid.keys(),
    k=100,
)

recall100_embedding_brp_model

# 5. Embedding Smoothing with LGCN <a name="Embedding-Smoothing-with-LGCN"></a>
Nodes that are close to each other tend to have similar properties. Using PyG we can implement LGCN and improve our recommendations. Mathematically we are calculating $H^{(k)}$: 

$$H^{(K)} = \frac{1}{K}\sum_{k=0}^K H^{(k)}$$
 
Where $H$ is the embedding matrix where the embedding of nodes are rows. The final embedding aquired from a LGCN with $K$ layers is the average over embeddings after $k$ step diffusion, $k \in \{1,.., K\}$. 

In particular for a single node $u$ the embedding at step $k+ 1$ is calculated as follows: $$h^{(k+1)}_u =\sum_{m \in N(u)} \frac{w_{u,m}}{\sqrt{d_u}\sqrt{d_m}} h_m^{(k)}$$

Where $w_{u,m}$ is the weight of the edge bewteen $u, m$ and $d_u$ is the node degree of node $u$. 

Note that $\frac{w_{u,m}}{\sqrt{d_u}\sqrt{d_m}} h_m^{(k)}$ is the message, Sum is the aggregation, if we interpret it as a GNN. In contrast to typical GNNs, LGCN dosn't use non-linearities. In addition we decided not to use any learnable weights, thus the model is very simple and easy to implement (no training needed).



We can interpret this process as Embeddings being smoothed within neighborhoods. (See blogpost for more details).

In [None]:
def gcn_norm(edge_index, edge_weight=None, num_nodes=None):
    """
    Returns edge index and edge weight that corresonds to edge diffusion matrix
    D^-0.5 A D^-0.5. A is the (weighted) adjacency matrix. 

    Params:
        - edge_index: tensor of shape (2, n_edges) containing edges
        - edge_weight: tensor of shape (n_edges, ) containing weights of each edge
        - num_nodes: integer with number of nodes of the graph

    Returns:
        - edge_index: edge index that was passed to the function
        - weight: tensor of shape (n_edges, ) containing weights that correspond to diffusion

    """
    num_nodes = maybe_num_nodes(edge_index, num_nodes)

    if edge_weight is None:
        edge_weight = torch.ones(
            (edge_index.size(1),), dtype=torch.float, device=edge_index.device
        )

    row, col = edge_index[0], edge_index[1]
    deg = scatter_add(edge_weight, col, dim=0, dim_size=num_nodes)
    deg_inv_sqrt = deg.pow_(-0.5)
    deg_inv_sqrt.masked_fill_(deg_inv_sqrt == float("inf"), 0)
    return edge_index, deg_inv_sqrt[row] * edge_weight * deg_inv_sqrt[col]

### Model Class definition

In [None]:
class LightGCNConv(MessagePassing):
    def __init__(self):
        super(LightGCNConv, self).__init__()

    def forward(self, x, edge_index, edge_weight=None):
        # propagate starts the message passing process
        out = self.propagate(edge_index, x=x, edge_weight=edge_weight)
        return out

    def message(self, x_j, edge_weight):
        # constructs message that is to be passed from neigbor node j to central node i
        # here message is value of x_j (embedding of x_j at given step)
        # multiplied with edge_weight (that was calculated by gcn_norm in this tutorial)
        return x_j if edge_weight is None else edge_weight.view(-1, 1) * x_j


class LightGCN(torch.nn.Module):
    def __init__(self, num_layers):

        super(LightGCN, self).__init__()

        self.convs = torch.nn.ModuleList([LightGCNConv() for i in range(num_layers)])

    def forward(self, x, adj_t, adj_t_is_undirected=True, edge_weight=None):
        
        """
        Diffuses x in neighborhood num_layers of times using diffusion matrix D^-0.5 A D^-0.5
        Params: 
            - x: torch.tensor with shape (n_nodes, ..), for our purposes
                    this will be the embedding matrix 
            - adj_t: torch tensor with edges for message propagation
                    has shape (2, num_edges_for_message_propagation)
                    Can be seen as different form of adjacency matrix. 
                    adj_t is renamed to edge_index to follow the PyG conventions.
            - adj_t_is_undirected: if true it means adj_t is a list that contains unordered edges 
                    (edges are not directed)
            - edge_weight: torch tensor with weights for each edge. 
                    Should have shape [adj_t.shape[1]].
        """

        # For message passing, Edges have to be directed
        # networkx graph edges are undirected by default
        # implementation assumes that there are no self loops
        # otherwise these would be duplicated
        if adj_t_is_undirected:
            adj_t_reversed = adj_t[[1, 0], :] # create reversed versions of edges
            adj_t = torch.cat([adj_t, adj_t_reversed], dim=1)

            if edge_weight is not None:
                edge_weight = torch.cat([edge_weight, edge_weight])

        out_ls = [x]
        edge_index, edge_weight = gcn_norm(
            adj_t, edge_weight=edge_weight, num_nodes=x.size(0)
        )

        for i in range(len(self.convs)):
            x = self.convs[i](x, edge_index, edge_weight)
            out_ls.append(x)
        return torch.stack(out_ls).mean(dim=0)

### Helper function

In [None]:
def get_lgcn_embedding_model(emb, message_edges, n_layers, edge_weight=None):
    """
    Returns Embedding model where embedding weights are the outcome of LGCN smoothing.
    params:
        - emb: Embedding to be smoothed with LGCN
        - message_edges: edges along which LGCN should pass embeddings for smoothing
        - n_layers: number of LGCN layers
        - edge_weight: if specified smoothing takes edge weight into account

    """
    lgcn = LightGCN(n_layers)
    res = lgcn.forward(emb.weight, message_edges, edge_weight=edge_weight)

    lgcn_emb = nn.Embedding(emb.num_embeddings, emb.embedding_dim).to(device)
    lgcn_emb.weight = nn.Parameter(res)

    lgcn_emb_model = SimpleEmbedding(lgcn_emb)
    return lgcn_emb_model


def get_best_lgcn_layer(emb, min_i=2, max_i=20, verbose=False):
    """
    Returns layer number according to recall@100 on validationset.
    Prints validation recall@100 for different layers (hyperparameter tuning of layer number)
    params:
        - emb: embedding to be passed along
        - min_i: minimum layer number
        - max_i: maximum layer number
        - verbose: (boolean) if True outputs validation recall for each layer tried,
                    else only the best layer
    """
    best_recall = 0
    best_param = None
    for i in range(min_i, max_i):
        lgcn_emb_model = get_lgcn_embedding_model(
            emb=emb, message_edges=pos_edge_index["train"], n_layers=i
        )

        recall_validation = evaluation.avg_recall_at_k(
            seen_edges=pos_edge_index["train"],
            test_edges=pos_edge_index["valid"],
            model=lgcn_emb_model,
            library=nodeid_movieid.keys(),
            users=nodeid_userid.keys(),
            k=100,
        )
        if verbose:
            print(f"n_layer {i} : ", recall_validation)
        if recall_validation > best_recall:
            best_param = i
            best_recall = recall_validation
    print(f"best param: {best_param}")
    return best_param

#### Tune Number of LGCN Layers with Validation Set

In [None]:
n1 = get_best_lgcn_layer(simple_embedding_model.emb)

In [None]:
n2 = get_best_lgcn_layer(embedding_brp_model.emb)

### improve base models

In [None]:
lgcn_simple_embedding_model = get_lgcn_embedding_model(
    emb=simple_embedding_model.emb,
    message_edges=torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1),
    n_layers=n1,
)

lgcn_embedding_brp_model = get_lgcn_embedding_model(
    emb=embedding_brp_model.emb,
    message_edges=torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1),
    n_layers=n2,
)

### Recall@100 on Testset

In [None]:
recall100_lgcn_simple_embedding_model = evaluation.avg_recall_at_k(
    seen_edges=torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1),
    test_edges=pos_edge_index["test"],
    model=lgcn_simple_embedding_model,
    library=nodeid_movieid.keys(),
    users=nodeid_userid.keys(),
    k=100,
)

recall100_lgcn_simple_embedding_model

In [None]:
recall100_lgcn_embedding_brp_model = evaluation.avg_recall_at_k(
    seen_edges=torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1),
    test_edges=pos_edge_index["test"],
    model=lgcn_embedding_brp_model,
    library=nodeid_movieid.keys(),
    users=nodeid_userid.keys(),
    k=100,
)

recall100_lgcn_embedding_brp_model

### Bonus: Possible Improvement of Embedding by Using Ratings

In [None]:
def get_ratings(edges):
    """Returns tensor of shape [(number of edges)] with edge weights for each edge."""
    ls = []
    for i in range(edges.shape[1]):
        edge = edges[:, i]
        r = G.get_edge_data(*edge.tolist())["weight"]
        ls.append(r)
    return torch.tensor(ls, device=device)


edge_w_ratings = get_ratings(
    torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1)
)

In [None]:
lgcn_embedding_brp_model_ratings = get_lgcn_embedding_model(
    emb=embedding_brp_model.emb,
    message_edges=torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1),
    n_layers=n2,
    edge_weight=edge_w_ratings,
)

In [None]:
recall100_lgcn_embedding_brp_model_ratings = evaluation.avg_recall_at_k(
    seen_edges=torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1),
    test_edges=pos_edge_index["test"],
    model=lgcn_embedding_brp_model_ratings,
    library=nodeid_movieid.keys(),
    users=nodeid_userid.keys(),
    k=100,
)

recall100_lgcn_embedding_brp_model_ratings

# 6. Summary and Outlook <a name="Summary-and-Outlook"></a>

In [None]:
print("Recall@100 values for different models on test-set:")
print(f"Simple Embedding: {recall100_simple_embedding_model:.6f}")
print(f"Simple Embedding with BRP: {recall100_embedding_brp_model:.6f}")
print(f"LGCN Simple Embedding: {recall100_lgcn_simple_embedding_model:.6f}")
print(f"LGCN Simple Embedding with BRP: {recall100_lgcn_embedding_brp_model:.6f}")
print(
    f"LGCN Simple Embedding with BRP and Ratings as Weights: {recall100_lgcn_embedding_brp_model_ratings:.6f}"
)

### Example Predictions

In [None]:
meta = pd.read_csv("data/movies_metadata.csv")
meta.id = pd.to_numeric(meta.id, errors="coerce")
links = pd.read_csv("data/links.csv")

meta = meta.merge(links, left_on="id", right_on="tmdbId")[["movieId", "original_title"]]
meta.index = meta.movieId
movie_names = meta.drop_duplicates().original_title.to_dict()

In [None]:
def recommend_movies(user, model, seen_edges, test_edges, library, k):
    
    """
    Returns recommended movies, actually watched movies, 
    the intersection of the former two and movies watched during training. 
    Movies are expressed in words and not as ids. 
    
    Params: 
        - user: (int) nodeid of user
        - model: model to make recommendations
        - seen_edges: edges that contains user movie interactions of the past 
                    (for testing, this should be train and validation edges)
        - test_edges: movie user interactions that are unknown to the model
                    these are movies that the user actually watched which will 
                    be used for benchmarking the predictions. 
        - library: library: arraylike object with all movie ids
        - k: number of recommendations to be made
    
    """

    unwatched = evaluation.get_unwatched_movies(user, seen_edges, library)
    unwatched = unwatched.unsqueeze(0)  # shape (1, n_unwatched_movies)
    ranked_edges, _ = evaluation.recommend_for(user, unwatched, model)

    # note that by construction movies are at index position 1
    recommendations = ranked_edges[1, :k].tolist()
    watched_test = evaluation.get_watched_movies(user, test_edges)

    recommendations = [nodeid_movieid[n_id] for n_id in recommendations]
    watched_test = [nodeid_movieid[n_id] for n_id in watched_test]

    recommendations = [movie_names.get(m_id) for m_id in recommendations]
    watched_test = [movie_names.get(m_id) for m_id in watched_test]

    hits = set(recommendations).intersection(set(watched_test))
    watched_train = evaluation.get_watched_movies(user, seen_edges)
    watched_train = [movie_names[nodeid_movieid[n_id]] for n_id in watched_train]

    return recommendations, watched_test, hits, watched_train

In [None]:
top10, actual, hits, watched = recommend_movies(
    user=765,
    seen_edges=torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1),
    test_edges=pos_edge_index["test"],
    model=lgcn_embedding_brp_model_ratings,
    library=nodeid_movieid.keys(),
    k=10,
)

In [None]:
top10

In [None]:
hits

In [None]:
actual

In [None]:
top10, actual, hits, watched = recommend_movies(
    user=765,
    seen_edges=torch.cat([pos_edge_index["train"], pos_edge_index["valid"]], dim=1),
    test_edges=pos_edge_index["test"],
    model=lgcn_simple_embedding_model,
    library=nodeid_movieid.keys(),
    k=10,
)

In [None]:
top10

In [None]:
hits