**Modified from**: [link](https://colab.research.google.com/drive/1xpzn1Nvai1ygd_P5Yambc_oe4VBPK_ZT?usp=sharing)

In [22]:
import torch
from torch import Tensor
import pandas as pd
import numpy as np
from torch_geometric.nn import SAGEConv, to_hetero
import torch.nn.functional as F
from torch_geometric.loader import LinkNeighborLoader
from torch_geometric.data import HeteroData
import torch_geometric.transforms as T
import tqdm

# Link Prediction on MovieLens

This colab notebook shows how to load a set of `*.csv` files as input and construct a heterogeneous graph from it.
We will then use this dataset as input into a [heterogeneous graph model](https://pytorch-geometric.readthedocs.io/en/latest/notes/heterogeneous.html#hgtutorial), and use it for the task of link prediction.
A few code cells require user input to let the code run through successfully.
Parts of this tutorial are also available in [our documentation](https://pytorch-geometric.readthedocs.io/en/latest/notes/load_csv.html).

We are going to use the [MovieLens dataset](https://grouplens.org/datasets/movielens/) collected by the GroupLens research group.
This toy dataset describes ratings and tagging activity from MovieLens.
The dataset contains approximately 100k ratings across more than 9k movies from more than 600 users.
We are going to use this dataset to generate two node types holding data for movies and users, respectively, and one edge type connecting users and movies, representing the relation of whether a user has rated a specific movie.

The link prediction task then tries to predict missing ratings, and can, for example, be used to recommend users new movies.

## Heterogeneous Graph Creation

First, we download the dataset to an arbitrary folder (in this case, the current directory):

In [3]:
movies_path = './ml-latest-small/movies.csv'
ratings_path = './ml-latest-small/ratings.csv'

Before we create the heterogeneous graph, let’s take a look at the data.

In [6]:
print('movies.csv:')
print('===========')
print(pd.read_csv(movies_path)[["movieId", "genres"]].head())
print()
print('ratings.csv:')
print('============')
print(pd.read_csv(ratings_path)[["userId", "movieId"]].head())

movies.csv:
   movieId                                       genres
0        1  Adventure|Animation|Children|Comedy|Fantasy
1        2                   Adventure|Children|Fantasy
2        3                               Comedy|Romance
3        4                         Comedy|Drama|Romance
4        5                                       Comedy

ratings.csv:
   userId  movieId
0       1        1
1       1        3
2       1        6
3       1       47
4       1       50


We see that the `movies.csv` file provides two useful columns: `movieId` assigns a unique identifier to each movie, while the `genres` column represent genres of the given movie.
We can make use of this column to define a feature representation that can be easily interpreted by machine learning models.

In [7]:
# Load the entire movie data frame into memory:
movies_df = pd.read_csv(movies_path, index_col='movieId')

# Split genres and convert into indicator variables:
genres = movies_df['genres'].str.get_dummies('|')
print(genres[["Action", "Adventure", "Drama", "Horror"]])
print("="*50)

# Use genres as movie input features:
movie_feat = torch.from_numpy(genres.values).to(torch.float)
assert movie_feat.size() == (9742, 20)  # 20 genres in total.
print(movie_feat)

         Action  Adventure  Drama  Horror
movieId                                  
1             0          1      0       0
2             0          1      0       0
3             0          0      0       0
4             0          0      1       0
5             0          0      0       0
...         ...        ...    ...     ...
193581        1          0      0       0
193583        0          0      0       0
193585        0          0      1       0
193587        1          0      0       0
193609        0          0      0       0

[9742 rows x 4 columns]
tensor([[0., 0., 1.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])


The `ratings.csv` data connects users (as given by `userId`) and movies (as given by `movieId`).
Due to simplicity, we do not make use of the additional `timestamp` and `rating` information.
Here, we first read the `*.csv` file from disk, and create a mapping that maps entry IDs to a consecutive value in the range `{ 0, ..., num_rows - 1 }`.
This is needed as we want our final data representation to be as compact as possible, *e.g.*, the representation of a movie in the first row should be accessible via `x[0]`.

Afterwards, we obtain the final `edge_index` representation of shape `[2, num_ratings]` from `ratings.csv` by merging mapped user and movie indices with the raw indices given by the original data frame.

In [10]:
# Load the entire ratings data frame into memory:
ratings_df = pd.read_csv(ratings_path)
print(ratings_df)

# Create a mapping from unique user indices to range [0, num_user_nodes):
unique_user_id = ratings_df['userId'].unique()
unique_user_id = pd.DataFrame(data={
    'userId': unique_user_id,
    'mappedID': np.arange(len(unique_user_id)),
})
print("Mapping of user IDs to consecutive values:")
print("==========================================")
print(unique_user_id)

# Create a mapping from unique movie indices to range [0, num_movie_nodes):
unique_movie_id = pd.DataFrame(data={
    'movieId': movies_df.index,
    'mappedID': np.arange(len(movies_df)),
})
print("Mapping of movie IDs to consecutive values:")
print("===========================================")
print(unique_movie_id)

# Perform merge to obtain the edges from users and movies:
ratings_user_id = pd.merge(ratings_df['userId'], unique_user_id,
                            left_on='userId', right_on='userId', how='left')
ratings_user_id = torch.from_numpy(ratings_user_id['mappedID'].values)

ratings_movie_id = pd.merge(ratings_df['movieId'], unique_movie_id,
                            left_on='movieId', right_on='movieId', how='left')
ratings_movie_id = torch.from_numpy(ratings_movie_id['mappedID'].values)

# With this, we are ready to construct our `edge_index` in COO format
# following PyG semantics:
edge_index_user_to_movie = torch.stack([ratings_user_id, ratings_movie_id], dim=0)
assert edge_index_user_to_movie.size() == (2, 100836)

print()
print("Final edge indices pointing from users to movies:")
print("=================================================")
print(edge_index_user_to_movie)
# print("===================================================")
# edge_feat = torch.from_numpy(ratings_df['rating'].values)
# print(edge_feat)

        userId  movieId  rating   timestamp
0            1        1     4.0   964982703
1            1        3     4.0   964981247
2            1        6     4.0   964982224
3            1       47     5.0   964983815
4            1       50     5.0   964982931
...        ...      ...     ...         ...
100831     610   166534     4.0  1493848402
100832     610   168248     5.0  1493850091
100833     610   168250     5.0  1494273047
100834     610   168252     5.0  1493846352
100835     610   170875     3.0  1493846415

[100836 rows x 4 columns]
Mapping of user IDs to consecutive values:
     userId  mappedID
0         1         0
1         2         1
2         3         2
3         4         3
4         5         4
..      ...       ...
605     606       605
606     607       606
607     608       607
608     609       608
609     610       609

[610 rows x 2 columns]
Mapping of movie IDs to consecutive values:
      movieId  mappedID
0           1         0
1           2         

With this, we are ready to initialize our `HeteroData` object and pass the necessary information to it.
Note that we also pass in a `node_id` vector to each node type in order to reconstruct the original node indices from sampled subgraphs.
We also take care of adding reverse edges to the `HeteroData` object.
This allows our GNN model to use both directions of the edge for message passing:

In [11]:
data = HeteroData()

# Save node indices:
data["user"].node_id = torch.arange(len(unique_user_id))
data["movie"].node_id = torch.arange(len(movies_df))

# Add the node features and edge indices:
data["movie"].x = movie_feat
data["user", "rates", "movie"].edge_index = edge_index_user_to_movie

# We also need to make sure to add the reverse edges from movies to users
# in order to let a GNN be able to pass messages in both directions.
# We can leverage the `T.ToUndirected()` transform for this from PyG:
data = T.ToUndirected()(data)

print(data)

assert data.node_types == ["user", "movie"]
assert data.edge_types == [("user", "rates", "movie"),
                           ("movie", "rev_rates", "user")]
assert data["user"].num_nodes == 610
assert data["user"].num_features == 0
assert data["movie"].num_nodes == 9742
assert data["movie"].num_features == 20
assert data["user", "rates", "movie"].num_edges == 100836
assert data["movie", "rev_rates", "user"].num_edges == 100836

HeteroData(
  user={ node_id=[610] },
  movie={
    node_id=[9742],
    x=[9742, 20],
  },
  (user, rates, movie)={ edge_index=[2, 100836] },
  (movie, rev_rates, user)={ edge_index=[2, 100836] }
)


## Defining Edge-level Training Splits

Since our data is now ready-to-be-used, we can split the ratings of users into training, validation, and test splits.
This is needed in order to ensure that we leak no information about edges used during evaluation into the training phase.
For this, we make use of the [`transforms.RandomLinkSplit`](https://pytorch-geometric.readthedocs.io/en/latest/modules/transforms.html#torch_geometric.transforms.RandomLinkSplit) transformation from PyG.
This transforms randomly divides the edges in the `("user", "rates", "movie")` into training, validation and test edges.
The `disjoint_train_ratio` parameter further separates edges in the training split into edges used for message passing (`edge_index`) and edges used for supervision (`edge_label_index`).
Note that we also need to specify the reverse edge type `("movie", "rev_rates", "user")`.
This allows the `RandomLinkSplit` transform to drop reverse edges accordingly to not leak any information into the training phase.

In [19]:
# For this, we first split the set of edges into
# training (80%), validation (10%), and testing edges (10%).
# Across the training edges, we use 70% of edges for message passing,
# and 30% of edges for supervision.
# We further want to generate fixed negative edges for evaluation with a ratio of 2:1.
# Negative edges during training will be generated on-the-fly.
# We can leverage the `RandomLinkSplit()` transform for this from PyG:
transform = T.RandomLinkSplit(
    num_val=0.1,
    num_test=0.1,
    disjoint_train_ratio=0.3,
    neg_sampling_ratio=2.0,
    add_negative_train_samples=False,
    edge_types=("user", "rates", "movie"),
    rev_edge_types=("movie", "rev_rates", "user"),
)

train_data, val_data, test_data = transform(data)
print("Training data:")
print("==============")
print(train_data)
print()
print("Validation data:")
print("================")
print(val_data)
print("Test data:")
print("================")
print(test_data)

assert train_data["user", "rates", "movie"].num_edges == 56469
assert train_data["user", "rates", "movie"].edge_label_index.size(1) == 24201
assert train_data["movie", "rev_rates", "user"].num_edges == 56469
# No negative edges added:
assert train_data["user", "rates", "movie"].edge_label.min() == 1
assert train_data["user", "rates", "movie"].edge_label.max() == 1

assert val_data["user", "rates", "movie"].num_edges == 80670
assert val_data["user", "rates", "movie"].edge_label_index.size(1) == 30249
assert val_data["movie", "rev_rates", "user"].num_edges == 80670
# Negative edges with ratio 2:1:
assert val_data["user", "rates", "movie"].edge_label.long().bincount().tolist() == [20166, 10083]

Training data:
HeteroData(
  user={ node_id=[610] },
  movie={
    node_id=[9742],
    x=[9742, 20],
  },
  (user, rates, movie)={
    edge_index=[2, 56469],
    edge_label=[24201],
    edge_label_index=[2, 24201],
  },
  (movie, rev_rates, user)={ edge_index=[2, 56469] }
)

Validation data:
HeteroData(
  user={ node_id=[610] },
  movie={
    node_id=[9742],
    x=[9742, 20],
  },
  (user, rates, movie)={
    edge_index=[2, 80670],
    edge_label=[30249],
    edge_label_index=[2, 30249],
  },
  (movie, rev_rates, user)={ edge_index=[2, 80670] }
)
Test data:
HeteroData(
  user={ node_id=[610] },
  movie={
    node_id=[9742],
    x=[9742, 20],
  },
  (user, rates, movie)={
    edge_index=[2, 90753],
    edge_label=[30249],
    edge_label_index=[2, 30249],
  },
  (movie, rev_rates, user)={ edge_index=[2, 90753] }
)


## Defining Mini-batch Loaders

We are now ready to create a mini-batch loader that will generate subgraphs that can be used as input into our GNN.
While this step is not strictly necessary for small-scale graphs, it is absolutely necessary to apply GNNs on larger graphs that do not fit onto GPU memory otherwise.
Here, we make use of the [`loader.LinkNeighborLoader`](https://pytorch-geometric.readthedocs.io/en/latest/modules/loader.html#torch_geometric.loader.LinkNeighborLoader) which samples multiple hops from both ends of a link and creates a subgraph from it.
Here, `edge_label_index` serves as the "seed links" to start sampling from.

In [20]:
# In the first hop, we sample at most 20 neighbors.
# In the second hop, we sample at most 10 neighbors.
# In addition, during training, we want to sample negative edges on-the-fly with
# a ratio of 2:1.
# We can make use of the `loader.LinkNeighborLoader` from PyG:

# Define seed edges:
edge_label_index = train_data["user", "rates", "movie"].edge_label_index
edge_label = train_data["user", "rates", "movie"].edge_label

train_loader = LinkNeighborLoader(
    data=train_data,
    num_neighbors=[25, 10],
    neg_sampling_ratio=2.0,
    edge_label_index=(("user", "rates", "movie"), edge_label_index),
    edge_label=edge_label,
    batch_size=128,
    shuffle=True,
)

# Inspect a sample:
sampled_data = next(iter(train_loader))

print("Sampled mini-batch:")
print("===================")
print(sampled_data)

assert sampled_data["user", "rates", "movie"].edge_label_index.size(1) == 3 * 128
assert sampled_data["user", "rates", "movie"].edge_label.min() == 0
assert sampled_data["user", "rates", "movie"].edge_label.max() == 1

Sampled mini-batch:
HeteroData(
  user={
    node_id=[609],
    n_id=[609],
    num_sampled_nodes=[3],
  },
  movie={
    node_id=[3037],
    x=[3037, 20],
    n_id=[3037],
    num_sampled_nodes=[3],
  },
  (user, rates, movie)={
    edge_index=[2, 19277],
    edge_label=[384],
    edge_label_index=[2, 384],
    e_id=[19277],
    num_sampled_edges=[2],
    input_id=[128],
  },
  (movie, rev_rates, user)={
    edge_index=[2, 8827],
    e_id=[8827],
    num_sampled_edges=[2],
  }
)


## Creating a Heterogeneous Link-level GNN

We are now ready to create our heterogeneous GNN.
The GNN is responsible for learning enriched node representations from the surrounding subgraphs, which can be then used to derive edge-level predictions.
For defining our heterogenous GNN, we make use of [`nn.SAGEConv`](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.conv.SAGEConv) and the [`nn.to_hetero()`](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.to_hetero_transformer.to_hetero) function, which transforms a GNN defined on homogeneous graphs to be applied on heterogeneous ones.

In addition, we define a final link-level classifier, which simply takes both node embeddings of the link we are trying to predict, and applies a dot-product on them.

As users do not have any node-level information, we choose to learn their features jointly via a `torch.nn.Embedding` layer. In order to improve the expressiveness of movie features, we do the same for movie nodes, and simply add their shallow embeddings to the pre-defined genre features.

In [24]:
class GNN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()

        self.conv1 = SAGEConv(hidden_channels, hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, hidden_channels)

    def forward(self, x: Tensor, edge_index: Tensor) -> Tensor:
        x = F.relu(self.conv1(x, edge_index))
        x = self.conv2(x, edge_index)
        return x

# Our final classifier applies the dot-product between source and destination
# node embeddings to derive edge-level predictions:
class Classifier(torch.nn.Module):
    def forward(self, x_user: Tensor, x_movie: Tensor, edge_label_index: Tensor) -> Tensor:
        # Convert node embeddings to edge-level representations:
        edge_feat_user = x_user[edge_label_index[0]]
        edge_feat_movie = x_movie[edge_label_index[1]]

        # Apply dot-product to get a prediction per supervision edge:
        return (edge_feat_user * edge_feat_movie).sum(dim=-1)


class Model(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        # Since the dataset does not come with rich features, we also learn two
        # embedding matrices for users and movies:
        self.movie_lin = torch.nn.Linear(20, hidden_channels)
        self.user_emb = torch.nn.Embedding(data["user"].num_nodes, hidden_channels)
        self.movie_emb = torch.nn.Embedding(data["movie"].num_nodes, hidden_channels)

        # Instantiate homogeneous GNN:
        self.gnn = GNN(hidden_channels)

        # Convert GNN model into a heterogeneous variant:
        self.gnn = to_hetero(self.gnn, metadata=data.metadata())

        self.classifier = Classifier()

    def forward(self, data: HeteroData) -> Tensor:
        x_dict = {
          "user": self.user_emb(data["user"].node_id),
          "movie": self.movie_lin(data["movie"].x) + self.movie_emb(data["movie"].node_id),
        }

        # `x_dict` holds feature matrices of all node types
        # `edge_index_dict` holds all edge indices of all edge types
        x_dict = self.gnn(x_dict, data.edge_index_dict)
        pred = self.classifier(
            x_dict["user"],
            x_dict["movie"],
            data["user", "rates", "movie"].edge_label_index,
        )

        return pred


model = Model(hidden_channels=64)

print(model)

Model(
  (movie_lin): Linear(in_features=20, out_features=64, bias=True)
  (user_emb): Embedding(610, 64)
  (movie_emb): Embedding(9742, 64)
  (gnn): GraphModule(
    (conv1): ModuleDict(
      (user__rates__movie): SAGEConv(64, 64, aggr=mean)
      (movie__rev_rates__user): SAGEConv(64, 64, aggr=mean)
    )
    (conv2): ModuleDict(
      (user__rates__movie): SAGEConv(64, 64, aggr=mean)
      (movie__rev_rates__user): SAGEConv(64, 64, aggr=mean)
    )
  )
  (classifier): Classifier()
)


## Training a Heterogeneous Link-level GNN

Training our GNN is then similar to training any PyTorch model.
We move the model to the desired device, and initialize an optimizer that takes care of adjusting model parameters via stochastic gradient descent.

The training loop then iterates over our mini-batches, applies the forward computation of the model, computes the loss from ground-truth labels and obtained predictions (here we make use of binary cross entropy), and adjusts model parameters via back-propagation and stochastic gradient descent.

In [26]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: '{device}'")

model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = total_examples = 0
    for sampled_data in tqdm.tqdm(train_loader):
        optimizer.zero_grad()

        sampled_data.to(device)
        pred = model(sampled_data)

        ground_truth = sampled_data["user", "rates", "movie"].edge_label
        loss = F.binary_cross_entropy_with_logits(pred, ground_truth)

        loss.backward()
        optimizer.step()
        total_loss += float(loss) * pred.numel()
        total_examples += pred.numel()
    print(f"Epoch: {epoch:03d}, Loss: {total_loss / total_examples:.4f}")

Device: 'cpu'


100%|█████████████████████████████████████████████████████████████████████████████████| 190/190 [00:07<00:00, 24.40it/s]


Epoch: 000, Loss: 0.3445


100%|█████████████████████████████████████████████████████████████████████████████████| 190/190 [00:05<00:00, 32.67it/s]


Epoch: 001, Loss: 0.3233


100%|█████████████████████████████████████████████████████████████████████████████████| 190/190 [00:05<00:00, 32.01it/s]


Epoch: 002, Loss: 0.3089


100%|█████████████████████████████████████████████████████████████████████████████████| 190/190 [00:08<00:00, 23.59it/s]


Epoch: 003, Loss: 0.2940


100%|█████████████████████████████████████████████████████████████████████████████████| 190/190 [00:09<00:00, 19.16it/s]


Epoch: 004, Loss: 0.2776


100%|█████████████████████████████████████████████████████████████████████████████████| 190/190 [00:05<00:00, 33.37it/s]


Epoch: 005, Loss: 0.2677


100%|█████████████████████████████████████████████████████████████████████████████████| 190/190 [00:05<00:00, 34.36it/s]


Epoch: 006, Loss: 0.2573


100%|█████████████████████████████████████████████████████████████████████████████████| 190/190 [00:10<00:00, 18.04it/s]


Epoch: 007, Loss: 0.2500


100%|█████████████████████████████████████████████████████████████████████████████████| 190/190 [00:06<00:00, 29.07it/s]


Epoch: 008, Loss: 0.2421


100%|█████████████████████████████████████████████████████████████████████████████████| 190/190 [00:09<00:00, 19.31it/s]

Epoch: 009, Loss: 0.2386





## Evaluating a Heterogeneous Link-level GNN

After training, we evaluate our model on useen data coming from the validation set.
For this, we define a new `LinkNeighborLoader` (which now iterates over the edges in the validation set), obtain the predictions on validation edges by running the model, and finally evaluate the performance of the model by computing the AUC score over the set of predictions and their corresponding ground-truth edges (including both positive and negative edges).

In [27]:
# Define the validation seed edges:
edge_label_index = val_data["user", "rates", "movie"].edge_label_index
edge_label = val_data["user", "rates", "movie"].edge_label

val_loader = LinkNeighborLoader(
    data=val_data,
    num_neighbors=[20, 10],
    edge_label_index=(("user", "rates", "movie"), edge_label_index),
    edge_label=edge_label,
    batch_size=3 * 128,
    shuffle=False,
)

sampled_data = next(iter(val_loader))

print("Sampled mini-batch:")
print("===================")
print(sampled_data)

assert sampled_data["user", "rates", "movie"].edge_label_index.size(1) == 3 * 128
assert sampled_data["user", "rates", "movie"].edge_label.min() >= 0
assert sampled_data["user", "rates", "movie"].edge_label.max() <= 1

Sampled mini-batch:
HeteroData(
  user={
    node_id=[609],
    n_id=[609],
    num_sampled_nodes=[3],
  },
  movie={
    node_id=[2695],
    x=[2695, 20],
    n_id=[2695],
    num_sampled_nodes=[3],
  },
  (user, rates, movie)={
    edge_index=[2, 19204],
    edge_label=[384],
    edge_label_index=[2, 384],
    e_id=[19204],
    num_sampled_edges=[2],
    input_id=[384],
  },
  (movie, rev_rates, user)={
    edge_index=[2, 7711],
    e_id=[7711],
    num_sampled_edges=[2],
  }
)


In [28]:
from sklearn.metrics import roc_auc_score, accuracy_score
import numpy as np

preds = []
ground_truths = []
for sampled_data in tqdm.tqdm(val_loader):
    with torch.no_grad():
        sampled_data.to(device)
        preds.append(model(sampled_data))
        ground_truths.append(sampled_data["user", "rates", "movie"].edge_label)

pred = torch.cat(preds, dim=0).cpu().numpy()
ground_truth = torch.cat(ground_truths, dim=0).cpu().numpy()
auc = roc_auc_score(ground_truth, pred)
acc = accuracy_score(ground_truth, (pred > 0).astype('int'))
print()
print(f"Validation AUC: {auc:.4f}")
print(f"Validation ACC: {acc:.4f}")

100%|███████████████████████████████████████████████████████████████████████████████████| 79/79 [00:01<00:00, 41.08it/s]


Validation AUC: 0.9351
Validation ACC: 0.8686





In [62]:
# prompt: buatkan fungsi untuk merekomendasikan user dengan top 10 movie dan urut dari yang paling direkomendasikan

def recommend_movies_for_user(user_id, model, data, top_k=10):
    # all_movie_nodes = data["movie"].node_id.tolist()
    user_rated_movie_nodes = data["user", "rates", "movie"].edge_index[1][data["user", "rates", "movie"].edge_index[0] == user_id].tolist()
    # all_movie_node_ids = torch.tensor(list(set(all_movie_nodes) - set(user_rated_movie_nodes)))
    user_node_id = data["user"].node_id[user_id].item()
    all_movie_node_ids = data["movie"].node_id

    edge_index = torch.tensor([[user_node_id for _ in range(len(all_movie_node_ids))], [x for x in all_movie_node_ids]])
    # Buat prediksi untuk setiap pasangan (user, movie)
    loader = LinkNeighborLoader(
        data=data,
        num_neighbors=[25, 10],
        edge_label_index=(("user", "rates", "movie"), edge_index),
        batch_size=len(all_movie_node_ids),
        shuffle=False
    )

    model.eval()
    for batch in loader:
        prediction = model(batch)
    predictions = [(all_movie_node_ids[x].item(), prediction[x]) for x in range(len(all_movie_node_ids))]
    
    # Urutkan film berdasarkan prediksi
    predictions.sort(key=lambda x: x[1], reverse=True)
    
    return predictions, user_rated_movie_nodes

In [63]:
user_id_to_recommend = 0
recommended_movies, user_have_watched = recommend_movies_for_user(user_id_to_recommend, model, data)

print(f"Rekomendasi film teratas untuk user dengan ID {user_id_to_recommend}:")
for movie_id, score in recommended_movies[:10]:
    print(f" - Movie ID: {movie_id}, Skor Prediksi: {score:.4f}")

Rekomendasi film teratas untuk user dengan ID 0:
 - Movie ID: 1061, Skor Prediksi: 4.6063
 - Movie ID: 478, Skor Prediksi: 4.5756
 - Movie ID: 379, Skor Prediksi: 4.4897
 - Movie ID: 1503, Skor Prediksi: 4.4736
 - Movie ID: 911, Skor Prediksi: 4.4176
 - Movie ID: 1183, Skor Prediksi: 4.4015
 - Movie ID: 902, Skor Prediksi: 4.2634
 - Movie ID: 898, Skor Prediksi: 4.2437
 - Movie ID: 314, Skor Prediksi: 4.2223
 - Movie ID: 2327, Skor Prediksi: 4.1963


In [64]:
for movie_id, score in recommended_movies[:10]:
    print(movie_id)
    real_id = unique_movie_id[unique_movie_id['mappedID'] == movie_id]['movieId']
    movie_row = movies_df.loc[real_id]
    title = movie_row.title.values[0]
    genres = movie_row.genres.values[0]
    print(f"Title: {title}\nGenres: {genres}\n\n")

1061
Title: Young Guns (1988)
Genres: Action|Comedy|Western


478
Title: Super Mario Bros. (1993)
Genres: Action|Adventure|Children|Comedy|Fantasy|Sci-Fi


379
Title: Coneheads (1993)
Genres: Comedy|Sci-Fi


1503
Title: Saving Private Ryan (1998)
Genres: Action|Drama|War


911
Title: Star Wars: Episode VI - Return of the Jedi (1983)
Genres: Action|Adventure|Sci-Fi


1183
Title: Men in Black (a.k.a. MIB) (1997)
Genres: Action|Comedy|Sci-Fi


902
Title: Aliens (1986)
Genres: Action|Adventure|Horror|Sci-Fi


898
Title: Star Wars: Episode V - The Empire Strikes Back (1980)
Genres: Action|Adventure|Sci-Fi


314
Title: Forrest Gump (1994)
Genres: Comedy|Drama|Romance|War


2327
Title: World Is Not Enough, The (1999)
Genres: Action|Adventure|Thriller




In [65]:
for movie_id in user_have_watched:
    real_id = unique_movie_id[unique_movie_id['mappedID'] == movie_id]['movieId']
    movie_row = movies_df.loc[real_id]
    title = movie_row.title.values[0]
    genres = movie_row.genres.values[0]
    print(f"Title: {title}\nGenres: {genres}\n\n")

Title: Toy Story (1995)
Genres: Adventure|Animation|Children|Comedy|Fantasy


Title: Grumpier Old Men (1995)
Genres: Comedy|Romance


Title: Heat (1995)
Genres: Action|Crime|Thriller


Title: Seven (a.k.a. Se7en) (1995)
Genres: Mystery|Thriller


Title: Usual Suspects, The (1995)
Genres: Crime|Mystery|Thriller


Title: From Dusk Till Dawn (1996)
Genres: Action|Comedy|Horror|Thriller


Title: Bottle Rocket (1996)
Genres: Adventure|Comedy|Crime|Romance


Title: Braveheart (1995)
Genres: Action|Drama|War


Title: Rob Roy (1995)
Genres: Action|Drama|Romance|War


Title: Canadian Bacon (1995)
Genres: Comedy|War


Title: Desperado (1995)
Genres: Action|Romance|Western


Title: Billy Madison (1995)
Genres: Comedy


Title: Clerks (1994)
Genres: Comedy


Title: Dumb & Dumber (Dumb and Dumber) (1994)
Genres: Adventure|Comedy


Title: Ed Wood (1994)
Genres: Comedy|Drama


Title: Star Wars: Episode IV - A New Hope (1977)
Genres: Action|Adventure|Sci-Fi


Title: Pulp Fiction (1994)
Genres: Comedy|C

In [66]:
def calculate_hits_recall_precision_at_k(predictions, ground_truth, k):
  hits = 0
  if len(predictions) > k:
    predictions = predictions[:k]

  for prediction in predictions:
    if prediction in ground_truth:
      hits += 1
  print(hits)

  hits_at_k = hits > 0
  if len(ground_truth) == 0:
      recall_at_k = 0
  else:
      recall_at_k = hits / len(ground_truth)
  if len(predictions) == 0:
      precision_at_k = 0
  else:
      precision_at_k = hits / len(predictions)

  return hits_at_k, recall_at_k, precision_at_k


In [67]:
predictions = [movie_id for movie_id, score in recommended_movies]

In [68]:
calculate_hits_recall_precision_at_k(predictions, user_have_watched, 20)

9


(True, 0.03879310344827586, 0.45)