<a href="https://colab.research.google.com/github/kanru-wang/Graph_Neural_Network/blob/main/Link_Regression_Heterogeneous_Graph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MovieLens Heterogeneous Graph Link Rating Regression

The MovieLens dataset contains approximately 100k ratings across more than 9k movies, 20 genres in total, from more than 600 users.

The goal is to predict the rating of a user for a movie.

See: https://pytorch-geometric.readthedocs.io/en/latest/get_started/colabs.html

In [1]:
import torch

print(torch.__version__)


2.1.0+cu118


In [2]:
# Install required packages
import os

os.environ['TORCH'] = torch.__version__
!pip install pyg-lib -f https://data.pyg.org/whl/nightly/torch-${TORCH}.html
!pip install git+https://github.com/pyg-team/pytorch_geometric.git

!pip install sentence_transformers
!pip3 install fuzzywuzzy[speedup]
!pip install captum

Looking in links: https://data.pyg.org/whl/nightly/torch-2.1.0+cu118.html
Collecting pyg-lib
  Downloading https://data.pyg.org/whl/nightly/torch-2.1.0%2Bcu118/pyg_lib-0.3.1.dev20231126%2Bpt21cu118-cp310-cp310-linux_x86_64.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyg-lib
Successfully installed pyg-lib-0.3.1.dev20231126+pt21cu118
Collecting git+https://github.com/pyg-team/pytorch_geometric.git
  Cloning https://github.com/pyg-team/pytorch_geometric.git to /tmp/pip-req-build-qtkie1bg
  Running command git clone --filter=blob:none --quiet https://github.com/pyg-team/pytorch_geometric.git /tmp/pip-req-build-qtkie1bg
  Resolved https://github.com/pyg-team/pytorch_geometric.git to commit 12959ed89c881c0db1c821c87a9a14f4e3597b2c
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproje

## Data Ingestion

In [3]:
from torch_geometric.data import download_url, extract_zip
import pandas as pd

dataset_name = 'ml-latest-small'

url = f'https://files.grouplens.org/datasets/movielens/{dataset_name}.zip'
extract_zip(download_url(url, '.'), '.')

movies_path = f'./{dataset_name}/movies.csv'
ratings_path = f'./{dataset_name}/ratings.csv'

Downloading https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Extracting ./ml-latest-small.zip


In [4]:
# Load the entire ratings dataframe into memory:
ratings_df = pd.read_csv(ratings_path)[["userId", "movieId", "rating"]]

# Load the entire movie dataframe into memory:
movies_df = pd.read_csv(movies_path, index_col='movieId')

print('movies.csv:')
print('===========')
print(movies_df[["genres", "title"]].head())
print(f"Number of movies: {len(movies_df)}")
print()
print('ratings.csv:')
print('============')
print(ratings_df[["userId", "movieId", "rating"]].head())
print(f"Number of ratings: {len(ratings_df)}")
print()

movies.csv:
                                              genres  \
movieId                                                
1        Adventure|Animation|Children|Comedy|Fantasy   
2                         Adventure|Children|Fantasy   
3                                     Comedy|Romance   
4                               Comedy|Drama|Romance   
5                                             Comedy   

                                      title  
movieId                                      
1                          Toy Story (1995)  
2                            Jumanji (1995)  
3                   Grumpier Old Men (1995)  
4                  Waiting to Exhale (1995)  
5        Father of the Bride Part II (1995)  
Number of movies: 9742

ratings.csv:
   userId  movieId  rating
0       1        1     4.0
1       1        3     4.0
2       1        6     4.0
3       1       47     5.0
4       1       50     5.0
Number of ratings: 100836



## Data Preprocessing

Use the genre as well as the title of the movie as node features.

For the `title` features, use a pre-trained sentence transformer model to encode the title into a vector.

For the `genre` features, use a one-hot encoding.

In [5]:
import numpy as np
import torch
from sentence_transformers import SentenceTransformer

# One-hot encode the genres:
genres = movies_df['genres'].str.get_dummies('|').values
genres = torch.from_numpy(genres).to(torch.float)

# Load the pre-trained sentence transformer model and encode the movie titles:
model = SentenceTransformer('all-MiniLM-L6-v2')
with torch.no_grad():
    titles = model.encode(movies_df['title'].tolist(), convert_to_tensor=True, show_progress_bar=True)
    titles = titles.cpu()

# Concatenate the genres and title features:
movie_features = torch.cat([genres, titles], dim=-1)

# We don't have user features, which is why we use an identity matrix
user_features = torch.eye(len(ratings_df['userId'].unique()))


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/305 [00:00<?, ?it/s]

The `ratings.csv` data contains the ratings of users for movies.

Create a mapping that maps `userId` to a unique consecutive value in the range `[0, num_users]`, so that the representation of a user in the first row is accessible via `x[0]`.

In [6]:
# Create a mapping from the userId to a unique consecutive value in the range [0, num_users]:
unique_user_id = ratings_df['userId'].unique()
unique_user_id = pd.DataFrame(data={
    'userId': unique_user_id,
    'mappedUserId': pd.RangeIndex(len(unique_user_id))
    })
print("Mapping of user IDs to consecutive values:")
print("==========================================")
print(unique_user_id.head())
print()

# Create a mapping from the movieId to a unique consecutive value in the range [0, num_movies]:
unique_movie_id = ratings_df['movieId'].unique()
unique_movie_id = pd.DataFrame(data={
    'movieId': unique_movie_id,
    'mappedMovieId': pd.RangeIndex(len(unique_movie_id))
    })
print("Mapping of movie IDs to consecutive values:")
print("===========================================")
print(unique_movie_id.head())
print()

# Merge mapped user and movie indices with the raw indices given by the original data frame.
ratings_df = ratings_df.merge(unique_user_id, on='userId')
ratings_df = ratings_df.merge(unique_movie_id, on='movieId')

edge_index = torch.stack([
    torch.tensor(ratings_df['mappedUserId'].values),
    torch.tensor(ratings_df['mappedMovieId'].values)]
    , dim=0)

assert edge_index.shape == (2, len(ratings_df))

print("Final edge indices pointing from users to movies:")
print("================================================")
print(edge_index[:, :10])

Mapping of user IDs to consecutive values:
   userId  mappedUserId
0       1             0
1       2             1
2       3             2
3       4             3
4       5             4

Mapping of movie IDs to consecutive values:
   movieId  mappedMovieId
0        1              0
1        3              1
2        6              2
3       47              3
4       50              4

Final edge indices pointing from users to movies:
tensor([[ 0,  4,  6, 14, 16, 17, 18, 20, 26, 30],
        [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]])


## Heterogeneous Graph Construction

Add reverse edges to the `HeteroData` object to allow the GNN
model to use both directions of the edges for the message passing.

In [7]:
import torch_geometric.transforms as T
from torch_geometric.data import HeteroData

data = HeteroData()

# Add user nodes
data['user'].x = user_features  # [num_users, num_features_users]

# Add movie nodes
data['movie'].x = movie_features  # [num_movies, num_features_movies]

# Add rating edges
data['user', 'rates', 'movie'].edge_index = edge_index  # [2, num_ratings]

# Add rating labels
rating = torch.from_numpy(ratings_df['rating'].values).to(torch.float)
data['user', 'rates', 'movie'].edge_label = rating  # [num_ratings]

# Add the reverse edges from movies to users in order to let a GNN be able to
# pass messages in both directions.
data = T.ToUndirected()(data)

# With the above transformation we also got reversed labels for the edges.
# We are going to remove them:
del data['movie', 'rev_rates', 'user'].edge_label

assert data['user'].num_nodes == len(unique_user_id)
assert data['user', 'rates', 'movie'].num_edges == len(ratings_df)

data

HeteroData(
  user={ x=[610, 610] },
  movie={ x=[9742, 404] },
  (user, rates, movie)={
    edge_index=[2, 100836],
    edge_label=[100836],
  },
  (movie, rev_rates, user)={ edge_index=[2, 100836] }
)

## Dataset Splitting

In [8]:
train_data, val_data, test_data = T.RandomLinkSplit(
    num_val=0.1,
    num_test=0.1,
    neg_sampling_ratio=0.0,
    edge_types=[('user', 'rates', 'movie')],
    rev_edge_types=[('movie', 'rev_rates', 'user')],
)(data)
train_data, val_data

(HeteroData(
   user={ x=[610, 610] },
   movie={ x=[9742, 404] },
   (user, rates, movie)={
     edge_index=[2, 80670],
     edge_label=[80670],
     edge_label_index=[2, 80670],
   },
   (movie, rev_rates, user)={ edge_index=[2, 80670] }
 ),
 HeteroData(
   user={ x=[610, 610] },
   movie={ x=[9742, 404] },
   (user, rates, movie)={
     edge_index=[2, 80670],
     edge_label=[10083],
     edge_label_index=[2, 10083],
   },
   (movie, rev_rates, user)={ edge_index=[2, 80670] }
 ))

## Graph Neural Network

In [9]:
from torch_geometric.nn import SAGEConv, to_hetero

class GNNEncoder(torch.nn.Module):
    def __init__(self, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = SAGEConv((-1, -1), hidden_channels)
        self.conv2 = SAGEConv((-1, -1), out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x


class EdgeDecoder(torch.nn.Module):
    """
    A decoder to predict the rating for the encoded user-movie combination.
    """
    def __init__(self, hidden_channels):
        super().__init__()
        self.lin1 = torch.nn.Linear(2 * hidden_channels, hidden_channels)
        self.lin2 = torch.nn.Linear(hidden_channels, 1)

    def forward(self, z_dict, edge_label_index):
        row, col = edge_label_index
        z = torch.cat([z_dict['user'][row], z_dict['movie'][col]], dim=-1)
        z = self.lin1(z).relu()
        z = self.lin2(z)
        return z.view(-1)


class Model(torch.nn.Module):
    def __init__(self, hidden_channels):
        super().__init__()
        self.encoder = GNNEncoder(hidden_channels, hidden_channels)
        self.encoder = to_hetero(self.encoder, data.metadata(), aggr='sum')
        self.decoder = EdgeDecoder(hidden_channels)

    def forward(self, x_dict, edge_index_dict, edge_label_index):
        z_dict = self.encoder(x_dict, edge_index_dict)
        return self.decoder(z_dict, edge_label_index)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Model(hidden_channels=32).to(device)

print(model)

Model(
  (encoder): GraphModule(
    (conv1): ModuleDict(
      (user__rates__movie): SAGEConv((-1, -1), 32, aggr=mean)
      (movie__rev_rates__user): SAGEConv((-1, -1), 32, aggr=mean)
    )
    (conv2): ModuleDict(
      (user__rates__movie): SAGEConv((-1, -1), 32, aggr=mean)
      (movie__rev_rates__user): SAGEConv((-1, -1), 32, aggr=mean)
    )
  )
  (decoder): EdgeDecoder(
    (lin1): Linear(in_features=64, out_features=32, bias=True)
    (lin2): Linear(in_features=32, out_features=1, bias=True)
  )
)


## Training a Heterogeneous GNN


In [10]:
import torch.nn.functional as F

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def train():
    model.train()
    optimizer.zero_grad()
    pred = model(train_data.x_dict, train_data.edge_index_dict,
                 train_data['user', 'movie'].edge_label_index)
    target = train_data['user', 'movie'].edge_label
    loss = F.mse_loss(pred, target)
    loss.backward()
    optimizer.step()
    return float(loss)

@torch.no_grad()
def test(data):
    data = data.to(device)
    model.eval()
    pred = model(data.x_dict, data.edge_index_dict,
                 data['user', 'movie'].edge_label_index)
    pred = pred.clamp(min=0, max=5)
    target = data['user', 'movie'].edge_label.float()
    rmse = F.mse_loss(pred, target).sqrt()
    return float(rmse)


for epoch in range(1, 301):
    train_data = train_data.to(device)
    loss = train()
    train_rmse = test(train_data)
    val_rmse = test(val_data)
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Train: {train_rmse:.4f}, '
          f'Val: {val_rmse:.4f}')

Epoch: 001, Loss: 13.3820, Train: 3.3970, Val: 3.3958
Epoch: 002, Loss: 11.5395, Train: 2.9613, Val: 2.9641
Epoch: 003, Loss: 8.7694, Train: 2.1584, Val: 2.1687
Epoch: 004, Loss: 4.6587, Train: 1.0826, Val: 1.0932
Epoch: 005, Loss: 1.1720, Train: 1.8252, Val: 1.7930
Epoch: 006, Loss: 5.3711, Train: 1.7821, Val: 1.7500
Epoch: 007, Loss: 3.5776, Train: 1.1381, Val: 1.1287
Epoch: 008, Loss: 1.2953, Train: 1.1416, Val: 1.1543
Epoch: 009, Loss: 1.3033, Train: 1.4852, Val: 1.4994
Epoch: 010, Loss: 2.2057, Train: 1.6746, Val: 1.6877
Epoch: 011, Loss: 2.8043, Train: 1.6854, Val: 1.6983
Epoch: 012, Loss: 2.8404, Train: 1.5575, Val: 1.5712
Epoch: 013, Loss: 2.4258, Train: 1.3357, Val: 1.3501
Epoch: 014, Loss: 1.7840, Train: 1.1038, Val: 1.1159
Epoch: 015, Loss: 1.2184, Train: 1.0251, Val: 1.0269
Epoch: 016, Loss: 1.0508, Train: 1.1783, Val: 1.1670
Epoch: 017, Loss: 1.3884, Train: 1.3418, Val: 1.3239
Epoch: 018, Loss: 1.8005, Train: 1.3177, Val: 1.3008
Epoch: 019, Loss: 1.7365, Train: 1.1510, Val

## Evaluation

If the Val RMSE is around 0.9, on average the model is off by 0.9 stars.

In [11]:
with torch.no_grad():
    test_data = test_data.to(device)
    pred = model(test_data.x_dict, test_data.edge_index_dict,
                 test_data['user', 'movie'].edge_label_index)
    pred = pred.clamp(min=0, max=5)
    target = test_data['user', 'movie'].edge_label.float()
    rmse = F.mse_loss(pred, target).sqrt()
    print(f'Test RMSE: {rmse:.4f}')

userId = test_data['user', 'movie'].edge_label_index[0].cpu().numpy()
movieId = test_data['user', 'movie'].edge_label_index[1].cpu().numpy()
pred = pred.cpu().numpy()
target = target.cpu().numpy()

print(pd.DataFrame({'userId': userId, 'movieId': movieId, 'rating': pred, 'target': target}))

Test RMSE: 0.9104
       userId  movieId    rating  target
0         559      359  3.731366     4.0
1         413     2073  3.817492     4.0
2          43       98  3.503288     4.0
3         447      401  3.606769     4.0
4         607     1541  2.477752     3.0
...       ...      ...       ...     ...
10078     220     3734  3.930115     4.5
10079     419      399  3.464217     4.5
10080     124     4560  3.391508     4.5
10081     317     3136  3.916764     3.5
10082     390       69  4.122063     5.0

[10083 rows x 4 columns]
