# Practicing link regression on heterogenous graph
Source: https://colab.research.google.com/drive/1N3LvAO0AXV4kBPbTMX866OwJM9YS6Ji2 \
This notebook also uses [sentence-transforemers](https://www.sbert.net/) which we could use to embed sentences and paragraphs. \
This might come in handy if we want to encode the diseases ourselves. \
More on sentence-transformers: https://huggingface.co/sentence-transformers

In [1]:
import torch
from torch_geometric.data  import download_url, extract_zip, HeteroData
import pandas as pd

import numpy as np
from sentence_transformers import SentenceTransformer

import torch_geometric.transforms as T

from torch_geometric.nn import SAGEConv, to_hetero

import torch.nn.functional as F

from torch_geometric.explain import Explainer, CaptumExplainer

## Gathering data

### Downloading Movielense Data

In [2]:

dataset_name = 'ml-latest-small'

url = f'https://files.grouplens.org/datasets/movielens/{dataset_name}.zip'
extract_zip(download_url(url, 'datasets/'), 'datasets')

movies_path = f'datasets/{dataset_name}/movies.csv'
ratings_path = f'datasets/{dataset_name}/ratings.csv'

Using existing file ml-latest-small.zip
Extracting datasets/ml-latest-small.zip


In [3]:
ratings_df = pd.read_csv(ratings_path)
movies_df = pd.read_csv(movies_path)

In [4]:
print(movies_df.shape)
movies_df.head()

(9742, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
print(ratings_df.shape)
ratings_df.head()

(100836, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Importing my IMDb ratings

In [6]:
my_user_id = ratings_df['userId'].max() + 1
print("My user id:", my_user_id)

# loading my ratings
my_ratings_df = pd.read_csv('datasets/my_ratings.csv')
print(my_ratings_df.shape)
my_ratings_df.head()

My user id: 611
(595, 13)


Unnamed: 0,Const,Your Rating,Date Rated,Title,URL,Title Type,IMDb Rating,Runtime (mins),Year,Genres,Num Votes,Release Date,Directors
0,tt1001526,9,2013-02-22,Megamind,https://www.imdb.com/title/tt1001526/,movie,7.3,95.0,2010,"Animation, Action, Comedy, Crime, Family, Myst...",285237.0,2010-10-28,Tom McGrath
1,tt0101414,9,2013-06-01,Beauty and the Beast,https://www.imdb.com/title/tt0101414/,movie,8.0,84.0,1991,"Animation, Family, Fantasy, Musical, Romance",471957.0,1991-09-29,"Kirk Wise, Gary Trousdale"
2,tt1029231,6,2014-05-25,Krrish 3,https://www.imdb.com/title/tt1029231/,movie,5.3,152.0,2013,"Action, Adventure, Sci-Fi",24624.0,2013-10-31,Rakesh Roshan
3,tt0103064,9,2013-05-21,Terminator 2: Judgment Day,https://www.imdb.com/title/tt0103064/,movie,8.6,137.0,1991,"Action, Sci-Fi",1147386.0,1991-07-01,James Cameron
4,tt1037705,8,2020-11-01,The Book of Eli,https://www.imdb.com/title/tt1037705/,movie,6.8,118.0,2010,"Action, Adventure, Drama, Thriller",332877.0,2010-01-11,"Allen Hughes, Albert Hughes"


#### Processing IMDb ratings dataframe

IMDb ratings' coulmns names are capitalized, but the movielense ratings are not. \
So lowering the column names of IMDb ratings.

In [7]:
my_ratings_df.columns = my_ratings_df.columns.str.strip().str.lower()
my_ratings_df.head()

Unnamed: 0,const,your rating,date rated,title,url,title type,imdb rating,runtime (mins),year,genres,num votes,release date,directors
0,tt1001526,9,2013-02-22,Megamind,https://www.imdb.com/title/tt1001526/,movie,7.3,95.0,2010,"Animation, Action, Comedy, Crime, Family, Myst...",285237.0,2010-10-28,Tom McGrath
1,tt0101414,9,2013-06-01,Beauty and the Beast,https://www.imdb.com/title/tt0101414/,movie,8.0,84.0,1991,"Animation, Family, Fantasy, Musical, Romance",471957.0,1991-09-29,"Kirk Wise, Gary Trousdale"
2,tt1029231,6,2014-05-25,Krrish 3,https://www.imdb.com/title/tt1029231/,movie,5.3,152.0,2013,"Action, Adventure, Sci-Fi",24624.0,2013-10-31,Rakesh Roshan
3,tt0103064,9,2013-05-21,Terminator 2: Judgment Day,https://www.imdb.com/title/tt0103064/,movie,8.6,137.0,1991,"Action, Sci-Fi",1147386.0,1991-07-01,James Cameron
4,tt1037705,8,2020-11-01,The Book of Eli,https://www.imdb.com/title/tt1037705/,movie,6.8,118.0,2010,"Action, Adventure, Drama, Thriller",332877.0,2010-01-11,"Allen Hughes, Albert Hughes"


Movie titles in IMDb ratings are not in the same format as in movielense. \
IMDb movie titles are just titles. Movielense movie titles are titles + year. \
So adding the year to the IMDb movie titles.

In [8]:
my_ratings_df['title'] = my_ratings_df['title'] + ' (' + my_ratings_df['year'].astype(str) + ')'
my_ratings_df.head()

Unnamed: 0,const,your rating,date rated,title,url,title type,imdb rating,runtime (mins),year,genres,num votes,release date,directors
0,tt1001526,9,2013-02-22,Megamind (2010),https://www.imdb.com/title/tt1001526/,movie,7.3,95.0,2010,"Animation, Action, Comedy, Crime, Family, Myst...",285237.0,2010-10-28,Tom McGrath
1,tt0101414,9,2013-06-01,Beauty and the Beast (1991),https://www.imdb.com/title/tt0101414/,movie,8.0,84.0,1991,"Animation, Family, Fantasy, Musical, Romance",471957.0,1991-09-29,"Kirk Wise, Gary Trousdale"
2,tt1029231,6,2014-05-25,Krrish 3 (2013),https://www.imdb.com/title/tt1029231/,movie,5.3,152.0,2013,"Action, Adventure, Sci-Fi",24624.0,2013-10-31,Rakesh Roshan
3,tt0103064,9,2013-05-21,Terminator 2: Judgment Day (1991),https://www.imdb.com/title/tt0103064/,movie,8.6,137.0,1991,"Action, Sci-Fi",1147386.0,1991-07-01,James Cameron
4,tt1037705,8,2020-11-01,The Book of Eli (2010),https://www.imdb.com/title/tt1037705/,movie,6.8,118.0,2010,"Action, Adventure, Drama, Thriller",332877.0,2010-01-11,"Allen Hughes, Albert Hughes"


IMDb ratings are in the range of 1 to 10. \
Movielense ratings are in the range of 0.5 to 5. \
So multiplying the IMDb ratings by 0.5.

In [9]:
my_ratings_df['rating'] = (my_ratings_df['your rating'] / 2).astype(int)
my_ratings_df.head()

Unnamed: 0,const,your rating,date rated,title,url,title type,imdb rating,runtime (mins),year,genres,num votes,release date,directors,rating
0,tt1001526,9,2013-02-22,Megamind (2010),https://www.imdb.com/title/tt1001526/,movie,7.3,95.0,2010,"Animation, Action, Comedy, Crime, Family, Myst...",285237.0,2010-10-28,Tom McGrath,4
1,tt0101414,9,2013-06-01,Beauty and the Beast (1991),https://www.imdb.com/title/tt0101414/,movie,8.0,84.0,1991,"Animation, Family, Fantasy, Musical, Romance",471957.0,1991-09-29,"Kirk Wise, Gary Trousdale",4
2,tt1029231,6,2014-05-25,Krrish 3 (2013),https://www.imdb.com/title/tt1029231/,movie,5.3,152.0,2013,"Action, Adventure, Sci-Fi",24624.0,2013-10-31,Rakesh Roshan,3
3,tt0103064,9,2013-05-21,Terminator 2: Judgment Day (1991),https://www.imdb.com/title/tt0103064/,movie,8.6,137.0,1991,"Action, Sci-Fi",1147386.0,1991-07-01,James Cameron,4
4,tt1037705,8,2020-11-01,The Book of Eli (2010),https://www.imdb.com/title/tt1037705/,movie,6.8,118.0,2010,"Action, Adventure, Drama, Thriller",332877.0,2010-01-11,"Allen Hughes, Albert Hughes",4


keeping only the titles that are in both IMDb ratings and movielense movies

In [10]:
my_ratings_df['userId'] = my_user_id

my_ratings_df['title'] = my_ratings_df['title'].str.strip()
movies_df['title'] = movies_df['title'].str.strip()

my_ratings_df_cropped = my_ratings_df[['title', 'rating', 'userId']].merge(movies_df[['movieId', 'title']], on='title')

# reordering columns to match the ratings_df
my_ratings_df_cropped = my_ratings_df_cropped[['userId', 'movieId', 'rating']]

print(my_ratings_df_cropped.shape)
my_ratings_df_cropped.head()

(298, 3)


Unnamed: 0,userId,movieId,rating
0,611,81564,4
1,611,595,4
2,611,589,4
3,611,97921,4
4,611,68954,4


In [11]:
ratings_final_df = pd.concat([ratings_df, my_ratings_df_cropped])
print(ratings_final_df.shape)
ratings_final_df.tail()

(101134, 4)


Unnamed: 0,userId,movieId,rating,timestamp
293,611,90866,4.0,
294,611,1291,3.0,
295,611,2012,4.0,
296,611,1213,4.0,
297,611,586,4.0,


## Preprocessing data

### Feature generation

#### For movies from genres

In [12]:
movie_genres_one_hot = movies_df['genres'].str.get_dummies('|').values
movie_genres_one_hot = torch.from_numpy(movie_genres_one_hot).to(torch.float)
print(movie_genres_one_hot.shape)
movie_genres_one_hot

torch.Size([9742, 20])


tensor([[0., 0., 1.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

#### For movies from titles using `sentence-transformers`

In [13]:
sentence_encoder = SentenceTransformer('all-MiniLM-L6-v2')
sentence_encoder

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [14]:
with torch.no_grad():
    movie_titles_embeddings = sentence_encoder.encode(movies_df['title'].tolist(), convert_to_tensor=True, show_progress_bar=True)
    movie_titles_embeddings = movie_titles_embeddings.to(torch.float).cpu()

print(movie_titles_embeddings.shape)
movie_titles_embeddings

Batches:   0%|          | 0/305 [00:00<?, ?it/s]

torch.Size([9742, 384])


tensor([[-0.0828,  0.0530,  0.0536,  ...,  0.0226,  0.0538,  0.1030],
        [-0.1053,  0.1508, -0.0264,  ...,  0.0106, -0.0726,  0.0086],
        [-0.0988,  0.0176, -0.0527,  ..., -0.0120,  0.0303,  0.0004],
        ...,
        [-0.1115,  0.0310, -0.0177,  ...,  0.0147,  0.0299,  0.0200],
        [ 0.0366,  0.0137,  0.0315,  ..., -0.0516, -0.0143,  0.1012],
        [-0.0500, -0.0141, -0.0031,  ...,  0.0320,  0.0546, -0.0271]])

In [15]:
movie_features = torch.cat([
    movie_genres_one_hot, 
    movie_titles_embeddings
], dim=-1)
print(movie_features.shape)
movie_features

torch.Size([9742, 404])


tensor([[ 0.0000e+00,  0.0000e+00,  1.0000e+00,  ...,  2.2616e-02,
          5.3814e-02,  1.0297e-01],
        [ 0.0000e+00,  0.0000e+00,  1.0000e+00,  ...,  1.0561e-02,
         -7.2631e-02,  8.6104e-03],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -1.2006e-02,
          3.0255e-02,  4.1660e-04],
        ...,
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  1.4684e-02,
          2.9905e-02,  2.0007e-02],
        [ 0.0000e+00,  1.0000e+00,  0.0000e+00,  ..., -5.1593e-02,
         -1.4267e-02,  1.0123e-01],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  3.1982e-02,
          5.4629e-02, -2.7146e-02]])

#### For user (identity matrix)

In [16]:
user_features = torch.eye(ratings_final_df['userId'].nunique()) # as we do not have any user features, we will use the identity matrix
print(user_features.shape)
user_features

torch.Size([611, 611])


tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 1., 0.],
        [0., 0., 0.,  ..., 0., 0., 1.]])

### Edge generation

In [17]:
def generate_edges(movies_df: pd.DataFrame, ratings_df: pd.DataFrame) -> tuple[torch.Tensor, torch.Tensor, np.ndarray, np.ndarray]:
    
    # creating movies id to index mapping
    unique_movie_ids = movies_df['movieId'].unique()
    movie_ids_to_idx_df = pd.DataFrame({
        'movieId': unique_movie_ids,
        'movie_idx': np.arange(len(unique_movie_ids))
    })
    
    # creating users id to index mapping
    unique_user_ids = ratings_df['userId'].unique()
    user_ids_to_idx_df = pd.DataFrame({
        'userId': unique_user_ids,
        'user_idx': np.arange(len(unique_user_ids))
    })
    
    # generating user to movie edges
    user_rates_movies = pd.merge( # edges
        left = pd.merge(
            left = ratings_df,
            right = user_ids_to_idx_df,
            on = 'userId',
            how = 'left'
        ),
        right = movie_ids_to_idx_df,
        on = 'movieId',
        how = 'left'
    ).loc[:, ['user_idx', 'movie_idx']].values

    ratings = ratings_df['rating'].values
    
    edge_index = torch.from_numpy(user_rates_movies).to(torch.long).t().contiguous()
    edge_labels = torch.from_numpy(ratings).to(torch.float)
    
    return edge_index, edge_labels, unique_user_ids, unique_movie_ids


In [18]:
edge_index, edge_labels, _, _ = generate_edges(movies_df, ratings_final_df)
print(edge_index.shape)
edge_index

torch.Size([2, 101134])


tensor([[   0,    0,    0,  ...,  610,  610,  610],
        [   0,    2,    5,  ..., 1487,  914,  504]])

In [19]:
print(edge_labels.shape)
edge_labels

torch.Size([101134])


tensor([4., 4., 4.,  ..., 4., 4., 4.])

### Graph building

In [20]:
def build_graph(user_feats: torch.Tensor, movie_feats: torch.Tensor, edge_index: torch.Tensor, edge_labels: torch.Tensor) -> HeteroData:
    graph = HeteroData()

    graph['user'].x = user_feats
    graph['movie'].x = movie_feats


    graph['user', 'rates', 'movie'].edge_index = edge_index
    graph['user', 'rates', 'movie'].edge_label = edge_labels

    graph = T.ToUndirected()(graph)

    del graph['movie', 'rev_rates', 'user'].edge_label # we do not need reverse edges labels

    return graph


In [21]:
data = build_graph(user_features, movie_features, edge_index, edge_labels)
data

HeteroData(
  user={ x=[611, 611] },
  movie={ x=[9742, 404] },
  (user, rates, movie)={
    edge_index=[2, 101134],
    edge_label=[101134],
  },
  (movie, rev_rates, user)={ edge_index=[2, 101134] }
)

### Dataset splitting

In [22]:
transform = T.RandomLinkSplit(
    num_val=0.1,
    num_test=0.1,
    neg_sampling_ratio=0.0, # not now, will generate them on the fly later
    # disjoint_train_ratio=0.3,
    edge_types = [('user', 'rates', 'movie')],
    rev_edge_types = [('movie', 'rev_rates', 'user')] 
)

In [23]:
train_data, val_data, test_data = transform(data)
train_data, val_data, test_data

(HeteroData(
   user={ x=[611, 611] },
   movie={ x=[9742, 404] },
   (user, rates, movie)={
     edge_index=[2, 80908],
     edge_label=[80908],
     edge_label_index=[2, 80908],
   },
   (movie, rev_rates, user)={ edge_index=[2, 80908] }
 ),
 HeteroData(
   user={ x=[611, 611] },
   movie={ x=[9742, 404] },
   (user, rates, movie)={
     edge_index=[2, 80908],
     edge_label=[10113],
     edge_label_index=[2, 10113],
   },
   (movie, rev_rates, user)={ edge_index=[2, 80908] }
 ),
 HeteroData(
   user={ x=[611, 611] },
   movie={ x=[9742, 404] },
   (user, rates, movie)={
     edge_index=[2, 91021],
     edge_label=[10113],
     edge_label_index=[2, 10113],
   },
   (movie, rev_rates, user)={ edge_index=[2, 91021] }
 ))

## Model

### Encoder

In [24]:
class GNNEncoder(torch.nn.Module):
    def __init__(self, hidden_channels: int, out_channels: int):
        super().__init__()
        
        # self.conv = torch.nn.ModuleList()
        # self.conv.append(SAGEConv((-1, -1), hidden_channels))
        
        # for i in range(num_layers - 1):
        #     self.conv.append(SAGEConv(hidden_channels, hidden_channels))

        self.conv1 = SAGEConv((-1, -1), hidden_channels)
        self.conv2 = SAGEConv(hidden_channels, out_channels)

    
    def forward(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
        x = self.conv1(x, edge_index).relu()
        x = self.conv2(x, edge_index)
        return x

### Decoder

In [25]:
class EdgeDecoder(torch.nn.Module):
    def __init__(self, input_channels: int):
        super().__init__()
        self.lin1 = torch.nn.Linear(2 * input_channels, input_channels) # 2 * input_channels because we are concatenating user and movie embeddings
        self.lin2 = torch.nn.Linear(input_channels, 1)

    def forward(self, z_dict: dict[str, torch.Tensor], edge_label_index: torch.Tensor) -> torch.Tensor:
        z: torch.Tensor = torch.cat([
            z_dict['user'][edge_label_index[0]], 
            z_dict['movie'][edge_label_index[1]]
        ], dim=-1) # concatenating user and movie embeddings, only the ones that are in the edge_label_index (supervised edges)

        z = self.lin1(z).relu()
        z = self.lin2(z)
        return z.view(-1)


### Model

In [26]:
class Model(torch.nn.Module):
    def __init__(self, hidden_channels: int, out_channels: int):
        super().__init__()

        self.encoder = GNNEncoder(hidden_channels, out_channels)
        self.encoder = to_hetero(self.encoder, data.metadata())
        
        self.decoder = EdgeDecoder(out_channels)

    def forward(self, x_dict: dict[str, torch.Tensor], edge_index_dict: torch.Tensor, edge_label_index: torch.Tensor) -> torch.Tensor:
        z_dict = self.encoder(x_dict, edge_index_dict) ## message passing using edge_index
        return self.decoder(z_dict, edge_label_index) ## prediction for the edges in edge_label_index

## Trainning

In [27]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [28]:
model = Model(64, 32).to(device)
model

Model(
  (encoder): GraphModule(
    (conv1): ModuleDict(
      (user__rates__movie): SAGEConv((-1, -1), 64, aggr=mean)
      (movie__rev_rates__user): SAGEConv((-1, -1), 64, aggr=mean)
    )
    (conv2): ModuleDict(
      (user__rates__movie): SAGEConv(64, 32, aggr=mean)
      (movie__rev_rates__user): SAGEConv(64, 32, aggr=mean)
    )
  )
  (decoder): EdgeDecoder(
    (lin1): Linear(in_features=64, out_features=32, bias=True)
    (lin2): Linear(in_features=32, out_features=1, bias=True)
  )
)

In [29]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
optimizer

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.01
    maximize: False
    weight_decay: 0
)

In [30]:
def train(data: HeteroData):
    data = data.to(device)
    
    model.train()
    optimizer.zero_grad()

    pred = model(
        data.x_dict,
        data.edge_index_dict, # for message passing
        data.edge_label_index_dict["user", "rates", "movie"] # for edge prediction
        # for message passing we're using edges from both directions, but for edge prediction we're only using edges from one direction (user -> movie)
    )

    target = data['user', 'movie'].edge_label
    loss = F.mse_loss(pred, target)
    loss.backward()
    optimizer.step()

    return loss.item()

#### Some notes and explorations

In [31]:
data.x_dict

{'user': tensor([[1., 0., 0.,  ..., 0., 0., 0.],
         [0., 1., 0.,  ..., 0., 0., 0.],
         [0., 0., 1.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 1., 0., 0.],
         [0., 0., 0.,  ..., 0., 1., 0.],
         [0., 0., 0.,  ..., 0., 0., 1.]]),
 'movie': tensor([[ 0.0000e+00,  0.0000e+00,  1.0000e+00,  ...,  2.2616e-02,
           5.3814e-02,  1.0297e-01],
         [ 0.0000e+00,  0.0000e+00,  1.0000e+00,  ...,  1.0561e-02,
          -7.2631e-02,  8.6104e-03],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ..., -1.2006e-02,
           3.0255e-02,  4.1660e-04],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  1.4684e-02,
           2.9905e-02,  2.0007e-02],
         [ 0.0000e+00,  1.0000e+00,  0.0000e+00,  ..., -5.1593e-02,
          -1.4267e-02,  1.0123e-01],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  3.1982e-02,
           5.4629e-02, -2.7146e-02]])}

In [32]:
data.edge_index_dict

{('user',
  'rates',
  'movie'): tensor([[   0,    0,    0,  ...,  610,  610,  610],
         [   0,    2,    5,  ..., 1487,  914,  504]]),
 ('movie',
  'rev_rates',
  'user'): tensor([[   0,    2,    5,  ..., 1487,  914,  504],
         [   0,    0,    0,  ...,  610,  610,  610]])}

In [33]:
(data['user', 'movie'].edge_label == data['user', 'rates', 'movie'].edge_label).all()

tensor(True)

In [34]:
(data.edge_label_dict['user', 'rates', 'movie'] == data['user', 'movie'].edge_label).all()
# however data.edge_label_dict['user', 'movie'] will not work
# on `*_dict` we need the full 'identifier'

tensor(True)

In [35]:
train_data.edge_label_index_dict # from the looks of it, `*_dict` is generated from on the fly in `HeeroData` instances
# for example from all the `x`, `x_dict` is generated
# our `data` does not have any `edge_label_index`. so there's no `edge_label_index_dict` on `data`
# however T.RandomLinkSplit generates `edge_label_index` and as a result we can get `edge_label_index_dict` for `train_data`, `val_data` and `test_data`

{('user',
  'rates',
  'movie'): tensor([[  21,  273,  134,  ...,  176,  216,  287],
         [5917, 1472, 1444,  ..., 2966, 1478, 1493]])}

#### Back to trainning

In [36]:
@torch.no_grad()
def test(data: HeteroData):
    data = data.to(device)
    
    model.eval()
    pred = model(
        data.x_dict,
        data.edge_index_dict, # for message passing
        data.edge_label_index_dict["user", "rates", "movie"] # for edge prediction
    )

    pred = pred.clamp(0, 5) # ratings are between 0 and 5
    target = data['user', 'movie'].edge_label
    
    rmse = F.mse_loss(pred, target).sqrt()
    return rmse.item()


In [37]:
for epoch in range(1, 50):
    train_data = train_data.to(device)
    train_loss = train(train_data)
    val_rmse = test(val_data)
    print(f'Epoch: {epoch:03d}, Train Loss: {train_loss:.4f}, Val RMSE: {val_rmse:.4f}')

Epoch: 001, Train Loss: 12.2536, Val RMSE: 3.1912
Epoch: 002, Train Loss: 10.2703, Val RMSE: 2.5651
Epoch: 003, Train Loss: 6.6285, Val RMSE: 1.3560
Epoch: 004, Train Loss: 1.8384, Val RMSE: 1.8145
Epoch: 005, Train Loss: 5.3617, Val RMSE: 1.6027
Epoch: 006, Train Loss: 2.6254, Val RMSE: 1.0445
Epoch: 007, Train Loss: 1.0854, Val RMSE: 1.3300
Epoch: 008, Train Loss: 1.7650, Val RMSE: 1.5993
Epoch: 009, Train Loss: 2.5609, Val RMSE: 1.6518
Epoch: 010, Train Loss: 2.7326, Val RMSE: 1.5219
Epoch: 011, Train Loss: 2.3149, Val RMSE: 1.2702
Epoch: 012, Train Loss: 1.6046, Val RMSE: 1.0441
Epoch: 013, Train Loss: 1.0778, Val RMSE: 1.1371
Epoch: 014, Train Loss: 1.2896, Val RMSE: 1.3545
Epoch: 015, Train Loss: 1.8434, Val RMSE: 1.3106
Epoch: 016, Train Loss: 1.7233, Val RMSE: 1.1112
Epoch: 017, Train Loss: 1.2291, Val RMSE: 1.0159
Epoch: 018, Train Loss: 1.0187, Val RMSE: 1.0834
Epoch: 019, Train Loss: 1.1588, Val RMSE: 1.1760
Epoch: 020, Train Loss: 1.3686, Val RMSE: 1.2055
Epoch: 021, Train 

Avg rmse after 50 epochs: 0.9538 out of 5 = 19.076% error \

## Evaluation

In [38]:
with torch.no_grad():
    test_data = test_data.to(device)

    pred = model(
        test_data.x_dict,
        test_data.edge_index_dict, # for message passing
        test_data['user', 'movie'].edge_label_index # for edge prediction
    )

    pred = pred.clamp(0, 5) # ratings are between 0 and 5

    target = test_data['user', 'movie'].edge_label

    rmse = F.mse_loss(pred, target).sqrt()
    print(f'Test RMSE: {rmse:.4f}')

userId = test_data["user", "movie"].edge_label_index[0].cpu().numpy()
movieId = test_data["user", "movie"].edge_label_index[1].cpu().numpy()

pred = pred.cpu().numpy()
target = target.cpu().numpy()

pred_df = pd.DataFrame({
    'user_idx': userId,
    'movie_idx': movieId,
    'pred': pred,
    'target': target
})
pred_df

Test RMSE: 0.9335


Unnamed: 0,user_idx,movie_idx,pred,target
0,508,5227,3.532783,3.5
1,90,1503,3.824166,4.5
2,110,8367,2.734814,3.0
3,304,510,4.037499,5.0
4,170,898,4.358857,4.0
...,...,...,...,...
10108,232,2372,3.722188,4.0
10109,461,2836,3.665640,4.0
10110,65,277,4.093874,4.5
10111,312,277,3.807434,3.0


In [39]:
pred_df[pred_df['user_idx'] == my_user_id-1]

Unnamed: 0,user_idx,movie_idx,pred,target
684,610,8159,3.692035,4.0
946,610,7710,3.563162,4.0
1090,610,8237,3.366368,3.0
1268,610,6254,3.563948,3.0
2047,610,1503,3.946443,4.0
2407,610,7569,3.649127,4.0
2530,610,8032,3.300668,4.0
2963,610,7064,3.543276,3.0
3091,610,836,3.769709,4.0
3229,610,8490,3.422407,3.0


## Makeing recommendations

### Selecting the movies I have not watched

In [40]:
# generating the id to index mapping for movies and users

# creating movies id to index mapping
unique_movie_ids = movies_df['movieId'].unique()
movie_ids_to_idx_df = pd.DataFrame({
    'movieId': unique_movie_ids,
    'movie_idx': np.arange(len(unique_movie_ids))
})

# creating users id to index mapping
unique_user_ids = ratings_final_df['userId'].unique()
user_ids_to_idx_df = pd.DataFrame({
    'userId': unique_user_ids,
    'user_idx': np.arange(len(unique_user_ids))
})

In [41]:
# fetching the index of my user id

my_user_idx = user_ids_to_idx_df[user_ids_to_idx_df['userId'] == my_user_id]['user_idx'].values[0]
my_user_idx

610

In [42]:
# selecting movies that my user has not rated

movies_not_rated_by_me = movies_df[~movies_df['movieId'].isin(my_ratings_df_cropped['movieId'])]
movies_not_rated_by_me

Unnamed: 0,movieId,title,genres
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
6,7,Sabrina (1995),Comedy|Romance
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


### Predicting the ratings

In [43]:
movie_to_predict_df = movies_not_rated_by_me.sample(1)
movie_to_predict_df

Unnamed: 0,movieId,title,genres
5248,8607,Tokyo Godfathers (2003),Adventure|Animation|Drama


In [44]:
movie_to_predict_idx = movie_ids_to_idx_df[movie_ids_to_idx_df['movieId'] == movie_to_predict_df['movieId'].values[0]]['movie_idx'].values[0]
movie_to_predict_idx

5248

In [45]:
edge_label_index = torch.tensor([
    my_user_idx,
    movie_to_predict_idx
])
edge_label_index

tensor([ 610, 5248], dtype=torch.int32)

In [46]:
with torch.no_grad():
    model.eval()
    test_data = test_data.to(device)
    pred = model(test_data.x_dict, test_data.edge_index_dict, edge_label_index)
pred

tensor([3.9499], device='cuda:0')

In [47]:
movies_not_rated_by_me["userId"] = my_user_id
movies_not_rated_by_me['rating'] = 0.0
movies_not_rated_by_me

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_not_rated_by_me["userId"] = my_user_id
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_not_rated_by_me['rating'] = 0.0


Unnamed: 0,movieId,title,genres,userId,rating
1,2,Jumanji (1995),Adventure|Children|Fantasy,611,0.0
2,3,Grumpier Old Men (1995),Comedy|Romance,611,0.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,611,0.0
4,5,Father of the Bride Part II (1995),Comedy,611,0.0
6,7,Sabrina (1995),Comedy|Romance,611,0.0
...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,611,0.0
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,611,0.0
9739,193585,Flint (2017),Drama,611,0.0
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,611,0.0


In [48]:
edge_label_index2, _, _, _= generate_edges(movies_not_rated_by_me, movies_not_rated_by_me)
edge_label_index2

tensor([[   0,    0,    0,  ...,    0,    0,    0],
        [   0,    1,    2,  ..., 9441, 9442, 9443]])

In [49]:
with torch.no_grad():
    model.eval()
    test_data = test_data.to(device)
    edge_label_index2 = edge_label_index2.to(device)
    pred = model(test_data.x_dict, test_data.edge_index_dict, edge_label_index2)
pred

tensor([3.9558, 3.8174, 3.8707,  ..., 3.5639, 3.7004, 4.1601], device='cuda:0')

In [50]:
edge_label_index2.to(torch.device("cuda"))

tensor([[   0,    0,    0,  ...,    0,    0,    0],
        [   0,    1,    2,  ..., 9441, 9442, 9443]], device='cuda:0')

In [51]:
movies_not_rated_by_me["rating"] = pred.cpu().numpy()    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_not_rated_by_me["rating"] = pred.cpu().numpy()


### Final bulk recommendation

In [52]:
movies_not_rated_by_me.sort_values(by="rating", ascending=False).head(20)

Unnamed: 0,movieId,title,genres,userId,rating
5552,26694,Ju Dou (1990),Drama,611,4.896482
5957,34482,"Browning Version, The (1951)",Drama,611,4.773307
4441,6559,Little Giants (1994),Children|Comedy,611,4.770657
9006,140110,The Intern (2015),Comedy,611,4.761703
5200,8464,Super Size Me (2004),Comedy|Documentary|Drama,611,4.727155
7545,85056,I Am Number Four (2011),Action|Sci-Fi|Thriller|IMAX,611,4.719279
5519,26524,"Times of Harvey Milk, The (1984)",Documentary,611,4.718159
8927,136016,The Good Dinosaur (2015),Adventure|Animation|Children|Comedy|Fantasy,611,4.7177
8403,110387,"Unknown Known, The (2013)",Documentary,611,4.685498
3258,4404,Faust (1926),Drama|Fantasy|Horror,611,4.678583


In [53]:
movies_not_rated_by_me.sort_values(by="rating", ascending=False).tail(20)

Unnamed: 0,movieId,title,genres,userId,rating
8302,106648,Guilty of Romance (Koi no tsumi) (2011),Crime|Drama|Horror,611,3.075194
1287,1713,Mouse Hunt (1997),Children|Comedy,611,3.064205
9592,175569,Wind River (2017),Action|Crime|Mystery|Thriller,611,3.060419
8953,136666,Search Party (2014),Comedy,611,3.06016
9604,176371,Blade Runner 2049 (2017),Sci-Fi,611,3.059762
7343,78116,Please Give (2010),Comedy|Drama,611,3.058069
9485,169984,Alien: Covenant (2017),Action|Horror|Sci-Fi|Thriller,611,3.055403
6854,62155,Nick and Norah's Infinite Playlist (2008),Comedy|Drama|Romance,611,3.044075
9535,172547,Despicable Me 3 (2017),Adventure|Animation|Children|Comedy,611,3.043691
9075,142550,Ryuzo and the Seven Henchmen (2015),Action|Comedy,611,3.04114


## GNN Prediction Explantion

In [54]:
explainer = Explainer(
    model = model,
    algorithm = CaptumExplainer("IntegratedGradients"),
    explanation_type = "model",

    model_config = dict(
        mode = "regression",
        task_level = "edge",
        return_type = "raw"
    ),

    # not sure what these does yet
    node_mask_type=None,
    edge_mask_type="object",
)
explainer

<torch_geometric.explain.explainer.Explainer at 0x1ff9b317fd0>

In [55]:
explanation = explainer(test_data.x_dict, test_data.edge_index_dict, index=0, edge_label_index = edge_label_index2).cpu().detach()
explanation

HeteroExplanation(
  prediction=[9444],
  target=[9444],
  index=[1],
  edge_label_index=[2, 9444],
  user={ x=[611, 611] },
  movie={ x=[9742, 404] },
  (user, rates, movie)={
    edge_mask=[91021],
    edge_index=[2, 91021],
  },
  (movie, rev_rates, user)={
    edge_mask=[91021],
    edge_index=[2, 91021],
  }
)

In [56]:
user_to_movie = explanation["user", "rates", "movie"].edge_index.T.numpy()
print(user_to_movie.shape)
user_to_movie

(91021, 2)


array([[  21, 5917],
       [ 273, 1472],
       [ 134, 1444],
       ...,
       [ 168, 1545],
       [ 176, 6985],
       [ 181,  430]], dtype=int64)

In [57]:
user_to_movie_attr = explanation["user", "rates", "movie"].edge_mask.numpy().T
print(user_to_movie_attr.shape)
user_to_movie_attr

(91021,)


array([0.0000000e+00, 0.0000000e+00, 6.8767753e-07, ..., 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00])

In [58]:
user_to_movie_df = pd.DataFrame(
    np.hstack([user_to_movie, user_to_movie_attr.reshape(-1, 1)]),
    columns=["user_idx", "movie_idx", "attr"]
)
print(user_to_movie_df.shape)
user_to_movie_df

(91021, 3)


Unnamed: 0,user_idx,movie_idx,attr
0,21.0,5917.0,0.000000e+00
1,273.0,1472.0,0.000000e+00
2,134.0,1444.0,6.876775e-07
3,609.0,2654.0,0.000000e+00
4,386.0,830.0,0.000000e+00
...,...,...,...
91016,508.0,3336.0,0.000000e+00
91017,533.0,6912.0,0.000000e+00
91018,168.0,1545.0,0.000000e+00
91019,176.0,6985.0,0.000000e+00


In [59]:
movie_to_user = explanation["movie", "user"].edge_index.T.numpy()
print(movie_to_user.shape)


movie_to_user_attr = explanation["movie", "user"].edge_mask.numpy().T
print(movie_to_user_attr.shape)
movie_to_user, movie_to_user_attr

(91021, 2)
(91021,)


(array([[5917,   21],
        [1472,  273],
        [1444,  134],
        ...,
        [1545,  168],
        [6985,  176],
        [ 430,  181]], dtype=int64),
 array([0.00000000e+00, 0.00000000e+00, 1.45788992e-06, ...,
        1.61141143e-06, 3.40972746e-07, 3.87258117e-07]))

In [60]:
movie_to_user_df = pd.DataFrame(
    np.hstack([movie_to_user, movie_to_user_attr.reshape(-1, 1)]),
    columns=["movie_idx", "user_idx", "attr"]
)
print(movie_to_user_df.shape)
movie_to_user_df

(91021, 3)


Unnamed: 0,movie_idx,user_idx,attr
0,5917.0,21.0,0.000000e+00
1,1472.0,273.0,0.000000e+00
2,1444.0,134.0,1.457890e-06
3,2654.0,609.0,2.713963e-07
4,830.0,386.0,0.000000e+00
...,...,...,...
91016,3336.0,508.0,8.140527e-07
91017,6912.0,533.0,6.345770e-07
91018,1545.0,168.0,1.611411e-06
91019,6985.0,176.0,3.409727e-07


In [61]:
explanation_df = pd.concat([user_to_movie_df, movie_to_user_df])
explanation_df[['user_idx', 'movie_idx']] = explanation_df[['user_idx', 'movie_idx']].astype(int)
print(explanation_df.shape)
explanation_df

(182042, 3)


Unnamed: 0,user_idx,movie_idx,attr
0,21,5917,0.000000e+00
1,273,1472,0.000000e+00
2,134,1444,6.876775e-07
3,609,2654,0.000000e+00
4,386,830,0.000000e+00
...,...,...,...
91016,508,3336,8.140527e-07
91017,533,6912,6.345770e-07
91018,168,1545,1.611411e-06
91019,176,6985,3.409727e-07


In [62]:
my_explanation_df = explanation_df[explanation_df['user_idx'] == my_user_idx]
my_explanation_df

Unnamed: 0,user_idx,movie_idx,attr
98,610,8200,0.000000e+00
316,610,1466,0.000000e+00
783,610,2836,4.562612e-07
884,610,6539,0.000000e+00
1394,610,8451,0.000000e+00
...,...,...,...
89490,610,5901,1.537246e-06
89566,610,7741,9.691935e-07
89635,610,7538,1.573790e-06
89976,610,6209,1.316606e-06


In [63]:
my_explanation_df = my_explanation_df.groupby('movie_idx').sum()
my_explanation_df # 2 * my_user_idx = 1220, for each movies we have 2 edges, one from user to movie and one from movie to user, summing them up gives us 1220 for each movie

Unnamed: 0_level_0,user_idx,attr
movie_idx,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1220,5.417223e-04
5,1220,1.952409e-06
18,1220,1.165269e-06
92,1220,1.396366e-06
98,1220,1.456762e-06
...,...,...
9437,1220,1.358864e-06
9464,1220,1.097243e-06
9557,1220,8.751189e-07
9570,1220,1.710010e-06


### Influences for the predictions
from among the movies I have rated.

In [64]:
my_explanation_df = my_explanation_df.merge(movie_ids_to_idx_df, on='movie_idx', how='left').merge(movies_df, on='movieId', how='left')
my_explanation_df.sort_values(by='attr', ascending=False, key = lambda x: abs(x))[['title', 'attr']].head(20)

Unnamed: 0,title,attr
0,Toy Story (1995),0.000542
24,Bambi (1942),3e-06
23,Back to the Future Part III (1990),2e-06
18,L.A. Confidential (1997),2e-06
13,Goodfellas (1990),2e-06
6,Forrest Gump (1994),2e-06
109,Watchmen (2009),2e-06
134,Inception (2010),2e-06
32,Gladiator (2000),2e-06
138,Tangled (2010),2e-06


In [65]:
my_explanation_df.sort_values(by='attr', ascending=False, key = lambda x: abs(x))[['title', 'attr']].tail(20)

Unnamed: 0,title,attr
168,Ted (2012),9.089261e-07
181,Stand Up Guys (2012),8.959181e-07
148,Mr. Popper's Penguins (2011),8.839923e-07
261,Valerian and the City of a Thousand Planets (2...,8.751189e-07
205,Ride Along (2014),8.597492e-07
236,Insidious: Chapter 3 (2015),8.417633e-07
102,"10,000 BC (2008)",8.192851e-07
103,Horton Hears a Who! (2008),8.059745e-07
45,Mean Machine (2001),7.933298e-07
251,The Man from U.N.C.L.E. (2015),7.470864e-07
