This tutorial is adapted from [WikiNet — An Experiment in Recurrent Graph Neural Networks](https://medium.com/stanford-cs224w/wikinet-an-experiment-in-recurrent-graph-neural-networks-3f149676fbf3) by Alexander Hurtado.

# WikiNet

WikiNet tackles the target prediction problem on the Wikispeedia dataset. Namely, given a sequence of articles clicked by a player, the task is to predict the final target article the user is searching for. The following code is of the model definition, training, and evaluation for the experiments.

First, we begin by installing the necessary libraries and dataset!

In [1]:
!pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu111.html
!pip install torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu111.html
!pip install torch-geometric
!pip install class-resolver

!wget --no-cache https://github.com/alexanderjhurtado/cs224w_wikinet/raw/main/colab_starter_pack/graph_with_features.gml.zip
!wget --no-cache https://github.com/alexanderjhurtado/cs224w_wikinet/raw/main/colab_starter_pack/paths_and_labels.tsv
!unzip -o /content/graph_with_features.gml.zip

Looking in links: https://data.pyg.org/whl/torch-1.10.0+cu111.html
Collecting torch-scatter
  Downloading torch_scatter-2.1.1.tar.gz (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.6/107.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: torch-scatter
  Building wheel for torch-scatter (setup.py) ... [?25l[?25hdone
  Created wheel for torch-scatter: filename=torch_scatter-2.1.1-cp310-cp310-linux_x86_64.whl size=3536424 sha256=59baa7a9d2226f6554091f257033cbc9e5a4813c5f229921d443fbb077ad5f76
  Stored in directory: /root/.cache/pip/wheels/ef/67/58/6566a3b61c6ec0f2ca0c2c324cd035ef2955601f0fb3197d5f
Successfully built torch-scatter
Installing collected packages: torch-scatter
Successfully installed torch-scatter-2.1.1
Looking in links: https://data.pyg.org/whl/torch-1.10.0+cu111.html
Collecting torch-sparse
  Downloading torch_sparse-0.6.17.tar.gz (209 kB)
[2K     

Here, we import all libraries that will be used by the code.

In [2]:
import json
import pandas as pd
import time
import networkx as nx
from torch_geometric.utils import from_networkx

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCN, GAT, GraphSAGE
from torch.utils.data import Dataset, DataLoader

In [3]:
# Getting the dataset
!wget https://github.com/alexanderjhurtado/cs224w_wikinet/blob/main/colab_starter_pack/graph_with_features.gml.zip
!wget https://github.com/alexanderjhurtado/cs224w_wikinet/blob/main/colab_starter_pack/paths_and_labels.tsv

--2023-07-24 07:08:44--  https://github.com/alexanderjhurtado/cs224w_wikinet/blob/main/colab_starter_pack/graph_with_features.gml.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4904 (4.8K) [text/plain]
Saving to: ‘graph_with_features.gml.zip.1’


2023-07-24 07:08:44 (66.2 MB/s) - ‘graph_with_features.gml.zip.1’ saved [4904/4904]

--2023-07-24 07:08:44--  https://github.com/alexanderjhurtado/cs224w_wikinet/blob/main/colab_starter_pack/paths_and_labels.tsv
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4867 (4.8K) [text/plain]
Saving to: ‘paths_and_labels.tsv.1’


2023-07-24 07:08:45 (53.3 MB/s) - ‘paths_and_labels.tsv.1’ saved [4867/4867]



In [None]:
nx_graph = nx.read_gml('graph_with_features.gml')
G = from_networkx(nx_graph, group_node_attrs=['out_degree', 'in_degree', 'category_multi_hot', 'article_embed'])

path_data = pd.read_csv('paths_and_labels.tsv', sep='\t', header=None)

The following function will be called during training and evaluation to evaluate the model on the validation and test datasets.

In [None]:
def get_evaluation_metrics(model, device, dataloader, dataset_size):
    model.eval()
    avg_loss = 0
    num_correct = 0
    with torch.no_grad():
        for i, data in enumerate(dataloader):
            # get data
            inputs = data['indices'].to(device)
            labels = data['label'].to(device) #TODO: Get labels
            outputs = model(inputs) #TODO: Pass inputs through model to get outputs
            # get loss
            loss = F.nll_loss(outputs, labels) #TODO: Get loss between outputs and labels using the Negative Log Likelihood Loss
            avg_loss += loss.item() #TODO: Get the average loss per one epoch
            # get accuracy
            pred = outputs.argmax(dim=1)
            correct = (pred == labels).sum() #TODO: Find the number of correct predictions
            num_correct += correct
    acc = int(num_correct) / dataset_size #TODO: Get the accuracy
    avg_loss /= dataset_size
    return acc, avg_loss

This defines the dataset class we use to represent the path data.

In [None]:
class CustomPathDataset(Dataset):
    def __init__(self, path_data):
        self.x = path_data[0].apply(json.loads)
        self.labels = path_data[1]
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        x = torch.LongTensor(self.x[idx])
        label = self.labels[idx]
        sample = {"indices": x, "label": label}
        return sample

This is the class definition for the baseline model, an LSTM. Run this cell to be able to train the baseline model.

In [None]:
class Baseline(torch.nn.Module):
    def __init__(self, graph, device, node_embed_size=64, lstm_hidden_size=32):
        super().__init__()
        self.graphX = graph.x.to(device)
        self.graphEdgeIndex = graph.edge_index.to(device)
        self.lstm_input_size = self.graphX.shape[1]
        self.lstm = nn.LSTM(input_size=self.lstm_input_size,
                            hidden_size=lstm_hidden_size,
                            batch_first=True)
        self.pred_head = nn.Linear(lstm_hidden_size, self.graphX.shape[0])

    def forward(self, indices):
        node_emb = self.graphX
        node_emb_with_padding = torch.cat([node_emb, torch.zeros((1, self.lstm_input_size)).to(device)])
        paths = node_emb_with_padding[indices]
        _, (h_n, _) = self.lstm(paths)
        predictions = self.pred_head(torch.squeeze(h_n))
        return F.log_softmax(predictions, dim=1)

This is the class definition for the Graph Neural Network - based model. GraphSage model is used here as it performed best. If you would like to use GCN or GAT, simply replace `self.gnn = GraphSAGE(...)` with `self.gnn = GCN(...)` or `self.gnn = GAT(...)`, respectively. The arguments are the same for all 3 models.

This cell also defines the model weights file. This file will be generated during training, storing the weights for the best model based on validation accuracy during training.

In [None]:
MODEL_WEIGHT_PATH = "model_weights.pth"

class Model(torch.nn.Module):
    def __init__(self, graph, device, sequence_path_length=32, gnn_hidden_size=128, node_embed_size=64, lstm_hidden_size=32):
        super().__init__()
        self.graphX = graph.x.to(device)
        self.graphEdgeIndex = graph.edge_index.to(device)
        self.gnn = GraphSAGE(in_channels=self.graphX.shape[1],
                       hidden_channels=gnn_hidden_size,
                       num_layers=3,
                       out_channels=node_embed_size,
                       dropout=0.1)
        self.batch_norm_lstm = nn.BatchNorm1d(sequence_path_length) #TODO: Applies Batch Normalization (1d) with sequence_path_length as number of input
        self.batch_norm_linear = nn.BatchNorm1d(lstm_hidden_size)  #TODO: Applies Batch Normalization (1d) with lstm_hidden_size as number of input
        self.lstm_input_size = node_embed_size
        self.lstm = nn.LSTM(input_size=self.lstm_input_size,
                            hidden_size=lstm_hidden_size,
                            batch_first=True) #TODO: Apply a LSTM
        self.pred_head = nn.Linear(lstm_hidden_size, self.graphX.shape[0]) #TODO: Initialize the linear layer with the appropriate number of input and output features

    def forward(self, indices):
        node_emb = self.gnn(self.graphX, self.graphEdgeIndex)
        node_emb_with_padding = torch.cat([node_emb, torch.zeros((1, self.lstm_input_size)).to(device)])
        paths = node_emb_with_padding[indices]
        paths = self.batch_norm_lstm(paths)
        _, (h_n, _) = self.lstm(paths)
        h_n = self.batch_norm_linear(torch.squeeze(h_n))
        predictions = self.pred_head(h_n)
        return F.log_softmax(predictions, dim=1)

Here, we set up the `train / val / test` split as `90 / 5 / 5`. Moreover, we define the hyperparameters, including the learning rate, the optimizer (Adam), and the batch size.

In [None]:
# get the dataset + splits
dataset = CustomPathDataset(path_data)  # TODO: Get the dataset using the class we've created above
train_size = int(0.9 * len(dataset))    # TODO: Get the train size
test_size = int(0.05 * len(dataset))    # TODO: Get the test size
val_size = len(dataset) - train_size - test_size # TODO: Get the vaule size
train_dataset, val_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, val_size, test_size]) #TODO: Get the datasets using torch.utils.data.random_split

# set up for training + validation
batch_size = 1024
trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
validloader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=True, num_workers=2)
testloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True, num_workers=2)

## Baseline Model

In [None]:
# set up the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Baseline(G, device).to(device) #TODO: Set up the model using the Baseline Class and send it to device
optimizer = torch.optim.Adam(model.parameters(), lr=0.01) #TODO: Initialize the optimizer with the appropriate parameters

This is the training script. We train the model for 200 epochs and print training loss, validation loss, validation accuracy, and time spent for each epoch.

Moreover, we train by running one batch through the model at a time and using the Negative Log Likelihood loss function. We also save the model weights for the best validation accuracy we see after an epoch. These weights will be used in the evaluation step.

In [None]:
best_acc = 0
training_losses = []
validation_losses = []
validation_accs = []
model.train()
for epoch in range(50):  # loop over the dataset multiple times
    print('Epoch:', epoch+1)
    model.train()
    epoch_loss = 0
    start_time = time.time()
    for i, data in enumerate(trainloader):
        # get the inputs; data is a list of [inputs, labels]
        inputs = data['indices'].to(device) #TODO: Get the inputs
        labels = data['label'].to(device) #TODO: Get the labels

        # zero the parameter gradients
        optimizer.zero_grad() #TODO: Reset the gradients

        # forward + backward + optimize
        outputs = model(inputs) #TODO: Pass inputs through model to get outputs
        loss = F.nll_loss(outputs, labels) #TODO: Get loss between outputs and labels using the Negative Log Likelihood Loss
        epoch_loss += loss.item() #TODO: Get the average loss per one epoch
        loss.backward() #TODO: Perform the backward pass
        optimizer.step() #TODO: Perform the optimization step

    # validate epoch and print results
    training_losses.append(epoch_loss / train_size)
    print('Training Loss:', training_losses[-1])
    acc, valid_loss = get_evaluation_metrics(model, device, validloader, val_size) #TODO: Get accuracy and validation loss using the function get_evaluation_metrics()
    validation_losses.append(valid_loss)
    validation_accs.append(acc)
    if acc > best_acc:
        torch.save(model.state_dict(), MODEL_WEIGHT_PATH)
        best_acc = acc
    print("Validation accuracy:", acc)
    print("Validation loss:", valid_loss)
    print('Time elapsed:', time.time() - start_time)
    print()

This code runs evaluation on the test dataset. In particular, it uses the weights from the best validation accuracy to obtain the test accuracy.

This cell will print out the "loss" and accuracy on the testing dataset.

In [None]:
# model.load_state_dict(torch.load(MODEL_WEIGHT_PATH))
model.eval() #TODO: Evaluate the model
acc, test_loss = get_evaluation_metrics(model, device, testloader, test_size) #TODO: Get accuracy and test loss using the function get_evaluation_metrics()
print("Test accuracy:", acc)
print("Test loss:", test_loss)

Test accuracy: 0.27262090483619345
Test loss: 0.006432513922871368


## Graph Neural Network

In [None]:
# set up the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Model(G, device).to(device) #TODO: Set up the model using the Model Class and send it to device
optimizer = torch.optim.Adam(model.parameters(), lr=0.01) #TODO: Initialize the optimizer with the appropriate parameters

This is the training script. We train the model for 200 epochs and print training loss, validation loss, validation accuracy, and time spent for each epoch.

Moreover, we train by running one batch through the model at a time and using the Negative Log Likelihood loss function. We also save the model weights for the best validation accuracy we see after an epoch. These weights will be used in the evaluation step.

In [None]:
best_acc = 0
training_losses = []
validation_losses = []
validation_accs = []
model.train()
for epoch in range(50):  # loop over the dataset multiple times
    print('Epoch:', epoch+1)
    model.train()
    epoch_loss = 0
    start_time = time.time()
    for i, data in enumerate(trainloader):
        # get the inputs; data is a list of [inputs, labels]
        inputs = data['indices'].to(device) #TODO: Get the inputs
        labels = data['label'].to(device) #TODO: Get the labels

        # zero the parameter gradients
        optimizer.zero_grad() #TODO: Reset the gradients

        # forward + backward + optimize
        outputs = model(inputs) #TODO: Pass inputs through model to get outputs
        loss = F.nll_loss(outputs, labels) #TODO: Get loss between outputs and labels using the Negative Log Likelihood Loss
        epoch_loss += loss.item() #TODO: Get the average loss per one epoch
        loss.backward()  #TODO: Perform the backward pass
        optimizer.step() #TODO: Perform the optimization step

    # validate epoch and print results
    training_losses.append(epoch_loss / train_size)
    print('Training Loss:', training_losses[-1])
    acc, valid_loss = get_evaluation_metrics(model, device, validloader, val_size) #TODO: Get accuracy and validation loss using the function get_evaluation_metrics()
    validation_losses.append(valid_loss)
    validation_accs.append(acc)
    if acc > best_acc:
        torch.save(model.state_dict(), MODEL_WEIGHT_PATH)
        best_acc = acc
    print("Validation accuracy:", acc)
    print("Validation loss:", valid_loss)
    print('Time elapsed:', time.time() - start_time)
    print()

This code runs evaluation on the test dataset. In particular, it uses the weights from the best validation accuracy to obtain the test accuracy.

This cell will print out the "loss" and accuracy on the testing dataset.

In [None]:
# model.load_state_dict(torch.load(MODEL_WEIGHT_PATH))
model.eval() #TODO: Evaluate the model
acc, test_loss = get_evaluation_metrics(model, device, testloader, test_size) #TODO: Get accuracy and test loss using the function get_evaluation_metrics()
print("Test accuracy:", acc)
print("Test loss:", test_loss)

Test accuracy: 0.3654446177847114
Test loss: 0.005760757115999362
