# Stochastic Training of GNN for Node Classification on Large Graphs

The final project deals with how to train a GNN model for node classification on the Amazon Co-purchase Network provided by OGB.

The dataset contains 2.4 million nodes and 61 million edges, hence the full graph will not fit in a single GPU.

The purpose of this project is mainly to show you an end-to-end working pipeline of training on a large graph, which you could potentially use in your own project. There are a **lot** of new concepts introduced in this notebook which is why we want to encourage you to spend time understanding the notebook first and start by making minor changes. Time permitting, you can go ahead and try to implement new layer architectures.

By the end of this project you will learn how to train a GNN model with a single machine, a single GPU, on a graph of any size.

In [7]:
!pip install dgl-cu101
!pip install torch==1.9.0

Collecting dgl-cu101
  Downloading dgl_cu101-0.6.1-cp36-cp36m-manylinux1_x86_64.whl (36.2 MB)
[K     |████████████████████████████████| 36.2 MB 22.3 MB/s eta 0:00:01
Installing collected packages: dgl-cu101
Successfully installed dgl-cu101-0.6.1
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p36/bin/python -m pip install --upgrade pip' command.[0m
Collecting torch==1.9.0
  Downloading torch-1.9.0-cp36-cp36m-manylinux1_x86_64.whl (831.4 MB)
[K     |████████████▍                   | 322.0 MB 108.4 MB/s eta 0:00:05

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |█████████████████████████▊      | 666.9 MB 111.3 MB/s eta 0:00:02

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |████████████████████████████████| 831.4 MB 6.0 kB/s s eta 0:00:01
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.4.0
    Uninstalling torch-1.4.0:
      Successfully uninstalled torch-1.4.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 1.0.61 requires nvidia-ml-py3, which is not installed.[0m
Successfully installed torch-1.9.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p36/bin/python -m pip install --upgrade pip' command.[0m


## About the OGB Products Dataset

The [ogbn-products](https://ogb.stanford.edu/docs/nodeprop/#ogbn-products) dataset is an undirected and unweighted graph, representing an Amazon product co-purchasing network. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. Node features are generated by extracting bag-of-words features from the product descriptions followed by a Principal Component Analysis to reduce the dimension to 100.

**Prediction task** : The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.

**Dataset splitting** : Instead of randomly assigning 90% of the nodes for training and 10% of the nodes for testing (without use of a validation set), the sales ranking (popularity) is used to split nodes into training/validation/test sets. Specifically, the products are sorted according to their sales ranking and the top 8% are used for training, next top 2% for validation, and the rest for testing. This is a more challenging splitting procedure that closely matches the real-world application where labels are first assigned to important nodes in the network and ML models are subsequently used to make predictions on less important ones.

# Section 1 : Load Dataset

Although you can directly use the Python package provided by OGB, for demonstration, we will instead manually download the dataset, peek into its contents, and process it with only `numpy`.

In [1]:
# Setup

import boto3
import os

course_ID = "MLA-GML"
bucket_name = "mlu-courses-datalake"


remote_dir_name = course_ID + "/data/final_project/"
local_dir_name = "dataset/"

s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(bucket_name) 
for obj in bucket.objects.filter(Prefix = remote_dir_name):
    if not os.path.exists(os.path.dirname(obj.key)):
        os.makedirs(os.path.dirname(obj.key))
    
    try:
        bucket.download_file(obj.key, obj.key)
    except:
        continue

The original dataset contains a lot of files. We will stick to only using these files:

* `edge.csv` (source-destination pairs)
* `node-feat.csv` (node features)
* `node-label.csv` (node labels)
* `train.csv` (node IDs in the training set)
* `valid.csv` (node IDs in the validation set)

In [1]:
import pandas as pd
edges = pd.read_csv('MLA-GML/data/final_project/edge.csv.gz', header=None).values
node_features = pd.read_csv('MLA-GML/data/final_project/node-feat.csv.gz', header=None).values
node_labels = pd.read_csv('MLA-GML/data/final_project/node-label.csv.gz', header=None).values[:, 0]

# pd.read_csv yields a DataFrame with one column, so we make them one-dimensional arrays.
train_nids = pd.read_csv('MLA-GML/data/final_project/train.csv.gz', header=None).values[:, 0]
valid_nids = pd.read_csv('MLA-GML/data/final_project/valid.csv.gz', header=None).values[:, 0]
test_nids = pd.read_csv('MLA-GML/data/final_project/test.csv.gz', header=None).values[:, 0]

In [2]:
test_nids.shape

(2213091,)

### Loading Node IDs into DGL

<div class="alert alert-info">
    <b>Note:</b> The node IDs should be consecutive integers from 0 to the number of nodes minus 1.  If your node ID is not consecutive or not starting from 0 (e.g., starting from 100000), you need to relabel them yourself.  The <code>astype</code> method in pandas DataFrame can conveniently relabel the IDs by converting the type to <code>"category"</code>.
</div>

------
------

# Section 2 : Construct DGL Graph

We construct the graph as follows:

In [3]:
import dgl
import torch

Using backend: pytorch


In [4]:
graph = dgl.graph((edges[:, 0], edges[:, 1]))
graph = dgl.add_self_loop(graph)
node_features = torch.FloatTensor(node_features)
node_labels = torch.LongTensor(node_labels)

# Save the graph, features and training-validation-test split for use for future tutorials.
import pickle
with open('data.pkl', 'wb') as f:
    pickle.dump((graph, node_features, node_labels, train_nids, valid_nids), f)

Using backend: pytorch


In [4]:
# Load the graph back from the file we saved
import dgl
import torch
import numpy as np
import pickle
with open('data.pkl', 'rb') as f:
#     graph, node_features, node_labels, train_nids, valid_nids, test_nids = pickle.load(f)
    graph, node_features, node_labels, train_nids, valid_nids = pickle.load(f)

In [5]:
graph

Graph(num_nodes=2449029, num_edges=64308169,
      ndata_schemes={}
      edata_schemes={})

We can see the size of the graph, features, and labels as follows.

In [6]:
print('Graph')
print(graph)
print('Shape of node features:', node_features.shape)
print('Shape of node labels:', node_labels.shape)

num_features = node_features.shape[1]
num_classes = (node_labels.max() + 1).item()
print('Number of classes:', num_classes)

Graph
Graph(num_nodes=2449029, num_edges=64308169,
      ndata_schemes={}
      edata_schemes={})
Shape of node features: torch.Size([2449029, 100])
Shape of node labels: torch.Size([2449029])
Number of classes: 47


----
----

# Section 3 : Define a Data Loader with Neighbor Sampling

But first

### Message passing overview

The formulation of message passing usually has the following form:

$$
\begin{gathered}
  \boldsymbol{a}_v^{(l)} = \rho^{(l)} \left(
    \left\lbrace
      \boldsymbol{h}_u^{(l-1)} : u \in \mathcal{N} \left( v \right)
    \right\rbrace
  \right)
\\
  \boldsymbol{h}_v^{(l)} = \phi^{(l)} \left(
    \boldsymbol{h}_v^{(l-1)}, \boldsymbol{a}_v^{(l)}
  \right)
\end{gathered}
$$

where $\rho^{(l)}$ and $\phi^{(l)}$ are parameterized functions, and $\mathcal{N}(v)$ represents the set of predecessors (or equivalently *neighbors*) of $v$ on graph $\mathcal{G}$:
$$
\mathcal{N} \left( v \right) = \left\lbrace
  s \left( e \right) : e \in \mathbb{E}, t \left( e \right) = v
\right\rbrace
$$

In [1]:
!pwd

/home/ec2-user/SageMaker/MLA-GML-Content/notebooks/final_project



For instance, to perform a message passing for updating the red node in the following graph:

![Imgur](./MLA-GML/data/final_project/1.png)

You need to aggregate the node features of its neighbors, shown as green nodes:

![Imgur](./MLA-GML/data/final_project/2.png)

Let's consider how multi-layer message passing works for computing the output of a single node.  In the following text, we refer to the nodes whose outputs GNN will compute as seed nodes.


### Multi-layer message passing 

Consider computing with a 2-layer GNN the output of the seed node 8, colored red, in the following graph:

![Imgur](./MLA-GML/data/final_project/seed.png)

By the formulation:

$$
\begin{gathered}
  \boldsymbol{a}_8^{(2)} = \rho^{(2)} \left(
    \left\lbrace
      \boldsymbol{h}_u^{(1)} : u \in \mathcal{N} \left( 8 \right)
    \right\rbrace
  \right) = \rho^{(2)} \left(
    \left\lbrace
      \boldsymbol{h}_4^{(1)}, \boldsymbol{h}_5^{(1)},
      \boldsymbol{h}_7^{(1)}, \boldsymbol{h}_{11}^{(1)}
    \right\rbrace
  \right)
\\
  \boldsymbol{h}_8^{(2)} = \phi^{(2)} \left(
    \boldsymbol{h}_8^{(1)}, \boldsymbol{a}_8^{(2)}
  \right)
\end{gathered}
$$

We can tell that, to compute $\boldsymbol{h}_8^{(2)}$, we need messages from node 4, 5, 7, and 11 (colored green) along the edges visualized below.

![Imgur](./MLA-GML/data/final_project/3.png)

The values of $\boldsymbol{h}_\cdot^{(1)}$ are the outputs from the first GNN layer.

To compute those values for the red and green nodes, we further need to perform message passing on the edges visualized below.

![Imgur](./MLA-GML/data/final_project/4.png)

Therefore, to compute the 2-layer GNN representation of the red node, we need the input features from the red node as well as the green and yellow nodes.  Note that we should take red node's neighbors again for this layer.

You may notice that the procedure which determines computation dependency is in the reverse direction of message aggregation: you start from the layer closest to the output and work backward to the input.

Summary
* Computing representation for a small number of nodes still often requires input features of a significantly larger number of nodes.  

* Taking all neighbors for message aggregation is often too costly since the nodes needed would easily cover a large portion of the graph.

Neighbour sampling addresses this

### Neighbour Sampling Overview

Neighbor sampling addresses this issue by selecting a random subset of the neighbors to perform aggregation.

For example, to compute $\boldsymbol{h}_8^{(1)}$, we can choose to sample 2 neighbors and aggregate.

![Imgur](./MLA-GML/data/final_project/5.png)

Similarly, to compute the red and green nodes' first layer representation, we can also do neighbor sampling that takes 2 neighbors for each node.  Note that we should take the red node's neighbors again for this layer.

![Imgur](./MLA-GML/data/final_project/6.png)

You can see that this method could give us fewer nodes needed for input features.

### Other graph sampling strategies
* Neighborhood sampling (GraphSAGE)
* Control-variate-based sampling (VRGCN)
* Layer-wise sampling (FastGCN, LADIES)
* Random-walk-based sampling (PinSage)
* Subgraph sampling (ClusterGCN, GraphSAINT)


### Defining neighbor sampler and node data loader in DGL



DGL provides useful tools to generate such computation dependencies while iterating over the dataset in minibatches and performing neighbor sampling.

For node classification, you can use
* [`dgl.dataloading.NodeDataLoader`](https://docs.dgl.ai/en/0.6.x/api/python/dgl.dataloading.html#dgl.dataloading.pytorch.NodeDataLoader) for iterating over the dataset, and
* [`dgl.dataloading.MultiLayerNeighborSampler`](https://docs.dgl.ai/en/0.6.x/api/python/dgl.dataloading.html#dgl.dataloading.neighbor.MultiLayerNeighborSampler) to generate computation dependencies of the nodes with neighbor sampling.

The syntax of `dgl.dataloading.NodeDataLoader` is mostly similar to a PyTorch `DataLoader`, with the addition that it needs a graph to generate computation dependency from, a set of node IDs to iterate on, and the neighbor sampler you defined.

Let's consider training a 3-layer GNN with neighbor sampling, and each node will gather message from 4 neighbors on each layer.  The code defining the data loader and neighbor sampler will look like the following.

In [7]:
sampler = dgl.dataloading.MultiLayerNeighborSampler([4, 4, 4])
train_dataloader = dgl.dataloading.NodeDataLoader(
    graph, train_nids, sampler,
    batch_size=1024,
    shuffle=True,
    drop_last=False,
    num_workers=0
)

We can peek at the first item in the data loader we created and see what it gives us.

In [8]:
example_minibatch = next(iter(train_dataloader))
print(example_minibatch)

[tensor([158124, 162988,  12835,  ...,  45071,  39915,  76617]), tensor([158124, 162988,  12835,  ..., 181952,  52808, 138705]), [Block(num_src_nodes=33317, num_dst_nodes=14939, num_edges=52863), Block(num_src_nodes=14939, num_dst_nodes=4411, num_edges=16293), Block(num_src_nodes=4411, num_dst_nodes=1024, num_edges=3882)]]


`NodeDataLoader` gives us three items per iteration.

* The input node list for the nodes whose input features are needed to compute the outputs.
* The output node list whose GNN representation are to be computed.
* The list of computation dependency for each layer.

In [9]:
input_nodes, output_nodes, bipartites = example_minibatch
print("To compute {} nodes' output we need {} nodes' input features".format(len(output_nodes), len(input_nodes)))

To compute 1024 nodes' output we need 33317 nodes' input features


The variable `bipartites` has the message passing computation dependency for each layer.

It is named suggestively, because it can be thought of as a **list** of bipartite graphs.

So why does DGL return a list of *bipartite* graphs for training a *homogeneous* graph? 

To distinguish between the source nodes sending the messages and the destination nodes being updated at each layer.

Recall the sampled sub-graph from the example above:

![Imgur](./MLA-GML/data/final_project/6.png)


The first GNN layer outputs the representation of three nodes (two green nodes and one red node), but requires input from 7 nodes (the green nodes and red node, plus 4 yellow nodes).  

A bipartite graph easily captures the computation dependency -

![](./MLA-GML/data/final_project/bipartite.png)

Let's look at each *bipartite* graph in `bipartites`

In [10]:
for block in bipartites:
    print(block)

Block(num_src_nodes=33317, num_dst_nodes=14939, num_edges=52863)
Block(num_src_nodes=14939, num_dst_nodes=4411, num_edges=16293)
Block(num_src_nodes=4411, num_dst_nodes=1024, num_edges=3882)


These represent the bipartites to be used at each layerof the graph unrolling.

Minibatch training of GNNs usually involves message passing on such bipartite graphs.

---
---

# Section 4 : Main Project Task - Defining Model Architecture

We are training a GraphSage GNN model that was previously introduced.

An example model can be written as follows:

In [11]:
import torch.nn as nn
import torch.nn.functional as F
import dgl.nn as dglnn

# TODO : Try out different layer implementations here. For eg. You could try replacing 'mean' with 
#        different aggregation strategies, change the number of layers etc. An advanced implementation
#        could also include implementing a custom layer like we did in the Day 3 notebook.
#        It is advisable to keep the rest of the code structure the same. 

class GAT(nn.Module):
    def __init__(self, in_feats, n_hidden, n_classes, n_layers):
        super().__init__()
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.n_classes = n_classes
        
        self.layers = nn.ModuleList()
        
        for i in range(n_layers):
            self.layers.append(dglnn.GATConv(in_feats, n_hidden, num_heads=3, attn_drop=0.05, activation=F.relu))
            in_feats = n_hidden * 3
            
        self.classify = nn.Linear(n_hidden*3, n_classes)

        
    def forward(self, bipartites, x):
        # Iterate over the layers
        for l, (layer, bipartite) in enumerate(zip(self.layers, bipartites)):
            x = layer(bipartite, x)
            x = x.flatten(1)
        x = self.classify(x)
        return x

You can see that here we are iterating over the pairs of NN module layer and bipartite graphs generated by the data loader.

---
---

# Section 5 : Defining the Training Loop


The following initializes the model and defines the optimizer.

In [12]:
# Note the .cuda() which will send the model to the GPU
model = GAT(num_features, 128, num_classes, 3).cuda()


opt = torch.optim.Adam(model.parameters())

In [13]:
model

GAT(
  (layers): ModuleList(
    (0): GATConv(
      (fc): Linear(in_features=100, out_features=384, bias=False)
      (feat_drop): Dropout(p=0.0, inplace=False)
      (attn_drop): Dropout(p=0.05, inplace=False)
      (leaky_relu): LeakyReLU(negative_slope=0.2)
    )
    (1): GATConv(
      (fc): Linear(in_features=384, out_features=384, bias=False)
      (feat_drop): Dropout(p=0.0, inplace=False)
      (attn_drop): Dropout(p=0.05, inplace=False)
      (leaky_relu): LeakyReLU(negative_slope=0.2)
    )
    (2): GATConv(
      (fc): Linear(in_features=384, out_features=384, bias=False)
      (feat_drop): Dropout(p=0.0, inplace=False)
      (attn_drop): Dropout(p=0.05, inplace=False)
      (leaky_relu): LeakyReLU(negative_slope=0.2)
    )
  )
  (classify): Linear(in_features=384, out_features=47, bias=True)
)

### Dataloader for validation

When computing the validation score for model selection, usually you can also do neighbor sampling.  To do that, you need to define another data loader.

In [14]:
valid_dataloader = dgl.dataloading.NodeDataLoader(
    graph, valid_nids, sampler,
    batch_size=1024,
    shuffle=False,
    drop_last=False,
    num_workers=0
)

The following is a training loop that performs validation every epoch.  It also saves the model with the best validation accuracy into a file.

In [15]:
import tqdm
import sklearn.metrics

best_accuracy = 0
best_model_path = 'model.pt'

In [16]:

for epoch in range(10):
    # set the model to train mode as required by PyTorch
    model.train()
    
    # tqdm simply draws progress bars
    with tqdm.tqdm(train_dataloader) as tq:
        
        # Iterate over the training data (nodes)
        for step, (input_nodes, output_nodes, bipartites) in enumerate(tq):
            
            # Send each of the layers of the graph structures to be used, to the GPU
            bipartites = [b.to(torch.device('cuda')) for b in bipartites]
            
            # Send the node features & labels to the GPU
            inputs = node_features[input_nodes].cuda()
            labels = node_labels[output_nodes].cuda()
            
            # Make predictions based on the current initializations
            predictions = model(bipartites, inputs)
            
            # Calculate the classification loss between the predictions and labels
            loss = F.cross_entropy(predictions, labels)
            
            # Common steps used within PyTorch
            opt.zero_grad()
            loss.backward()
            opt.step()
            
            # Calculate the accuracy of the predicted labels w.r.t ground truth labels
            accuracy = sklearn.metrics.accuracy_score(labels.cpu().numpy(), predictions.argmax(1).detach().cpu().numpy())
            
            # Just a tqdm thing
            tq.set_postfix({'loss': '%.03f' % loss.item(), 'acc': '%.03f' % accuracy}, refresh=False)
    
    # Switch to eval mode for checking the validation set
    model.eval()
    
    predictions = []
    labels = []
    
    # Same loop as above, albeit a bit compressed
    with tqdm.tqdm(valid_dataloader) as tq, torch.no_grad():
        for input_nodes, output_nodes, bipartites in tq:
            bipartites = [b.to(torch.device('cuda')) for b in bipartites]
            inputs = node_features[input_nodes].cuda()
            labels.append(node_labels[output_nodes].numpy())
            predictions.append(model(bipartites, inputs).argmax(1).cpu().numpy())
            
        predictions = np.concatenate(predictions)
        labels = np.concatenate(labels)
        accuracy = sklearn.metrics.accuracy_score(labels, predictions)
        
        print('Epoch {} Validation Accuracy {}'.format(epoch, accuracy))
        if best_accuracy < accuracy:
            best_accuracy = accuracy
            torch.save(model.state_dict(), best_model_path)

100%|██████████| 193/193 [00:15<00:00, 12.47it/s, loss=1.104, acc=0.857]
100%|██████████| 39/39 [00:03<00:00, 12.59it/s]
  1%|          | 2/193 [00:00<00:16, 11.87it/s, loss=0.700, acc=0.815]

Epoch 0 Validation Accuracy 0.8413396739821478


100%|██████████| 193/193 [00:15<00:00, 12.63it/s, loss=0.616, acc=0.714]
100%|██████████| 39/39 [00:02<00:00, 13.26it/s]
  1%|          | 2/193 [00:00<00:16, 11.69it/s, loss=0.754, acc=0.793]

Epoch 1 Validation Accuracy 0.8445693360120031


100%|██████████| 193/193 [00:15<00:00, 12.39it/s, loss=0.763, acc=0.857]
100%|██████████| 39/39 [00:03<00:00, 12.60it/s]
  1%|          | 2/193 [00:00<00:15, 12.10it/s, loss=0.553, acc=0.847]

Epoch 2 Validation Accuracy 0.8554026905373445


100%|██████████| 193/193 [00:15<00:00, 12.32it/s, loss=0.362, acc=0.857]
100%|██████████| 39/39 [00:02<00:00, 13.06it/s]
  1%|          | 2/193 [00:00<00:16, 11.34it/s, loss=0.549, acc=0.843]

Epoch 3 Validation Accuracy 0.8534445489916842


100%|██████████| 193/193 [00:15<00:00, 12.41it/s, loss=1.160, acc=0.714]
100%|██████████| 39/39 [00:03<00:00, 12.68it/s]
  1%|          | 2/193 [00:00<00:16, 11.79it/s, loss=0.669, acc=0.823]

Epoch 4 Validation Accuracy 0.8407293441497342


100%|██████████| 193/193 [00:15<00:00, 12.38it/s, loss=0.476, acc=0.857]
100%|██████████| 39/39 [00:03<00:00, 12.43it/s]
  1%|          | 2/193 [00:00<00:16, 11.92it/s, loss=0.514, acc=0.862]

Epoch 5 Validation Accuracy 0.8604887724741246


100%|██████████| 193/193 [00:15<00:00, 12.60it/s, loss=0.044, acc=1.000]
100%|██████████| 39/39 [00:03<00:00, 12.72it/s]
  1%|          | 2/193 [00:00<00:15, 12.08it/s, loss=0.417, acc=0.887]

Epoch 6 Validation Accuracy 0.8690333901279149


100%|██████████| 193/193 [00:15<00:00, 12.18it/s, loss=0.130, acc=1.000]
100%|██████████| 39/39 [00:02<00:00, 13.36it/s]
  1%|          | 2/193 [00:00<00:17, 11.20it/s, loss=0.436, acc=0.875]

Epoch 7 Validation Accuracy 0.8673549830887776


100%|██████████| 193/193 [00:15<00:00, 12.55it/s, loss=0.427, acc=0.857]
100%|██████████| 39/39 [00:03<00:00, 12.71it/s]
  1%|          | 2/193 [00:00<00:16, 11.83it/s, loss=0.472, acc=0.872]

Epoch 8 Validation Accuracy 0.86521882867533


100%|██████████| 193/193 [00:15<00:00, 12.43it/s, loss=0.063, acc=1.000]
100%|██████████| 39/39 [00:03<00:00, 12.44it/s]

Epoch 9 Validation Accuracy 0.871067822902627





---
---

# Section 6 : Generate a submission


## Offline Inference without Neighbor Sampling

Usually for offline inference it is desirable to aggregate over the entire neighborhood to eliminate randomness introduced by neighbor sampling.  However, using the same methodology in training is not efficient, because there will be a lot of redundant computation.  Moreover, simply doing neighbor sampling by taking all neighbors will often exhaust GPU memory because the number of nodes required for input features may be too large to fit into GPU memory.

Instead, you need to compute the representations layer by layer: you first compute the output of the first GNN layer for all nodes, then you compute the output of second GNN layer for all nodes using the first GNN layer's output as input, etc.  

This gives us a different algorithm from what is being used in training.  During training we have an outer loop that iterates over the nodes, and an inner loop that iterates over the layers.  In contrast, during inference we have an outer loop that iterates over the layers, and an inner loop that iterates over the nodes.

If you do not care about randomness too much (e.g., during model selection in validation), you can still use the `dgl.dataloading.MultiLayerNeighborSampler` and `dgl.dataloading.NodeDataLoader` to do offline inference, since it is usually faster for evaluating a small number of nodes.

![Double click on this cell and add a `t` to complete the word `dataset` for seeing the actual animation](./MLA-GML/data/final_project/anim.gif)

In [19]:
def inference(model, graph, input_features, batch_size):
    nodes = torch.arange(graph.number_of_nodes())
    
    sampler = dgl.dataloading.MultiLayerNeighborSampler([None])  # one layer at a time, taking all neighbors
    dataloader = dgl.dataloading.NodeDataLoader(
        graph, nodes, sampler,
        batch_size=batch_size,
        shuffle=False,
        drop_last=False,
        num_workers=0
    )
    
    # Make sure the model is in eval mode
    model.eval()
    
    with torch.no_grad():
        for l, layer in enumerate(model.layers):
            # Allocate a buffer of output representations for every node
            # Note that the buffer is on CPU memory.
            # This buffer will maintain the node representations at each level of the graph unrolling
            output_features = torch.zeros(
                graph.number_of_nodes(), model.n_hidden * 3
            )

            for input_nodes, output_nodes, bipartites in tqdm.tqdm(dataloader):
                # bipartites[0] bc bipartites is a list with a single element that contains one graph unroll step
                # with ALL neighbors for the nodes in that batch
                bipartite = bipartites[0].to(torch.device('cuda'))

                x = input_features[input_nodes].cuda()

                # the following code is identical to the loop body in model.forward()
                x = layer(bipartite, x)
                x = x.flatten(1)

                output_features[output_nodes] = x.cpu()
            input_features = output_features
          
        output_features = torch.zeros(
                graph.number_of_nodes(), model.n_classes
            )
        num_of_batch = graph.number_of_nodes()//batch_size
        for i in range(num_of_batch+1):
            x = input_features[i*batch_size:(i+1)*batch_size,:].to(torch.device('cuda'))
            x = model.classify(x)
            output_features[i*batch_size:(i+1)*batch_size,:] = x.cpu()
            
    return output_features

The following code loads the best model from the file saved previously and performs offline inference.  It computes the accuracy on the test set afterwards.

In [18]:
model.load_state_dict(torch.load(best_model_path))
all_predictions = inference(model, graph, node_features, 8192)

test_predictions = all_predictions[test_nids].argmax(1)
sklearn.metrics.accuracy_score(node_labels[test_nids], test_predictions)

100%|██████████| 299/299 [01:18<00:00,  3.82it/s]
100%|██████████| 299/299 [01:08<00:00,  4.39it/s]
100%|██████████| 299/299 [01:06<00:00,  4.48it/s]


0.7570822889795313

In [20]:
test_predictions = all_predictions[test_nids[:663927]].argmax(1)
sklearn.metrics.accuracy_score(node_labels[test_nids[:663927]], test_predictions)

0.7451376431445024

In [22]:
import csv

def write_to_file(filename, ids, predictions):
    with open(filename,'w') as out:
        csv_out = csv.writer(out)
        csv_out.writerow(['ID','Label'])
        for index, label in zip(ids, predictions):
            csv_out.writerow((index, label))
test_predictions = all_predictions[test_nids].argmax(1)
write_to_file('submission.csv', range(len(test_predictions)), test_predictions.numpy())

## Conclusion

In this tutorial, you have learned how to train a multi-layer GNN with neighbor sampling on a large dataset that cannot fit into GPU at once.  The method you have learned can scale to a graph of any size, and works on a single machine with a single GPU.

Running the last cell above will write your predictions out to a file called submission.csv in your current folder. A sample submission file has been included for you to make a test submission, and then you can improve upon it with more experimentation. Download it to your local machine and upload it to https://leaderboard.corp.amazon.com/tasks/703/submit

A single submission is enough to qualify you for completion.

Experiment with various models of GNNs and see if you can go above the performance of the simple network defined above.


**NOTE** : Since the test data is public, you could, in theory, just make a submission using the ground truth labels and get a 1.0 on the leaderboard. This is not helping you learn anything and although it qualifies as a completion it will likely automatically disqualify you from being considered a top submission.