Copyright (c) 2023 Graphcore Ltd. All rights reserved.

# Sampling large graphs on IPUs using PyTorch Geometric



[![Run on Gradient](../../gradient-badge.svg)](https://console.paperspace.com/github/<runtime-repo>?machine=Free-IPU-POD4&container=<dockerhub-image>&file=<path-to-file-in-repo>)  [![Join our Slack Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

>### Link for the Run on Gradient button

> Once the notebook is available on Paperspace Gradient we like to have a "Run on Gradient" button. The link for the button needs to be configured. The example above shows the convention for how to form the link. 

> - The SVG image file for the button should be a local file in the repo (as shown in the example above). You can also use the [image file on Paperspace](https://assets.paperspace.io/img/gradient-badge.svg) but this is not reliable as Github's caching occasionally breaks.
> - `<runtime-repo>` represents the "organisation/repository-name" of the public repository that contains the notebook.
> - `<dockerhub-image>` is the name and tag of a public Docker Hub container.
> - `<path-to-file-in-repo>` is the location of the notebook inside the repo starting with a leading `/`.
>
> Note the part after the `?` in the link needs to be URL-encoded. You can use an online [URL encoder](https://www.urlencoder.org/) or you can use the [Paperspace link builder](https://docs.paperspace.com/gradient/notebooks/run-on-gradient/).
>
> Example of a fully-formed link for the "Run on Gradient" button:
> https://console.paperspace.com/github/gradient-ai/Graphcore-Pytorch?machine=Free-IPU-POD4&container=graphcore/pytorch-jupyter:3.1.0-ubuntu-20.04-20230104&file=/temporal-graph-networks/Train_TGN.ipynb

In the previous tutorials we have been focusing on working with datasets comprising many small graphs. For some modern applications, however, we will need to operate on larger graphs characterised by increasing number of nodes (range 10M-10B) and edges (range 100M-100B): imagine having to build a recommendation system for a social network type of input graph, which can be consituted by a huge number of users (nodes) and relationships (edges). Mention OGB benchmarks as well? 

We might think of two routes to approach large graph problems:
- full batch training: this is the approach we have been using in [Tutorial 2](TODO add link) when working with a single, relatively small graph. The aim is to generate embeddings for all the nodes at the same time: this entails keeping in memory the entire graph as well as all the node embeddings. If the size of the computational graph increases, the amount of memory required to hold graph and embeddings become challenging. 
- mini-batching: alternatively, we can sample mini-batches from the graph similarly to what we did in Tutorial 3 or 4 [TODO add links] where the dataset was a collection of many small graphs. When sampling from a larger graph, however, we need to be extra careful to reduce the chances of the sampled nodes to be isolated from each other. Should that be the case, the mini-batches would no longer be representative of the whole graph which would negatively impact our machine learning task. The need here is to engineer effective sampling methods to make sure that the message passing scheme is still effective with large graphs. 

In this tutorial, we will demonstrate two approaches widely used in literature to cope with increasing graph size by performing message passage over mini-batches. We will leverage the Graphcore IPU architecture, which is a very good fit for GNNs applications [TODO link blogs], and our PyTorch Geometric (PyG) integration. You will learn how to: 
- effectively cluster nodes of your input graph, then train your GNN on IPUs to classify papers from the PubMed dataset
- sample neighbouring nodes of your input graph, then train your GNN on IPUs to (TODO link prediction task)

This notebook assumes some familiarity with PopTorch as well as PyTorch Geometric (PyG). For additional resources please consult:
- [PopTorch Documentation](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/index.html),
- [PopTorch Examples and Tutorials](https://docs.graphcore.ai/en/latest/examples.html#pytorch),
- [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/),
- [PopTorch Geometric Documentation](https://docs.graphcore.ai/projects/poptorch-geometric-user-guide/en/latest/index.html).

[![Join our Slack
Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)

In [None]:
# Make imported python modules automatically reload when the files are changed
# needs to be before the first import.
%load_ext autoreload
%autoreload 2
# TODO: remove at the end of notebook development

## Environment setup

The best way to run this demo is on Paperspace Gradient's cloud IPUs because everything is already set up for you. To run the demo using other IPU hardware, you need to have the Poplar SDK enabled and the latest PopTorch Geometric wheel installed. Refer to the [getting started guide](https://docs.graphcore.ai/en/latest/getting-started.html#getting-started) for your system for details on how to enable the Poplar SDK and install the PopTorch wheels.

> You can install requirements directly from a notebook. You can:
>
> 1. Run commands by starting the line in a code cell with `!`, as shown in the first code block below. 
> 2. Install Python requirements with the `%pip` [magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html).
> 
> Use these methods to make it easier for your user to set up the environment they need.

In [None]:
%pip install -q -r requirements.txt

To make it easier for you to run this tutorial, we read in some configuration related to the environment you are running the notebook in.

In [None]:
import os

number_of_ipus = int(os.getenv("NUM_AVAILABLE_IPU", 16))
pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod16")
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/")
dataset_directory = os.getenv("DATASETS_DIR")
checkpoint_directory = os.getenv("CHECKPOINT_DIR")

> As the notebook writer, you only need to define a variable if you intend to use the value in the rest of the notebook.
> You can choose to use the default values of any variables if it is better for your development workflow.
> When you write the notebook, you need to use those variables that you have defined to configure the execution.
>
> Note on Poplar executables: We also set the standard PyTorch, PopART and TensorFlow environment variables, so
> if you do not customise the behaviour then you don't need to read them from the environment.
> For more information refer to [Writing a Paperspace notebook](https://graphcore.atlassian.net/wiki/spaces/PM/pages/3098345498/Writing+a+Paperspace+notebook).

## Clustering the computation graph for node classification

The idea: we can sample the entire graph in small subgraphs that individually fit in memory and on which we can calculate layer-wise embeddings, performing message passing on one subgraph at the time.
The subgraph should also retain the connectivity of the original graph, to achieve that we make sure that the small communities the original graph is made up of are mirrored in the subgraphs.

A well known approach is Cluster-GCN (TODO link the example). The steps are:
- pre-procesing: given a large graph, we partition it into group of nodes we name subgraphs
- mini-batch training: we load one subgraph at the time in the device memory, apply message passing over it to compute the loss


### Cluster-GCN in PyG on IPUs

Notes:
- in our cluster-gcn example we use Metis to do the pre-processing step. In the PyG tutorial on 'scaling GNNs' the use `ClusterData` PyG API, shall I try to use this latter to be more PyG?
- Given a user defined `batch_size` they using `ClusterLoader` to implement the stochastic partitioning scheme. On our side, we should use `FixedSizeClusterLoader` in PopTorch Geometric to comply with AOT compilation requirement. 
- To demonstrate that using clustering does not complicate the GNN model implementation, it would then be good to re-use a GCN model like the one we defined in e.g. Tutorial 2

### Training a GNN to classify papers in PubMed dataset

Idea: node classification in PubMed dataset from the Planetoid node classification benchmarking suite.

## Neighbourhood sampling the computation graph for link prediction

### Neighbour sampling in PyG on IPUs 

### Training a GNN to predict XYZ in KKK dataset 

> ### Useful tips and known challenges
>
> #### Working with `argparse` and command line arguments
>
> If you have encountered problems related to `argparse` while writing a notebook, these tips may help you resolve your problem:
> - Try to disentangle your application from any argument parsing logic.
> - Manually create an [`argparse.Namespace`](https://docs.python.org/3/library/argparse.html#argparse.Namespace).
> - Define custom parsing logic in your app to detect when its running in a Jupyter Notebook, for example as shown in the [simple parsing utilities](https://github.com/graphcore/examples-utils/blob/f8673d362fdc7dc77e1fee5f77cbcd81dd9e4a2e/examples_utils/parsing/simple_parsing_tools.py#L118). 
> 
> Often with these kinds of problems, the issue is rooted in the structure of the app, so consider using the [Applications common code interface](https://graphcore.atlassian.net/wiki/spaces/PM/pages/3164995668/Making+applications+notebook+ready+RFC#Proposal) to write an app that is easier to use.
>
> #### Detaching from IPUs
>
> Notebooks continue running after the last cell has been run, so you need to make sure that all IPUs are released at the end. This ensures that other users have resources available to run their notebooks.

In [None]:
if model.isAttachedToDevice():
model.detachFromDevice()

## Conclusion

In this tutorial we explored the main methods to deal with large graphs that otherwise wouldn't fit in memory, using two different sampling approaches and dedicated dataloaders to optimise performance on Graphcore IPUs. While we have worked with homogeneous graphs in this tutorial, scaled up GNN problems are also very well suited to heterogeneous graphs (for example, citation graphs can be huge).

TODO add more details about what we covered

> This section should describe the conclusions from this notebook:
>
> - Summarise the main steps that were performed in the demo making it clear what
>  your user got to do. This can be similar to the learning outcomes listed at the beginning of the notebook, but can contain more details. Try to link the specific feature, method or class that was used to achieving a specific outcome. Remember we want to highlight how we can solve the user's problems not sell a feature. (short paragraph: 3-6 sentences)
> - Provide resources for the user's next steps. These can be links to other tutorials, to specific 
>  documentation (for example user guides, tech notes), to code examples in the public Graphcore [examples](https://github.com/graphcore/examples) repo, or to other deployments. (2-4 suggestions)
>
> If you want to link to a notebook in the same runtime, then point the user to the file is rather than using an explicit. For example: "Please see our [name of tutorial] tutorial in `<folder_name>/<notebook_name>.ipynb`. For relative links, the Paperspace platform will download the file locally if the machine is running and if the machine is not running it will throw a 404 error. New windows are opened for full path links.