Copyright (c) 2022 Graphcore Ltd. All rights reserved.

Prediction of molecular properties using GPS++ on IPUs (OGB-LSC)
==========

This notebook demonstrates doing inference on a single sample and the entire test-challenge dataset split using GPS++, the model we used for our OGB-LSC PCQM4Mv2 submission. We will discuss GPS++ in this notebook but for more details on GPS++ see [GPS++: Reviving the Art of Message Passing for Molecular Property Prediction](https://arxiv.org/abs/2302.02947).

We will use IPUs for this, which allows us to do inference on the entire validation dataset split in less than a minute.

In the process of doing this we will see some of the additional features we generate from the original dataset and feed into the model.

### Running on Paperspace

The Paperspace environment lets you run this notebook with no set up. To improve your experience we preload datasets and pre-install packages, this can take a few minutes, if you experience errors immediately after starting a session please try restarting the kernel before contacting support. If a problem persists or you want to give us feedback on the content of this notebook, please reach out to through our community of developers using our [slack channel](https://www.graphcore.ai/join-community) or raise a [GitHub issue](https://github.com/gradient-ai/Graphcore-Tensorflow2/issues).

**Requirements:**

* Python packages installed with `pip install -r requirements.txt`

In [None]:
%pip install -q -r requirements.txt

In [None]:
from examples_utils import notebook_logging
%load_ext gc_logger

This example requires building a few things:
* An optimised method to get the path lengths of a graph
* IPU-optimised grouped gather/scatter operations

In [None]:
# %%capture
import os

code_directory = os.getenv("OGB_SUBMISSION_CODE", ".")
! cd {code_directory} && make -C data_utils/feature_generation
! cd {code_directory} && make -C static_ops

In [None]:
%matplotlib inline
# Need notebook utils as first import as it modifies the path
import notebook_utils

import os
import yaml
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import random
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem, Draw


import tensorflow as tf
from argparser import parse_dict
from data_utils.load_dataset import CustomGraphData, load_raw_dataset
from data_utils.pcq_dataset_28features import smiles2graph_large, CustomPCQM4Mv2Dataset
from data_utils.preprocess_dataset import preprocess_dataset
from notebook_utils import predict

# Set the tensorflow log level
tf.get_logger().setLevel("ERROR")

When running on Paperspace we will run some additional configuration steps below. If you aren't running on Paperspace, ensure you have the following environment variables set: `DATASETS_DIR` — location of the dataset, `CHECKPOINT_DIR` — location of any checkpoints, and `POPLAR_EXECUTABLE_CACHE_DIR` — location of any Poplar executable caches. Or you can update the paths manually in the following two cells.

In [None]:
checkpoint_directory = os.getenv("OGB_CHECKPOINT_DIR", ".")
dataset_directory = os.getenv("OGB_DATASETS_DIR", ".")
code_directory = Path(os.getenv("OGB_SUBMISSION_CODE", "."))

Let's also set a few things to enable us to use the executable caches, saving us from recompiling the model.

In [None]:
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", ".")
os.environ["TF_POPLAR_FLAGS"] = f"--executable_cache_path='{executable_cache_dir}'"

### Loading a configuration

There are three configurations available with the model sizes of 11M (GPS_4layer), 22M (GPS_8layer) and 44M (GPS_16layer).
By choosing the model name, the corresponding config path and checkpoint path will be used.


In [None]:
# Choose model
model_name = "GPS_4layer" # choose from ["GPS_4layer", "GPS_8layer", "GPS_16layer"]

# Set configs
model_dict = {"GPS_4layer": "GPS_PCQ_4gps_11M.yaml", 
              "GPS_8layer": "GPS_PCQ_8gps_22M.yaml",
              "GPS_16layer": "GPS_PCQ_16gps_44M.yaml"}
cfg_path = code_directory / "configs" / model_dict[model_name]
cfg_yaml = yaml.safe_load(cfg_path.read_text())
cfg = parse_dict(cfg_yaml)

# Set the checkpoint path for the corresponding config
sub_directory = model_dict[model_name].split(".")[0]
checkpoint_path = Path(checkpoint_directory).joinpath(f"{sub_directory}/model-FINAL")

# Turn off dataset caching for this notebook
cfg.dataset.save_to_cache = False
cfg.dataset.load_from_cache = False

### Predicting the HOMO-LUMO gap of a single molecule

The competition requires predicting the HOMO-LUMO energy gap of a number of molecules. To demonstrate inference with our model on the IPU we will start by predicting the HOMO-LUMO gap of a single molecule.

Let's create a smile string:

In [None]:
smile_string = "CC(NCC[C@H]([C@@H]1CCC(=CC1)C)C)C"

To get an idea of how this molecule looks let's visualise this smile string using [rdkit](https://www.rdkit.org/docs/GettingStartedInPython.html).

In [None]:
mol = Chem.MolFromSmiles(smile_string)
AllChem.Compute2DCoords(mol)
Draw.MolToImage(mol)

We now convert the smile string into some features our model can understand. The competition provides a method to do this (`smiles2graph`). We have written a custom version of this that uses [rdkit](https://www.rdkit.org/docs/GettingStartedInPython.html) to generate more features from the smile string.

In [None]:
graph_obj = smiles2graph_large(smile_string)

We can take a look at the features that have been created that represent this molecule.

In [None]:
graph_obj.keys()

We can see the features that have been created:
* `edge_index`: The sending and receiving node indices, representing the edges in the graph.
* `edge_feat`: The features of each of the edges in the graph.
* `node_feat`: The features of each of the nodes in the graph.
* `num_nodes`: Number of nodes in the graph.

One thing to note is the size of the node features. They have a size of 28 compared to 9 that the OGB-LSC provided `smiles2graph` function creates.

In [None]:
len(graph_obj["node_feat"][0])

Now we will wrap this single graph in our dataset wrapper, which contains some metadata and the dataset object itself.

In [None]:
graphs = [graph_obj]
graph_raw = CustomGraphData(
    graphs,
    use_extended_features=True,
    use_conformers=True,
    trim_chemical_features=cfg.dataset.trim_chemical_features,
    use_periods_and_groups=cfg.dataset.use_periods_and_groups,
    do_not_use_atomic_number=cfg.dataset.do_not_use_atomic_number,
    chemical_node_features=cfg.dataset.chemical_node_features,
    chemical_edge_features=cfg.dataset.chemical_edge_features,
)

Now let's feed this through our dataset preprocessing pipeline.

In [None]:
graph_preprocessed = preprocess_dataset(dataset=graph_raw, options=cfg)

Let's take a look at the additional features created.

In [None]:
preprocessed_single_graph = graph_preprocessed.dataset[0][0]
preprocessed_single_graph.keys()

You can see we now have many more features. Features of note are:

* `lap_eig_vals`: The Laplacian eigenvalues used as positional encodings in the model.
* `lap_eig_vecs`: - The Laplacian eigenvectors used as positional encodings in the model.
* `random_walk_landing_probs`: The random walk feature used as structural encodings in the model.
* `shortest_path_distances`: The shortest path distances (2D graph structure information) used as attention bias.
* `centrality_encoding`: Degrees of the atoms.
* `ogb_conformer`, `ogb_bond_lengths` and `atom_distances` require the 3D position information which is only provided for training data. You will see these values are NaNs or zeros for our inference case.

Checking the size of the node features, you will see that is smaller than before. From the 28 features we had before, we have selected 11 based on how beneficial the features were for training. For example, we found the features related to the atomic number group, period and family to be very beneficial.

In [None]:
len(preprocessed_single_graph["node_feat"][0])

Now we are ready to run on the IPU. We have wrapped this functionality in a single function for simplicity. We encourage you to check the contents of this function.

The first time that we run the model in inference on IPUs will require the model to be compiled. This can take a few minutes. We have cached the executable so any subsequent runs we want to do will not require this compile phase.

Now let's predict the HOMO-LUMO gap of the molecule.

In [None]:
prediction, _ = predict(graph_preprocessed, checkpoint_path, "test-challenge", cfg)

In [None]:
prediction[0]

### Predictions on the validation dataset split

Doing a single prediction at a time is very inefficient. In this next section we will generate predictions for the entire `valid` dataset split, which is very fast on the IPU.

In [None]:
cases = ("valid",)
split_mode = "original"

Now let's load and preprocess the dataset. These steps are the same as those we took when preprocessing the single molecule, but we do the preprocessing only on the molecules in the `valid` dataset split. This will be done from scratch so can take a few minutes.

In [None]:
graph_data = load_raw_dataset(
    cfg.dataset.dataset_name,
    dataset_directory,
    cfg,
    split=cases,
    load_ensemble_cache=True,
    split_mode=split_mode,
    ensemble=True,
)
graph_data = preprocess_dataset(
    dataset=graph_data,
    options=cfg,
    load_ensemble_cache=False,
    folds=cases,
    split_mode=split_mode,
    ensemble=True,
)

Now let's get predictions on the entire `valid` dataset split.

In [None]:
predictions, ground_truth = predict(graph_data, checkpoint_path, cases[0], cfg)

Let's take a look at the mean, variance and histogram of the predictions, and a histogram of the ground truth labels.

In [None]:
mean = predictions.astype(float).mean()
std = predictions.astype(float).var()

plt.hist(predictions, 30, alpha=0.7, label="Predictions")
plt.hist(ground_truth, 30, alpha=0.5, label="Ground truth")
plt.xlabel("HOMO-LUMO Gap (eV)")
plt.ylabel("Counts")
plt.title("Histogram of HOMO-LUMO gap predictions")
plt.legend()
plt.text(5, 15000, f"mean: {mean:.2f}, std: {std:.2f} \n")
plt.show()

We could compare the predicted values with the true labels and calculate the MAE (mean absolute error) as shown below. The model we test here is a small one (11M parameters). To achieve a better MAE we could select one of the larger models (22M or 44M) and re-run the predictions. For the challenge submission we ran the largest model (44M) with 16 GPS layers.

In [None]:
mean_MAE = np.mean(np.abs(np.array(predictions) - np.array(ground_truth)))
mean_MAE

Let's have a look at some molecules with their corresponding label and prediction. First, we have to load the raw data.

In [None]:
# Load raw SMILES strings with corresponding label and select only validation molecules
smiles_only = CustomPCQM4Mv2Dataset(
    root=dataset_directory, only_smiles=True, split_path=cfg.dataset.split_path
)
valid_idx = smiles_only.get_idx_split()["valid"]
valid_smiles = []
for i in valid_idx:
    valid_smiles.append(smiles_only[i])

Now we can plot random molecules in the `valid` dataset.

In [None]:
# Choose 6 random indices from validation dataset
r_idx = random.sample(range(len(valid_smiles)), 6)
# Extract SMILES, RDKit molecular objects, ground_truth labels, and predictions for these molecules
Smiles = [valid_smiles[r][0] for r in r_idx]
GT = [valid_smiles[r][1] for r in r_idx]
Mols = [Chem.MolFromSmiles(r) for r in Smiles]
Pred = [predictions[r] for r in r_idx]

# Create labels
labelList = [
    "True: " + str("%.3f" % gt) + "\v Pred: " + str(prediction)
    for gt, prediction in zip(GT, Pred)
]
# Display molecules with labels
Draw.MolsToGridImage(
    Mols,
    molsPerRow=3,
    legends=[label for label in labelList],
    subImgSize=(250, 250),
    useSVG=False,
)

Let's now look at the mean absolute error compared to a global molecular property such as number of nodes.

First, we have to extract the number of nodes from our graph data.

In [None]:
num_nodes_list = []
for i in valid_idx:
    num_nodes_list.append(graph_data.dataset[i][0]["num_nodes"])

For simplicity, we create a dataframe with columns for ground truth, prediction, and number of nodes

In [None]:
df = pd.DataFrame(
    list(zip(ground_truth, predictions, num_nodes_list)),
    columns=["ground_truth", "prediction", "num_nodes"],
)
df.head()

Now, let's plot the mean HOMO-LUMO gap and mean prediction per molecule size.

In [None]:
# plotting a line graph
plt.plot(
    df.groupby("num_nodes")["ground_truth"].mean(),
    color="red",
    label="Mean ground truth",
)
plt.plot(
    df.groupby("num_nodes")["prediction"].mean(), color="green", label="Mean prediction"
)
plt.legend()
plt.ylabel("HOMO-LUMO Gap (eV)")
plt.xlabel("Number of Atoms")
plt.show()

Now, let's add the mean absolute error to our dataframe and compare it to the molecule size.

In [None]:
# Add MAE to dataframe
df["mae"] = np.abs(df["ground_truth"] - df["prediction"])
# plot
plt.plot(df.groupby("num_nodes")["mae"].mean(), color="orange", label="Mean MAE")
plt.legend()
plt.ylabel("Mean Absolute Error (eV)")
plt.xlabel("Number of Atoms")
plt.show()

It seems that our model struggles to predict the HOMO-LUMO gap for larger molecules. However, there are only very few molecules (31 in total) that consist of 40 or more atoms. Therefore, let's see how much molecules of each size contribute to the overall error

In [None]:
# Add MAE to dataframe
df["mae"] = np.abs(df["ground_truth"] - df["prediction"])
# plot
plt.plot(df.groupby("num_nodes")["mae"].size() / len(valid_idx) * 100, color="green")
plt.ylabel("Contribution in % to final MAE")
plt.xlabel("Number of Atoms")
plt.show()

### Follow up tasks

* Try changing the smile string and seeing what HOMO-LUMO gap is predicted.
* Try a larger model (`GPS_PCQ_8gps_22M.yaml` or `GPS_PCQ_16gps_44M.yaml`) and the corresponding checkpoint.
* Take a look at the `predict` function to get an understanding of what is required to run on IPUs.
* Use the 'test-challenge' split of the dataset to predict HOMO-LUMO gaps for the molecules in the challenge.

Or head to `notebook_training.ipynb` to try out training the model from scratch using IPUs!

In [None]:
from tensorflow.python import ipu

ipu.config.reset_ipu_configuration()