Copyright (c) 2023 Graphcore Ltd. All rights reserved.

# Molecular Prediction on IPUs using MolFeat 
## Pytorch Geometric with Lipophilicity and QM9 Datasets



### Introduction: MolFeat on the IPU

The popular [MolFeat Library](https://molfeat.datamol.io) provides as open source hub of pre-trained featurizers for molecules to deploy directly into ML workflows. 
In this notebook we'll show exactly how to do that using the Graphcore IPU to train a PyTorch Geometric Graph Neural Network (GNN) on the `Lipophilicity` dataset from MoleculeNet to predict octanol/water distribution coefficients. 
Then we show how this can be extended to any regression task on a new dataset.



### Summary table
|  Domain | Tasks | Model | Datasets | Workflow |   Number of IPUs   | Execution time |
|---------|-------|-------|----------|----------|--------------|--------------|
|   Molecules   |  Regression  | GINE | Lipophilicity / QM9 | Training, evaluation, inference | recommended: 4x (min: 1x) | 5mn    |

### Learning outcomes
In this demo you will learn how to:
- Predict molecular properties on molecules using a MolFeat featurizer and aPyTorchGeometric GNN on the IPU
- How to build an inference workflow for single molecule predictions


### Links to other resources
For more information about MolFeat check out their [documentation](https://molfeat.datamol.io), and about the datasets used in this notebook look at [MoleculeNet](https://moleculenet.org/datasets-1). 
For this notebook a familiarity with GNNs and PyTorch Geometric is assumed, to refresh on any details the tutorials on using PyTorch Geometric on the IPU can be found [here](https://github.com/graphcore/Gradient-Pytorch-Geometric/tree/main/learning-pytorch-geometric-on-ipus).


[![Join our Slack
Community](https://img.shields.io/badge/Slack-Join%20Graphcore's%20Community-blue?style=flat-square&logo=slack)](https://www.graphcore.ai/join-community)


## Dependencies

This notebook is currently only supported for Graphcore's SDK 3.2.1. 

In [None]:
import os

sdk_version = os.getenv("SDK_VERSION", "")
if sdk_version == "3.2.1":
    print("SDK check passed.")
else:
    raise Exception(
        f"The current SDK version {sdk_version} is incompatible with this notebook. We recommend you relaunch this notebook with the SDK 3.2.1."
    )

## Running on Paperspace

The Paperspace environment lets you run this notebook with no set up. To improve your experience we preload datasets and pre-install packages, this can take a few minutes, if you experience errors immediately after starting a session please try restarting the kernel before contacting support. If a problem persists or you want to give us feedback on the content of this notebook, please reach out to through our community of developers using our [slack channel](https://www.graphcore.ai/join-community) or raise a [GitHub issue](https://github.com/graphcore/examples).

Requirements:

* Python packages installed with `pip install -r ./requirements.txt`

In order to improve usability and support for future users, Graphcore would like to collect information about the
applications and code being run in this notebook. The following information will be anonymised before being sent to Graphcore:

- User progression through the notebook
- Notebook details: number of cells, code being run and the output of the cells
- Environment details

You can disable logging at any time by running `%unload_ext graphcore_cloud_tools.notebook_logging.gc_logger` from any cell.

In [None]:
%pip install  -r ./requirements.txt
%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

## The Problem...

Let's start by posing a toy problem. Imagine we have a dataset of molecules that we know something about, in this case about the Lipophilicity of the molecules (the ability of a molecule to dissolve in non-water based solvents), but it could be any property we wish. This dataset has been made from experimental results which are very expensive and time consuming to produce or extend. 

So like any good scientist, we wonder can we take the data we have and create a model to describe the physical results that we can use to extend to new molecules where experiments haven't been carried about, and yield reliable results? 

Below we can see an example molecule from the dataset, Clopidogrel, represented as a SMILES string and visualised in 3D. Looking at the table we can see the expected result.
This is the target we are aiming to produce, and by the end of this notebook we will have a model to fill in the rest of the table. 

In [None]:
from utils import report_molecule_regression

report_molecule_regression(
    "CHEMBL1940306", 1.51, None, r"CS(=O)(=O)c1ccc(Oc2ccc(cc2)C#C[C@]3(O)CN4CCC3CC4)cc1"
)

In [None]:
# Make imported python modules automatically reload when the files are changed
# needs to be before the first import.
%load_ext autoreload
%autoreload 2

import torch
import pandas as pd
import numpy as np
import torch.nn.functional as F
from tqdm.auto import tqdm
from IPython.display import clear_output
import ipywidgets as widgets

import poptorch
from poptorch_geometric.dataloader import CustomFixedSizeDataLoader
from utils import (
    plot_smoothed_loss,
    report_molecule_regression,
)

## PyG integration

As seen in the [molfeat integration tutorial](https://github.com/datamol-io/molfeat/blob/main/docs/tutorials/pyg_integration.ipynb), molfeat integrates easily with the PyTorch ecosystem. In this tutorial, we will demonstrate how you can integrate molfeat with [PyG](https://pytorch-geometric.readthedocs.io/en/latest/) for training SOTA GNNs.


In [None]:
from molfeat.trans.graph.adj import PYGGraphTransformer
from molfeat.calc.atom import AtomCalculator
from molfeat.calc.bond import EdgeMatCalculator

### Featurizer

The key advantage of using MolFeat as part of the GNN pipeline is the pre-trained featurizers. These featurizers will allow us to take a molecule and build a set of node and edge features that is already discriminative and describes the relatonships between atoms and bonds. 
This is because the pre-trained featurizers have been trained on large quantities of data from a range of domains in order to capture the basic underlying physical relationships between atoms that is transferable between downstream tasks. 

We first start by defining our featurizer. We will use the `PYGGraphTransformer` from molfeat with atom and bond featurizers.

In [None]:
featurizer = PYGGraphTransformer(
    atom_featurizer=AtomCalculator(), bond_featurizer=EdgeMatCalculator()
)

### Dataset

For the dataset, we will use the `Lipophilicity` dataset (LogD) from MoleculeNet, which contains experimental results of octanol/water distribution coefficient at pH=7.4.
This contains a list of molecules given by their CMPD_CHEMBLID, the SMILES string describing the molecular structure, and the expected result. 

We can look at the data in the dataframe in the output below, but take note of the length of the dataset. 
This is partly where the MolFeat Featurizer starts to make a lot of sense, with only 4200 molecules in the dataset we need to maximise the learning on downstream tasks to get good performance, and needing to learn meaningful chemical features would make this a significantly harder problem.
This featurization can be offloaded to `MolFeat` as core features around how molecules are constructed will be consistent and transferable between tasks. 


In [None]:
df = pd.read_csv(
    "https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/Lipophilicity.csv"
)
print(df.head())
print(f"Length of dataset: {len(df)}")

Now let's try and visualise the dataset a bit better, we can plot a 3D representation of the molecules from our dataset and scan through with the slider to see how they look and the expected target values. 

In [None]:
mol_id = 10  # SET THIS VALUE AND RE-RUN THE CELL TO EXPLORE THE DATASET

view = report_molecule_regression(
    df.CMPD_CHEMBLID.values[mol_id],
    df.exp.values[mol_id],
    None,
    df.smiles.values[mol_id],
)
view.show()

**Extension:** To use a different dataset, just load the dataset from csv and note the column headings from your dataframe. You will likely need to change the name `CMPD_CHEMBLID`/`mol_id` and `exp`/`gap` headings which correspond to the `Lipophilicity` and `QM9` datasets respectively. In the cell below you can make sure you make the right changes. Then when you load the dataset make sure the same changes are made to ensure the correct columns are read from the dataframe. Be careful in the Demo section at the end - you will need to make the same changes there. 

In [None]:
# Alternatively we can use the QM9 dataset
# df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm9.csv")
# print(df.head())
# print(f"Length of dataset: {len(df)}")
# mol_id = 10  # Pick a value between 0 and 2050 for the BBBP dataset to explore
# report_molecule_regression(df.mol_id.values[mol_id], df.gap.values[mol_id], None, df.smiles.values[mol_id])

Since training a network with PyTorch requires defining a dataset and dataloader, we can define our custom dataset that will take 
**(1)** the SMILES string of the molecule,
**(2)** the LogD measurement, and 
**(3)** our molfeat transformer
as input to generate the data point we need for model training.

In [None]:
from torch.utils.data import Dataset
from torch_geometric.utils import degree


class DTset(Dataset):
    def __init__(self, smiles, y, featurizer):
        super().__init__()
        self.smiles = smiles
        self.featurizer = featurizer
        self.featurizer.auto_self_loop()
        self.y = torch.tensor(y.astype("float16")).unsqueeze(-1).float()
        self.transformed_mols = self.featurizer(smiles)
        self._degrees = None

    @property
    def num_atom_features(self):
        return self.featurizer.atom_dim

    @property
    def num_output(self):
        return self.y.shape[-1]

    def __len__(self):
        return len(self.transformed_mols)

    @property
    def num_bond_features(self):
        return self.featurizer.bond_dim

    @property
    def degree(self):
        if self._degrees is None:
            max_degree = -1
            for data in self.transformed_mols:
                d = degree(
                    data.edge_index[1], num_nodes=data.num_nodes, dtype=torch.long
                )
                max_degree = max(max_degree, int(d.max()))
            # Compute the in-degree histogram tensor
            deg = torch.zeros(max_degree + 1, dtype=torch.long)
            for data in self.transformed_mols:
                d = degree(
                    data.edge_index[1], num_nodes=data.num_nodes, dtype=torch.long
                )
                deg += torch.bincount(d, minlength=deg.numel())
            self._degrees = deg
        return self._degrees

    def collate_fn(self, **kwargs):
        # luckily the molfeat featurizer provides a collate function for PyG
        return self.featurizer.get_collate_fn(**kwargs)

    def __getitem__(self, index):
        return self.transformed_mols[index], self.y[index]

Now process the dataset with our custom class, and split into test and train datasets. 

In [None]:
dataset = DTset(df.smiles.values, df.exp.values, featurizer)
# The line below can be used for the QM9 dataset - or edit this line for your own dataset as suggested above.
# dataset = DTset(df.smiles.values, df.gap.values, featurizer)
generator = torch.Generator().manual_seed(42)
train_dt, test_dt = torch.utils.data.random_split(
    dataset, [0.8, 0.2], generator=generator
)

For the PopTorch dataloader we will need a fixed batch size, so for starters we can make a conservative estimate of the maximum possible batch size as the maximum number of nodes or edges multiplied by the batch size. (For a more efficient batch see the tutorial [here](https://console.paperspace.com/github/graphcore/Gradient-Pytorch-Geometric?machine=Free-IPU-POD4&container=graphcore%2Fpytorch-geometric-jupyter%3A3.2.0-ubuntu-20.04-20230314&file=%2Flearning-pytorch-geometric-on-ipus%2F4_small_graph_batching_with_packing.ipynb&utm_source=Medium&utm_medium=content&utm_campaign=PyG+Launch) for packed batches.)

In [None]:
def max_nodes_edges(dataset):
    max_nodes, max_edges = 0, 0

    for data in dataset:
        data = data[0]
        num_nodes = data.num_nodes
        num_edges = data.num_edges

        max_nodes = max(max_nodes, num_nodes)
        max_edges = max(max_edges, num_edges)

    return max_nodes, max_edges


max_nodes, max_edges = max_nodes_edges(dataset)

In [None]:
BATCH_SIZE = 64

The custom dataset object has a collate function that utilizes the collate function provided by the featurizer, we want to combine this with the poptorch `CustomFixedSizeDataLoader` that ensures the batches are all of fixed size. To do this we can write a light custom DataLoader - it calls the collate function from the `CustomFixedSizeDataLoader` then calls the collate function from the dataset.

In [None]:
from poptorch_geometric.dataloader import CustomFixedSizeDataLoader


class CombinedDataloader(CustomFixedSizeDataLoader):
    def _create_collater(self, **collater_args):
        fixed_size_collater = super()._create_collater(**collater_args)
        featurizer_collater_fn = dataset.collate_fn(return_pair=False)

        def my_collater(input):
            input = featurizer_collater_fn(input)
            input = fixed_size_collater(input)
            return input

        return my_collater

In [None]:
collate_fn = dataset.collate_fn(return_pair=False)
from torch_geometric.transforms import Pad

pad_transform = Pad(
    max_num_nodes=max_nodes * BATCH_SIZE, max_num_edges=max_edges * BATCH_SIZE
)


def padded_collate_fn(data_list):
    print(data_list)
    batch = collate_fn(data_list)
    batch = pad_transform(batch)
    print(batch)
    return batch

In [None]:
from poptorch_geometric.dataloader import DataLoader

train_opts = poptorch.Options()
train_opts.deviceIterations(1)
train_opts.Training.gradientAccumulation(1)
train_opts.replicationFactor(1)

train_loader = DataLoader(
    dataset=train_dt,
    batch_size=BATCH_SIZE,
    drop_last=True,
    options=train_opts,
    collate_fn=padded_collate_fn,
)

Then we can use the new dataloader to create the `train_loader` and `test_loader` iterables that we will use in training. The args we pass in for the collate function are as normal for poptorch Geometric, but now we can have a fixed size batch with all the utility from the featurizer. 

Here we set the poptorch options for both training and testing, for simplciity we are going to set device iteration, gradient accumulation, and number of replicas all to 1. For more information on tuning these parameters, you can look [here](https://console.paperspace.com/github/graphcore/Gradient-Pytorch-Geometric?machine=Free-IPU-POD4&container=graphcore%2Fpytorch-geometric-jupyter%3A3.2.0-ubuntu-20.04-20230314&file=%2Flearning-pytorch-geometric-on-ipus%2F1_at_a_glance.ipynb&utm_source=Medium&utm_medium=content&utm_campaign=PyG+Launch).

In [None]:
train_opts = poptorch.Options()
train_opts.deviceIterations(1)
train_opts.Training.gradientAccumulation(1)
train_opts.replicationFactor(1)

test_opts = poptorch.Options()
test_opts.deviceIterations(1)
test_opts.Training.gradientAccumulation(1)
test_opts.replicationFactor(1)

train_loader = CombinedDataloader(
    dataset=train_dt,
    batch_size=BATCH_SIZE,
    num_nodes=max_nodes * BATCH_SIZE,
    collater_args=dict(num_edges=max_edges * BATCH_SIZE, add_masks_to_batch=True),
    drop_last=True,
    options=train_opts,
)

test_loader = CombinedDataloader(
    dataset=test_dt,
    batch_size=BATCH_SIZE,
    num_nodes=max_nodes * BATCH_SIZE,
    collater_args=dict(num_edges=max_edges * BATCH_SIZE, add_masks_to_batch=True),
    drop_last=True,
    options=test_opts,
)

To see how the collater works, we can inspect a single batch. We can see that the batch has been padded to the fixed size we specified and a mask has been added for the additional nodes. 

In [None]:
sample = next(iter(train_loader))
sample

### Network + Training
We are almost ready to go, we just need to define our GNN. For this example we are going to build a [GINE](https://arxiv.org/abs/1905.12265) model. 
We use GINE because it allows us to exploit the edge features provided by Molfeat.
The PyTorch Geometric docs give a good description of the core `GINEConv` layer [here](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.GINEConv.html#torch_geometric.nn.conv.GINEConv).

This model is a further extension of the `GIN` model used in the `<note/book>.ipynb`.
If you are unfamiliar with the adaptions needed when building a PyTorch model on the IPU, then please see our tutorials. The main difference with vanilla PyTorch is that the loss is returned by the model itself as can be seen on `line 71` in the cell below. 

In [None]:
import torch.nn as nn
from torch_geometric.nn.conv import GINConv, GINEConv
from torch_geometric.nn.pool import global_add_pool
from torch_geometric.nn.models import MLP


class GINE(nn.Module):
    """
    Graph Isomorphism Network modified to take into account fixed size
    tensor inputs created with padding.

    params:
        in_channels (int): number of features each node is represented by
        hidden_channels (int): number of hidden units for all MLP layers
        out_channels (int): num of hidden units in the output of the network
        num_conv_layers (int): number of GINConv layers in the network
        num_mlp_layers (int): number of hidden layers in MLP
        batch_size (int): maximum number of graphs in a batch
    """

    def __init__(
        self,
        in_channels,
        hidden_channels,
        out_channels,
        num_conv_layers,
        num_mlp_layers,
        batch_size,
        edge_dim,
    ):
        super().__init__()

        self.batch_size = batch_size

        # `num_conv_layers` layers for AGGREGATE and COMBINE
        self.hop_k_gin_layers = nn.ModuleList()

        # linear READOUT nets for (sum) graph pooling
        # the first pooling occurs on the input nodes' features (0-hop)
        self.hop_k_readout_layers = nn.ModuleList(
            [nn.Linear(in_features=in_channels, out_features=out_channels)]
        )

        # Final Regression Head
        self.out_layer = nn.Linear(in_features=out_channels, out_features=1)

        for k_hop in range(num_conv_layers):
            phi = MLP(
                in_channels=in_channels if k_hop == 0 else hidden_channels,
                hidden_channels=hidden_channels,
                out_channels=hidden_channels,
                num_layers=num_mlp_layers,
                act="relu",
                norm="layer_norm",
                plain_last=False,
            )

            # performs the initial (sum) neighbour pooling + (1-eps) * hv, then applies phi
            # i.e phi o f = MLP((1-eps)*hv + (sum) neighbour k_hop representation)
            self.hop_k_gin_layers.append(
                GINEConv(nn=phi, eps=0, train_eps=False, edge_dim=edge_dim)
            )

            # READOUT is performed on each k_hop node representation for each graph
            self.hop_k_readout_layers.append(
                nn.Linear(in_features=hidden_channels, out_features=out_channels)
            )

    def forward(
        self, x, edge_index, batch, graphs_mask=None, target=None, edge_attr=None
    ):
        # perform k-hop aggregation using GINConv, and return all layer outputs
        hop_k_outputs = [x]
        h = x
        for gin_layer in self.hop_k_gin_layers:
            h = gin_layer(h, edge_index, edge_attr)
            hop_k_outputs.append(h)

        # perform readout over all nodes in each graph in each layer
        score_over_layer = torch.zeros((1))
        for i, linear in enumerate(self.hop_k_readout_layers):
            pooled_h = global_add_pool(
                x=hop_k_outputs[i], batch=batch, size=self.batch_size
            )
            # compute scores
            score_over_layer = score_over_layer + nn.functional.dropout(
                linear(pooled_h), training=self.training
            )
        score_over_layer = self.out_layer(score_over_layer)
        if self.training:
            # Compute loss
            loss = F.mse_loss(score_over_layer.squeeze()[:-1], target[:-1])
            return score_over_layer, loss

        return score_over_layer

## Build the training model 

Now we create the model for the prediction task and prepare it for training.

We can tune a number of key parameters below:
* `LEARNING_RATE` - set the learning rate for the model
* `EPOCHS` - training duration
* `HIDDEN_CHANNELS`  - hidden dimension size of the MLP in the GINE model 
* `OUT_CHANNELS` - out dimension from the message passing portion of the layer
* `NUM_CONV_LAYERS` - number of conv layers (i.e. message passing steps)
* `NUM_MLP_LAYERS` - number of final MLP layers after each conv block

(Note: Each conv layer adds an additional linear readout layer as well)

We can see the impact of model size by tweaking these parameters in the summary printed below. 

In [None]:
# Tunable Parameters
LEARNING_RATE = 5e-4
NUM_EPOCHS = 15
HIDDEN_CHANNELS = 256
OUT_CHANNELS = 256  # dataset.num_output
NUM_CONV_LAYERS = 4
NUM_MLP_LAYERS = 2


model = GINE(
    dataset.num_atom_features,
    HIDDEN_CHANNELS,
    OUT_CHANNELS,
    NUM_CONV_LAYERS,
    NUM_MLP_LAYERS,
    BATCH_SIZE,
    dataset.num_bond_features,
)
model.train()

optimizer = poptorch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
poptorch_training_model = poptorch.trainingModel(
    model, options=train_opts, optimizer=optimizer
)


from torchinfo import summary

summary(poptorch_training_model)

## Train the model 

Now we can start training the model. 


In [None]:
# Train
epoch_losses = []
with tqdm(range(NUM_EPOCHS), colour="#FF6F79") as pbar:
    for epoch in pbar:
        losses = []
        for data in train_loader:
            out, loss = poptorch_training_model(
                data.x,
                data.edge_index,
                batch=data.batch,
                graphs_mask=data.graphs_mask,
                edge_attr=data.edge_attr,
                target=data.y,
            )
            losses.append(loss.item())
            epoch_losses.append(loss.item())
        print(f"Epoch {epoch} - Loss {np.mean(losses):.3f}")

        pbar.set_description(f"Epoch {epoch} - Loss {np.mean(losses):.3f}")

In [None]:
poptorch_training_model.detachFromDevice()

In [None]:
# It's always good to look at the curve of the training loss
plot_smoothed_loss(epoch_losses)

## Evaluation

We can now test our model. For the simplicity of this tutorial, no hyperparameter search or evaluation of the best atom/bond featurization was performed. This inevitably impacts the performance - but is left as an exercise to the reader.

A manual hyperparameter search is easy to start performing by tweaking the given hyperparameters when building the model above and running the training loop again. (Remember to run both the building cell and the training cell.)
The model compiles and trains in a couple of minutes on the IPU, so it's easy to experiment with even quite large models quickly and tune the model to your specific dataset. 

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error
from matplotlib import pyplot as plt

model.eval()
poptorch_inference_model = poptorch.inferenceModel(model, options=test_opts)
test_y_hat = []
test_y_true = []
with torch.no_grad():
    for data in test_loader:
        out = poptorch_inference_model(
            data.x,
            data.edge_index,
            batch=data.batch,
            graphs_mask=data.graphs_mask,
            edge_attr=data.edge_attr,
        )[:-1]
        # out = global_add_pool(out, data.batch)
        test_y_hat.append(out.detach().cpu().squeeze())
        test_y_true.append(data.y[:-1])


test_y_hat = torch.cat(test_y_hat).numpy()
test_y_true = torch.cat(test_y_true).numpy()

r2 = r2_score(test_y_true, test_y_hat)
mae = mean_absolute_error(test_y_true, test_y_hat)
poptorch_inference_model.detachFromDevice()

We can visualise the distribution of results by plotting the predicted values against the expected true values from the dataset. The $R^2$ and MAE are given on the plot.

In [None]:
from utils import plot_contours

plot_contours(test_y_true, test_y_hat, r2, mae)

print(min(test_y_hat), max(test_y_hat))
print(min(test_y_true), max(test_y_true))

## Demo Molecules

We can see the MAE and $R^2$ score above, and while they seem pretty good, it's hard to put that in context of a real use case. 
So instead let's apply the model to individual molecules and report the performance.

For demonstration purposes we'll build a new dataloader for inference with a batch-size of 1 molecule, and for simplicity we'll just take the full dataset - in reality this might be a different dataset, or new molecules as they are needed to be evaluated , but this gives an idea of the workflow. 

This is a similar idea to the demo in the [Transformer notebook](https://console.paperspace.com/github/graphcore/Gradient-HuggingFace?machine=Free-IPU-POD4&container=graphcore/pytorch-jupyter%3A3.2.0-ubuntu-20.04-20230331&file=dolly2-instruction-following%2FDolly2-an-OSS-instruction-LLM.ipynb), but here we need to provide a little more informarion to ensure the graphs are fixed size and will work with any molecule in our dataset. The batch size is 2 to allow for the dummy padding graph (see tutorial X for more information) and we provide a really generous max nodes + edges which for most graphs is overkill, but keeps the dataloader simple, and in a single molecule by single molecule case is perfectly sufficient for our needs. 

In [None]:
inf_opts = poptorch.Options()
inf_opts.deviceIterations(1)
inf_opts.replicationFactor(1)
inf_df = df
# These two lines are can be swapped to choose the QM9 dataset as an extension
dataset = DTset(inf_df.smiles.values, inf_df.exp.values, featurizer)
# dataset = DTset(inf_df.smiles.values, inf_df.gap.values, featurizer)

inf_loader = CombinedDataloader(
    dataset=dataset,
    batch_size=2,
    num_nodes=max_nodes * BATCH_SIZE,
    collater_args=dict(num_edges=max_edges * BATCH_SIZE, add_masks_to_batch=True),
    drop_last=True,
    options=inf_opts,
)
model.batch_size = 2
model.eval()
inference_model = poptorch.inferenceModel(model, options=inf_opts)

In [None]:
sampler = iter(inf_loader)
smiles = iter(df.smiles.values)
# These lines can be selected to run QM9 dataset if desired
# names = iter(df.mol_id.values)
names = iter(df.CMPD_CHEMBLID.values)

In [None]:
from utils import Emoji


def next_molecule():
    # =============================================================
    # |                                                           |
    # |             DEMO MOLECULE PREDICTION                      |
    # |               (Re-run this block)                         |
    # |                                                           |
    # =============================================================
    clear_output(True)
    sample = next(sampler)
    name = next(names)
    smile = next(smiles)
    # print(name)
    out = inference_model(
        sample.x,
        sample.edge_index,
        batch=sample.batch,
        graphs_mask=data.graphs_mask,
        edge_attr=sample.edge_attr,
    )
    view = report_molecule_regression(name, sample.y[0], out.squeeze()[0], smile)
    view.show()

In [None]:
# RE-RUN THIS CELL TO SEE INFERENCE RESULTS ON INDIVIDUAL MOLECULES

next_molecule()

## Conclusion

In this notebook we've shown how to use pretrained feature to train a model on a downstream task, including how to use this model with new molecules.

**Next steps:**
- Try a hyperparameter sweep - in the section on building the model there are some suggested parameters to try tuning. Try changing the learning rate or number of hidden layers and see the impact of the training dynamics.  
- Try changing the dataset - this example start with the `Lipophilicity` dataset (LogD) from MoleculeNet, and the code is provided in comments with hints explaining how to update to train on the `QM9` dataset for a regression task. 
- Try exploring more datasets from [MoleculeNet](https://moleculenet.org/datasets-1) which are labeled by the type of task and see if you can tune the fine-tuning on these datasets. 
- Try a different model from MolFeat - the `ChemGPT-4.7M` model is provided as an alternative featurizer / model to finetune - this is in the same cell as the `QM9` dataset.

- For an alternative use of the MolFeat library please see the integration of MolFeat with PyTorch Geometric on the IPU in  `<folder_name/notebook_name>.ipynb`


In [None]:
# Finally detach the inference model - this line is at the end to make sure it's not run accidentally before finishing the notebook.
inference_model.detachFromDevice()