# Downstream adaption with MiniMol

This example shows how MiniMol can featurise the molecules that will then serve as an input to another model trained on a small downstream dataset from TDC ADMET. This allows to transfer the knowledge from the pre-trained MiniMol to another task. 

Before we start, let's make sure that the TDC package is installed in the environment. 

In [None]:
%pip install PyTDC

## Step 1: Getting the data
Next, we will build a predictor for the `HIA Hou` dataset, one of the binary classification benchmarks corresponding to `absorption`-type of problems from TDC ADMET group. We then split the data into training, validation and test set based on molecular scaffolds. 

In [1]:
from tdc.benchmark_group import admet_group

DATASET_NAME = 'hia_hou'

admet = admet_group(path="admet-data/")

mols_test = admet.get(DATASET_NAME)['test']
mols_train, mols_val = admet.get_train_valid_split(benchmark=DATASET_NAME, split_type='scaffold', seed=42)

Found local copy...
generating training, validation splits...
generating training, validation splits...
100%|██████████| 461/461 [00:00<00:00, 3650.21it/s]


In [2]:
print(f"Dataset - {DATASET_NAME}\n")
print(f"Val split ({len(mols_val)} mols): \n{mols_val.head()}\n")
print(f"Test split ({len(mols_test)} mols): \n{mols_test.head()}\n")
print(f"Train split ({len(mols_train)} mols): \n{mols_train.head()}\n")

Dataset - hia_hou

Val split (58 mols): 
                 Drug_ID                                               Drug  Y
0         Atracurium.mol  COc1ccc(C[C@H]2c3cc(OC)c(OC)cc3CC[N@@+]2(C)CCC...  0
1  Succinylsulfathiazole          O=C(O)CCC(=O)Nc1ccc(S(=O)(=O)Nc2nccs2)cc1  0
2            Ticarcillin  CC1(C)S[C@H]2[C@@H](NC(=O)[C@@H](C(=O)O)c3ccsc...  0
3          Raffinose.mol  OC[C@@H]1O[C@@H](OC[C@@H]2O[C@@H](O[C@]3(CO)O[...  0
4          Triamcinolone  C[C@@]12C=CC(=O)C=C1CC[C@@H]1[C@H]3C[C@@H](O)[...  1

Test split (117 mols): 
                Drug_ID                                               Drug  Y
0         Trazodone.mol         O=c1n(CCCN2CCN(c3cccc(Cl)c3)CC2)nc2ccccn12  1
1          Lisuride.mol  CCN(CC)C(=O)N[C@H]1C=C2c3cccc4[nH]cc(c34)C[C@@...  1
2  Methylergonovine.mol  CC[C@H](CO)NC(=O)[C@H]1C=C2c3cccc4[nH]cc(c34)C...  1
3      Methysergide.mol  CC[C@H](CO)NC(=O)[C@H]1C=C2c3cccc4c3c(cn4C)C[C...  1
4       Moclobemide.mol                       O=C(NCCN1CCOCC1)c1ccc(Cl

## Step 2: Generating molecular fingerprints
After spltting the dataset into training, validation and test sets, we will use MiniMol to embed all molecules. The embedding will be added as an extra column in the dataframe returned by TDC.

In [3]:
from minimol import Minimol

model = Minimol()

In [4]:
mols_val['Embedding'] = model(mols_val['Drug'])
mols_test['Embedding'] = model(mols_test['Drug'])
mols_train['Embedding'] = model(mols_train['Drug'])

featurizing_smiles, batch=1:   0%|          | 0/58 [00:00<?, ?it/s]

Casting to FP32: 100%|██████████| 58/58 [00:00<00:00, 16144.79it/s]
100%|██████████| 1/1 [00:00<00:00,  6.06it/s]


featurizing_smiles, batch=3:   0%|          | 0/39 [00:00<?, ?it/s]

Casting to FP32: 100%|██████████| 117/117 [00:00<00:00, 17139.94it/s]
100%|██████████| 2/2 [00:00<00:00,  6.25it/s]


featurizing_smiles, batch=13:   0%|          | 0/31 [00:00<?, ?it/s]

Casting to FP32: 100%|██████████| 403/403 [00:00<00:00, 12412.10it/s]
100%|██████████| 5/5 [00:00<00:00,  7.30it/s]


The model is small, so it took us 7.3 seconds to generate the embeddings for almost 600 molecules. Here is a preview after a new column has been added:

In [5]:
print(mols_train.head())

           Drug_ID                                               Drug  Y  \
0        Guanadrel                      N=C(N)NC[C@@H]1COC2(CCCCC2)O1  1   
1      Cefmetazole  CO[C@@]1(NC(=O)CSCC#N)C(=O)N2C(C(=O)O)=C(CSc3n...  0   
2   Zonisamide.mol                           NS(=O)(=O)Cc1noc2ccccc12  1   
3   Furosemide.mol            NS(=O)(=O)c1cc(Cl)cc(NCc2ccco2)c1C(=O)O  1   
4  Telmisartan.mol  CCCc1nc2c(n1Cc1ccc(-c3ccccc3C(=O)O)cc1)=C[C@H]...  1   

                                           Embedding  
0  [0.24859753, 0.18472305, 0.4028932, 0.22700065...  
1  [0.7069565, 0.41227153, 1.0127053, 2.3176281, ...  
2  [0.19019875, -0.14087728, 0.8896561, 1.2718395...  
3  [0.11933186, 0.38785577, 1.5808605, 1.999807, ...  
4  [0.99853146, 1.1408926, 2.2468193, 1.3438487, ...  


## Step 3: Training a model
Now that the molecules are featurised leverging the representation MiniMol learned during its pre-training, we will set up the training of a simple Multi-Layer Perceptron model on our newely generated embeddings and the labels from the `HIA Hou` dataset. We will use PyTorch.

Let's start by defining a new class for the dataset and then creating the dataloaders for different splits.

In [6]:
from torch.utils.data import DataLoader, Dataset
    
class AdmetDataset(Dataset):
    def __init__(self, samples):
        self.samples = samples['Embedding'].tolist()
        self.targets = [float(target) for target in samples['Y'].tolist()]

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        sample = torch.tensor(self.samples[idx])
        target = torch.tensor(self.targets[idx])
        return sample, target

val_loader = DataLoader(AdmetDataset(mols_val), batch_size=128, shuffle=False)
test_loader = DataLoader(AdmetDataset(mols_test), batch_size=128, shuffle=False)
train_loader = DataLoader(AdmetDataset(mols_train), batch_size=32, shuffle=True)

Our model will be a simple 3-layer perceptron with batch normalisation and dropout. 

In [51]:
import torch.nn as nn
import torch.nn.functional as F


class TaskHead(nn.Module):
    def __init__(self):
        super(TaskHead, self).__init__()
        self.dense1 = nn.Linear(512, 512)
        self.dense2 = nn.Linear(512, 512)
        self.dense3 = nn.Linear(512, 1)
        self.bn1 = nn.BatchNorm1d(512)
        self.bn2 = nn.BatchNorm1d(512)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        x = self.dense1(x)
        x = self.bn1(x)
        x = self.dropout(x)
        x = F.relu(x)

        x = self.dense2(x)
        x = self.bn2(x)
        x = self.dropout(x)
        x = F.relu(x)

        x = self.dense3(x)
        return F.softmax(x)
    

predictor = TaskHead()

Below we declare the basic hyperparamters together with choosing optimiser, loss function, learning scheduler and weight decay regularisation.

In [52]:
import math
import torch.optim as optim

lr = 0.001
epochs = 10
warmup = 5

loss_fn = nn.CrossEntropyLoss()
lr_fn = lambda epoch: lr * (1 + math.cos(math.pi * (epoch - warmup) / (epochs - warmup))) / 2
optimiser = optim.Adam(predictor.parameters(), weight_decay=0.0001)
scheduler = optim.lr_scheduler.LambdaLR(optimiser, lr_fn)

Before we start training, let's evaluate the randomly initilised model on the evaluation split. 

In [59]:
import torch
from sklearn.metrics import roc_auc_score, average_precision_score

def evaluate(predictor, dataloader, loss_fn):
    predictor.eval()
    total_loss = 0
    all_probs = []
    all_targets = []

    with torch.no_grad():
        
        for inputs, targets in dataloader:
            
            probs = predictor(inputs.float())
            
            loss = loss_fn(output, targets.long())
            total_loss += loss.item()

            all_probs.extend(probs.squeeze().tolist())
            all_targets.extend(targets.tolist())

    loss = total_loss / len(dataloader)
    return roc_auc_score(all_targets, all_probs), average_precision_score(all_targets, all_probs)


auroc, avpr = evaluate(predictor, val_loader, loss_fn)
print(
    f"auroc =  {auroc:.4f}\n"
    f"avpr  =  {avpr:.4f}"
)

auroc =  0.5000
avpr  =  0.8276


The randomly initialised task-head outputs values larger than >5 for all inputs, so after going through softmax, all classes are 1. Since the dataset is not class balanced, this model gets 82.8% average precisio, but AUROC, which measures both sensitivity and specificity, indicates that the model is no better than random. Let's see how good it gets after training.