In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
!nvidia-smi

Sun May  7 10:14:23 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:13:00.0  On |                  N/A |
|  0%   54C    P3    34W / 220W |    988MiB /  8192MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33milkersigirci[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [4]:
# !pip install deepchem simpletransformers transformers torch jupyter
!pip list | egrep "deepchem|simpletransformers|transformers|torch"

deepchem                   2.7.1
simpletransformers         0.63.9
torch                      2.0.0
transformers               4.27.4


## Imports


In [5]:
# NOTE: Tutorial version was 2.5.0

import deepchem
import logging

# from rdkit import Chem
# Test if NVIDIA apex training tool works. TODO: Why is this needed?
# from apex import amp

from thesis_work.chemberta.molnet_dataloader import (
    load_molnet_dataset,
    write_molnet_dataset_for_chemprop,
)

deepchem.__version__

Skipped loading some Tensorflow models, missing a dependency. No module named 'tensorflow'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/home/ilker/Documents/MyRepos/thesis-work/.venv/lib/python3.10/site-packages/deepchem/models/torch_models/__init__.py)
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightning'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


'2.7.1'

In [6]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# What is Transfer Learning, and how does ChemBERTa utilize it?

Transfer learning is a research problem in machine learning that focuses on **storing knowledge gained while solving one problem and applying it to a different but related problem**.

By pre-training directly on SMILES strings, and teaching ChemBERTa to recognize masked tokens in each string, the model learns a strong molecular representation. We then can take this model, trained on a structural chemistry task, and apply it to a suite of classification tasks in the MoleculeNet suite, from Tox21 to BBBP!


# Fine-tuning ChemBERTa on a Small Mollecular Dataset

Our fine-tuning dataset, ClinTox, consists of qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons.

The ClinTox dataset consists of 1478 binary labels for toxicity, using the SMILES representations for identifying molecules. The computational models produced from the dataset could become decision-making tools for government agencies in determining which drugs are of the greatest potential concern to human health. Additionally, these models can act as drug screening tools in the drug discovery pipelines for toxicity.


#But why use custom Smiles-Tokenizer's over BPE?

In this tutorial, we will be comparing the BPE tokenization algorithm with a **custom SmilesTokenizer** based on a regex pattern, which we have released as part of DeepChem. To compare tokenizers, we pretrained an identical model tokenized using this novel tokenizer, on the PubChem-1M set. The pretrained model was evaluated on the BBBP and Tox21 in the paper. We found that the SmilesTokenizer narrowly outperformed the BPE algorithm by ∆PRC-AUC = $+0.021$.

Though this result suggests that a more semantically relevant tokenization may provide performance benefits, further benchmarking on additional datasets is needed to validate this finding. **In this tutorial, we aim to do so, by testing this alternate model on the ClinTox dataset.**

Let's fetch the Smiles Tokenizer's character per line vocabulary file, which can bve loaded from the DeepChem S3 data bucket:


In [7]:
# TODO: Needed for tokenizer training I think?

# !wget https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/vocab.txt

## Data


In [16]:
import pandas as pd
import numpy as np

# protein_type = "clintox"
protein_type = "kinase"
# protein_type = "gpcr"
# protein_type = "protease"

In [17]:
if protein_type == "clintox":
    tasks, (train_df, valid_df, test_df), transformers = load_molnet_dataset(
        "clintox", tasks_wanted=None
    )

else:
    data_path = f"../../thesis_work/data/{protein_type}_smiles.csv"

    df = pd.read_csv(data_path)
    df.columns = ["text", "labels"]

    df = df.sample(frac=1, random_state=42)

    train_df, valid_df, test_df = np.split(
        # df.sample(frac=1), [int(0.6 * len(df)), int(0.8 * len(df))]
        df.sample(frac=1),
        [int(0.8 * len(df)), int(0.9 * len(df))],
    )

In [18]:
# check if our train and evaluation dataframes are setup properly. There should only be two columns for the SMILES string and its corresponding label.
print("Train Dataset: {}".format(train_df.shape))
print("Eval Dataset: {}".format(valid_df.shape))
print("Test Dataset: {}".format(test_df.shape))

Train Dataset: (53047, 2)
Eval Dataset: (6631, 2)
Test Dataset: (6631, 2)


In [19]:
train_df
# valid_df
# test_df

Unnamed: 0,text,labels
39020,C=CC(=O)Nc1cccc(-c2nc(Nc3ccc(N4CCOCC4)c(F)c3)n...,0
64077,O=C(Nc1cccc(C(F)(F)F)c1)Nc1cn(CCNc2ncnc3ccsc23...,0
10698,CN(CC1(C#N)CCCCC1)C1CCN(C(=O)c2ccc(-n3cc(NC(=O...,1
49210,CN1C(=O)c2c(c3c4ccccc4n(C4OC(CO)C(O)C(O)C4O)c3...,0
15819,Cc1cccc(NC(=O)c2cc(Br)ccc2OCc2ccccc2)c1,1
...,...,...
21861,FC(F)(F)C1(Nc2cc(-c3ccncc3)nc3cnccc23)CCC1,1
66188,C=CC(=O)Nc1cccc(CNc2nc(Nc3cc(F)c(N4CCOCC4)c(F)...,0
30227,CC(C)n1cc(C(=O)c2cncc(NCCc3cccnc3)n2)c2c(N)ncnc21,1
53790,COc1cc2c(cc1OC)-c1n[nH]c(-c3ccc(-c4cn(C)cn4)cc...,0


## Model


- Now, using `simple-transformer`, let's load the pre-trained model from HuggingFace's useful model-hub. We'll set the number of epochs to 10 in the arguments, but you can train for longer, and pass early-stopping as an argument to prevent overfitting.
- Also make sure that `auto_weights` is set to True to do automatic weight balancing, as we are dealing with imbalanced toxicity datasets.
  - FIXME: Not working with ClassificationArgs. If provivided as dict, if doesn't complain but is it actually used?


In [20]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import os

# NOTE: Necessary for training outside of Colab
os.environ["TOKENIZERS_PARALLELISM"] = "false"


model_args = ClassificationArgs(
    evaluate_each_epoch=True,
    evaluate_during_training_verbose=True,
    no_save=True,
    num_train_epochs=1,
    # overwrite_output_dir=True,
    # auto_weights=True, # NOTE: Not working
    # NOTE: Necessary for training outside of Colab
    use_multiprocessing=False,
    # dataloader_num_workers=0,
    # process_count=1,
    use_multiprocessing_for_evaluation=False,
)

# Early stopping
# model_args.use_early_stopping = True
# model_args.early_stopping_delta = 0.01
# model_args.early_stopping_metric = "mcc"
# model_args.early_stopping_metric_minimize = False
# model_args.early_stopping_patience = 5
# model_args.evaluate_during_training_steps = 1000

# model_type = "seyonec/PubChem10M_SMILES_BPE_396_250"  # BPE tokenizer
# model_type = "seyonec/SMILES_tokenized_PubChem_shard00_160k"  # Custom SMILES tokenizer

# model_type = "DeepChem/ChemBERTa-10M-MTR"
model_type = "DeepChem/ChemBERTa-77M-MLM"

# output_dir = "BPE_PubChem_10M_ClinTox_run"
# output_dir = "SmilesTokenizer_PubChem_10M_ClinTox_run"
# output_dir = f"{protein_type.upper()}_77M_MLM_Scaffold"

output_dir = f"{protein_type.upper()}_77M_MLM_Shuffle_80_10_10_epoch1"

# if not os.path.exists(f"results/{output_dir}"):
#     os.makedirs(output_dir)


# You can set class weights by using the optional weight argument
model = ClassificationModel(
    "roberta",
    model_type,
    args=model_args,
    # use_cuda=False,
)

Some weights of the model checkpoint at DeepChem/ChemBERTa-77M-MLM were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-77M-MLM and are newly initialized: ['classifier.out_proj.weight', 'classifier.den

In [21]:
print(model.tokenizer)

RobertaTokenizerFast(name_or_path='DeepChem/ChemBERTa-77M-MLM', vocab_size=591, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': AddedToken("[MASK]", rstrip=False, lstrip=True, single_word=False, normalized=False)})


Train the model on the train scaffold set of ClinTox, and monitor our runs using W&B. We will evaluate the performance of our model each epoch using the validation set.


In [None]:
# Create directory to store model weights (change path accordingly to where you want!)
# !mkdir BPE_PubChem_10M_ClinTox_run

# Train the model
model.train_model(
    train_df,
    eval_df=valid_df,
    output_dir=f"results/{output_dir}",
    args={"wandb_project": output_dir},
)

We will be using the accuracy and PRC-AUC metrics (average precision score).

- Clintox_77M_MLM_Scaffold:

  - 'auroc': 0.6598721023181455, 'auprc': 0.9686162534366902, 'acc': 0.9391891891891891, 'eval_loss': 0.21732922604209498
  - 'auroc': 0.6598721023181455, 'auprc': 0.9686162534366902, 'acc': 0.9391891891891891, 'eval_loss': 0.21732922604209498

- Kinase_BPE_60_20_20:

  - 'auprc': 0.5819903832629333, 'acc': 0.5187000452420449, 'eval_loss': 0.6947489678140716
  - 'auprc': 0.5819903832629333, 'acc': 0.5549285423253196, 'eval_loss': 0.6947489678140716

- Kinase_77M_MLM_Random:

  - 'auroc': 0.7816713295079376, 'auprc': 0.7923752900766724, 'acc': 0.696576685266174, 'eval_loss': 0.5885900806172766
  - 'auroc': 0.7816713295079376, 'auprc': 0.7923752900766724, 'acc': 0.6778493619534837, 'eval_loss': 0.5885900806172766

- Kinase_77M_MLM_Random_epoch_1:

  - None
  - 'auroc': 0.7543682288335132, 'auprc': 0.7674025333716489, 'acc': 0.6585454274106363, 'eval_loss': 0.588208513610757

- Kinase_77M_MLM_Scaffold_epoch_10:

  - 'auroc': 0.8009122719662067, 'auprc': 0.8054774057518067, 'acc': 0.7210073895340069, 'eval_loss': 0.5609703987872787
  - 'auroc': 0.8009122719662067, 'auprc': 0.8054774057518067, 'acc': 0.6842033467149147, 'eval_loss': 0.5609703987872787

- GPRC_BPE_60_20_20:

  - 'auroc': 0.5052311430060601, 'auprc': 0.5437956080844786, 'acc': 0.5227907660638141, 'eval_loss': 0.6958835787944132
  - 'auroc': 0.5052311430060601, 'auprc': 0.5437956080844786, 'acc': 0.5486169810712278, 'eval_loss': 0.6958835787944132

- PROTEASE_BPE_60_20_20:

  - 'auroc': 0.5058419844941522, 'auprc': 0.453862209356516, 'acc': 0.4530363223609535, 'eval_loss': 0.721208622182268
  - 'auroc': 0.5058419844941522, 'auprc': 0.453862209356516, 'acc': 0.44255565858638773, 'eval_loss': 0.721208622182268

- PROTEASE_77M_MLM_Scaffold_epoch_10:
  - 'auroc': 0.7980674578306961, 'auprc': 0.7416403521458773, 'acc': 0.7346765039727582, 'eval_loss': 0.5834547029871519
  - 'auroc': 0.7980674578306961, 'auprc': 0.7416403521458773, 'acc': 0.6234580182996616, 'eval_loss': 0.5834547029871519


In [15]:
# import torch
# import gc

# torch.cuda.empty_cache()
# gc.collect()

In [16]:
# import wandb
# run = wandb.init(project=output_dir)

In [None]:
# FIXME: Takes to much time, 17 min for 6k samples

import sklearn

## accuracy
# result, model_outputs, wrong_predictions = model.eval_model(
#     test_df, acc=sklearn.metrics.accuracy_score
# )

## ROC-PRC
result, model_outputs, wrong_predictions = model.eval_model(
    test_df, acc=sklearn.metrics.average_precision_score
)

In [24]:
result

{'mcc': 0.36808729171911037,
 'tp': 2537,
 'tn': 2012,
 'fp': 1049,
 'fn': 1033,
 'auroc': 0.7543682288335132,
 'auprc': 0.7674025333716489,
 'acc': 0.6585454274106363,
 'eval_loss': 0.588208513610757}

In [25]:
model_outputs

array([[-0.79931641,  0.85546875],
       [-0.65283203,  0.68603516],
       [ 0.01628113, -0.01928711],
       ...,
       [-0.05737305,  0.08734131],
       [-0.0067749 ,  0.06628418],
       [ 0.01933289,  0.02659607]])

In [26]:
wrong_predictions

[{'guid': 0, 'text_a': 'COc1cc(C(=O)NC2CC2)ccc1-c1cnc2c(NCCCN)nc(C#CCO)cn12', 'text_b': None, 'label': 0},
 {'guid': 4, 'text_a': 'CCNC1=N/C(=C\\c2c[nH]c3ncccc23)C(=O)N1', 'text_b': None, 'label': 1},
 {'guid': 5, 'text_a': 'CC1=CC(=O)N(Nc2nc(-c3cccs3)nc3sc(-c4ccc(OCCNCC(C)C)cc4)c(C)c23)C1=O.Cl', 'text_b': None, 'label': 1},
 {'guid': 8, 'text_a': 'CC(Oc1cc(-n2cnc3cc(OCCCN(C)C)ccc32)ccc1C(N)=O)c1ccccc1C(F)(F)F', 'text_b': None, 'label': 0},
 {'guid': 11, 'text_a': 'O=Cc1cn2c3c1ccc1c4ccccc4n(c13)CCCC2', 'text_b': None, 'label': 1},
 {'guid': 15, 'text_a': 'CCC1C(=O)N(C)c2cnc(Nc3cc(F)c(C(=O)NC4CCCC4)cc3OCCO)nc2N1C1CCCC1', 'text_b': None, 'label': 0},
 {'guid': 17, 'text_a': 'CN(c1ccc(F)cc1)c1cc2cnnc(-c3ccc(F)cc3F)c2n(C)c1=O', 'text_b': None, 'label': 0},
 {'guid': 18, 'text_a': 'NCCNc1nc(-c2ccc3[nH]ncc3c2)nc2ccccc12', 'text_b': None, 'label': 1},
 {'guid': 20, 'text_a': 'COc1c(F)cc(-c2cnc3[nH]ccc3c2-c2ccc(N3CCN(C)CC3)cc2)cc1F', 'text_b': None, 'label': 0},
 {'guid': 21, 'text_a': 'c1cc2c

The model performs pretty well, averaging above 97% ROC-PRC after training on only ~1400 data samples and 150 positive leads in a couple of minutes! We can clearly see the predictive power of transfer learning, and approaches like these are becoming increasing popular in the pharmaceutical industry where larger datasets are scarce. By training on more epochs and tasks, we can probably boost the accuracy as well!

Lets evaluate the model on one last string from ClinTox's test set for toxicity. The model should predict 1, meaning the drug failed clinical trials for toxicity reasons and wasn't approved by the FDA.


In [18]:
# Lets input a molecule with a toxicity value of 1
mol = "C1=C(C(=O)NC(=O)N1)F"
# mol = "Nc1nc(NC2CC2)c2ncn([C@H]3C=C[C@@H](CO)C3)c2n1"

predictions, raw_outputs = model.predict([mol])

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1 [00:00<?, ?it/s]

In [19]:
print(predictions)
print(raw_outputs)

[0]
[[ 1.46777344 -1.59082031]]
