# End-to-End Conditional SMILES Generation Using a GPT Model

**Author:** Mahmoud Ebrahimkhani 

**Date:** 2025-04-11

**Course:** Applied Deep Learning and Generative AI in Healthcare

**Reference:** DOI: [10.1021/acs.jcim.1c00600](https://doi.org/10.1021/acs.jcim.1c00600)

---

## Introduction

In this notebook, you will train transformer-based language model to design novel drug-like small molecules. We'll walk through an end-to-end pipeline where a GPT-style model (decoder-only transformer) is trained to generate valid SMILES (Simplified Molecular Input Line Entry System) strings, *conditioned on desired molecular properties*. These properties—such as scaffold, LogP, QED, and TPSA are relevant in early drug discovery for assessing bioavailability, drug-likeness, and molecular complexity.

---

## What You'll Do

1. **Load and explore** the MOSES dataset ([github.com/molecularsets/moses](https://github.com/molecularsets/moses)), which contains pre-filtered drug-like molecules.
2. **Compute molecular descriptors**, including:
    - Scaffold (core structure of a molecule): The rigid, central framework that defines the molecule's basic shape and connectivity.
    - LogP (lipophilicity): A measure of how well a molecule dissolves in fats versus water, important for drug absorption and distribution.
    - QED (quantitative estimate of drug-likeness): A score between 0 and 1 that indicates how similar a molecule's properties are to known drugs.
    - TPSA (topological polar surface area): The total surface area of all polar atoms (mainly oxygen and nitrogen) in the molecule, which helps predict drug absorption.
3. **Format the data** to enable *conditional* generation—so the model learns to generate SMILES strings based on specified property values.
4. **Train a GPT-like transformer model** to generate molecules.
5. **Evaluate the quality of the generated molecules** using:
    - Validity of SMILES strings: Using RDKit to parse each generated SMILES string and verify it represents a valid chemical structure
    - Uniqueness of generated molecules: Computing the ratio of unique SMILES strings to total generated molecules
    - Tanimoto similarity to training molecules: Calculating molecular fingerprint similarity between generated and training set molecules to assess novelty
    - Alignment with the conditional property distributions: Comparing statistical distributions of properties (LogP, QED, TPSA) between generated and training molecules

## Environment Setup

You may need to install a few dependencies

In [None]:

%pip install rdkit-pypi transformers torch tqdm scikit-learn matplotlib moses

## Import packages

In [None]:
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from rdkit import Chem
from rdkit.Chem import AllChem, Draw, Descriptors, QED
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Scaffolds.MurckoScaffold import GetScaffoldForMol
from rdkit.Chem import rdDecomposition, rdMolDescriptors, rdDistGeom
from rdkit.Chem.MolStandardize import rdMolStandardize

import torch
from torch.utils.data import Dataset, DataLoader
from tqdm.auto import tqdm

from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    GPT2Config,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)

from moses.dataset import get_dataset